Robust Visual Tracking Based on Adaptive Convolutional Features and Offline Siamese Tracker

Robust and accurate visual tracking is one of the most challenging computer vision problems. Due to the inherent lack of training data, a robust approach for constructing a target appearance model is crucial. The existing spatially regularized discriminative correlation filter (SRDCF) method learns partial-target information or background information when experiencing rotation, out of view, and heavy occlusion. In order to reduce the computational complexity by creating a novel method to enhance tracking ability, we first introduce an adaptive dimensionality reduction technique to extract the features from the image, based on pre-trained VGG-Net. We then propose an adaptive model update to assign weights during an update procedure depending on the peak-to-sidelobe ratio. Finally, we combine the online SRDCF-based tracker with the offline Siamese tracker to accomplish long term tracking. Experimental results demonstrate that the proposed tracker has satisfactory performance in a wide range of challenging tracking scenarios.


Introduction
Target tracking is a classical computer vision problem with many applications. In generic tracking, the goal is to estimate the trajectory and size of a target in an image sequence, given only its initial information [1]. Target tracking has significantly progressed, but challenges still remain due to appearance change, scale change, deformation, and occlusion. Researchers have been tackling these problems by using the learning discriminative appearance model of the target. This method describes the target and background appearance based on rich feature representation. As such, this paper investigates deep robust feature representations, adaptive model updates, and Siamese offline tracker for robust visual tracking.
Danelljan et al. [2] proposed the spatial regularization correlation filter (SRDCF), which introduced learning to the penalize correlation filter coefficients depending on their spatial location. The SRDCF framework has been significantly improved by including scale estimation [3], non-linear kernels [4], long-term memory [5], and by removing the periodic effects of circular convolution [2,6,7]. However, three main problems limit the SRDCF formulation. Firstly, the dimension of the deep features significantly limits the tracking speed. Secondly, short-term target tracking algorithms cannot handle the out-of-view problem. Thirdly, online updates with fixed rate cause drift when suffering heavy occlusion.
Advances in visual tracking have been made for the features learned from deep convolutional neural networks (DCNNs). However, the outperforming deep features rely heavily on training on large-scale datasets. Thus, most state-of-the-art trackers use pre-trained networks to extract deep features. However, these improvements in robustness cause significant reductions in tracking speed. Siamese Networks have also been used to solve the tracking problem. The matching mechanism in Siamese Network approaches prevent model contamination and achieve better tracking performance. To perform long-term tracking, some methods implement a failure detection mechanism to combine multiple detectors with complementary characteristics at the different tracking stages. However, these approaches only use online update tracking and cannot unite the Siamese Trackers.
Based on the discussion above, we propose a novel SRDCF tracking framework that synthetically uses DCNN and failure detection combined with Siamese trackers. The main contributions of this paper are as follows: (1) We propose a method to obtain a specific feature map considering the tradeoffs between spatial information and semantic information though convolutional feature response, and use an adaptive projection matrix to obtain the principal component of the corresponding feature map, which reduces the computational complexity during feature extraction. (2) We propose a novel adaptive model updating method. First, we obtain the confidence of the target position based on the peak-to-sidelobe ratio (PSR), and then explore the confidence map to obtain the PSR, which is highly reliable. Finally, the weight is calculated by the given PSR and is used to achieve adaptive model updating. (3) We also combine the SRDCF frameworks with a Siamese Tracker by assigning the threshold; we infer the tracker status and warn of potential tracking failures in order to achieve long-term tracking by switching two different trackers.
The rest of the paper is organized as follows: in Section 2, we review related research work. In Section 3, we present the proposed visual tracking framework in detail. Numerous experimental results and analyses are shown in Section 4. In Section 5, we provide the conclusions to our work.

Tracker with Correlation Filter
Discriminative Correlation Filters (DCFs) [2,8,9] have outstanding results for visual tracking. This approach uses the circular correlation properties to train a regressor using a sliding window. At first, DCF methods [8,10] were limited to a single feature channel. Some approaches have extended the DCF framework to multi-channel feature maps [11][12][13]. The high-dimensional features are exploited in multi-channel DCF for improved tracking. The combination of the DCF framework and deep convolutional features [14] has significantly improved tracking ability. Danelljan et al. [3] proposed scale estimation to achieve spatial evaluation. Danelljan et al. [2] also introduced spatial regularization in order to alleviate the boundary effect in SRDCF. Valmadre et al. [15] constructed a convolutional neural network (CNN) that contains a correlation filter as the part of the network and uses end-to-end representation learning based on the similarity between correlation and convolution operations.

Tracker with Deep Features
The introduction of CNNs has significantly progressed the field of computer vision, including visual tracking. Wang et al. [9] proposed a deep learning tracker (DLT) that is based on the combination of offline pre-training and online fine-tuning. Wang et al. [16] designed the structured output deep learning tracker (SO-DLT) within the particle filters framework. Trackers were introduced that learn target-specific CNNs without pre-training to prevent the problems caused by offline training, which treat the CNN as black box [17,18]. In order to learn multiple correlation filters, Ma et al. [19] extracted the hierarchical convolutional features (HCF) from three layers of related networks. Danelljan et al. [20] proposed a tracker by learning continuous convolution operators (CCOT) to interpolate discrete features and train spatial continuous convolution filters, which enabled the efficient integration of multi-resolution deep feature maps. Danelljan et al. [21] also designed an efficient convolution operator (ECO) for visual tracking using a factorized convolution operation to prevent the low computational efficiency caused by CNN operation.

Trackers with Feature Dimensionality Reduction
Dimensionality reduction is widely used in visual tracking due to the computational complexity. Danelljan et al. [22] minimized the data term used in Principal Component Analysis (PCA) on the target appearance. In order to achieve sparse representation of the related target, Huang et al. [23] used sparse multi-manifold learning to achieve semi-supervised dimensionality reduction. Cai et al. [24] designed an adaptive dimensionality reduction method to handle the high-dimensional features extracted by deep convolutional networks. To model the mapping from high-dimensional SPD manifold to the low-dimensional manifold with an orthonormal projection, Harandi et al. [25] proposed a dimensionality reduction method to handle high-dimensional SPD matrices by constructing a lower-dimensional SPD manifold.

Trackesr with Siamese Networks
Siamese architecture has been exploited in the tracking field, performing impressively without any model update. Tao et al. [26] trained a Siamese network to identify candidate image locations that match the initial object appearance, and called their method the Siamese Instance Search Tracker (SINT). In this approach, many candidate patches are passed through the network, and the patch with the highest matching score is selected as the tracking output. Held et al. [27] introduced GOTURN, which avoids the need to score many candidate patches and runs at 100 fps. However, a disadvantage of their approach is that it does not possess intrinsic invariance for translating the search image. Later, Bertinetto et al. [28] trained a similar Siamese network to locate an example image within a large search image. The network parameters were initialized by the pre-trained networks through ILSVRC2012 (Large Scale Visual Recognition Challenge) [29] image classification problem, and then fine-tuned for the similarity learning problem in the second offline phase.

Baseline
The SRDCF tracker [2] is a spatially regularized correlation filter obtained by exploiting the sparse nature of the proposed regularization in the Fourier domain. The tracker effectively reduces the boundary effect and has achieved better tracking performance in OTB2015 benchmark compared with other correlation filter tracking algorithms.
In the learning stage, the SRDCF tracker introduces a spatial weight function ω to penalize the magnitude of the filter coefficient f . The regularization weights ω determine the importance of the correlation filter coefficients f depending on their spatial locations. Coefficients in f residing outside the target region are suppressed by assigning higher weights to ω and vice versa. The resulting optimization problem is expressed as: x l k f l represents the convolution response of the filter to samples x k and l is the dimension of feature. The desired output y k is a scalar valued function over the domain that includes a label for each location in the sample, k denotes the number of frames, t represents the total number of samples, and d donates the dimension of the feature map.
By applying Parseval's theorem to Equation (1), the filter f can equivalently be obtained by minimizing the resulting loss function in Equation (2) over the DFT coefficientsf: The symbol denotes DFT, M, N represents the sample size, D x l k denotes the diagonal matrix with the elements of the vectorx l k in the diagonal, C(ŵ) represents the circular two-dimensional (2D) convolution in the function (i.e.; C(ŵ)f l = vec ŵ f l ), and vec(·) is the vector representation.
By applying unitary MN × MN matrix, B, and the real-valued part off l , we obtain f l = Bf l . The loss function is then simplified by defining the fully vectorized real-valued filter where D l = D 1 k , . . . , D d k , D l k = BD x l k B H , and y k = B y k , C = BC(ŵ)B H /MN. We defined W as the dMN × dMN block diagonal matrix with each diagonal block being equal to C.
Finally, the regularized correlation filter is obtained by solving the normal equation The SRDCF model is updated first by extracting a new training sample x t centered at the target location. Here, t denotes the current frame number. We then update A t in Equation (4) and b t in Equation (5) with a learning rate γ ≥ 0:

Adaptive Convolutional Features
By applying the convolutional features of the pre-trained VGG-Net [12], we used an adaptive dimension reduction method to construct the feature space, then designed the peak-to-sidelobe ratio to choose more reliable results in order to update the model. For long-term tracking, we designed a novel failure detection mechanism in the tracking procedure. By combining the online updating method and the offline tracker, we not only addressed the issues in the SRDCF framework, but also solved the occlusion, deformation, and out-of-view problems present in long-term tracking. The flow chart of proposed the tracking algorithm is shown in Figure 1.

Convolutional Features
Convolutional neural networks (CNNs) have successfully applied to large image classification and detection by extracting features or by directly performing the task, such as with AlexNet [30], GoogleNet [31], ResNet [32], and VGG-Net [12]. VGG-Net was trained by 1.3 million images in the ImageNet dataset, and achieved the best result in a classification challenge. Compared with most CNN models of only five to seven layers, VGG Net has a deeper structure with up to 19 layers, 16 convolution and three fully-connected layers, which contain spatial information and semantic information, respectively, which can identify deeper features.
Research indicates that the features extracted by convolution layer features are better than extracted from fully-connected layers. As shown in Figure 2, the feature extracted by the Conv3-4 layer in the VGG-Net model maintains spatial details, especially some information that is useful for accurate tracking (Figure 2b). Figure 2d illustrates the Conv5-4 layer of the VGG-Net model, which contains more semantic information. The semantic information effectively achieves better feature extraction when experiencing deformation in the tracking process. We chose the Conv3-4 feature in this paper considering the tradeoff between spatial information and semantic information. The feature mapping of Pool5 is only 7 × 7. Achieving accurate location depending on such low resolution is impossible. Bilinear interpolation is typically used to solve this problem in mapping space,

Convolutional Features
Convolutional neural networks (CNNs) have successfully applied to large image classification and detection by extracting features or by directly performing the task, such as with AlexNet [30], GoogleNet [31], ResNet [32], and VGG-Net [12]. VGG-Net was trained by 1.3 million images in the ImageNet dataset, and achieved the best result in a classification challenge. Compared with most CNN models of only five to seven layers, VGG Net has a deeper structure with up to 19 layers, 16 convolution and three fully-connected layers, which contain spatial information and semantic information, respectively, which can identify deeper features.
Research indicates that the features extracted by convolution layer features are better than extracted from fully-connected layers. As shown in Figure 2, the feature extracted by the Conv3-4 layer in the VGG-Net model maintains spatial details, especially some information that is useful for accurate tracking (Figure 2b). Figure 2d illustrates the Conv5-4 layer of the VGG-Net model, which contains more semantic information. The semantic information effectively achieves better feature extraction when experiencing deformation in the tracking process. We chose the Conv3-4 feature in this paper considering the tradeoff between spatial information and semantic information.

Convolutional Features
Convolutional neural networks (CNNs) have successfully applied to large image classification and detection by extracting features or by directly performing the task, such as with AlexNet [30], GoogleNet [31], ResNet [32], and VGG-Net [12]. VGG-Net was trained by 1.3 million images in the ImageNet dataset, and achieved the best result in a classification challenge. Compared with most CNN models of only five to seven layers, VGG Net has a deeper structure with up to 19 layers, 16 convolution and three fully-connected layers, which contain spatial information and semantic information, respectively, which can identify deeper features.
Research indicates that the features extracted by convolution layer features are better than extracted from fully-connected layers. As shown in Figure 2, the feature extracted by the Conv3-4 layer in the VGG-Net model maintains spatial details, especially some information that is useful for accurate tracking (Figure 2b). Figure 2d illustrates the Conv5-4 layer of the VGG-Net model, which contains more semantic information. The semantic information effectively achieves better feature extraction when experiencing deformation in the tracking process. We chose the Conv3-4 feature in this paper considering the tradeoff between spatial information and semantic information. The feature mapping of Pool5 is only 7 × 7. Achieving accurate location depending on such low resolution is impossible. Bilinear interpolation is typically used to solve this problem in mapping space, The feature mapping of Pool5 is only 7 × 7. Achieving accurate location depending on such low resolution is impossible. Bilinear interpolation is typically used to solve this problem in mapping space, where the weight β ki depends on the location of kth frame and ith adjacent eigenvectors, and h represents the feature space.

Adaptive Dimensionality Reduction
The feature dimension of Conv3-4 layer is 56 × 56 × 256, which contains less information and increases computation time. We used an adaptive dimensionality reduction to preserve the main component of Conv3-4, depending on the principal component analysis (PCA) of the related layer. After applying this method, the feature dimension was reduced to 130 from 256. As shown in Figure 3, the contribution of the feature under adaptive dimensionality reduction was 98% in sequence MotorRolling. where the weight ki β depends on the location of kth frame and ith adjacent eigenvectors, and h represents the feature space.

Adaptive Dimensionality Reduction
The feature dimension of Conv3-4 layer is 56 × 56 × 256, which contains less information and increases computation time. We used an adaptive dimensionality reduction to preserve the main component of Conv3-4, depending on the principal component analysis (PCA) of the related layer. After applying this method, the feature dimension was reduced to 130 from 256. As shown in Figure 3, the contribution of the feature under adaptive dimensionality reduction was 98% in sequence MotorRolling.
We used singular value decomposition (SVD) of the matrix t R to solve Equation (9). The projection matrix is chosen from the first 2 D feature vectors from matrix t R : x t denotes the D 1 -dimensional feature learned from frame t. Adaptive dimensionality reduction results in the projection matrix P t , which contains an orthogonal vector in D 1 × D 2 dimension, and P T t P t = I. By applying the projection matrix P t , we achieved the new D 2 -dimensional feature space: where η 1 , . . . , η t denote weights and ξ We used singular value decomposition (SVD) of the matrix R t to solve Equation (9). The projection matrix is chosen from the first D 2 feature vectors from matrix R t : where G t denotes the covariance matrix of; Λ t represents the diagonal matrix with D 2 × D 2 , which contains ξ We obtain the adaptive projection matrix though a fixed learning rate λ. The matrix R t and the variance matrix Q t are updated using linear interpolation at every time step. Use the fixed learning rate γ ≥ 0 to simultaneously update the appearance feature spacex t . x t donates the feature space determined through Equation (8). Due to the Pooling operation, the feature space contains more semantic information:

Fast Sub-Grid Detection
At the detection stage, the location of the target in a new frame t is estimated by applying the filterf t−1 that was updated in the previous frame. Apply the filter at multiple resolutions to estimate changes in the target size.
where i denotes the imaginary unit. We iteratively maximize Equation (16) using Newton's method by starting at the location u (0) , v (0) ∈ Ω. The gradient and Hessian in each iteration are computed by analytically differentiating Equation (16) to the maximum score:

Adaptive Model Update
The SRDCF framework uses the fixed learning rate to update the tracking model. Once the target is occluded, the appearance model is negatively affected, which leads to tracking drift. The proposed method uses the PSR R PSR to compute the confidence of the target position [33]. Through this method, we update the model depending on the confidence. PSR has been widely used in signal processing; usually the peak intensity of the signal can be expressed as: where S f (x t ) represents the convolution response to the correlation filter of the sample, and ϕ t and σ t denote the mean and standard deviation of convolution response to the sample x t , respectively. The PSR distribution of the David3 dataset is shown in Figure 4. The higher the PSR, the higher the confidence score of the target location. The target is completely occluded by the tree in the 84th frame, so the corresponding PSR drops to the extreme point, as seen in point A in Figure 4. The PSR gradually increase in the following frames. When the target was completely occluded by the trees in the 188th frame, the corresponding PSR decreases to the extreme point again, as shown by point B in Figure 4. The tracking results of point A and B are apparently unreliable, which cannot be used The PSR distribution of the David3 dataset is shown in Figure 4. The higher the PSR, the higher the confidence score of the target location. The target is completely occluded by the tree in the 84th frame, so the corresponding PSR drops to the extreme point, as seen in point A in Figure 4. The PSR gradually increase in the following frames. When the target was completely occluded by the trees in the 188th frame, the corresponding PSR decreases to the extreme point again, as shown by point B in Figure 4. The tracking results of point A and B are apparently unreliable, which cannot be used to update the model. The experiments show that the tracking result is highly reliable when PSR is around 10-18. Therefore, it is possible to determine whether the target is affected by the occlusion according to PSR in order to assign weight to the model update: The model is updated by using the learning rate η as follows: Therefore, it is possible to determine whether the target is affected by the occlusion according to PSR in order to assign weight to the model update: The model is updated by using the learning rate η as follows:

Long-Term Tracking Mechanism Based on Siamese Offline Tracker
Studies have shown the impressive performance of Siamese networks without any model update [26][27][28]. Compared with online trackers, these Siamese-network-based offline trackers are more robust to noisy model updates. Moreover, state-of-the-art tracking performance was achieved with a rich representation model learned from the large IILSVRC15 dataset [29]. However, these Siamese-network-based offline trackers are prone to drift in the presence of distractors that are similar to the target or when the target appearance in the first frame is significantly different from that in the remaining frames. Motivated by the complementary traits of online and offline trackers, we equipped our online update method with an offline-trained fully convolutional Siamese network [28]. By using this method, the stability-plasticity dilemma was balanced.
In long term tracking, tracking-learning-detection (TLD) [34] implements the long-term tracking mechanism in each frame of the image sequence. The proposed algorithm used threshold θ re to activate the long-term tracking mechanism. When max(s r ) < θ re , the tracking method switches to the offline Siamese tracker. When max(s r ) is less than the activation threshold, the algorithm elects the offline Siamese tracker to track the target. The process is executed once, when max(s r ) < θ re . The implementation details of the fully convolutional Siamese Network were provided in a previous study [28]. The ablation study in Section 4.2 shows that the proposed offline tracker can avoid noisy model updates to achieve some improvements. The overall tracking algorithm is described in Algorithm 1.

Algorithm 1: Proposed tracking algorithm.
Input: Image I; Initial target position u (0) , v (0) and scale a r 0 ; previous target position u (t−1) , v (t−1) and scale a r t−1 Output: Estimated object position u (t) , v (t) and scale a r t .

For each I t
Extract the deep feature spacex t thought the pre-trained VGG-Net; Update matrix R t and Q t by linear interpolation using Equation (13) and (14). The SVD is performed and a new P t is found; Update the low dimensional appearance feature spacex t using Equation (15); Compute the confidence of the target position using Equation (18); Update the tracking model A t , b t andx t using Equations (19)-(22); Compute the estimated object position u (t) , v (t) and scale a r t using fast sub-grid detection; If max(s r ) < θ re , Update the estimated object position and scale using the offline Siamese tracker; Else Output the estimated object position and scale directly; End

Experimental Results and Analysis
This section presents a comprehensive experimental evaluation of the proposed tracker.

Implementation Details
The configuration used was an Intel (R) Core ™ I74790 CPU, 3.6 GHz, 16 GB RAM, NVIDIA Tesla K20 m GPU standard desktop. The weight function ω was constructed by starting from a quadratic function ω(m, n) = τ + ξ (m/P) 2 + (n/Q) 2 . The minimum value of ω was set to ω = τ = 0.1, and the impact of the regularizer was set to ζ = 3. P × Q denotes the target size. The number of the scale was set to S = 7, and a = 1.01 denotes the scale incremental factor. During adaptive dimensionality reduction, the feature dimension of Conv3-4 was set to D 1 = 256, which was reduced to D 2 = 130. During linear interpolation, the learning ratio was set to λ = 0.15, γ = 0.025. θ re = 0.5 was used to activate the offline Siamese tracker; the tracker used the same parameters as in a previous study [20]. The R PSR,t was set to 10 during model update, and the learning ratio was set to η = 0.01. Our MATLAB implementation ran at 4.6 frames per second with MatConvNet [35].

Reliablity Ablation Study
An ablation study on VOT2016 was conducted to evaluate the contribution of the adaptive dimensionality reduction, adaptive model update, and Siamese tracker in the proposed method. The results of the VOT primary measure expected average overlap (EAO) and two supplementary measures, accuracy (A) and robustness (R), are summarized in Table 1 We provide the details of performance measures and evaluation protocol of VOT2016 in Section 4.4. Performance of the various modifications of the proposed method are discussed in the following.
Applying the adaptive dimensionality reduction reliability is equivalent to extracting the principle component from the original image feature space. It not only reduces the computational complex, but also improves the sematic representation during the procedure. The performance drop in EAO compared to the proposed method was 11%. Replacing the adaptive model updating means that Ours Adr− does not use the PSR (R PSR ) to compute the confidence of the target position and completed the updating procedure based on the confidence. Since the updated filter drifted due to the deformation and occlusion, which affect the appearance of the tracking object, this version reduced our tracker performance by over 22% in EAO. R av remained unchanged in this experiment, whereas the A av of this version dropped by over 40%.
Replacing the Siamese tracker from the proposed method mainly affected the performance of long-term tracking. The performance drop in EAO compared with the proposed method was around 10%, and the A av dropped 20% due to the lack of a failure detection mechanism. This clearly illustrates the importance of our combination of the online tracker and Siamese tracker as outlined in Section 3.3.

OTB-2015 Benchmark
The OTB100 [36] benchmark contains the results of 29 trackers evaluated on 100 sequences using a no-reset evaluation protocol. We measured the tracking quality using precision and success plots. The success plot shows the fraction of frames with an overlap between the predicted and ground truth bounding box greater than a threshold with respect to all threshold values. The precision plot shows similar statistics on the center error. The results are summarized by areas under the curve (AUC) in these plots. Here, we only show the results for top-performing recent baselines to avoid clutters, including Struck [8], TGPR [37], DSST [3], KCF [4], SAMF [38], RPT [39], LCT [5], and results for recent top performing state-of-the-art trackers SRDCF [2] and MUSTER [40]. The results are shown in Figure 5. The proposed method performed the best in OTB100 and outperformed the baseline tracker, SRDCF. The OTB success plots computed on these trajectories and summarized by the AUC values are equal to the average overlap [41]. Here, we only show the results for top-performing recent baselines to avoid clutters, including Struck [8], TGPR [37], DSST [3], KCF [4], SAMF [38], RPT [39], LCT [5], and results for recent top performing state-of-the-art trackers SRDCF [2] and MUSTER [40]. The results are shown in Figure 5. The proposed method performed the best in OTB100 and outperformed the baseline tracker, SRDCF. The OTB success plots computed on these trajectories and summarized by the AUC values are equal to the average overlap [41].

VOT2016 Benchmark
We compared the proposed tracker with other state-of-the-art trackers in VOT2016, which contains 60 sequences. The trackers were restarted at each failure. The set is diverse, with the topperforming trackers come from various classes including correlation filter methods such as CCOT [20],
The proposed method outperforms the compared trackers, except for ECO and CCOT, with an EAO score of 0.329. The proposed method significantly outperformed the correlation filter approaches that apply deep ConvNets, and also outperforms the trackers that apply different detection-based approaches. The detailed performance scores for the 10 top-performing trackers are shown in Table 2.

Per-Attribute Analysis
The VOT2016 dataset is per-frame annotated with visual attributes to allow the detailed analysis of per-attribute tracking performance. Figure 6 shows the per-attribute plot for the top-performing trackers on VOT2016 in EAO. The proposed method was consistently ranked among the top three trackers on the five attributes. The proposed method performed the best in terms of size change, occlusion, camera motion, and unassigned. During the illumination change challenge, the proposed tracker did not perform better than four trackers, including ECO, CCOT, MLDF, and SSAT.

Per-Attribute Analysis
The VOT2016 dataset is per-frame annotated with visual attributes to allow the detailed analysis of per-attribute tracking performance. Figure 6 shows the per-attribute plot for the top-performing trackers on VOT2016 in EAO. The proposed method was consistently ranked among the top three trackers on the five attributes. The proposed method performed the best in terms of size change, occlusion, camera motion, and unassigned. During the illumination change challenge, the proposed tracker did not perform better than four trackers, including ECO, CCOT, MLDF, and SSAT.

Tracking Speed Analysis
Speed measurements on a single CPU were computed using an Intel ® Core™ I74790 CPU, 3.6 GHz, 16 GB RAM, NVIDIA Tesla K20 m GPU standard desktop. Compared with the two best-performing methods, ECO and CCOT, the proposed method was slower than ECO, while being four times faster than CCOT. Compared with other trackers that apply deep ConvNets, such as DeepSRDCF [14] and SiamFC, the proposed tracker had better tracking results and was twice as fast as DeepSRDCF. The

Tracking Speed Analysis
Speed measurements on a single CPU were computed using an Intel ® Core™ I74790 CPU, 3.6 GHz, 16 GB RAM, NVIDIA Tesla K20 m GPU standard desktop. Compared with the two best-performing methods, ECO and CCOT, the proposed method was slower than ECO, while being four times faster than CCOT. Compared with other trackers that apply deep ConvNets, such as DeepSRDCF [14] and SiamFC, the proposed tracker had better tracking results and was twice as fast as DeepSRDCF. The proposed tracker performs nearly two times slower than the baseline SRDCF, but achieved better tracking results. Compared with baseline real-time trackers like KCF, DSST, and Staple, the proposed tracker performed poorly, but the tracking performance of the proposed tracker was much better. The speed of trackers in terms of frames per second is shown in Table 3. The average speed of the proposed tracker measured on the VOT 2016 dataset was approximately 4.6 fps or 217 ms/frame. Figure 7 shows the processing time required by each step of the proposed method. Among them, the Fast Sub-Grid Detection process required 173 ms, the Adaptive Model Update required 67 ms, and the offline Siamese Tracker required 136 ms. The condition max(s r ) depends on whether or not the offline Siamese Tracker is employed. Due to the adaptive dimensionality reduction, the proposed tracker can save time than when directly using deep features.

Qualitative Evaluation on the OTB Benchmark
In this section, we focus on the tracking results for objects experiencing severe occlusion, illumination, and in-plane rotation on OTB100. The compared trackers included the baseline SRPDCF, MUSTER, LCT, RPT, and SAMF. The tracking results are shown in Figure 8. Given the rich representation of deep ConvNet, the proposed tracker outperformed other trackers given complex attributes. In sequence Car4 and CarDark, the illumination occurs in frames 205 and 333, respectively. In the sequence FaceOcc2, the target is occluded by a cap and book. In the Freeman sequence, the target is suffering from severe in-plane rotation. Due to the adaptive model update, the model is updated based on the peak-to-sidelobe ratio, which prevents the correlation filter from learning background information and tracking the object. Due to the deep ConvNet features, the proposed tracker contains rich representation that performs well when experiencing illumination change in the Car 4 and CarDark sequences. Notably, the proposed tracker succeeds in tracking the target until the  In this section, we focus on the tracking results for objects experiencing severe occlusion, illumination, and in-plane rotation on OTB100. The compared trackers included the baseline SRPDCF, MUSTER, LCT, RPT, and SAMF. The tracking results are shown in Figure 8. Given the rich representation of deep ConvNet, the proposed tracker outperformed other trackers given complex attributes. In sequence Car4 and CarDark, the illumination occurs in frames 205 and 333, respectively. In the sequence FaceOcc2, the target is occluded by a cap and book. In the Freeman sequence, the target is suffering from severe in-plane rotation. Due to the adaptive model update, the model is updated based on the peak-to-sidelobe ratio, which prevents the correlation filter from learning background information and tracking the object. Due to the deep ConvNet features, the proposed tracker contains rich representation that performs well when experiencing illumination change in the Car 4 and CarDark sequences. Notably, the proposed tracker succeeds in tracking the target until the very end of the FaceOcc2 and Freeman sequences. The offline Siamese Tracker is activated to achieve long-term tracking to prevent tracking failure from the online model update.
In this section, we focus on the tracking results for objects experiencing severe occlusion, illumination, and in-plane rotation on OTB100. The compared trackers included the baseline SRPDCF, MUSTER, LCT, RPT, and SAMF. The tracking results are shown in Figure 8. Given the rich representation of deep ConvNet, the proposed tracker outperformed other trackers given complex attributes. In sequence Car4 and CarDark, the illumination occurs in frames 205 and 333, respectively. In the sequence FaceOcc2, the target is occluded by a cap and book. In the Freeman sequence, the target is suffering from severe in-plane rotation. Due to the adaptive model update, the model is updated based on the peak-to-sidelobe ratio, which prevents the correlation filter from learning background information and tracking the object. Due to the deep ConvNet features, the proposed tracker contains rich representation that performs well when experiencing illumination change in the Car 4 and CarDark sequences. Notably, the proposed tracker succeeds in tracking the target until the very end of the FaceOcc2 and Freeman sequences. The offline Siamese Tracker is activated to achieve long-term tracking to prevent tracking failure from the online model update.

Qualitative Evaluation on VOT Benchmark
In this section, we focus on the tracking results of objects undergoing severe occlusion, scale change, and camera motion on VOT2016. The compared trackers included CCOT, ECO, Staple, SiamFC, and the baseline SRDCF. The tracking results are shown in Figure 9. The proposed tracker outperformed the other trackers in terms of occlusion, scale change, and camera change, which is illustrated in Section 4.5. In the Tiger sequence, the target is occluded frequently during the entire procedure. The tracker based on deep ConvNet performed well in this sequence, since the high number of layers retains rich semantics information. In the Bolt1 and Dinosaur sequence, the target experiences scale change. Compared with the other trackers, the proposed tracker performed well, due to the long-term mechanism of the offline Siamese tracker. In the Racing sequence, the camera changes throughout the sequence. Nearly all the trackers can track the target successfully, whereas the proposed tracker achieved the most accurate tracking, which can be seen in Figure 9d. (a) Tiger

Qualitative Evaluation on VOT Benchmark
In this section, we focus on the tracking results of objects undergoing severe occlusion, scale change, and camera motion on VOT2016. The compared trackers included CCOT, ECO, Staple, SiamFC, and the baseline SRDCF. The tracking results are shown in Figure 9. The proposed tracker outperformed the other trackers in terms of occlusion, scale change, and camera change, which is illustrated in Section 4.5. In the Tiger sequence, the target is occluded frequently during the entire procedure. The tracker based on deep ConvNet performed well in this sequence, since the high number of layers retains rich semantics information. In the Bolt1 and Dinosaur sequence, the target experiences scale change. Compared with the other trackers, the proposed tracker performed well, due to the long-term mechanism of the offline Siamese tracker. In the Racing sequence, the camera changes throughout the sequence. Nearly all the trackers can track the target successfully, whereas the proposed tracker achieved the most accurate tracking, which can be seen in Figure 9d. procedure. The tracker based on deep ConvNet performed well in this sequence, since the high number of layers retains rich semantics information. In the Bolt1 and Dinosaur sequence, the target experiences scale change. Compared with the other trackers, the proposed tracker performed well, due to the long-term mechanism of the offline Siamese tracker. In the Racing sequence, the camera changes throughout the sequence. Nearly all the trackers can track the target successfully, whereas the proposed tracker achieved the most accurate tracking, which can be seen in Figure 9d.

Conclusions
In this paper, we propose a visual tracking framework that combines deep ConvNet features, adaptive model updates, and an offline Siamese tracker. The proposed tracker outperformed other state-of-the-art methods in complex attributes. The adaptive dimensionality reduction provides low dimensional features for the correlation filter to reduce computational complexity. The adaptive model updating method improves the tracking performance in occlusion situations. The offline Siamese tracker enables long-term tracking. Numerous experimental results demonstrated that the proposed tracker outperforms state-of-the-art trackers, highlighting the significant benefits of our method.