A Robust Visual Tracking Algorithm Based on Spatial-Temporal Context Hierarchical Response Fusion

: Discriminative correlation ﬁlters (DCFs) have been shown to perform superiorly in visual object tracking. However, visual tracking is still challenging when the target objects undergo complex scenarios such as occlusion, deformation, scale changes and illumination changes. In this paper, we utilize the hierarchical features of convolutional neural networks (CNNs) and learn a spatial-temporal context correlation ﬁlter on convolutional layers. Then, the translation is estimated by fusing the response score of the ﬁlters on the three convolutional layers. In terms of scale estimation, we learn a discriminative correlation ﬁlter to estimate scale from the best conﬁdence results. Furthermore, we proposed a re-detection activation discrimination method to improve the robustness of visual tracking in the case of tracking failure and an adaptive model update method to reduce tracking drift caused by noisy updates. We evaluate the proposed tracker with DCFs and deep features on OTB benchmark datasets. The tracking results demonstrated that the proposed algorithm is superior to several state-of-the-art DCF methods in terms of accuracy and robustness.


Introduction
Visual object tracking is a basic task in computer vision, with a wide range of applications such as autonomous driving, robotics, video surveillance, human-machine interaction and so forth [1,2].Although the initial frame of the target is given, how to use an effective method to judge the position of the target in the subsequent frame is a difficult problem.These methods should be able to overcome various challenges well, including background clutter, illumination changes, scale variation, motion blur, and partial occlusions.
In recent years, Discriminative Correlation Filter (DCF) based tracking methods have shown prominent results on object tracking benchmarks [3][4][5][6].The discriminative methods view the tracking task as a binary classification problem.During the tracking process, a binary classifier is learned online to distinguish the target and its surrounding background, and the learned classifier is used to classify the image blocks in the current image frame-mark whether the pixel belongs to the target or the background.The main goal is to find the area with the highest confidence for classifier, which is the target location, and to use the tracking result as a sample to update classifier.This method is also called tracking-by-detection approaches.Our work will follow the DCFs tracking methods based on the tracking-by-detection framework.
Moreover, deep convolutional neural networks (CNNs) have shown high performance in many tasks.Activations from the last convolutional layers have been successfully employ for image classification [7][8][9].Features from these deep convolutional layers are effective in saving spatial and structural information of the object.Ma et al. [10] proposed the use of hierarchical convolutional features in VGGNet [8] for visual tracking.The main tracking task is to extract and use the features of each convolution layer.On the one hand, shallow features can accurately locate the target, but its disadvantage is that it does not capture semantic information very well.On the contrary, the advantages of deep features can capture semantic information very well.The disadvantage is that it cannot describe exhaustive spatial details to locate the target.The semantic information has a great effect on the object after the appearance changes.After the appearance changes, the semantic information has a great effect on the tracking.Therefore, the deep features of CNNs play an increasingly important role in visual object tracking.In this paper, we learn the above mentioned methods to extract hierarchical convolutional features as feature representation.
We learn a spatial-temporal context correlation filter on convolutional layers and employ these correlation response scores for fusion to estimate the location of the target.Zhang et al. [11] proposed a spatial-temporal relationship between the target and its surrounding context regions, indicating that the context information of the target and its surrounding background regions can effectively improve tracking results.Ma et al. [12] showed that the correlation between spatial-temporal contexts can improve tracking accuracy and robustness.It is necessary to establish the spatial-temporal relationship between the target and its surrounding environment.Therefore, we employ a context-aware framework [13] based on discriminative correlation filter as our spatial-temporal context model.On this basis, we obtain a powerful filter that produces a high response value for the target image block and a near-zero response value for the context region.In order to estimate scale changes adaptively, we utilize the HOG feature as feature representation to train a discriminative correlation filter and estimate the desired object scale from the best score frame.HOG [14] feature is a feature descriptor that used for object detection in computer vision and image processing.It has certain translation invariance, rotation invariance, and illumination invariance, which can better adapt to target deformation, scale changes, and occlusion.
It is very important to design an effective re-detection method to improve the robustness of visual tracking in the case of tracking failure.In this work, we employ the EdgeBox [15] to achieve object online re-detection and use the predefined threshold as the activation condition for re-detection.However, it did not perform very well for all the video sequences.To this end, we proposed a self-adaptive activation method to stimulate the re-detection component.We compared the size of the response map peak and its corresponding peak-to-side lobe ratio (PSR) [5] score generated by DCFs.By this method, the detector can be well awakened when the condition is satisfying (Section 3.3.1).For the model updating, most existed tracking algorithms often update the tracking model at a fixed interval or frame-by-frame [6,12,[16][17][18].These approaches have some obvious disadvantages.If objects go through complex appearance changes for instance occlusion and disappear in the current frame, these situations bring will fault background information.The wrong information is delivered to subsequent frames and decrease the performance of tracking after accumulating for a long time.The end result is tracking drift.Hence, we propose an effective model updating method similar to re-detection method.We compare the size of the response map peak and its corresponding PSR score generated by DCFs.By this method, the tracking models can be updated in time to improve tracking robustness and avoid tracking drift effectively due to noisy problems (Section 3.4).
The main contributions of this work are as follows: (1) The hierarchical features of CNNs are used as feature representation to handle large appearance variations, and we learn a spatial-temporal context correlation filter on each CNN layer as a discriminative classifier.We use multi-level correlation response maps for fusion to infer the target location.For scale estimation, we train DCF based on scale pyramid representation and estimate the desired object scale from the best score frame.(2) We employ the EdgeBox to redetect when tracking failure occurred and proposed a novel re-detection activation method.For model updating, we propose a novel model update method to solve the model noisy problems.(3) We extensively validate our method on benchmark datasets with large-scale sequences and extensive experimental results demonstrated that the proposed tracking algorithm is superior to the state-of-the-art methods in terms of accuracy and robustness.

Related Works
In Discriminative Correlation Filter based trackers, the filter is trained to predict the optimal response map by minimizing a least-squares loss for all circular shifts of a training sample.Since the complicated convolution operations can be converted into simple element-wise multiplication operations, DCF shows the advantage of high computational efficiency.Firstly, Bolme et al. [5] proposed a tracker using minimum output sum of squared error (MOSSE) filter, which uses a grayscale template.Henriques et al. [19] replaced the grayscale templates by HOG [14] features and built on multiple channel features which further improved the tracking accuracy and robustness.Danelljan et al. [16] learned separate filters for translation and scaling.The role of the two filters is to locate the target and estimate desired scale of the target object, respectively.Zhang et al. [20] incorporated context information into filter learning.Luca et al. [17] proposed the STAPLE tracker which combines DCF and color histogram based model.Danelljan et al. [21] introduced a spatial regularization component in the learning to punish correlation filter coefficients depending on their spatial location, which enhanced the robustness of tracking effectively.Wang et al. [22] proposed a large margin visual tracking method with circulant feature maps, which employed a multi-modal detection technique to avoid tracking drift.C-COT [23] and ECO [24] adopted an implicit interpolation model to solve the learning problems in the continuous spatial domain, which enhanced the tracking accuracy.Alam et al. [25] introduced a new metric called the peak-to-clutter mean (PCM) and it provided sharp and high correlation peaks corresponding to targets.This method improves the efficiency of detection.Paheding Sidike et al. [26] introduced class-associative spectral fringe-adjusted joint transform correlation (CSFJTC) based on joint transform correlation (JTC) and employed class-associative filtering, modified Fourier plane image subtraction, and fringe-adjusted JTC techniques to execute the object detection task.The performance of the detection was outstanding.To reduce the training time significantly for online training of the object, Evan Krieger et al. [27] proposed Progressively Expanded Neural Network (PENNet) tracker methodology and employed a modified variant of the extreme learning machine.To overcome these challenges, such as object structural information distortions and background variations, Krieger et al. [28] proposed a Directional Ringlet Intensity Feature Transform (DRIFT) method, which utilized Kirsch kernel filtering for edge features and a ringlet feature mapping for rotational invariance.This method obtained accurate object boundaries and improvements for lowering computation times.Zhang et al. [29] introduced a spatial alignment module, which provides continuous feedback for transforming the target from the border to the center with a normalized aspect ratio.This method can handle undesired boundary effects.Song et al. [30] used an adversarial learning method to maintain the most robust features of the target objects and proposed a high-order cost sensitive loss to decrease the effect of easy samples.
Visual representations are important for visual tracking.Most tracking algorithms have recently employed deep features extracted from convolutional neural networks CNN as feature representation.Danelljan et al. [31] used the deep features learned from CNN for representation based DCF framework, which improved tracking performances.Fan et al. [32] learned a feature extractor with convolutional neural networks from an offline training set for visual tracking.To handle long training times and a large number of training samples, DeepTrack [33] employed a single CNN for learning effective feature representations of the target object by a purely online manner.CNT [34] proposed a convolutional neural network model tracking framework without pre-training and used simple two-layer convolutional networks to learn robust representations for visual tracking, which improved tracking accuracy.Song et al. [35] integrated features extraction, response map generation, and model updating into the convolutional neural network for end-to-end training, which effectively improved the tracking robustness.Zhu et al. [36] proposed a joint convolutional tracking, which viewed the process of feature extraction and tracking as convolution operation.Yao et al. [37] proposed a learning representation and truncated inference model by modeling the representor as CNN and achieved competitive accuracy.At present, researchers have seen the advantages of deep networks.Some existing tracking methods [10,23,24,35,38] use fewer convolutional layers to extract target features and improve tracking robustness effectively.Therefore, we make full use of hierarchical feature of convolution layers as feature representation.

The Overall Flowchart of The Proposed Algorithm
The proposed framework for visual tracking algorithm with spatial-temporal context hierarchical response fusion is showed in Figure 1.As shown in Figure 1, we decompose the tracking task into translation estimation and scale estimation.We first extract hierarchical features of convolution layers and learn spatial-temporal context correlation filter on each layer.Then, we fuse the correlation response map of the translation filters to infer the target position.We predict the scale change using the scale filter.The re-detection module built on EdgeBox.When the PSR values are lower than its corresponding response map peak value R max , we activate the re-detection module.When the PSR values were greater than its corresponding response map peak value R max , we update the tracking model.and used simple two-layer convolutional networks to learn robust representations for visual tracking, which improved tracking accuracy.Song et al. [35] integrated features extraction, response map generation, and model updating into the convolutional neural network for end-to-end training, which effectively improved the tracking robustness.Zhu et al. [36] proposed a joint convolutional tracking, which viewed the process of feature extraction and tracking as convolution operation.Yao et al. [37] proposed a learning representation and truncated inference model by modeling the representor as CNN and achieved competitive accuracy.At present, researchers have seen the advantages of deep networks.Some existing tracking methods [10,23,24,35,38] use fewer convolutional layers to extract target features and improve tracking robustness effectively.Therefore, we make full use of hierarchical feature of convolution layers as feature representation.

The Overall Flowchart of The Proposed Algorithm
The proposed framework for visual tracking algorithm with spatial-temporal context hierarchical response fusion is showed in figure 1.As shown in figure 1, we decompose the tracking task into translation estimation and scale estimation.We first extract hierarchical features of convolution layers and learn spatial-temporal context correlation filter on each layer.Then, we fuse the correlation response map of the translation filters to infer the target position.We predict the scale change using the scale filter.The re-detection module built on EdgeBox.When the PSR values are lower than its corresponding response map peak value  , we activate the re-detection module.When the PSR values were greater than its corresponding response map peak value  , we update the tracking model.

Hierarchical Feature of Convolution Layer
The hierarchical features of deep neural networks play an important role in visual object tracking, which can enhance the robustness and accuracy of visual tracking.Existing tracking methods based on deep learning [33,34,39] find that the deep features of the convolutional layer encode the semantic information of targets, which is invariant to the appearance change.But, when the convolution layer is getting deeper, the resolution of the image is lower and it is more difficult to estimate the location of the target.In contrast, features from shallow convolutional layers capture more fine-grained spatial details and locate the target accurately but not robust to appearance changes.Our aim is to make full use of the semantic information features of deep convolutional layers to solve large appearance changes and to utilize the features of shallow layers to locate the target precisely and prevent tracking drift.

Hierarchical Feature of Convolution Layer
The hierarchical features of deep neural networks play an important role in visual object tracking, which can enhance the robustness and accuracy of visual tracking.Existing tracking methods based on deep learning [33,34,39] find that the deep features of the convolutional layer encode the semantic information of targets, which is invariant to the appearance change.But, when the convolution layer is getting deeper, the resolution of the image is lower and it is more difficult to estimate the location of the target.In contrast, features from shallow convolutional layers capture more fine-grained spatial details and locate the target accurately but not robust to appearance changes.Our aim is to make full use of the semantic information features of deep convolutional layers to solve large appearance changes and to utilize the features of shallow layers to locate the target precisely and prevent tracking drift.
We make use of the convolution features of a CNN (VGGNet [8]) to encode target appearance.For visual tracking, the result we want is to more accurately locate the target and know where it is located.However, the semantic information has a great effect on the object after the appearance changes and can estimate the approximate position of the tracking object.Therefore, the target can be positioned more accurately.In our work, the feature extractor we use is pre-trained VGGNet-16 [8] and we select the third, fourth and fifth layer of convolutional layers to represent target objects (Figure 1).The characteristic of conv5 layer handles serious appearance changes but does not accurately locate the target due to low resolution.On the contrary, the characteristic of conv4 and conv3 layers can capture more space details and help better locate the object.The peculiarity is used for image segmentation and detailed localization using CNNs [40].If we can make clever use of hierarchical feature of convolution layer, it will be very helpful for our experiments in the future.

Spatial-Temporal Context Correlation Filter
Traditional discriminative correlation filters use cosine windows to handle the boundary effects of tracking targets due to the circulant assumption, which causes limited contextual information for DCFs based trackers.It is easily result in tacking drift when target objects experience complex scene changes such as occlusion, background clutter and fast motion.In order to learn a powerful filter that produces a high response value for the target image block and a near zero response value for the context region, the CACF framework [13] combined the background information around the target into the learned filter.In our work, we utilize the hierarchical features of convolutional layer as feature representation and employ CACF framework as our spatial-temporal context model for visual tracking, reference [13].
In this framework, we aim to get an ideal correlation filter w.For all training samples U 0 generated by circular shifts using a sliding window on three convolution layer, utilizing the nature of the circular matrix [6] will better dealt better with the ridge regression trouble.
In Equation (1), the data matrix U 0 represents all circular shifts of the vectorized image patch u 0 , and w denotes the learned correlation filter.The regression target y represents a vectorized image of a 2D Gaussian, λ 1 represents regularization weight parameters.
We use the learned correlation filter w to convolve with the image block in the next frame in order to predict the position of the target.The maximum response value of all training sample response vectors y p (z, w) is the estimated position of the target.Given an image block z, the output response value is derived from the following formula: In Equation ( 2), F −1 represents the inverse Fourier transformation, and represents the convolution operation.Then update filter model by employing following equations: The subscript i represents the sequence number of current frame, and η represents the learning rate parameter, and xi represents the target appearance model.

Multi-Response Maps Fusion
The context-aware filter mainly integrates the context background information around the target into the filter to learn together and obtains a correlation filter with high discriminative performance through training.The advantage is that it can effectively utilize the context information of the surrounding area of the target and can make the target a better robustness in complex scenes such as occlusion, background clutter, and fast motion.
In order to improve tracking robustness and take full advantage of hierarchical features and each layer of filters, we use context-aware correlation filters to output response values on the third, fourth and fifth convolutional layers, respectively, recorded as R context3 , R context4 , R context5 , and then calculate the weight of each layer's response map normalized in t frame: The filter response map accounts for a larger proportion and assigns a higher weight.Update the original response weight with the weight of the t frame: Here, τ is the weight update parameter; and context3_w t , context4_w t and context5_w t denote the original response weight in t frame.At t frame, the final response is obtained by fusing each response map (R context3 , R context4 , R context5 ):

The Scale Discriminative Correlation Filter
When tracking the target, the target object experiences different scenes, and its appearance and scale will change with time.This situation brings great problems to the tracking, how to update the target appearance and scale change effectively in a timely manner, which is the key to improving tracking performance.We found that [16] proposed an accurate scale estimation method.This method can estimate the target scale from the best score by training a scale discriminative correlation filter.Based on this, we learn a scale discriminative correlation filter to handle the problem of target scale change and get an ideal scale correlation filter h by the following function: In Equation (7), g represents the optimal correlation output, l represents the dimension of the feature, and λ is a regular coefficient.The above solution in the frequency domain is given by: To better calculate the results, the numerator and denominator of H l in Equation ( 8) are respectively updated as follows: η represents a learning rate parameter.the response value of the scale filter that we need can be calculated by following equation: We estimate the target scale by getting the maximum scale response value and utilize Equations (9a) and (9b) to update the scale discriminative filter model.

Target Recovery
It is clearly known that a re-detection module is essential for realizing a robust visual tracking.To better cope with tracking failures and continue tracking in subsequent frames, we employ the EdgeBox [15] as our detector and use this method to generate candidate regions: candidate regions detection C d is across the entire image and has large step size.In addition, we need to calculate the confidence value of candidate c and use a traditional learning rate to learn spatial-temporal context correlation filters in order to maintain a long-term memory of the target appearance.Give a candidate c, we represent the maximum response value of correlation filter by g(c).
We consider the response map peak as a dynamic threshold with comparison to corresponding PSR score (Section 3.3.1).Only when the PSR score is lower than its corresponding response map peak score R max and the tracking failure occurs at this time, will it proceed with target re-detection.We produce a set of candidate regions across the whole frame for recovering target objects.We choose the desired candidate as our re-detection result by minimizing the following issue: In Equation (11), the purpose of the weight factor α is to get a balance between candidate regions value and motion smoothing.D denotes the center location distance between each candidate c i t and the bounding box c t−1 .

Detector Activation
When the tracking failure occurs, it is very robust for the tracking result as the detector can be stimulated in time to re-detect.In this paper, we adopt the peak-to-side lobe ratio (termed PSR) referred by [5] to be our tracking quality evaluation.The tracking quality is evaluated according to the strength of the response map peak.The higher the PSR score, the more excellent the tracking quality.The PSR is defined as follows: where R max is the peak value of the response map R t .The subscript s1 is the peak side lobe region around the peak, which is 15% of the response map area in this paper, µ s1 and σ s1 are the mean value and standard deviation of the side lobe area.Figure 2 illustrates the distribution of the response map generated by DCFs and its corresponding PSR score.It is clear that the tracking perform well when the PSR scores are much larger than the response peak score (R max = 7.126) as shown in Figure 2 (point A and point D), and the tracking results in these two frames are regarded to be highly reliable.Therefore, there is no need for activating detector in this case.However, when the target object underwent significant appearance changes, such as occlusion and deformation, the PSR scores are lower than the response peak score (from point B to point C).As you can see in Figure 3, the target is experiencing occlusion from #70 to #79 in Jogging-1.In order to ensure accurate tracking of the target in subsequent frames, we activate the detector timely to re-detect the target object and avoid tracking drift under the circumstance.In this work, we consider the response map peak to be a dynamic threshold in comparison to its corresponding PSR score.Only when the PSR score is lower than its corresponding response map peak score R max and the tracking failure occurred at this point, is the detector activated online for re-detecting.Otherwise, the detector is not activated at this time.
In MOSSE [5], when the PSR drops below 10, the occlusion appears or the tracking fails.Therefore, we first set the fixed threshold to 10 and the other parameters were consistent.We select multiple video sequences in the OTB dataset for testing.The results are shown in Table 1.We consider the response map peak of each sequence to be a dynamic threshold, which is better than using a fixed threshold in most video sequences.
where  is the peak value of the response map  .The subscript s1 is the peak side lobe region around the peak, which is 15% of the response map area in this paper,  and  are the mean value and standard deviation of the side lobe area.
Figure 2 illustrates the distribution of the response map generated by DCFs and its corresponding PSR score.It is clear that the tracking perform well when the PSR scores are much larger than the response peak score ( =7.126) as shown in Figure 2 (point A and point D), and the tracking results in these two frames are regarded to be highly reliable.Therefore, there is no need for activating detector in this case.However, when the target object underwent significant appearance changes, such as occlusion and deformation, the PSR scores are lower than the response peak score (from point B to point C).As you can see in Figure 3, the target is experiencing occlusion from #70 to #79 in Jogging-1.In order to ensure accurate tracking of the target in subsequent frames, we activate the detector timely to re-detect the target object and avoid tracking drift under the circumstance.In this work, we consider the response map peak to be a dynamic threshold in comparison to its corresponding PSR score.Only when the PSR score is lower than its corresponding response map peak score  and the tracking failure occurred at this point, is the detector activated online for re-detecting.Otherwise, the detector is not activated at this time.
In MOSSE [5], when the PSR drops below 10, the occlusion appears or the tracking fails.Therefore, we first set the fixed threshold to 10 and the other parameters were consistent.We select multiple video sequences in the OTB dataset for testing.The results are shown in Table 1.We consider the response map peak of each sequence to be a dynamic threshold, which is better than using a fixed threshold in most video sequences.where  is the peak value of the response map  .The subscript s1 is the peak side lobe region around the peak, which is 15% of the response map area in this paper,  and  are the mean value and standard deviation of the side lobe area.
Figure 2 illustrates the distribution of the response map generated by DCFs and its corresponding PSR score.It is clear that the tracking perform well when the PSR scores are much larger than the response peak score ( =7.126) as shown in Figure 2 (point A and point D), and the tracking results in these two frames are regarded to be highly reliable.Therefore, there is no need for activating detector in this case.However, when the target object underwent significant appearance changes, such as occlusion and deformation, the PSR scores are lower than the response peak score (from point B to point C).As you can see in Figure 3, the target is experiencing occlusion from #70 to #79 in Jogging-1.In order to ensure accurate tracking of the target in subsequent frames, we activate the detector timely to re-detect the target object and avoid tracking drift under the circumstance.In this work, we consider the response map peak to be a dynamic threshold in comparison to its corresponding PSR score.Only when the PSR score is lower than its corresponding response map peak score  and the tracking failure occurred at this point, is the detector activated online for re-detecting.Otherwise, the detector is not activated at this time.
In MOSSE [5], when the PSR drops below 10, the occlusion appears or the tracking fails.Therefore, we first set the fixed threshold to 10 and the other parameters were consistent.We select multiple video sequences in the OTB dataset for testing.The results are shown in Table 1.We consider the response map peak of each sequence to be a dynamic threshold, which is better than using a fixed threshold in most video sequences.

Model Update
Many tracking methods now update their models frame-by-frame or at a fixed interval.However, there exists some disadvantage of these methods.When the target objects go through complicated scenario changes such as occlusion, scale changes, and deformation, the tracking models may have absorbed some wrong information.The wrong information will be delivered to subsequent frames and they will decrease the performance of tracking after accumulating for a long time.The end result is tracking drift.
To solve the issues of model update, we propose a novel update method which is equivalent to active mode.In this paper, the response map peak is taken as a dynamical threshold.Since the corresponding response map peak score is different at each frame, our approach is to compare the size of PSR score and corresponding response map peak score.Only when the PSR score is greater than its maximum response peak score R max , will this show that the tracking result has a good performance in the current frame.In this respect, the translation discriminative filter model (see Equation ( 2)) and the scale discriminative filter model (see Equation ( 8)) will be updated online based on a learning rate parameter η (see Equations ( 3) and ( 9)).Or else, the proposed approach chooses not to update the tracking model in the current frame.This can effectively prevent the error information from being passed to the next frame that cause tracking drift.
Figure 2 illustrates the distribution of the PSR scores.We can see from Figure 3 that the target objects go through complicated scenario challenges such as occlusion and background clutter from frames #61 to #79 (point A to point C), and the PSR score obviously decrease (that is, the PSR score decrease from 8.445 to 3.427).It is not suitable to update the model when the PSR score is lower than its corresponding peak score.When the target object left the occluded area at #83 (point D), the PSR score obviously increased (that is, the PSR score increase from 3.427 to 8.694).The greater the peak scores are, the more robust the tracking performance is.Under this circumstance, the update condition is met where the PSR score is greater than its corresponding peak score.The model update should be considered, and the tracking result is considered to be highly reliable in the current frame.In the case of point E (Figure 3), PSR score and its corresponding peak score has similar values.At this time, it is also not suitable to update the model.

Algorithm Flowchart
The proposed tracking algorithm is showed in Algorithm 1.
Crop out the image samples centered at P (x t , y t ) and extract convolutional features and HOG features; 3.
For each layer l computes the response map R context via Equation (2); 4.
Estimate the location of the target by computing the maximum response map after fusion R t via Equation (6); 5.
Construct a target scale pyramid around P (x t , y t ) and estimate the optimal scale of the target as in Equation ( 10); 6.
Calculate the PSR score of the response map peak; 7.
If PSR t < R max , then 8.
Activate re-detecting component D and find the possible candidate states C; 9.
For each state C d in C, do computing response score R context via Equation (2); 10.

End if 11.
If PSR t > R max , then 12.

End if 16.
Until end of video sequence.

Implementation Details
The proposed tracker is implemented in MATLAB2014a on a PC with an i7 3.2 GHz CPU with 16 GB memory.The feature extractor we employ is pre-trained VGGNet-16 trained on the ImageNet dataset.We also use HOG features in 4 × 4 window (a cell size of 4 × 4).We set the size of the search window to 2.2 times the size of the target object.The spatial bandwidth is set to 1/10.The learning rate parameter η in Equations ( 3) and ( 9) is set to 0.025.The number of scales (S) is set to 33, with a scale-step of 1.02.The PSR 0 (the PSR initial value) is set to 1.8.We employ the same parameters for each video sequence.
The position of the accuracy degradation is different for different video sequences (Table 2).We evaluate the proposed tracker, and it compares with the state-of-the-art 13 trackers including C-COT_HOG [23], SRDCF [21], CFNet_conv3 [41], SiamFC_3s [42], STAPLE_CA [13], Staple [17], LMCF [22], SAMF_AT [43], SAMF_CA [13], LCT [12], DSST [16], KCF [6], and some CNN based tracker, including CNN-SVM [44] on a large standard benchmark dataset [3], [4] that contains 100 videos.We use three metrics provided in [3] and [4] to evaluate 11 trackers on OTB-100, and display the tracking results by two indicators (Distance Precision (DP) and Overlap success Precision (OP)).We evaluate the tracking performance based on one-pass evaluation (OPE) protocol provided by [3] and [4].As shown in Figure 4, we use DP plot and OP plot for presentation and employ area under the curve (AUC) success plots to rank these trackers.Figure 4 illustrates DP plot and OP plot of 12 trackers on OTB-100 benchmark datasets.As can be seen in the figure, the proposed tracker performs favorably against state-of-the-art trackers in distance precision (DP) and overlap success precision (OP).DP is computed as the relative number of frames in the sequence where the center location error is smaller than a certain threshold, and the DP values at a threshold of 20 pixels are reported.OP is defined as the percentage of frames where the bounding box overlap larger than a given threshold and the initial threshold value of overlap success (OP) is generally set to 0.5.Table 3 shows the results from the proposed tracker and the state-of-the-art trackers.The proposed tracker performs better with DP of 87.6% and OP of 62.7%, where the DP and OP surpassed other trackers.The comparison of these tracking results prove the effectiveness of the proposed methods.The last column of bold Proposed is the best result.
The proposed tracker mainly employs the hierarchical features of CNNs for feature representation and the tracking speed is 6 frames per second, the main time-consuming burden of the proposed tracker is the process of extracting deep convolutional features.The next work guarantees robustness while improving tracking speed.

The Attribute-based Tracking Evaluation
We also perform an attribute-based analysis of our approach.In the OTB dataset, all videos are annotated with 11 different attributes, namely: illumination variation, motion blur, fast motion, inplane rotation, scale variation, background clutter, deformation, out of view, out-of-plane rotation, Figure 4 illustrates DP plot and OP plot of 12 trackers on OTB-100 benchmark datasets.As can be seen in the figure, the proposed tracker performs favorably against state-of-the-art trackers in distance precision (DP) and overlap success precision (OP).DP is computed as the relative number of frames in the sequence where the center location error is smaller than a certain threshold, and the DP values at a threshold of 20 pixels are reported.OP is defined as the percentage of frames where the bounding box overlap larger than a given threshold and the initial threshold value of overlap success (OP) is generally set to 0.5.Table 3 shows the results from the proposed tracker and the state-of-the-art trackers.The proposed tracker performs better with DP of 87.6% and OP of 62.7%, where the DP and OP surpassed other trackers.The comparison of these tracking results prove the effectiveness of the proposed methods.The last column of bold Proposed is the best result.
The proposed tracker mainly employs the hierarchical features of CNNs for feature representation and the tracking speed is 6 frames per second, the main time-consuming burden of the proposed tracker is the process of extracting deep convolutional features.The next work guarantees robustness while improving tracking speed.

The Attribute-Based Tracking Evaluation
We also perform an attribute-based analysis of our approach.In the OTB dataset, all videos are annotated with 11 different attributes, namely: illumination variation, motion blur, fast motion, in-plane rotation, scale variation, background clutter, deformation, out of view, out-of-plane rotation, occlusion and low resolution.Due to the limited space, we only display 5 attributes results for representation in Figure 5.
Figure 5 illustrates that the proposed tracker compares to 13 state-of-the-art trackers, and the result use DP plot and OP plot to show four different challenges attributes.It is clearly demonstrated that proposed tracker obtains superior DP and OP in background clutter (83.9%, 59.7%), occlusion (89.1%, 63.8%), in-plane rotation (86.2%, 61.0%), illumination variation (83.7%, 60.8%), and scale variation (86.6%, 60.2%).In sequences annotated with the background clutter attribute, fast motion attribute and in-plane rotation attribute, our approach outperforms the compared trackers.Benefit from the re-detecting method and self-adaptive model update method, the proposed tracker can adaptively activate the re-detection and update tracking model in the case of tracking failure.This shows that our tracker is highly robust and achieves superior performance in different challenging scenarios.
Our approach performs favorably compared to the existing tracker in these challenging situations.

Conclusions
In this paper, we use hierarchical features of CNNs for visual tracking.We make full use of the characteristics of each convolution layer to locate the object target.The semantic information of deep convolution has great significance for the change of the appearance of the target.The spatial details of shallow convolutional layer enable to locate targets position accurately.In addition, we train a spatial-temporal context filter on convolutional layers and predict the target position by fusing the response value of the filters on the three convolutional layers.Moreover, we proposed a re-detection method and model update method by comparing the size of the PSR and its corresponding response value to determine whether to re-detect or update the model.The experimental results demonstrate that the proposed algorithm with deep and HOG features outperforms most state-of-the-art algorithms based on DCFs. Figure 6.A qualitative comparison of our approach with six trackers, C-COT_HOG [23], SRDCF [21], STAPLE_CA [13], LMCF [22], SAMF_CA [13], CNN-SVM [44].Tracking results are shown on five video sequences from the OTB dataset (from top to down are Bird2, Dragon Baby, Ironman, Soccer, Box).Our approach performs favorably compared to the existing tracker in these challenging situations.

Conclusions
In this paper, we use hierarchical features of CNNs for visual tracking.We make full use of the characteristics of each convolution layer to locate the object target.The semantic information of deep convolution has great significance for the change of the appearance of the target.The spatial details of shallow convolutional layer enable to locate targets position accurately.In addition, we train a spatialtemporal context filter on convolutional layers and predict the target position by fusing the response value of the filters on the three convolutional layers.Moreover, we proposed a re-detection method and model update method by comparing the size of the PSR and its corresponding response value to determine whether to re-detect or update the model.The experimental results demonstrate that the proposed algorithm with deep and HOG features outperforms most state-of-the-art algorithms based on DCFs.

Figure 1 .
Figure 1.The overall framework of the proposed visual tracking algorithm.

Figure 1 .
Figure 1.The overall framework of the proposed visual tracking algorithm.

Figure 4 .
Figure 4.The overall tracking performances of precision (a) and success plots (b) comparing the proposed tracker with state-of-the-art trackers on OTB-100 using OPE evaluation.The proposed tracker performs well against these algorithms.

Figure 4 .
Figure 4.The overall tracking performances of precision (a) and success plots (b) comparing the proposed tracker with state-of-the-art trackers on OTB-100 using OPE evaluation.The proposed tracker performs well against these algorithms.

Table 1 .
Comparison of fixed threshold and dynamic threshold.

Table 2 .
Different scale-step values and their corresponding precision.

Table 3 .
Comparison with state-of-the-art trackers on the OTB dataset.The results are presented in terms of distance precision (DP) and overlap success precision (OP).

Table 3 .
Comparison with state-of-the-art trackers on the OTB dataset.The results are presented in terms of distance precision (DP) and overlap success precision (OP).