Learning Multifeature Correlation Filter and Saliency Redetection for Long-Term Object Tracking

: Recently due to the good balance between performance and tracking speed, the discriminative correlation ﬁlter (DCF) has become a popular and excellent tracking method in short-term tracking. Computing the correlation of a response map can be efﬁciently performed in the Fourier domain by the discrete Fourier transform (DFT) of the input, where the DFT of an image has symmetry in the Fourier domain. However, most of the correlation ﬁlter (CF)-based trackers cannot deal with the tracking results and lack the effective mechanism to adjust the tracked errors during the tracking process, thus usually perform poorly in long-term tracking. In this paper, we propose a long-term tracking framework, which includes a tracking-by-detection part and redetection part. The tracking-by-detection part is built on a DCF framework, by integrating with a multifeature fusion model, which can effectively improve the discriminant ability of the correlation ﬁlter for some challenging situations, such as occlusion and color change. The redetection part can search the tracked object in a larger region and reﬁne the tracking results after the tracking has failed. Beneﬁted by the proposed redetection strategy, the tracking results are re-evaluated and reﬁned, if it is necessary, in each frame. Moreover, the reliable estimation module in the redetection part can effectively identify whether the tracking results are correct and determine whether the redetector needs to open. The proposed redetection part utilizes a saliency detection algorithm, which is fast and valid for object detection in a limited region. These two parts can be integrated into DCF-based tracking methods to improve the long-term tracking performance and robustness. Extensive experiments on OTB2015 and VOT2016 benchmarks show that our proposed long-term tracking method has a proven effectiveness and high efﬁciency compared with various tracking methods.


Introduction
Visual object tracking is a fundamental task in computer vision and machine learning, aiming to localize the tracked object in the rest of the image sequences after giving the object localization and scale information in the first frame [1,2]. Although significant achievements have been made to improve tracking performance in the last few decades [3][4][5][6][7], there are still many challenges in object tracking field, especially in long-term tracking, where the tracked object may easily suffer from some challenging situation, such as heavy occlusion, disappearing and reappearing, deformation, and color change. Object tracking also has been widely used in many applications, such as unmanned aerial vehicle (UAV), selfdriving car, video monitoring system, telemedicine, autonomous landing, military combat system, and so on.
Tracking-by-detection methods [8,9] have gained much popularity and have had a huge success for visual object tracking in recent years. These tracking methods usually identify objects through a detector and update the detector online to keep up with the • As a crucial part of our tracker, the redetector part contains a reliable estimation module and redetector module. The proposed tracking-by-detection part integrates with multiple features in the correlation filter, which is equipped with color and HOG features for tracking.

•
The estimation module determines whether it is necessary to replace the previous tracking result and whether to start the redetection process. Considering the tracking speed and performance, we employ a saliency detector for redetecting the tracked object, which is fast and valid for object detection in a limited region. This re-detection module is more effective and can locate the object after it reappears in the image.

•
Our MSLT method is evaluated by extensive experiments, which compare results with several state-of-the-art methods on two benchmarks, including OTB2015 [20] and VOT2016 [21]. Both qualitative and quantitative experiments demonstrate the favorable effectiveness of our tracker.

Related Work
In recent years, many trackers have been proposed and achieved great success in visual object tracking field. Here we review three categories of trackers that are relevant to our work.

Correlation-Filter-Based Tracking
In the visual tracking field, DCF has gained a lot of popularity and achieved impressive performance. In the DCF framework, correlation filters are trained by minimizing a leastsquares loss for all circular-shifted samples and transforming the objective function into the Fourier domain to reduce the heavy computation. The first correlation filter framework was proposed by Bolme et al. [22] who used gray features to train a minimum output sum of squared error filter (MOSSE) with a high speed. Henriques et al. [10] exploited the circulant structure of training patches and proposed a kernel correlation filter (KCF) tracker by combining multidimensional features with kernels. Some trackers were proposed to adapt to the change of object scale by using multiscale correlation filters, such as DSST [23] and SAMF [24]. The SRDCF [25] tracker addressed the boundary effects problem by introducing a spatial regularization term to penalize the correlation filter coefficients that enable the correlation filter to be learned on larger image regions, and lead to a more discriminative appearance model. Similarly, the BACF tracker [26] exploited real background patches together with the target patch and used an online adaptation strategy to update the tracker model to alleviate the boundary effects. Recently, with the development of convolution neural networks (CNN) in object detection and classification, some trackers have also used the CNN features pretrained on a large object recognition to replace or combine handcrafted features, such as C-COT [27], HCF [28], ECO [29], and so on. Finally, the CFNet [30] tracker proposed an end-to-end framework, which interpreted the correlation filter learner as a differentiable layer in a deep neural network.

Tracking-by-Detection
Tracking-by-detection methods regard the tracking problem as a classification problem by learning a discriminative model, such as with a support vector machine (SVM) and partial filter-based tracking. The TLD method [31] consisted of three tasks, including tracking, learning, and detection, in which the tracking and detection tasks ran simultaneously. Inspired by the TLD method, many related trackers have been proposed. LMCF [32] employed a structured output SVM into a CF framework and combined two kinds of algorithms. The MEEM [33] method collected snapshots and picked the best prediction result from the SVM framework. Struct learned a structured output to update the detector. Lu et al. [34] proposed a robust object tracking algorithm by using a collaborative model which exploited both holistic templates and local representations. Zhang et al. [35] proposed a novel circulant sparse tracker (CST), which exploited circulant target templates. These above tracking-by-detection methods focus on short-term tracking and perform poorly when facing some challenging situations.

Long-Term Tracking
Long-term tracking focuses on solving challenging situations, such as object disappearing and reappearing, partial occlusion, and full occlusion. MUSTer [36] maintained a short-term memory for detection and a long-term mechanism for searching the object via key-point matching. However, the MUSTer method needs to evaluate the integrated trackers in every frame. LCT [37] learned discriminative correlation filters for estimating the translation and scale variation, and the authors also developed a robust online detector using random ferns to redetect objects in case of tracking failure. The PTAV method [38] proposed a framework that contains three parts, including a base tracker T, verifier V and their coordination mechanism. The base tracker T used a DCF-based tracker, while the verifier V used a Siamese network to verify the similarity between two objects. Wang et al. [39] and Tang et al. [40] utilized a reliable redetection mechanism with a DCF-based tracker for long-term tracking.

Framework
The overall framework of our MLST method is shown in Figure 1. It contains three modules, including tracking-by-detection, reliability estimation, and redetection. Firstly, for the tracking-by-detection module, we process it with twice the size of the region of interest patch in the input frame and employ a DCF model based on HOG features and a color histogram model based on color features to obtain the related response maps. Then, we estimate the responses by the peak-to-sidelobe ratio (PSR) and color ratio of the related region, respectively. The reliability estimation can decide whether to recall the redetection module. If the tracking result passes through the reliability estimation, then we keep the original tracking result as the final result, if not, we go through the redetection process, where a saliency detector is recalled with a larger search region. We introduce the related modules in the following sections. two objects. Wang et al. [39] and Tang et al. [40] utilized a reliable redetection mechanism with a DCF-based tracker for long-term tracking.

Framework
The overall framework of our MLST method is shown in Figure 1. It contains three modules, including tracking-by-detection, reliability estimation, and redetection. Firstly, for the tracking-by-detection module, we process it with twice the size of the region of interest patch in the input frame and employ a DCF model based on HOG features and a color histogram model based on color features to obtain the related response maps. Then, we estimate the responses by the peak-to-sidelobe ratio (PSR) and color ratio of the related region, respectively. The reliability estimation can decide whether to recall the redetection module. If the tracking result passes through the reliability estimation, then we keep the original tracking result as the final result, if not, we go through the redetection process, where a saliency detector is recalled with a larger search region. We introduce the related modules in the following sections.

Correlation Filter Response with HOG Features
The standard DCF model learned a discriminative correlation filter on image patch x with d channels, and the size x was 2.5 times that of the tracked object. All training samples were generated from the circular shift operation on the tracked object and extracted 28-dimension HOG features as an appearance description. Thus, the objective function of the DCF used a Tikhanov regularization which can be formulated as,

• Correlation Filter Response with HOG Features
The standard DCF model learned a discriminative correlation filter on image patch x with d channels, and the size x was 2.5 times that of the tracked object. All training samples were generated from the circular shift operation on the tracked object and extracted 28dimension HOG features as an appearance description. Thus, the objective function of the DCF used a Tikhanov regularization which can be formulated as, where y is a desired response, which use the Gaussian-shaped ground truth generally, and λ 1 is a regularization factor. To reduce the computation cost, Equation (1)  transformed into the Fourier domain through Parseval's Theorem. The objective function has a closed-form solution and the solution for the lth channel can be expressed as, where denotes the element-wise product, the symbolˆstands for the discrete Fourier transform (DFT) of a vector, andx * l is the complex-conjugate ofx l . For efficient updates, we used a linear update strategy to update the numeratorÂ t l and denominatorB t l of where η h denotes the learning rate of the correlation filter. To reduce the boundary effects during the learning process, we employed Hann windows on the samples [41]. During the tracking stage, an image patch z t with the same size of training sample x t−1 was cropped at the last location, and generated a response map R t h by correlating with the learned filter of the last frame, whereÂ t−1 l andB t−1 l are the numerator and denominator of the filter in the previous frame, respectively, and F −1 denotes the inverse DFT. The correlation filter response with HOG features is shown in Figure 2, where we selected 28-dimension HOG features.
where  denotes the element-wise product, the symbol ˆ stands for the discrete Fou transform (DFT) of a vector, and  l x * is the complex-conjugate of  l x . For effic updates, we used a linear update strategy to update the numerator  t l A and denomin  t l B of Equation (2), η denotes the learning rate of the correlation filter. To reduce the bound effects during the learning process, we employed Hann windows on the samples [41] During the tracking stage, an image patch t z with the same size of training sam 1 t x − was cropped at the last location, and generated a response map t h R by correla with the learned filter of the last frame, are the numerator and denominator of the filter in the prev frame, respectively, and 1 −  denotes the inverse DFT. The correlation filter respo with HOG features is shown in Figure 2, where we selected 28-dimension HOG featu •

Color Histogram Response
For long-term tracking, a color histogram model was adopted for some challeng situations, such as color change and blur motion. Figure 3 shows the generation of a c

• Color Histogram Response
For long-term tracking, a color histogram model was adopted for some challenging situations, such as color change and blur motion. Figure 3 shows the generation of a color histogram model. Inspired by [19], the histogram weight vector m was obtained via minimizing the regression error; the formula is as follows, where ϕ x (υ) denotes the feature pixels of patch x in the finite region , y are corresponding labels, and λ 2 is a regularization factor of the color histogram model. Following [41], the where c η is a learning rate of the color histogram model. Similar to the correlation filter model, after obtaining the histogram weight vector, we computed the color histogram response c R for the given image patch z . •

Response Fusion
The above correlation filter response with HOG features and color histogram response can be utilized for object tracking. For more accurate tracking, we combined them with a linear fusion, Solved by method of ridge regression, the solution can be found: For each dimension j = 1, 2, . . . M, p j (E ) denotes the jth pixel of vector p. Similarly, the update strategy can be expressed as follows: where η c is a learning rate of the color histogram model. Similar to the correlation filter model, after obtaining the histogram weight vector, we computed the color histogram response R c for the given image patch z. •

Response Fusion
The above correlation filter response with HOG features and color histogram response can be utilized for object tracking. For more accurate tracking, we combined them with a linear fusion, where γ is a fusion weight factor. The position of the tracked object in the current frame was defined as the maximal value of R f inal , while the scale estimation used the DSST method [23]. Figure 4 shows the fusion between the correlation filter response and the color histogram response. In tracking challenges, HOG features are good for occlusion, while color features are good for deformation, color change, and so on. Therefore, the tracker with response fusion performed well when evaluated on occlusion, color change, and deformation challenges.
color histogram response. In tracking challenges, HOG feat while color features are good for deformation, color chang tracker with response fusion performed well when evaluated and deformation challenges.

Redetection Module
The redetection module contained two processes, reliab detection, respectively. When the tracking results arrive, reliability by the correlation filters and color histogram res how to redetect the object when the tracking result is not reli

Reliability Estimation
For the correlation filter response with HOG features, w peak-to-sidelobe ratio (PSR) to quantify the confidence in th was low, then correlation confidence was low. The PSR of t t h S can be expressed as follows:

Redetection Module
The redetection module contained two processes, reliability estimation and saliency detection, respectively. When the tracking results arrive, we need to estimate their reliability by the correlation filters and color histogram responses. Then, we introduce how to redetect the object when the tracking result is not reliable.

Reliability Estimation
For the correlation filter response with HOG features, we computed the value of the peak-to-sidelobe ratio (PSR) to quantify the confidence in the results. If the value of PSR was low, then correlation confidence was low. The PSR of the correlation filter response S t h can be expressed as follows: where µ t and σ t denote the mean and standard deviation of the correlation response R t h , respectively. The superscript t of the correlation filter response denotes the t-th frame. If a large peak value appears in the target area, while a smooth low value occurs in other areas, it indicates that the tracking result is reliable and matched with the target. On the contrary, when the tracking result is not reliable, the response graph has multiple peak values with low values, and PSR value decreases significantly. Therefore, PSR value can reflect the quality of the tracking results to a certain extent.
Considering the little change between two consecutive frames, we defined the average value of the PSR value in the previous frames as a threshold value. The set of PSR values for previous frames was C h = S 2 h , S 3 h , . . . S t−1 h , and its average value was defined as M h . The reliability estimation criteria of correlation filter response can be defined as: where τ 1 is a constant less than 1. The above formula indicates that when the PSR value of the current frame does not satisfy the reliability estimation criteria, the target tracking has failed under the module of the correlation filter with HOG features. For the color histogram response, a color region was obtained by adding all pixels of target region in the first frame, and then the color score was defined according to the proportion of the color region in the response to the color histogram obtained in each subsequent frame. The formula of the color score was set as follows: where S t c denotes the color score of the tth frame. Similarly, the reliability estimation criteria of the color histogram response can be defined as: where τ 2 is a constant less than 1, and M c denotes the average value of the set C h = 1, S 2 t , S 3 t , . . . S t−1 t , which includes the color score of the first frame.

Saliency Detection and Candidates Sort
For the saliency redetector, we used an existing algorithm [42] to obtain the saliency map efficiently. Considering the tracked object may be out of view, we detected the object in a larger region with the saliency detection. If two or more salient objects were detected, we needed to rearrange the candidate targets and choose the best matched. Assuming that N salient candidates (z 1 , z 2 , . . . , z N ) were obtained, we computed their correlations with the original correlation filter template H.
where Z n is the feature description of candidate z n in frequency domain. The symbol denotes the matrix dot product. We sorted the responses of all candidate boxes and set the candidate box with the maximum response as the final detected object. The related formula can be expressed as follows: Through the threshold process, we can obtain the salient region S t of an image patch, and its center coordinate (x s , y s ) can be computed as: where S t (i, j) is a saliency value of pixel (i, j) in target region, with the size [W, H]. With the calculated coordinate (x s , y s ) as the position, we used the DSST algorithm [23] to calculate the size of the candidate box.

Algorithm Description
The formal description of the proposed tracking method is given in Algorithm 1. Algorithm 1: Long-term tracker with multiple features and saliency redetection (MSLT).

Input:
The initial position l 0 , tracked object position l t−1 , and scale of the (t − 1)th frame; Output: The predicted object position l t and scale of the tth frame; Repeat: 1. Extract features and compute related HOG features and color features maps in the search region of the tth frame.

2.
Compute the correlation filter and color histogram responses, respectively. 3. Compute the reliability estimation PSR value S t h and color score S t c by using Equations (13) and (15). 4. If S t h < τ 1 · M h and S t c < τ 2 · M c , then Start the saliency detection in a larger search region, and obtain N salient candidates (z 1 , z 2 , . . . , z N ); If N = 1 Take this object as saliency detection object; Else Compute the correlations between salient candidates and original correlation filter template H using (17). Sort the responses of all candidate boxes, and set the maximum response as the final salient object;

End if
Compute the center coordinate (x s , y s ) of the salient object using Equations (19) and (20), and estimate the related scale; Else Fuse the response using Equation (12)

Experiments
In this section, we implemented our MSLT method based on the MATLAB 2017a platform and run on a PC machine equipped with an Intel 3.7 GHz and 16 GB RAM. For the parameters, we set the cell size of HOG as 4 and the bin value of color histogram as 32. The learning rates of the correlation filter with HOG features and color histogram were set as 0.01 and 0.04, respectively. The regularization parameters λ 1 and λ 2 were both set as 0.001, and the fusion response parameter was 0.3. The search region size of the saliency detection module was set as four times the size of the tracked object. The related scale parameter setting was according to DSST [23]. We evaluated our MLST on two benchmarks, including OTB2015 [20] and VOT2016 [21]. For a fair comparison, we used the available codes or results provided by the related authors.

OTB2015 Dataset
The OTB2015 [19] dataset is a popular and classical tracking dataset which contains 100 video sequences; we took part of them for this experiment. It is fully annotated with 11 different attributes, including fast motion (FM), background clutter (BC), deformation (DEF), motion blur (MB), occlusion (OCC), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), low resolution (LR), illumination variation (IV), and scale variation (SV). To be fair, we evaluated all compared trackers based on a one-pass evaluation (OPE) protocol provided in [43], which was employed to evaluate the compared trackers from two metrics, namely distance precision and overlap success. In this experiment, we compared our tracker with seven state-of-the-art trackers, which included SRDCFdecon [44], STAPLE_CA [45], STAPLE [19], SAMF [24], DSST [23], KCF [10], and CSK [46]. Figure 5 shows the plots of the overall precision and success rates of different trackers on the part of OTB2015 dataset [20], where the legend in the plots denotes the average distance precision (DP) score at 20 pixels and the area-under-the-curve (AUC) score [43], respectively. We can see that our MSLT tracker performs better, and the DP score and AUC score are 0.874 and 0.639, respectively. Compared with the baseline STAPLE_CA tracker, our overall DP and AUC scores improve by 4.2% and 2.6%, respectively. Compared with the KCF tracker, our tracker has obvious advantages, where the overall DP score improved by 14% and AUC score improved by 13.3%. The tracking speed of our tracker reaches 31.2 fps, which meets the real-time requirement.
respectively. We can see that our MSLT tracker performs better, and the DP score and AUC score are 0.874 and 0.639, respectively. Compared with the baseline STAPLE_CA tracker, our overall DP and AUC scores improve by 4.2% and 2.6%, respectively. Compared with the KCF tracker, our tracker has obvious advantages, where the overall DP score improved by 14% and AUC score improved by 13.3%. The tracking speed of our tracker reaches 31.2 fps, which meets the real-time requirement.  Figures 6 and 7 show the precision and success plots of OPE among seven methods and our proposed method with different video sequence attributions. It can be seen that our tracker outperforms other trackers in both DP and AUC scores with all contributions except the FM challenge. Take the OCC contribution for example, compared with the SRDCFdecon tracker (ranks second), our tracker improved by 2.1% in the DP and 0.9% in the AUC score.

VOT2016 Dataset
The visual object challenge (VOT) is a workshop at the IEEE International Conference on Computer Vision (ICCV) and European Conference on Computer Vision (ECCV), started in 2013; the VOT2016 dataset [21] consists of 60 video sequences. In VOT2016, there are ten new difficult sequences replacing ten simple sequences of VOT2015, but no change is made in the evaluation metrics. We compared our MSLT model with six trackers, including KCF [10], SRDCF [25], DSST [23], Staple [19], DAT [47], and ECO [29]. We used three evaluation metrics [48] in this experiment, accuracy, robustness and expected average overlap (EAO).
Tables 1 and 2 present the accuracy and robustness results of seven compared trackers with nine contributions, which include camera motion, empty, illumination change, motion change, occlusion, size change, mean weighted mean, and pooled [48]. A higher accuracy score indicates a better performance, while a lower robustness score indicates a better performance. Table 1 reports that our MSLT achieves the best performance on most attributions, such as occlusion, camera motion, and pooled. Similarly, Table 2 shows that our tracker performs well in robustness results, which illustrate the times of failures in the sequences. For the occlusion contribution, our tracker ranks first with an accuracy of 0.5253 and robustness of 10. Moreover, our MSLT tracker ranks first on both the pooled and weighted mean contributions. The expected average overlap metrics are shown in Figure 8, and the related EAO values are shown in Table 3. It can be seen that the EAO value of our MSLT is 0.3737, which ranks first among all the compared trackers. Symmetry 2022, 14, x FOR PEER REVIEW 11 of 19 Figure 6. Precision plots of OPE between seven methods and our proposed method with different video sequence attributions.

VOT2016 Dataset
The visual object challenge (VOT) is a workshop at the IEEE International Conference on Computer Vision (ICCV) and European Conference on Computer Vision (ECCV), started in 2013; the VOT2016 dataset [21] consists of 60 video sequences. In VOT2016, there
We can see that our MSLT tracker achieved good tracking performance compared with other trackers. Taking the Skiing sequence (Contains IV, SV, DEF, IPR and OPR contributions) as an example, most trackers tracked well in the twentieth frame, but only a few trackers could accurately track the object all the time, and our tracker performed well in handling fast motion, due to the redetection module. For the Lemming sequence (containing IV, SV, OCC, OPR, and OV contributions), our tracker performed well even in the 1336th frame, but the bounding boxes of SRDCFdecon, CSK and STAPLE_CA

Ablation Study
In order to verify the effectiveness of the saliency redetection module, we compared the tracking performance of the proposed algorithm with a redetection module and without a redetection module, while its experimental conditions and parameters remained unchanged. For simplicity, we used the OTB2013 dataset [43] for testing and compared the tracking performance on both the overall and several contributions under the OPE protocol. Figure 10 shows the overall precision plots and success plots of OPE with and without a redetection module. It can be seen that the tracker with a redetection module achieves better performance with a DP score of 0.898 and an AUC score of 0.678, which We can see that our MSLT tracker achieved good tracking performance compared with other trackers. Taking the Skiing sequence (Contains IV, SV, DEF, IPR and OPR contributions) as an example, most trackers tracked well in the twentieth frame, but only a few trackers could accurately track the object all the time, and our tracker performed well in handling fast motion, due to the redetection module. For the Lemming sequence (containing IV, SV, OCC, OPR, and OV contributions), our tracker performed well even in the 1336th frame, but the bounding boxes of SRDCFdecon, CSK and STAPLE_CA trackers were drifting in the 859th frame.

Ablation Study
In order to verify the effectiveness of the saliency redetection module, we compared the tracking performance of the proposed algorithm with a redetection module and without a redetection module, while its experimental conditions and parameters remained unchanged. For simplicity, we used the OTB2013 dataset [43] for testing and compared the tracking performance on both the overall and several contributions under the OPE protocol. Figure 10 shows the overall precision plots and success plots of OPE with and without a redetection module. It can be seen that the tracker with a redetection module achieves better performance with a DP score of 0.898 and an AUC score of 0.678, which significantly improves by 5.9% and 6.4% compared with the performance of the tracker without a redetection module. Figure 11 shows the success rate plots of OPE with different video sequence attributions. We can see that the tracker with a redetection module performs better than the one without a redetection module when evaluated on all contributions.

Conclusions
This paper proposed a long-term tracker with multiple features and saliency redetection (MSLT). The MSLT tracker consists of two parts, a tracking-by-detection part and a redetection part. The tracking-by-detection part was built on the DCF framework by integrating with HOG and a color histogram fusion model, which was effective for some challenging situations, such as occlusion and color change. Meanwhile, the saliency redetection part could estimate the reliability of tracking result and redetect the tracked object with a saliency detection in a larger region if necessary. Compared with state-of-art trackers on two benchmarks, our MSLT method exhibited obvious advantages on most evaluation metrics. Furthermore, the ablation study indicated that the tracker with our redetection module performed better than without. In the future, we will attempt to introduce a deep learning method in the redetection module and utilize CNN features to improve the discrimination power of the correlation filter model. In addition, we will further explore the applications of object tracking for blurry and dark challenges, and some potential interdisciplinary applications, such as ambient technologies [49], industry 4.0, and so on.