Second-Order Spatial-Temporal Correlation Filters for Visual Tracking

: Discriminative correlation ﬁlters (DCFs) have been widely used in visual object tracking, but often suffer from two problems: the boundary effect and temporal ﬁltering degradation. To deal with these issues, many DCF-based variants have been proposed and have improved the accuracy of visual object tracking. However, these trackers only adopt ﬁrst-order data-ﬁtting information and have difﬁculty maintaining robust tracking in unconstrained scenarios, especially in the case of complex appearance variations. In this paper, by introducing a second-order data-ﬁtting term to the DCF, we propose a second-order spatial–temporal correlation ﬁlter (SSCF) learning model. To be speciﬁc, the SSCF tracker both incorporates the ﬁrst-order and second-order data-ﬁtting terms into the DCF framework and makes the learned correlation ﬁlter more discriminative. Meanwhile, the spatial–temporal regularization was integrated to develop a robust model in tracking with complex appearance variations. Extensive experiments were conducted on the benchmarking databases CVPR2013, OTB100, DTB70, UAV123, and UAVDT-M. The results demonstrated that our SSCF can achieve competitive performance compared to the state-of-the-art trackers. When penalty parameter λ was set to 10 − 5 , our SSCF gained DP scores of 0.882, 0.868, 0.706, 0.676, and 0.928 on the CVPR2013, OTB100, DTB70, UAV123, and UAVDT-M databases, respectively.


Introduction
Visual object tracking is a fundamental problem in the field of computer vision, which has a wide range of applications in human-computer interaction, video surveillance, unmanned driving, and so on.The task of visual object tracking always suffers from the challenges of appearance variations, such as illumination variation, fast motion, out-ofplane rotation, and in-plane rotation.To deal with these challenges, various innovative trackers have been proposed and achieved significant progress in tracking performance and robustness.Among these tracking methods, discriminative-filter-based trackers [1][2][3][4][5] have received significant attention due to their competitive performance.
The standard discriminative-correlation-filter (DCF)-based tracker treats the filter learning as a ridge regression problem, and the objective function can be transferred to the frequency domain by the fast Fourier transform (FFT) for the solution.Bolme et al. [6] first learned the correlation filter to perform the target tracking task and proposed a minimum output sum of squared error (MOSSE) model.The MOSSE trains the filter by calculating the minimum actual and expected mean-squared errors of sequence images.Inspired by the MOSSE, Henriques et al. [7] considered that cyclic displacement could be used to replace random sampling to achieve dense sampling and proposed a theoretical framework to explore the effect of dense sampling.The proposed framework formulates a kernelized correlation filter to improve the tracking performance.Zhang et al. [8] adopted the Bayesian principle to build a spatial-temporal context model for tracking.However, these CF-based trackers only utilize single-channel features, which is not robust in the tracking scenarios with complex appearance variations.To tackle this issue, some CF-based methods [9][10][11][12][13][14][15][16][17][18][19] extract multiple features to learn the filters.The commonly used handcrafted features include the histogram of oriented gradients (HOG), color names (CNs), the local binary pattern (LBP), and scale-invariant feature transform (SIFT).These features describe the shape and color information of the targets.Trackers using multiple features are more robust to the fast movement and deformation variation of targets.For instance, Galoogahi et al. [17] employed multi-channel HOG descriptors in the frequency domain to extract HOG features for filter learning and proposed a multi-channel CF tracker (MCCF).Huang et al. [14] used hybrid color features to learn filters in which the compressed CN features and the HOG features based on the opponent color space were extracted, and principal component analysis was used to reduce the computational cost.Li et al. [12] integrated the raw pixel, HOG, and color label features into the DCF framework and presented an adaptive multiple feature tracker.Kumar et al. [19] exploited the LBP, color histogram, and pyramid of the histogram of gradients to model the object's appearance and developed an adaptive multi-cue particle filter method for real-time visual tracking.
Even though these DCF-based trackers using multi-channel features succeed to some extent, some aspects such as the redundancy of multi-channel features, the boundary effect, and data fitting have not been fully explored.To tackle these issues, many structural regularized DCF methods [20][21][22][23][24][25][26] have been presented.Zhu et al. [2] proposed an adaptive attribute-aware strategy to distinguish the importance of different channel features.Jain et al. [20] presented a channel graph regularized CF model by introducing a channel weighing strategy in which a channel regularizer was integrated into the CF framework to learn the channel weights.Xu et al. [22] proposed a channel selection scheme for multi-channel feature representations and adopted a low-rank approximation to learn filters in a low-dimensional manifold.In addition, many trackers propose a variety of strategies to solve the boundary effect.The SRDCF [23] incorporates a spatial regularizer into the DCF to deal with the problem caused by the periodic assumption.Li et al. [24] supplemented the temporal regularization term into the SRDCF tracker [23] and proposed a spatial-temporal regularization CF framework.To be specific, the STRCF integrates both temporal regularization and spatial regularization into the standard DCF model and can perform model updating and DCF learning simultaneously.As a result, the STRCF could be regarded as an approximation of the SRDCF with multiple samples and achieves better tracking performance than the SRDCF.The BACF [25] utilizes a cropping matrix to extract patches densely from the background and expands the search area at a low computational cost.Xu et al. [26] combined temporal consistency constraints and spatial feature selection to propose an adaptive DCF model in which the multi-channel filters can be learned in a low-dimensional manifold space.However, the aforementioned trackers only employ the first-order data-fitting information of the feature maps.In other words, such methods do not consider high-order data-fitting information for tracking.
On the basis of the above-mentioned analysis, we propose a novel CF-based tracker, the second-order spatial-temporal correlation filter (SSCF) learning model.We formulated our tracking algorithm by incorporating a second-order data-fitting term into the DCF framework, which helps to take full advantage of target features against surrounding background clutter.The main contributions of the SSCF are summarized as follows:

•
We propose a new discriminative correlation filter model for visual tracking with complex appearance variations, unlike prior DCF-based trackers in which the first-order data-fitting information is only used.We incorporated the second-order data fitting and spatial-temporal regularization into the DCF framework and developed a more robust tracker; • An effective alternating-direction method-of-multipliers (ADMM)-based algorithm was used to solve the proposed tracking model; • Extensive experiments on the benchmarking databases demonstrated that our SSCF can achieve competitive performance compared to the state-of-the-art trackers.
The remainder of this paper is organized as follows.Section 2 introduces the related work.Section 3 describes the detailed mathematical formulation of the proposed model and introduces the optimization algorithm.Section 4 reports the experimental results and the corresponding analysis.Finally, Section 5 draws the conclusions.

Related Work
In this section, we review mainly three categories of tracking methods, including trackers based on target detection, trackers based on clustering, and channel-reliability learning trackers.
Since target detection techniques [27][28][29] have attracted wide attention in the computer vision field, many trackers based on target detection have been proposed.Guan et al. [30] proposed a joint detection and tracking framework for object tracking in which the detection threshold was adaptively modified according to the information fed back to the detector by the tracker.Zhang et al. [31] employed a faster recurrent convolutional neural network to extract the candidate detection areas and proposed a multi-target tracking algorithm.In [32], Liu et al. combined motion detection with correlation filtering and presented a new model for object tracking.The presented model determines the object position via the weighted outputs of motion detection and the tracker.Considering that the existing kernelized correlation filter tracking methods fail to identify occlusion, Min et al. [33] adopted a detector to assist the occlusion judgment and improve the tracking performance.
Clustering-based algorithms [34,35] have been commonly used in pattern recognition and computer vision, such as image segmentation [36] and patten classification [37].Inspired by this, many researchers use clustering algorithms to improve the performance of object tracking.For instance, Keuper et al. [38] combined motion segmentation with object tracking and presented a correlation co-clustering model to improve the performance.In [39], Li et al. developed an intuitionistic fuzzy clustering model for object tracking.Specifically, the local information of the targets is incorporated into the intuitionistic fuzzy clustering to improve the robustness.Considering that DBSCAN clustering does not require the number of clusters, He et al. [40] employed a DBSCAN clustering-based track-to-track fusion strategy for multi-target tracking.
Recently, the idea of different weights distinguishing the importance of different components has been widely used in pattern classification [41,42] and face recognition [43].Similarly, some DCF-based channel-reliability learning trackers have been proposed to deal with the problem of model degradation.Du et al. [44] argued that different channels have different contributions in the tracking process and proposed a joint channel-reliability and correlation-filter learning model.The proposed tracker assigns each channel a weight to distinguish the different importance.To exploit the interaction between different channels, Jain et al. [20] assigned similar weights to similar channels to emphasize important channels and developed a channel attention model.Li et al. [45] argued that the existing trackers do not consider the complementary information of different channels and proposed a channel-feature integration method.All channels of each feature share an importance map to avoid overfitting.In [46], the authors introduced channel and spatial reliability to the DCF framework and employed the reliability scores to weight the per-channel filter responses.The experiments showed that the channel weights were able to improve the tracking performance.These methods principally focus on overcoming model degradation by incorporating channel reliability and enhance the discriminative performance to some extent.

Objective Function Construction
As mentioned above, the existing DCF-based methods only utilize first-order datafitting information and ignore high-order data-fitting information for tracking, which cannot take full advantage of target features against surrounding background clutter and suffer from the stability-plasticity dilemma.To deal with these issues, we built a second-order spatial-temporal correlation-filter learning framework.Specifically, we incorporated a second-order data-fitting term and spatial-temporal regularization into the DCF framework and formulated a robust model.The objective function is able to be formulated as below.
We first denote the dataset S = {X t } T t=1 , and each frame X t ∈ R M×N×K contains K feature maps with a size of M × N. Y ∈ R M×N is the Gaussian-shaped label.Our aim was to learn a multi-channel convolution filter F ∈ R M×N×K by minimizing the following objective function: where * represents the convolution operator and • denotes the Hadamard product.W is the spatial regularization matrix, and F t−1 is the correlation filter used in the t − 1-th frame.λ and µ are penalty parameters.The first term is the first-order data-fitting term, which is a generic formulation for learning the filter in DCF-based trackers.The second term is the spatial regularizer to solve the boundary effect.The third term is the second-order datafitting term, which can be helpful to make full use of discriminative target features.The last term is the temporal regularizer to force the current frame filter close to the previous one, which helps to prevent the effect caused by the corrupted samples.

Optimization Algorithm
It can be noted that the objective function in Equation ( 1) is convex, and the minimization problem can be solved by the ADMM algorithm.To be specific, we introduced an auxiliary variable G ∈ R M×N×K by restricting F = G and constructed the augmented Lagrangian form of Equation (1) as: is the Lagrange multiplier and γ is the stepsize.Assuming H = 1 γ S, Equation (2) can be written as: The optimization problem can be divided into several subproblems as follows.
Then, we can alternatively solve each subproblem as follows: Solving F: According to Parseval's theorem, the subproblem in Equation ( 4) can be formulated in the Fourier domain as: Here, F represents the discrete Fourier transform (DFT) of F. From Equation (7), it can be noted that the i-th row and the j-th element of Ŷ only depend on the i-th row and the j-th element of F and Xt across all K channels.Assume v ij (F) is a K-dimensional vector that contains the i-th row and the j-th elements of F along all K channels.Optimizing the problem in Equation ( 7) is equivalent to solving the following MN subproblems: Taking the derivative of Equation ( 8) with respect to v ij ( F) as zero, we have: Here, Solving G: From Equation ( 5), each element of G is able to be updated independently, and we adopted the same strategy as solving F. Assume v ij (G) is a K-dimensional vector that contains the i-th row and the j-th elements of G along all K channels.Optimizing the problem in Equation ( 5) is equivalent to solving the following MN subproblems: Taking the derivative of Equation ( 10) with respect to v ij (G) as zero, we have: where P is a diagonal matrix and each diagonal element is w ij .Updating H: Let v ij (H) be a K-dimensional vector that contains the i-th row and the j-th elements of G along all K channels.In the l + 1-th iteration of the ADMM, the Lagrange multiplier vector v ij (H) can be updated as follows: The details of the optimization procedure can be seen in Algorithm 1.

Algorithm 1 SSCF algorithm
Input: Feature maps X t , Gaussian-shaped label Y, previous correlation filters F t−1 , spatial regularization matrix W, initial values G (0) and H (0) .Output: Estimated correlation filters F.
Obtain correlation filters F by applying the inverse DFT.

Computational Complexity
In this subsection, we discuss the computational complexity of the presented SSCF.As shown in Section 3.2, we divided the optimization problem into several subproblems.According to the Parseval theorem and the ADMM algorithm, the complexity of solving F is O(KMN) in each iteration.Taking the DFT and inverse DFT into account, the computational complexity of solving F is O(KMNlog(MN)).Moreover, the complexity of subproblems H and G is O(KMN).Suppose the number of iteration is T: the whole computational complexity of the proposed SSCF is O(TKMN(log(MN) + 1)).In view of this, the speed of our tracker is not fast.
In the experiments, our tracker was implemented using MATLAB R2017a on a computer with an i7-8700K processor (3.7GHz) with 48GB RAM.λ was set to 10 −5 , and other parameters were set to the same values as the STRCF.The histogram of oriented gradients (HOG) features were used to conduct the comparative experiments.In addition, we followed the one-pass evaluation (OPE) protocol [53] to evaluate the performance of different trackers.The success and precision plots are reported based on the bounding box overlap and center location error.The AUC is the area under the curve of the success plot, and the distance precision (DP) is the percentage of the location errors within 20 px.

Results on the CVPR2013 Database
The CVPR2013 database contains 50 fully annotated video sequences with 11 different attributes, such as background clutter, low resolution, occlusion, and out of view.The overall performance, which is summarized by the success and precision plots, is listed in Figure 1.It can be observed that the proposed SSCF achieved the top-ranking results.The area under the curve (AUC) and distance precision (DP) scores were 0.681 and 0.882, respectively.Specifically, the AUC and DP scores of SSCF were higher by 1.2% and 0.9% than the STRCF.This indicates that incorporating the second-order data-fitting term is effective at improving the tracking performance.To evaluate the robustness of the proposed SSCF on different attributes, we constructed subsets with different dominant attributes for the experiments.The 11 challenging factors were background clutter (BC), low resolution (LR), illumination variation (IV), motion blur (MB), out of view (OV), fast motion (FM), deformation (DEF), occlusion (OCC), out-ofplane rotation (OPR), scale variation (SV), and in-plane rotation (IPR).Table 1 shows the AUC and DP scores of the proposed SSCF and the other trackers on the 11 attributes on the CVPR2013 database.Despite not all scores of the proposed SSCF being the highest, our method achieved the best robustness.Especially for the AUC scores on the different attributes, our SSCF outperformed the other trackers, except LADCF.

Results on the OTB100 Database
OTB100 is a database containing 100 challenging video sequences, and these sequences consist of more than 28,000 fully annotated frames.The results of the success and precision plots for all trackers are shown in Figure 2. From the figure, the proposed SSCF outperformed all the competing trackers in its overall performance.Our tracker achieved 0.664 and 0.868 in terms of the AUC and DP scores, respectively.
We also provide the attribute-based evaluation to validate the robustness of our SSCF.The AUC and DP scores of all trackers on the 11 different attributes are reported in Table 2. From the DP scores listed in the table, the proposed SSCF outperformed all competing trackers on eight attributes.In terms of the AUC scores, our tracker performed better than the other trackers on seven attributes.On other attributes, the SSCF was among the top-three trackers.These results demonstrate that our SSCF is more robust than the other trackers.
Table 1.The area under the curve (AUC) and distance precision (DP) scores of the proposed SSCF and the other trackers on different attributes on the CVPR2013 database.The top-three methods on each attribute are denoted by different colors: red, blue, and green.That is, red represents the best performance, blue represents the second best, and green represents the third best (AUC/DP).

Results on the UAV123 Database
The UAV123 dataset contains 123 video sequences, which is the most commonly used and most comprehensive dataset for UAV tracking.The overall performance, which is summarized by success and precision plots, is listed in Figure 6.It can be observed that the proposed SSCF achieved the top-ranking results.The area under the curve (AUC) and distance precision (DP) scores were 0.479 and 0.676, respectively.
In order to visually show the performance of the proposed SSCF in the tracking process, we selected three different types of video sequences, namely person, boat, and car sequences, to conduct the experiments.As shown in Figure 7, each column corresponds to three frames of the images, and the images were randomly selected from the video sequences.The comparative methods were five trackers, including our SSCF, AutoTrack, the MSCF, the STRCF, and the LADCF, marked in green, red, blue, yellow, and orange, respectively.It can be seen that our SSCF always tracked the correct target and had the best performance.The STRCF and LADCF were not robust in tracking the small targets.

Results on the UAVDT-M Database
In this section, we compare our SSCF with the existing methods on the UAVDT-M database.We also report the running speed of these methods.The running speed was measured in frames per second (FPS).Table 3 shows the comparison results.It can be observed that our SSCF achieved better performance than the existing trackers.The area under the curve (AUC) and distance precision (DP) scores were 0.667 and 0.928, respectively.However, It should be pointed out that the performance improvement of our tracker came at the expense of speed reduction.

Conclusions
In this paper, we proposed a new model called the second-order spatial-temporal correlation filter (SSCF) for visual object tracking.The SSCF is a DCF framework of combining the second-order data-fitting term and spatial-temporal regularization.To solve the proposed model, we divided the optimization problem into several subproblems and adopted the ADMM algorithm to solve each subproblem.By taking full advantage of the second-order data-fitting information, the SSCF becomes more discriminative and robust in addressing complex tracking situations.Extensive experiments on the benchmarking databases demonstrated that our SSCF can achieve competitive performance compared to the state-of-the-art trackers.
It can be noted that the presented SSCF achieved better tracking results than the existing trackers on most of the attributes, but it was not robust on a few attributes, such as low resolution and occlusion.Recently, occlusion-processing methods have been presented in face recognition such as occlusion dictionary learning [58,59] and the occlusion-invariant model [60].Can these occlusion processing methods be used for object tracking with occlusion?If the answer is yes, how can we design a new model to enhance the performance?It also should be pointed out that the performance improvement of our tracker came at the expense of speed reduction.How to improve the running speed of our SSCF is an important problem.In addition, although the proposed SSCF achieved better results than the existing methods, the accuracy was not high when tracking small targets.Self-paced learning has been widely used in computer vision and machine learning [61].Combining self-paced learning and filter learning could potentially yield better performance in tracking small targets.In future work, we will focus on these topics.

Figure 1 .
Figure 1.Success plots (a) and precision plots (b) of the proposed SSCF and other trackers on the CVPR2013 database.

Figure 2 .
Figure 2. Success plots (a) and precision plots (b) of the proposed SSCF and the other trackers on the OTB100 database.

Figure 3
Figure3lists the success plots comparing the presented method on OTB50 with the existing trackers.The overall performance is summarized in Figure3a.It can be seen that the proposed SSCF had the best success rates.The success plots of all trackers on the 11 different attributes are shown in Figure3b-l.The proposed SSCF outperformed the existing trackers on eight attributes, i.e., fast motion, background clutter, motion blur, illumination variation, in-plane rotation, occlusion, out-of-plane rotation, and out of view.Our SSCF incorporates the second-order data fitting and spatial-temporal regularization into the DCF framework to develop a robust tracking pattern.The tracking results of the SSCF on the other three attributes were among the top two.This also demonstrates the effectiveness and robustness of our tracker.

Figure 3 .
Figure 3. Success plots of the proposed SSCF and the other trackers on the OTB50 database.(a) Overall performance; (b-l) success plots on the 11 different attributes.

Figures 4 and 5
Figures 4 and 5 show the success plots and precision plots comparing the presented method on the DTB70 database with the existing trackers.The overall performance is summarized in Figures 4a and 5a.It is observed that our SSCF achieved the best results in the overall performance.The success plots and precision plots of all trackers on the 11 different attributes are shown in Figures 4b-l and 5b-l.Our SSCF outperformed the existing trackers on nine attributes except motion blur and low resolution.

Figure 4 .Figure 5 .
Figure 4. Success plots of the proposed SSCF and the other trackers on the DTB70 database.(a) Overall performance; (b-l) success plots on the 11 different attributes.

Figure 6 .
Figure 6.Success plots (a) and precision plots (b) of the proposed SSCF and the other trackers on the UAV123 database.

Figure 7 .
Figure 7.The qualitative analysis of different trackers on three video sequences.

Table 2 .
The area under the curve (AUC) and distance precision (DP) scores of the proposed SSCF and the other trackers on different attributes on the OTB100 database.The top-three methods on each attribute are denoted by different colors: red, blue, and green.That is, red represents the best performance, blue represents the second best, and green represents the third best (AUC/DP).

Table 3 .
The area under the curve (AUC), distance precision (DP) scores, and FPS of the proposed SSCF and other trackers on the UAVDT-M database.