Learning Spatial–Temporal Background-Aware Based Tracking

cn; Tel.: +86-595-22339012 Abstract: Discriminative correlation ﬁlter (DCF) based tracking algorithms have obtained prominent speed and accuracy strengths, which have attracted extensive attention and research. However, some unavoidable deﬁciencies still exist. For example, the circulant shifted sampling process is likely to cause repeated periodic assumptions and cause boundary effects, which degrades the tracker’s discriminative performance, and the target is not easy to locate in complex appearance changes. In this paper, a spatial–temporal regularization module based on BACF (background-aware correlation ﬁlter) framework is proposed, which is performed by introducing a temporal regularization to deal effectively with the boundary effects issue. At the same time, the accuracy of target recognition is improved. This model can be effectively optimized by employing the alternating direction multiplier (ADMM) method, and each sub-problem has a corresponding closed solution. In addition, in terms of feature representation, we combine traditional hand-crafted features with deep convolution features linearly enhance the discriminative performance of the ﬁlter. Considerable experiments on multiple well-known benchmarks show the proposed algorithm is performs favorably against many state-of-the-art trackers and achieves an AUC score of 64.4% on OTB-100.


Introduction
Visual object tracking is one of the popular problems in the computer vision community that has gained a wide spectrum of attention, and that is widely applied to practical applications such as intelligent video monitoring application, intelligent driving car, virtual reality, high-speed translation location and UAV tracking in 6-Generation technology. Although numerous efforts have been made in the previous ten years, there still exist various intractable challenges such as occlusion, fast motion and deformation. Recently, the mainstream object tracking methods are generally categorized into two types, one is a tracking method based on discriminative correlation filter framework, and the other is a tracking method based on deep learning networks. This paper mainly studies the target tracking method based on DCF.
A spatial-temporal regularization background-aware DCF based model is proposed, which can deal robustly with boundary effects and complex appearance changes. 2.
Our model can effectively be solved by the alternating direction multiplier method (ADMM), and each sub-problem has a corresponding closed solution. 3.
The proposed tracker gains promising tracking performance and significantly outperforms than other state-of-the-art DCF based tracker in accurate and overlap success rate.
The remaining of the work is structured as follows: Section 2 introduces related works, Section 3 presents the proposed tracking framework, and Section 4 introduces experimental verification and analysis. Finally, the conclusions are introduced in Section 5.

Related Works
DCF based paradigms have been extensively employed in the field of visual tracking for a long time. Bolme et al. [26] first introduced the correlation filter theory to visual tracking and proposed a single-channel minimum error square sum (MOSSE) correlation filter. The method is chosen as a baseline tracker to expand and optimize from different viewpoints. Henriques et al. [27] proposed DCF with circulant structure kernels (CSK) and employed dense sampling to achieve high-efficiency tracking performance based on tracking-by-detection framework. Danelljan [28] introduced a kernel trick into the DCF framework, and effectively extended the single-channel to multi-channel features to achieve real-time and robust tracking performance. In order to solve the difference of variance on the scale, Li et al. [29] proposed a IBCCF model, which can deal well with the aspect ratio variance in the tracking process, and effectively improve the tracking robustness and accuracy. In terms of feature fusion, [3] built an effective analysis framework of multiple clues based on DCF for visual tracking, and integrated deep convolution features and traditional hand-crafted features to increase the discriminative performance of the classifier. Bhat et al. [30] systematically analyzed the characteristics of deep and shallow-Appl. Sci. 2021, 11, 8427 3 of 13 level hand-crafted features and proposed a novel adaptive feature fusion improvement that measures robustness and preciseness of the tracker, which is performed by employing the complementary relationship between the deep and shallow features. To handle with the boundary effect problem, Li et al. [24] introduced spatial-temporal regularization based on SRDCF, which enable robustness against the boundary effect problem. Dai et al. [31] proposed the ASRCF model to address the boundary effect problem, which is achieved by introducing adaptive spatial regularization module, and can learn the reliable filter coefficients make the filter gain robustness to complex appearance changes.
Currently, deep convolution features are widely exploited in the field of visual tasks [32][33][34][35], the combination of the DCF based paradigm and deep network model is a significant drift in the development of visual tracking. With the help of deep network features, deep semantic information is captured, improve the tracking accuracy of tracking instrument. Reference [19] proposed the CREST model, which considers DCF as a layer class convolutional neural network, and integrates feature extraction, the corresponding reflection image is formed, and the formed image is updated to the corresponding neural network to form the corresponding training port to achieve promising tracking results. Zhang et al. [36] incorporated geometric transformation a network architecture based on correlation filtering information, and introduced a spatial alignment module, which can deal effectively with a variety of complex appearance changes and geometric transformations. Zhu et al. [37] make the most of the rich optical flow information between consecutive frames to enhance the representation of features, and treated optical flow estimation, feature extraction, and correlation filter and tracking processes as special layers of deep networks to achieve high-efficient tracking performance. In this work, we combine traditional hand-crafted features with deep network features (VGG-Net) on a DCF based framework, which takes full advantage of the performance dominant position of multiple properties to further raise the accuracy and robustness of the tracker.

Background-Aware Correlation Filters Framework
In this work, the proposed algorithm mainly chose the background-aware filter as the baseline tracker, and then briefly reviews the basic principles of BACF.
Denote by x ∈ R D a vectorized image with D channels, y is the ideal correlation response map y ∈ R D , the main objective function of BACF is to reduce the value of the objective function to the minimum, that is, minimize it: The letter B represents a clipping matrix in the formula, and it is binary. Input D elements through matrix clipping, and w d is the representative of the D channel in the channel learning filter.
BACF mainly exploits negative training in real sense samples densely pick up from the background area to learn or update the filter, which can deal effectively with appearance changes, but there still existing some imperfections in addressing boundary effect. First, due to the BACF tracker mainly learns and trains the classifier through current frame, it does not take into account the spatial-temporal information of the historical frame, which affects the discriminative ability of the filter so that it cannot distinguish the target from similar backgrounds. Secondly, in terms of feature representation, BACF mainly employs 31-channel HOG features for representing the target. Limited traditional hand-crafted features are difficult to capture abstract semantic information, which immediate affects the accuracy of tracking equipment. From the above analysis, it is obvious that the BACF tracker has some deficiency in dealing with the appearance change problem, and then degrades its performance in dealing with boundary effects. In this work, to solve the impact of the above problems, a spatial-temporal regularization background-aware DCF based algorithm is proposed; the framework of the proposed method is shown in Figure 1.
hand-crafted features are difficult to capture abstract semantic information, which immediate affects the accuracy of tracking equipment. From the above analysis, it is obvious that the BACF tracker has some deficiency in dealing with the appearance change problem, and then degrades its performance in dealing with boundary effects. In this work, to solve the impact of the above problems, a spatial-temporal regularization backgroundaware DCF based algorithm is proposed; the framework of the proposed method is shown in Figure 1.

The Objective Function of the Proposed Model
Inspired by the above-mentioned analysis, we propose a spatial-temporal regularization background-aware discriminative correlation filter, which is achieved by introducing a spatial-temporal regularization module [24] based on the background-aware filter framework. As shown in Figure 1, the spatial and temporal relation is effectively established between historical frames, and thus gains robustness to appearance changes. Our main objective equation can be expressed as follows: where, fd−1 represents a filter that is learned in a channel in a filter, such as a D − 1 channel. α1 and α2 represent regularization parameters.
is the introduced spatial-temporal regularization term, make the filter establish a spatial-temporal relationship between the current and historical frame.

Optimization of the Proposed Model
Correlation filters usually benefit from the frequency domain for high-efficiency computation, and can transform complicated convolution operations into simple elementwise multiplications. Therefore, we convert Equation (2) into the following corresponding frequency domain form: The symbol ^ in the equation represents the transformation of the signal in the discrete Fourier transform, for instanceˆ= T F α α , F is the orthonormal T × T matrix of com-

The Objective Function of the Proposed Model
Inspired by the above-mentioned analysis, we propose a spatial-temporal regularization background-aware discriminative correlation filter, which is achieved by introducing a spatial-temporal regularization module [24] based on the background-aware filter framework. As shown in Figure 1, the spatial and temporal relation is effectively established between historical frames, and thus gains robustness to appearance changes. Our main objective equation can be expressed as follows: where, f d−1 represents a filter that is learned in a channel in a filter, such as a D − 1 channel.
2 is the introduced spatialtemporal regularization term, make the filter establish a spatial-temporal relationship between the current and historical frame.

Optimization of the Proposed Model
Correlation filters usually benefit from the frequency domain for high-efficiency computation, and can transform complicated convolution operations into simple elementwise multiplications. Therefore, we convert Equation (2) into the following corresponding frequency domain form: The symbolˆin the equation represents the transformation of the signal in the discrete Fourier transform, for instanceα = √ TFα, F is the orthonormal T × T matrix of complex basis vectors for mapping any T dimensional vectorized signal into the Fourier domain. g = (ĝ 1 ,ĝ 2 , . . . ,ĝ D ) denotes auxiliary variables parameter. ⊗ denotes Kronecker product, I D is D × D identity matrix, T represents the conjugate transpose operation of a complex vector or matrix. It can be clearly seen that Equation (3) is a convex function, which can be optimized by the ADMM method to obtain a local optimal solution. The specific incremental Lagrangian form can be reformulated as the following equation: where is the introduced penalty parameter of the error term,ρ = [ρ 1 ,ρ 2 , . . . ,ρ D ] T is the introduced Lagrange auxiliary variable of DT × 1. Employing the ADMM optimization method to solve Equation (4) iteratively, each subproblemĝ * and f * has corresponding closed solutions: Subproblem f * : where g and ρ can be obtained by the following inverse Fourier transform operations: Subproblem g * : The computational complexity of Equation (7) is O(T 3 D 3 ), we solve the equation at every ADMM iteration will generate heavy computation, which will deteriorate real-time performance. Since the sparse property of X is utilized, each element ofŷ (ŷ(1),ŷ(2) , . . . ,ŷ(m)) is merely dependent on eachx(m) = [x 1 (m),x 2 (m), . . . ,x D (m)] T andĝ(m) = [conj(ĝ 1 (m)), conj(ĝ 2 (m)), . . . , conj(ĝ D (m))] T , conj(·) represents the complex conjugate operation of a complex vector.
Subproblemĝ * can be divided into the following M smaller subproblems as followŝ The solution of each subproblem ofĝ * can be obtained by the following equation: The computational complexity of Equation (9) is O(TD 3 ), with the inverse operation, there will cause huge computational resources, which will affect tracking real-time performance. The Sherman-Morrison formula [38] is applied for further optimize and accelerate the calculation.
where in this situation, u = v =x t (m), A = µ 1+γ I D , Equation (9) can be reformulated as follows:ĝ (m) where,r x (m) =x(m) Tx ,r ρ (m) =x(m) Tρ ,r f (m) =x(m) Tf , and l =r x (m) + Tµ. After optimization, the computational complexity is smaller than O(TK), and all sub-problems have been solved.

Lagrangian Parameter Update
In this work, we employ an online adaptive template update strategy to update Lagrangian parameter, the model of the proposed tracker is updated as follows: where m and m − 1 represent the (m)th and (m − 1)th frames of the video sequences, respectively, and λ represents an efficiency of learning, namely learning rate parameter of the mode.

Implementation Details
The experiment of the proposed method is conducted using MATLAB2017a on a PC with an i7-8700 3.2GHz GPU with 16GB RAM, and NVIDIA GeForce GTX 1070Ti GPU with 11GB RAM. In order to obtain accurate feature information, we employ 31-channel HOG features, color names, and hierarchical convolutional features such as (conv5-4 and conv4-4 layers in VGG-19). The regularization weight factors in Equation (2) are set to 0.01 and 0.08, respectively. In terms of scale estimation, and the number of scales is set to 7 and the scale step is set to 1.02. In terms of the ADMM algorithm, the number of ADMM iterations is set to 5, the penalty weight factor of in Equation (4) is set to 1, and the learning rate λ is set to 0.0192, and then updated by µ(i + 1) = min(µ max , βµ (i) ), where β = 10 and µ max = 10 3 .

The Overall Tracking Results on OTB Dataset
We compare the proposed method with 10 object tracking methods, including ECO-HC [20], GradNet [39], MCPF [11], DeepSRDCF [40], DeepLMCF [41], CREST [19], UCT [42], ARCF [25], SRDCF [28] and STAPLE_CA [6] on a well-known tracking benchmark datasets [43,44], which contains almost 50 and 100 video annotations video sequences separately with 11 different attributes. We evaluate the performance of 10 trackers by using two metrics provided in [43] on OTB-50 [44] and OTB-100 dataset, the results of tracking are displayed by the report of overlapping success rate and range accuracy. The distance precision (DP) shows that the ratio of frames whose center location error is within a certain threshold. The overlap success plot, which shows the number of thresholds given is less than the overlap number of bounding boxes, and generally, the given threshold of OS is set to 0.5. We report by using some data provided by [43] through the evaluation protocol, the data is applied to the tracking performance of the protocol, and the overlapping success graph and precision graph on these data sets are used. The results are shown in Figures 2 and 3    We compare 10 kinds of newly discovered OS methods with DP methods, and the proposed tracker performs well with DP of 91.1% and OS of 64.3%. It can be easily seen from Figure 1 Table 1 shows the mean FPS results of the proposed tracker and some mainstream trackers. It can be seen from the table that the proposed tracker combines traditional handcrafted and deep network features, which significantly increases the amount of calculation and makes the tracker's speed worse than STRCF and BACF, which only use hand-crafted trackers but still outperform some trackers such as SRDCF and MCPF.   We compare 10 kinds of newly discovered OS methods with DP methods, and the proposed tracker performs well with DP of 91.1% and OS of 64.3%. It can be easily seen from Figure 1 Table 1 shows the mean FPS results of the proposed tracker and some mainstream trackers. It can be seen from the table that the proposed tracker combines traditional handcrafted and deep network features, which significantly increases the amount of calculation and makes the tracker's speed worse than STRCF and BACF, which only use hand-crafted trackers but still outperform some trackers such as SRDCF and MCPF.  We compare 10 kinds of newly discovered OS methods with DP methods, and the proposed tracker performs well with DP of 91.1% and OS of 64.3%. It can be easily seen from Figure 1 Table 1 shows the mean FPS results of the proposed tracker and some mainstream trackers. It can be seen from the table that the proposed tracker combines traditional handcrafted and deep network features, which significantly increases the amount of calculation and makes the tracker's speed worse than STRCF and BACF, which only use hand-crafted trackers but still outperform some trackers such as SRDCF and MCPF.  Figure 4 shows the attribute-based evaluation results of six video attributes on OTB50. The results demonstrate proposed tracker outperforms existing competing trackers, this empirically shows how combining spatial-temporal regularization with background-aware module improves the reliability of such trackers against complex boundary effect and appearance variations. Trackers such as MCPF, ECO-HC and DeepLMCF have shown to be less robust to background clutter, out of view, deformation, illumination and variation, respectively.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 8 of 13 Figure 4 shows the attribute-based evaluation results of six video attributes on OTB50. The results demonstrate proposed tracker outperforms existing competing trackers, this empirically shows how combining spatial-temporal regularization with background-aware module improves the reliability of such trackers against complex boundary effect and appearance variations. Trackers such as MCPF, ECO-HC and DeepLMCF have shown to be less robust to background clutter, out of view, deformation, illumination and variation, respectively.

The Overall Tracking Results on Temple-Color 128 Dataset
We evaluated the dataset based on the tracking results generated by the temple color-128 dataset [45], which contains 128 color sequences, and then compares it with the most advanced and highest level method, including ECO [20], ECO-HC [20], STRCF [24], TADT [46], MCPF [11], DeepSRDCF [40], STAPLE [32], BACF [23], PTAV [45], SRDCF [28] and the success rate of overlapping and the accuracy of distance accuracy are taken as the evaluation indexes. Table 2 shows that the proposed tracking instrument achieves good performance in the most advanced trackers with a OS of 56.6% and DP of 78% which is closely followed by ECO (60%), where the average OS of 56.6% outperformed the baseline

The Overall Tracking Results on Temple-Color 128 Dataset
We evaluated the dataset based on the tracking results generated by the temple color-128 dataset [45], which contains 128 color sequences, and then compares it with the most advanced and highest level method, including ECO [20], ECO-HC [20], STRCF [24], TADT [46], MCPF [11], DeepSRDCF [40], STAPLE [32], BACF [23], PTAV [45], SRDCF [28] and the success rate of overlapping and the accuracy of distance accuracy are taken as the evaluation indexes. Table 2 shows that the proposed tracking instrument achieves good performance in the most advanced trackers with a OS of 56.6% and DP of 78% which is closely followed by ECO (60%), where the average OS of 56.6% outperformed the baseline trackers such as BACF (51.9%) and homogeneous tracker such as STRCF (55.3%), it demon-strates the effectiveness of the combination of background-aware and spatial-temporal regularization term, which can deal effectively with boundary effect issue. This different discovery further illustrates the combination of deep convolution features, which can further improve the recognition ability of the classifier. Due to different feature combinations, there is a certain degree of complementarity, where the traditional manual feature is generally used to capture some details of the superficial appearance, while the deep feature represents the higher-level information, the semantic information of the target space. It also demonstrates that the introduced spatial-temporal regularization module significantly gains robustness to appearance variance caused by the unwanted boundary effect and improves the discriminative ability of the learned filter to some extent.

The Overall Tracking Results on UAV123 Dataset
We evaluate the tracking results on the UAV123 dataset [47], which contains 123 challenging sequences with comparisons to state-of-the-art methods, including ECO-HC [20], DSST [1], SRDCF [28], BACF [23], STAPLE_CA [6], STAPLE [32], KCF [28], SAMF [2] and ARCF [25], Table 3 show that the proposed tracker performs well with OS of 57.2% which is closely followed by ARCF (60%), where the average OS of 57.2% outperformed the baseline trackers such as BACF (51.9%), it further demonstrate that the effectiveness of the combination of spatial-temporal regularization and background-aware module.  In the Matrix and MotorRolling sequences, the object mainly experiences significant appearance changes such as scale variation and background clutter challenges. Most of the trackers such as GradNet, ARCF, CREST, UCT, ECO-HC, DeepSRDCF, DeepLMCF and STAPLE_CA lose the target and fail to recover from tracking drift, therefore, compared with the traditional hand-crafted feature or CNN feature, it is more effective to fuse the hand-crafted feature and multiple powerful hierarchical convolution features. However, due to the proposed method, the proposed tracker can deal effectively with challenges of motion blur and occlusion.
As for the skiing, DragonBaby and Bolt2 sequences, the object undergoes a certain degree of appearance change, such as fast motion, deformation and scale variance. The proposed tracker locates the target accurately from #28 to #34 in Skiing sequence, and performs more robust than ECO-HC, ARCF, SRDCF and DeepSRDCF trackers. GradNet, MCPF, ECO-HC, UCT, ARCF, CREST, DeepSRDCF, DeepLMCF and STAPLE_CA trackers fail to track the target object successfully from #46 to #81 in the DragonBaby sequence when fast motion and large-scale scale change occurs. GradNet, ARCF, ECO-HC, UCT, CREST and SRDCF in the sequence with other features, such as bolt2 sequence deformation and background clutter, the performance of the tracker is not so good. Even so, through the experiment, it is found that the tracker can still locate the related targets well, which further verifies the reliability of the proposed method.
STAPLE_CA, DeepLMCF and CREST) on five representative challenging sequences (Mo-torRolling, DragonBaby, Skiing, Bolt2 and Matrix) from OTB-2015. From the figure, we can see that the proposed tracker deals well with motion blur, fast motion, deformation, scale variation, out-of-view, out-plane rotation, occlusion and background clutter scenarios challenges.
In the Matrix and MotorRolling sequences, the object mainly experiences significant appearance changes such as scale variation and background clutter challenges. Most of the trackers such as GradNet, ARCF, CREST, UCT, ECO-HC, DeepSRDCF, DeepLMCF and STAPLE_CA lose the target and fail to recover from tracking drift, therefore, compared with the traditional hand-crafted feature or CNN feature, it is more effective to fuse the hand-crafted feature and multiple powerful hierarchical convolution features. However, due to the proposed method, the proposed tracker can deal effectively with challenges of motion blur and occlusion.
As for the skiing, DragonBaby and Bolt2 sequences, the object undergoes a certain degree of appearance change, such as fast motion, deformation and scale variance. The proposed tracker locates the target accurately from #28 to #34 in Skiing sequence, and performs more robust than ECO-HC, ARCF, SRDCF and DeepSRDCF trackers. GradNet, MCPF, ECO-HC, UCT, ARCF, CREST, DeepSRDCF, DeepLMCF and STAPLE_CA trackers fail to track the target object successfully from #46 to #81 in the DragonBaby sequence when fast motion and large-scale scale change occurs. GradNet, ARCF, ECO-HC, UCT, CREST and SRDCF in the sequence with other features, such as bolt2 sequence deformation and background clutter, the performance of the tracker is not so good. Even so, through the experiment, it is found that the tracker can still locate the related targets well, which further verifies the reliability of the proposed method.

Conclusions
In this work, we integrate a spatial-temporal regularization module into a background-aware correlation filter framework, which is performed by adaptive balancing between active and passive model learning, thus gaining robustness to target appearance variance. The proposed model can be effectively solved by the ADMM optimization algorithm, which accelerates the convergence of the algorithm. In addition, from the perspective of feature representation, the proposed model effectively combines higher-level deep features with shallow hand-crafted features, which makes the filter capture more abstract semantic information, and then improves the discriminative ability of the learned filter. Compared with many new and advanced technologies, the performance of the proposed tracker is still better.

Informed Consent Statement: Not applicable.
Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: http://cvlab.hanyang.ac.kr/tracker_benchmark/datasets.html (accessed on 7 June 2021).