1. Introduction
Visual object tracking is an important aspect of remote sensing that has become widely used [
1]. Visual object tracking aims to automatically acquire the states of objects in subsequent video frames based on the initial state (center location and scale) of a target given in a video sequence. Vision-based tracking has attracted much attention in the field of computer vision. Over the past decade, with the rapid advancement of related research, numerous tracking methods have emerged, demonstrating highly effective outcomes [
2,
3,
4]. Advances in visual tracking have been applied in many areas, such as UAV-based monitoring [
5], intelligent surveillance, and airplane tacking [
6]. However, improving the robustness and accuracy of target tracking algorithms is challenging due to appearance changes caused by rapid motion, occlusion, and scale changes [
7,
8].
After years of research and development, target tracking algorithms have been divided into two main categories: generative models and discriminative models. With the development of feature extraction technology, the discriminant method has become the mainstream research direction in the field of target tracking due to its good performance. Discriminative models construct an objective function that can significantly distinguish between the background and the target, and tracking models can obtain the exact position of the target object in each frame of the image in a complex environment [
9]. In addition, ensuring the robustness of tracking during a long-term target-tracking process is difficult. Therefore, researchers have introduced the concept of online learning. The tracker learns the updated image features online to distinguish the target object from the background. To address long-term tracking, Kalal et al. [
10] proposed a TLD model, which divides the tracking process into three parts: tracking, learning, and detection. The main advantage of this method is the ability to learn more information and avoid repeated mistakes. However, if the target object is rotated out of the original plane, the TLD cannot provide good results. Yu et al. [
11] proposed a collaborative training technology that combines a generative model and a discriminative model. The technology uses the subspace features of online learning to model and learn the appearance of the target object and then uses a support vector machine (SVM) classifier to discriminate the objects. This model is very efficient but cannot address occlusion problems.
With the development of correlation filters, Bolme et al. added correlation filters (CFs) to the tracking process [
12] and proposed a tracking model based on the minimum output sum of squared error (MOSSE) filter. Heriques et al. introduced a circular structure kernel (CSK) algorithm [
13] that relies on illumination intensity features. Kernelized CFs (KCFs) were developed to use more robust features such as the histogram of oriented gradients (HOG) [
14]. The discriminative correlation filter (DCF)-based tracker in [
15] mitigated two major problems in the existing paradigm: the spatial boundary effect and temporal filter degeneration. The efficient appearance learning models in DCF have been proven to be effective in visual tracking [
16]. The spatially regularized DCF (SRDCF) [
17] method learned filters from training examples with rigid spatial constraints. Then, spatial–temporal regularized correlation filters (STRCFs) [
18] were introduced that included spatial–temporal regularization to handle boundary effects, and they achieved superior performance to SRDCF. However, the STRCF simply measures passive-aggressive learning from the current filter and previous filter via the Euclidean distance. Learning adaptive discriminative correlation filters (LADCFs) [
15] exploit the complementary information of the target and background to adaptively optimize the most discriminative spatial features. This approach enables the continuously updated tracker to always be in a lower-dimensional manifold space by combining with explicit distance-based temporal constraints. However, the limitations of explicit distance-based measurements under two-frame filters are inevitable in amplifying overfitting from limited training samples in visual tracking tasks. Recently, joint spatial–temporal feature information has also been found to strongly affect target tracking performance. Traditional tracking methods [
19,
20] were organized to avoid boundary effects and temporal degradation problems in a similar manner for visual tracking.
Although basing a tracking model on spatiotemporal features improves the performance of the target tracking model, these methods separately embed only temporal filtering information and spatial weighting to regularize traditional DCF trackers. Generally, the target position in a video sequence will always change over time and cannot be predicted, especially if it encounters motion blur or partial occlusion. Since the features extracted by the tracking model are derived from the current frame, the learned tracker highly depends on the feature quality of offline training, instead of relying on the spatiotemporal joint features of the image. Because the motion information between two video frames is disregarded, temporal and spatial consistency is not achieved. Here, we can impose a prior condition in advance, which is that the higher-order derivatives in the learned temporal feature space are small. Based on this finding, the concept of time-consistent slow feature analysis (SFA) [
21] in a video can be employed as free supervision [
22] to promote spatial–temporal consistency (that is, subtle feature differences between nearby frame pairs) and to ensure that the spatial representation changes smoothly over time. This approach motivates the concept of modeling consistent spatial–temporal dependency for visual tracking. Inspired by these findings, we introduce a model confidence term in our model to mitigate independent spatial–temporal dependency and integrate the spatial information of the previous frame to supplement the temporal dependency information of the current frame without negatively affecting online learning.
Therefore, we propose an improved tracking model, namely the variational online learning correlation filter for visual tracking (VOLCF).
Figure 1 shows the process of our proposed variational online learning update in visual tracking. VOLCF improves the extraction and online learning of spatiotemporal features of image patches by filtering the general target model. VOLCF uses online learning to obtain the feature matrix of the target image block in the previous frame, that is, the trained convolution kernel. Then, the convolution kernel of the previous target image block is convolved with the candidate modules in the candidate set (which can be regarded as template matching) and compared with the filter response graph to obtain the candidate image block with the highest similarity to the original target image block as the target block of the current frame. On this basis, the model confidence term introduced by the VOLCF model can adjust the filter to a reasonable object range that is suitable for tracking, better dividing the area of the target candidate block, and obtaining a more accurate search range. Therefore, the model confidence term reflects the quality of the filter learned by the model, which can achieve more accurate target positioning and improve the scalability of the model in a deformed environment such as during fast motion. Furthermore, considering that in a complex environment, a model that relies only on first-order information easily fails due to sudden changes in the target object, we try to extract second-order information related to the correlations between spatial features and temporal features. Correlation-related second-order features are used to improve the robustness of tracking in complex situations, which is achieved mainly by Kullback–Leibler (KL) divergence. The KL divergence cannot be understood as the “distance”; it is used to measure the information loss of one distribution compared to another. Specifically, if the information loss between the candidate and the target in the previous frame reaches a minimum value, the corresponding candidate block position is the new target position. By constantly changing the parameters of the estimated distribution, we can obtain different values of KL divergence. If the KL divergence reaches the minimum value, the corresponding parameter is the optimal parameter of interest. For the specific value of the parameter, our model uses the covariance matrix to obtain it. In the GAN network [
23], the KL divergence is substituted into the objective function to solve the minimax game problem. Mapping to the tracking problem, we build a loss function based on minimizing the information loss of the KL divergence, and the model can be optimized and updated by solving the minimum value of the loss function [
24]. In [
25], the KL divergence is minimized to train the regression network.
Furthermore, we introduce the alternating direction method of multipliers (ADMM) to solve the optimization in the iteration of the subproblems for efficient online learning of our VOLCF. The experimental results obtained on OTB [
26], VOT [
27] and other benchmarks demonstrate the accuracy and superiority of VOLCF. VOLCF has made considerable progress in terms of robustness and accuracy in comparison with state-of-the-art trackers. Compared with traditional online learning DCF trackers, our VOLCF has several significant advantages in making the VOLCF more interpretable than previous trackers. In summary, we list our main contributions as the following three items.
VOLCF introduces a model confidence item based on the general model and uses the spatial information of the previous frame to supplement the timing information of the current frame to achieve the precise positioning of the target, thereby ensuring the consistency of the model’s spatial–temporal dependency.
The KL divergence is used to implement the second-order information of the model filter to ensure the consistency of the spatial–temporal information. The parameters are selected and adjusted through the covariance matrix to improve the robustness of the tracking model in complex environments.
The loss function that was constructed based on the KL minima game is optimized by the ADMM algorithm, which greatly reduces the computational complexity of the model. Our algorithm achieves outstanding performance in both accuracy and robustness against state-of-the-art trackers.
The main structure of the paper is as follows:
Section 1 is the introduction of this paper. The related works are introduced in
Section 2. We introduce the STRCF tracking formulation in
Section 3. Next, the new method proposed in this paper, VOLCF, is deduced and calculated in
Section 4.
Section 5 analyzes the experiments and results of VOLCF and the state-of-the-art trackers. A brief conclusion of our work is given in
Section 6.
3. Spatial–Gemporal Regularization Tracking Formulation
A training set of samples
with the Gaussian-shaped regression labels
is given, and the filter is defined as
f with
D channels. Our goal is to learn a function differentiating the target from the surrounding environment. DCF trackers can utilize the fast Fourier transform (FFT) and its inverse transform (
) to improve the computational efficiency in the Fourier domain.
where
is the Fourier representation of
x and
is the complex conjugate of
in the frequency domain. ⊗ denotes the convolution operator, and ⊙ denotes the operator of elementwise multiplication. Upon acquiring the tracked feature of the target, the updated model undergoes training by minimizing the subsequent loss function.
where
is the objective and
is the regularization term.
represents the labeled training samples created by the circulant matrix, with the base sample
x centered on the tracking result in the current frame. In the learning phase of traditional DCF target tracking, the ridge regression problem is usually formed with the quadratic loss and
-norm penalty. Thus,
and
are commonly employed.
DCF trackers can identify the optimal candidate to maximize the discriminant function in a given filter based on model parameter
and prior knowledge:
where the candidate is a feature map extracted from the image.
Because the tracking target is constantly moving, there is a potential relationship in the time series of the target during movement. To highlight the temporal characteristics of the filter, the subscript
is used to represent the previous filter. Target tracking may benefit from a spatial–temporal regularization strategy [
18] due to its continuous movement. The objective function of the improved spatial–temporal DCFs is expressed in the following general form:
where
x denotes input samples,
D is the number of channels;
denotes the current frame filter,
is the size of the filter,
denotes the previous frame filter,
y is the Gaussian-shaped labels,
w is the spatial regularization weight matrix, and
is the time regularization parameter.
5. Experiments and Results
This section describes experiments that were carried out based on several kinds of benchmark datasets to evaluate the performance of VOLCF for visual tracking, including the OTB dataset, TempleColour dataset, VOT dataset, and DTB dataset. Our VOLCF algorithm was implemented in MATLAB 2017a, and all the experiments were run on a PC equipped with an Intel i7 7700 CPU, 32 GB of RAM and a single NVIDIA GTX 1070 GPU. To ensure the authenticity of the experimental data, we used the public code or results provided by the author for a fair comparison.
5.1. Experimental Setting and Evaluation Criterion
Following the settings in [
18], we set the side length of the square region centered at the target to
(W and H represent the width and height, respectively, of the target scale). The corresponding image area is proportional to the area of the target bounding box. Then, we extracted HOG and deep features from the image region using a cell size of 4 × 4 pixels. To reduce the boundary discontinuities, the features were further weighted by a cosine window. For the ADMM algorithm, we set the hyperparameters in Equation (
7) to
throughout all the experiments. The initial step size parameter
, maximum value
, and learning rate
are set to 10, 100, and 1.2, respectively.
In assessing the effectiveness of our VOLCF algorithm, we adopted the one-pass evaluation (OPE) metric, as recommended by the OTB benchmark. Precision plots depict the accuracy rates of predicted positions against ground truth across various thresholds. Success plots gauge performance based on average overlap, considering both the size and position [
54].
5.2. Analysis of the OTB Dataset
In this section, we compare our VOLCF tracker against 25 state-of-the-art trackers, including HOG features (i.e., STRCF (HOG) and STRCF (HOGCN) [
18], ECO-HC [
36], SRDCFdecon [
34], BACF [
33], Staple+CA [
55], SRDCF [
17], Staple [
28], SAMF+AT [
56], SAMF [
29], MEEM [
57], DSST [
30], LADCF [
15], GFSDCF (HOGCN) [
32] and KCF [
14]), and deep features (i.e., MDNet, ECO [
36], DeepSTRCF [
18], HDT [
40], HCF [
41], DeepSRDCF [
31], SiamFC [
38], and CF-Net [
39]) based on the Object Tracking Benchmark (OTB).
The OTB dataset in this section includes OTB50, CVPR2013, OTB100, and another general testing dataset, Temple-Colour. The OTB experiments focus on single-object tracking and use two metrics, precision plots and success plots, as mentioned above. The robustness of the experimental results on OTB is judged by 11 attributes, including illumination variation, occlusion, motion blur, in-plane rotation, out-of-view, low resolution, scale variation, deformation, fast motion, out-of-plane rotation and background clutter.
5.2.1. OTB50 Dataset
We compare our method with other state-of-the-art methods on the basis of HOG and deep features on the OTB50 dataset, which contains 50 video sequences.
Figure 2 shows the precision and success plots of our VOLCF tracker and 15 state-of-the-art trackers, namely, STRCF (HOG) and STRCF (HOGCN) [
18], ECO-HC [
36], SRDCFdecon [
34], BACF [
33], Staple+CA [
55], SRDCF [
17], Staple [
28], SAMF+AT [
56], SAMF [
29], MEEM [
57], DSST [
30], LADCF [
15], GFSDCF (HOGCN) [
32] and KCF [
14], with HOG features. Our proposed VOLCF tracker achieves a precision of 0.817 and a success score of 0.607, which is the best performance among all the trackers. Compared with STRCF, which has a precision of 0.811 and a success score of 0.600, our VOLCF tracker has improved by almost 0.74% and 1.17%, respectively.
The success plots for four attributes, including deformation, illumination variation, low resolution, and out-of-plane rotation, are shown in
Figure 3. Our VOLCF tracker has the best performance for three attributes, with the exception of the deformation attribute, and achieves improvements of 2.7%, 0.35%, and 4.2% compared with the second tracker. The other trackers do not perform well in such scenes due to the oversight of the potential link of the moving object target in the time series. In addition, for the deformation attribute, VOLCF outperforms STRCF by 5.1% and has a score that is lower than that of the best tracker by only 0.017%.
Next, we performed the experiments on VOLCF and other trackers with deep features. As shown in
Figure 4, VOLCF performs significantly better than most of the competing trackers, with the exception of ECO-HC [
36] and MDNet and surpasses its counterpart DeepSTRCF [
18] by 3.6% in successful plots of OPE. Compared with
Figure 2, the performance of VOLCF with deep features is much better than that with HOG features in both parts of the OPE.
5.2.2. CVPR2013 Dataset
For more thorough evaluation, we evaluated our VOLCF tracker on the CVPR2013 dataset, which includes one more video than OTB50 does, in comparison with 15 state-of-the-art trackers, including ECO [
36], CCOT [
35], ECO-HC [
36], DeepSTRCF, STRCF (HOG), STRCF (HOGCN) [
18], LADCF [
15], GFSDCF (HOGCN) [
32] and DSST [
30]. In
Figure 5, we note that the VOLCF has the best performance among the trackers with the HOG feature, achieving a score of 0.676, which is 0.45% higher than that of the STRCF and has a high score on the success plot with the deep feature, which is 1.2% higher than that of DeepSTDCF. Compared with the results in
Figure 2, VOLCF outperforms the other methods on CVPR2013 on both HOG and deep features. The scores on CVPR2013 are 0.022 and 0.005 greater than those on OTB50 with HOG and deep features, respectively.
5.2.3. OTB100 Dataset
The OTB100 benchmark consists of 100 fully annotated video sequences. We compare our VOLCF model with the 13 trackers mentioned in OTB50.
Figure 6 compares the overlap success plots of four attributes, namely, the deformation, illumination variation, out-of-plane rotation, and scale variation, with the HOG feature. Benefiting from the model confidence and the KL divergence term, our VOLCF algorithm can update the filters with the latest samples in close proximity to both of the previously learned filters. Therefore, the VOLCF tracker has the best performance for these attributes. For example, our VOLCF method achieves absolute gains of 1.6% over the second-best method in terms of the attributes of out-of-plain rotation.
Figure 7 shows a comparison of the overlap success plots of our VOLCF tracker with those of other trackers. As shown in
Figure 7a, our VOLCF achieves the second-best performance, with a success plot score of 65.4%, outperforming its counterparts STRCF(HOGCN) [
18] and STRCF(HOG) by gains of 0.3% and 2.3%, respectively. In the second picture, the VOLCF algorithm achieves the second-best performance among all the trackers, with a score of 68.1%, which is 0.7% lower than that of the ECO [
36]. The score of VOLCF with deep features is much greater than the score with HOG features. We obtain the overlap success curves of these trackers with HOG features and deep features, which are ranked using the area under the curve (AUC) score.
As shown in
Figure 6 and
Figure 7, the superiority of the experimental results based on the OTB100 benchmark is more significant than that based on the OTB50 benchmark. The main reason is that the number of videos based on OTB100 is greater, which is beneficial for target selection. For clarity, the results of VOLCF and five state-of-the-art trackers, namely, STRCF [
18], ECO-HC [
36], SRDCF [
17], SRDCFdecon [
34] and BACF [
33], are shown in
Figure 8, which contains the tracking results of six video sequences. As shown in
Figure 8, VOLCF successfully captures the target, and the tracking accuracy is higher than that of the other trackers.
5.2.4. Temple-Colour Dataset
We performed VOLCF and DeepVOLCF experiments on the Temple-Colour dataset [
58], which consists of 128 color sequences compared with the 9 state-of-the-art trackers mentioned in CVPR2013.
Figure 9 shows a comparison of overlap success plots for all trackers with hog and deep features. While the performance of VOLCF is not as good as that of ECO [
36], the score of VOLCF is the same as that of its counterpart STRCF [
18] with deep features and the success plot of VOLCF with the HOG feature is 0.55% greater than that of STRCF.
5.3. Analysis of Efficiency Experiment
Additionally, we also report the tracking speed (FPS) comparison on the OTB dataset in
Table 1 and
Table 2.
It can be observed that our VOLCF tracker achieves 7.8 and 7.9 fps with hand-crafted features and deep features on OTB respectively. Compared to other state-of-the-art trackers using hand-crafted features, our proposed VOLCF surpasses the SRDCFdecon, SRDCF, and SAMF+AT methods in terms of fps and is on par with the GFSDCF (HOGCN) method. In terms of deep feature recognition, our VOLCF method outperforms most competing trackers, exceeding DeepSRDCF by 97.46%, slightly below CF-Net.
5.4. Analysis of the VOT Dataset
For further evaluation, we performed experiments on three VOT datasets, namely, VOT2016, VOT2017 and VOT2018. We compared the VOLCF algorithm with other trackers, including STRCF, DeepSTRCF [
18], Staple [
28], DSST [
30], SRDCF [
17], SiamFC [
38], MEEM [
57], and KCF [
14], with HOG and deep features. Unlik the OTB dataset, the experimental effect on the VOT dataset is reported in three metrics:
Accuracy measures the average overlap ratio between the ground truth and the predicted bounding box by trackers.
Robustness refers to computing the average number of tracking failures over the sequence, which represents the failure rate; this metric includes six attributes (i.e., camera motion, illumination change, occlusion, motion change, size change, and empty attribute).
The expected average overlap (EAO) estimates the accuracy of the estimated bounding box after a certain number of frames are processed since initialization and averages the no-reset overlap of a tracker.
5.4.1. VOT2016 Dataset
We report the results on VOT-2016 [
27], which consists of 60 challenging videos, in
Table 3. The best results are marked in red, and the second-best results are marked in green. As shown in
Table 3, DeepVOLCF has the best performance among all the trackers and outperforms the second-best tracker by 0.74%, 4.79% and 2.2% in terms of EAO, accuracy and robustness, respectively. VOLCF has the second-best performance in terms of EAO, achieving a great improvement over its counterpart STRCF. In addition, DeepVOLCF also performs favorably against DeepSTRDCF [
18] by a gain of 8.6% in the EAO metric.
5.4.2. VOT2017 Dataset
The VOT2017 dataset contains 60 challenging sequences (some simple sequences are replaced with more difficult sequences in VOT2016) and has more accurate ground truths. We report the evaluation results on the VOT2017 benchmark in comparison with seveb state-of-the-art trackers (i.e., STRCF, DeepSTRCF [
18], Staple [
28], DSST [
30], SRDCF [
17], SiamFC [
38], and MEEM [
57]) and add failure metrics compared to VOT2016. The results are summarized in
Table 4. The top three are marked in red, blue, and green, respectively. In terms of all the metrics, the performance of DeepVOLCF is much better than that of its counterpart DeepSTRCF [
18], and it achieves the best performance among the trackers in terms of all the metrics, with the exception of accuracy.
To better demonstrate the superior performance of deepVOLCF,
Figure 10 shows the accuracy–robustness (A-R) plots of the trackers. We compare the A-R plots of VOLCF and DeepVOLCF with those of seven trackers with different graphic symbols in
Figure 10. It is noticeable that the best performance of the tracker would result in a closer position at the top-right corner of the plot. Thus, DeepVOLCF shows the best tracking effect for its top-right position, which is better than that of STRCF.
5.4.3. VOT2018 Dataset
The performance of our VOLCF tracker was also evaluated on VOT2018 for comparisons with 6 trackers, namely, STRCF [
18], Staple [
28], DSST [
30], SRDCF [
17], SiamFC [
38] and KCF [
14]. The VOT2018 dataset consists of 60 videos, and the results in terms of the EAO, accuracy, robustness and failure are presented in
Table 5. VOLCF outperforms STRCF in terms of EAO, robustness and failure. In addition, DeepSTRCF obtained the highest scores, 0.1893 and 2.083 for EAO and robustness, respectively.
Figure 11 shows a comparison of EAO at the experimental baseline among all the trackers. In contrast, DeepVOLCF is in the farthest right position among all the trackers, which means that it has the best accuracy, and VOLCF is in the third position among the EAO metrics.
5.5. Analysis on the DTB70 Dataset
To better demonstrate the superiority of our model in tracking performance, we selected four video sequences from the DTB70 dataset to showcase the performance of our method in this article. DTB70 contains 70 video sequences. This dataset covers various challenging scenarios, including fast motion, scale variations, occlusions, blurriness, and changes in lighting conditions. In our experiments, the VOLCF model exhibited outstanding tracking performance on the DTB70 dataset. Compared to the current state-of-the-art trackers, including LADCF [
15], SRDCF, DeepSRDCF [
17], STRCF, DeepSTRCF [
18], our approach performed excellently across multiple metrics in
Table 6. LADCF suffers from severe tracking drift, leading to tracking failure, while STRCF also has issues with expanding the tracking area in the Bike video. In summary, our method consistently maintains good tracking performance, with the tracking frame always surrounding the target object. Particularly in handling small targets and complex backgrounds, the VOLCF model effectively maintained target tracking, even in cases of high similarity between the target and background, thereby avoiding template drift and model degradation.
5.6. Analysis of Ablation Experiments
During the experiment, we found that the parameter of the loss function has an important effect on the performance. We validated the effectiveness of VOLCF by comparing existing trackers with variants of VOLCF. Here, the variants of VOLCF are obtained by changing the parameter of Equation (
11) to obtain
. Using the experiment on OTB100 as an example, we investigate the impacts of the model confidence term on the VOLCF. The results on the overlap success plot of different VOLCF variants and ECO-HC [
36], STRCF [
18], and SRDCF [
17] are shown in
Figure 12 based on HOG features and deep features. As shown in
Figure 12, different parameters yield various experimental results. As a result, the VOLCF algorithm with the HOG feature outperforms the STRCF algorithm when the parameter is smaller than 0.01. The best-performing variant of DeepVOLCF ranks second place on OTB100.