A Learning Frequency-Aware Feature Siamese Network for Real-Time Visual Tracking

: Visual object tracking by Siamese networks has achieved favorable performance in accuracy and speed. However, the features used in Siamese networks have spatially redundant information, which increases computation and limits the discriminative ability of Siamese networks. Addressing this issue, we present a novel frequency-aware feature (FAF) method for robust visual object tracking in complex scenes. Unlike previous works, which select features from different channels or layers, the proposed method factorizes the feature map into multi-frequency and reduces the low-frequency information that is spatially redundant. By reducing the low-frequency map’s resolution, the computation is saved and the receptive ﬁeld of the layer is also increased to obtain more discriminative information. To further improve the performance of the FAF, we design an innovative data-independent augmentation for object tracking to improve the discriminative ability of tracker, which enhanced linear representation among training samples by convex combinations of the images and tags. Finally, a joint judgment strategy is proposed to adjust the bounding box result that combines intersection-over-union (IoU) and classiﬁcation scores to improve tracking accuracy. Extensive experiments on 5 challenging benchmarks demonstrate that our FAF method performs favorably against SOTA tracking methods while running around 45 frames per second.


Introduction
In recent years, visual object tracking as a fundamental problem in the computer vision field has been widely studied and applied to the unmanned vehicle, traffic surveillance, and intelligent transportation. As a middle-level semantic problem, object tracking further extracts and process low-level semantic features (such as image classification) to provide reliable target location and tracking information for high-level semantic problems (such as action recognition). The tracker can analyze manually or automatically selected target in a video sequence and effectively predict the position and corresponding status of the current tracking target. However, the tracking targets have changed from traditional vehicles, pedestrians, and other large objects to random, small objects in complex scenes (such as background clutter, illumination variation, scale variation, low resolution, occlusion, and fast motion), which are harder to predict. To address this issue, strong discriminative deep learning models have been introduced to design robust and real-time tracking methods in complex scenes.
Existing deep learning-based trackers can obtain robust tracking results for deep models [1][2][3][4] have strong foreground and background discrimination ability by learning knowledge with massive parameters. However, targets are changing during the tracking process and these models perform heavy calculations to adapt to the current target, which limits the tracking speed and cannot meet the requirements of real-time tracking. Some methods try to solve this problem by choosing lightweight models [5], but those methods usually improve tracking speed by sacrificing tracking accuracy.
The robust deep learning-based trackers use huge labeled training samples to train models. The discriminative ability of the model will be stronger with a larger training dataset. Expansion of training set requires additional manual annotation, thus some methods [6,7] try to general new training samples by the geometric transformation of original samples, which can improve the discriminative ability of the model. However, those data augmentation methods assume that samples share the same class vicinity without considering the vicinity relation across different classes, which will limit its improvement. Furthermore, classification or regression-based methods use the highest predicted score as the object position and some methods choose the object position with both higher classification and regression scores to improve the tracking accuracy. However, when classification and regression scores are conflicted, it will reduce tracking robustness and cause tracking failure.
To address those issues, a novel robust real-time tracker FAF is proposed. Different from existing tracking methods select features by different layers or channels, we innovatively introduce frequency-aware features into object tracking, which can improve the model's discrimination ability while reducing feature calculation. In order to further to improve the model's ability to distinguish between background and target, an effective training method based on data fusion is innovatively designed, which can help the model learns the vicinity relation across different classes. Finally, a joint judgment algorithm combining regression and classification scores is introduced to further improve the accuracy of the tracking model in complex scenes. Extensive experiments are evaluated on 5 famous benchmarks: OTB [8], LaSOT [9], GOT10K [10], TrackingNet [11], and VOT18 [12], which show that our tracker outperforms the state-of-the-art trackers.
The contributions of this paper include: • This paper proposes a novel robust real-time tracker FAF with combines frequency-aware features. Different from existing tracking methods use linear combinations of shallow and deep layer features for tracking, which need a more complex network. We innovative decompose the layer feature into high-frequency and low-frequency features, then compress the redundant low-frequency features and splice them into multi-frequency features. Without increasing model complexity, the frequency-aware feature reduces feature calculations and improve feature discrimination ability.

•
To enhance the ability of tracking models to distinguish between foreground and background, we innovatively design a training data fusion method to enhance the ability of the model to learn vicinity relations across different classes. Both labels and samples are used to perform weighted fusion and obtain fusion samples. By training with fusion samples, blurred boundaries between classes can improve the discriminative ability of the model.

•
To improve the tracking bounding box accuracy, a joint judgment strategy combining regression and classification predicted scores are proposed. Compare with existing trackers use independent or linearly combined classification and regression scores, the proposed strategy uses confidence estimation with both predicted scores to improve the tracking accuracy. In particular, we can solve the conflict of classification and regression scores in complex environments and enhance the robustness of the model. To comprehensive verify the efficiency of FAF, extensive experiments are evaluated on 5 famous benchmarks, the results prove that the proposed FAF outperforms the state-of-the-art trackers while running at 45 fps.

Related Work
In this section, we will mainly introduce two categories of tracking methods: correlation filter (CF)-based methods and deep-learning-based methods.

Correlation Filter-Based Method
CF-based methods achieve many successful applications with high-speed features calculations in recent years. [13] represents objects through hand-craft features (such as HOG) and achieves high-speed tracking performance. To adapt to object scale changes, scale CFs are also designed [14]. Ref. [6] further improves tracking speed by mapping feature calculations into Fourier space. With the development of deep learning, the discrimination ability of deep features has been improved, [15,16] takes advantage of deep features to track objects. To improve the accuracy of the model, [17,18] combines the deep features of different layers with semantic and spatial information while [7] combines hand-craft and deep features to enhance the discriminative ability of the model. However, deep features with better discrimination ability are obtained through complex calculations by larger models, which limits the speed performance of CF-based methods.

Deep Learning-Based Methods
With the rapid development of deep learning in recent years, deep learning models have been widely used in the computer vision field and their powerful learning and discrimination abilities have surpassed the traditional methods with state-of-the-art performance. For object tracking, deep learning-based methods design tracking framework through deep network models, and perform supervised or semi-supervised pre-training through massive samples to obtain robust tracking models [19]. Afterward, deep reinforcement learning and Siamese network have also been introduced into object tracking. Deep reinforcement learning-based methods [20,21] can effectively transfer the training knowledge to the tracking environments and quickly adapt to the new scenes through self-learning. To accelerate the tracking process, Siamese network [22][23][24] uses template matching and non-update model strategies to reduce feature calculation and model update cost. However, the existing methods mainly balance speed and accuracy by selecting different deep models without optimization for deep features. Meanwhile, complex models require massive diverse training samples while most tracking methods do not have data processing or only use geometric transformation to increase sample diversity, which also limits the robustness of the model.

Proposed Method
In order to solve those issues, a novel robust real-time tracking method called FAF is proposed. The proposed tracking framework consists of four modules: offline IoU modulation, online IoU predictor, online classifier, and update modules as shown in Figure 1. For the offline training stage, the offline IoU modulation is independently pre-trained with massive training datasets to learn the relation between target scale and position. For the online tracking stage, the offline IoU Modulation will guide the online IoU Predictor with the IoU regression score, and the classifier will give the classification score. The joint judgment strategy will provide an optimized target scale and position information based on the classification and regression score. Then the IoU predictor and classifier will be updated by the update module. In the proposed method, ResNet18 is chosen as backbone and pre-trained on ImageNet [25]. To improve the discrimination ability of the backbone, we innovatively optimize the original backbone through the feature decomposition and sample fusion methods.  For the offline training stage as shown in Figure 1, the optimized ResNet18 obtains two-way frequency-aware features from fusion samples, as the shallow layer feature contains position information and the deep layer feature contains semantic information, and the connected features are used to learn the scale and position of the target. The conv and pooling layers are used to further improve the discrimination ability of features. The IoU Modulation is trained on large video and image datasets offline and without updates during online tracking. Our pre-training data fusion is described in Section 3.1 and the frequency-aware feature is detailed in Section 3.2.
For the online tracking stage, the first frame of the object based on data fusion will be used to initial the IoU predictor and the classifier module. Unlike the offline stage, the IoU predictor will obtain two-way features: the relevant frame guidance features from IoU Modulation and target features from the current frame. Then the IoU predictor and classifier will give the IoU and classification scores of the object in the current frame. Finally, the proposed joint judgment strategy will give the final prediction based on scores and update the IoU predictor and classifier based on the update module. The joint judgment strategy is detailed in Section 3.3.

Training Sample Fusion
Large-scale deep learning has made breakthroughs in recent years, and they have two points in common: First, more complex network structures are designed. Second, larger training datasets are proposed. Because the training dataset requires lots of manual labeling, data augmentation methods based on the existing datasets are used to increase the data. For object tracking problem, some methods apply the geometric transformation to increase data and enhance the robustness of the model. However, the existing data augmentation methods are based on the same class, and the relationship between different classes is not considered, which cannot increase the diversity of the data and limit its performance.
To solve this issue, we innovatively proposed a training sample fusion method to increase data diversity. Unlike the classification problem, the object tracking problem only contains two classes: the target and background, and pays less attention to what category the object belongs to. Inspired by [26], we enhance the data by weighting the fused samples and sample labels. With such data augmentation, the model can learn vicinity relations across examples of different classes.
To be specific, we first generate candidate samples around the ground truth bounding box by Gaussian distribution. By calculating the intersection-over-union (IoU) overlap with the ground truth, candidate samples are classified into positive and negative samples. Different from existing methods directly use the classified samples for model training, we fuse the positive and negative samples to obtain fusion samples as shown in Figure 2, the size of the fusion sample is the maximum of the two images. The details are shown in Algorithm 1. Obtain fusion sample (x,ỹ) 11: end for 12: Obtain N f us fusion samples 13: Loss = λ*criterion(outputs, y 1 ) + (1λ)*criterion(outputs, y 2 ) The α ∈ (0, ∞) controls the interpolation between feature-target pairs, and generate weight λ from Beta distribution. Finally, when calculating the loss function, we calculate the loss function separately for the labels of the two samples and then perform a weighted sum of the loss functions according to the weight λ. The experiment results show that the robustness of the model can be effectively improved through data fusion.

Frequency-Aware Feature
The current models used for object tracking are fixed structures with fixed-scale convolutional layers. However, the shallow convolution contains the apparent features, while the deep convolution features contain the advanced semantic features. Therefore, the features included in traditional convolution currently have information redundancy, which increases network calculation, and the redundant information will reduce the network's ability to discriminate targets.
To address this issue, we innovatively introduce frequence-aware features into object tracking. Inspire by [27], unlike other tracking methods that distinguish between the features of different convolution layers, we decompose the features of each convolution layer. The features in a convolution layer are divided into high-frequency features and low-frequency features, and high-frequency features contain semantic details and low-frequency features contain rough structure. By combining high-frequency features with compressed low-frequency features to reduce the network calculations and improve the network's ability to identify targets, as shown in Figure 3.
In Figure 3, the common features are divided into high-frequency and low-frequency features. Compressing the low-frequency part, processing the data of the high-frequency and low-frequency parts, and exchanging information between them, thereby can reduce the consumption of storage and calculation by the convolution operation. The size of the low-frequency part is (0.5h, 0.5w), and the length and width are exactly half of the high-frequency part (h, w). Although the low-frequency part is compressed, it also effectively expands the receptive field in the original pixel space, which can improve the recognition performance. We control the high and low-frequency feature segmentation ratio by setting the hyperparameter α as follows, where X means common feature, w, and h are the width and height of the feature, c is the channel number, and X H and X L is high-frequency and low-frequency features, respectively.  For feature update operation, high-frequency and low-frequency features will update within the corresponding frequency. And features exchange operation will update the high-frequency and low-frequency features information between the different frequencies. Therefore, the high-frequency feature includes not only its information process, but also maps from low frequency to high frequency, and vice versa. Another advantage of the frequency-aware feature is that it has a large receptive field of low frequency-feature maps. Compared with the ordinary feature, it effectively doubles the receptive field, which will further help each frequency-aware feature capture more contextual information to improve recognition performance. As far as we know, this is the first time to design a frequency-aware feature-based Siamese network for object tracking.

Joint Judgment Strategy
The motivation of the proposed strategy comes from the classification confidence (CC) and regression confidence (RC) is separately used by tracking methods, which cannot reflect the positioning accuracy of the bounding box. Because the RC and the CC are not positively related, the existing tracking methods can only solve the high CC with high RC, but for the other three types: low CC with low RC, high CC with low RC, and low CC with high RC cannot be solved.
To solve this problem, a joint judgment strategy is designed based on [28]. Through a joint analysis of classification and regression confidence, the final prediction result has both higher classification and regression confidences. We assume the bounding box is a Gaussian distribution P Θ (x) = 1 2πσ 2 e − (x−xe ) 2σ 2 2 , and the ground truth bounding box is a Dirac delta distribution P D (x) = δ(x − x g ). The KL divergence is used to measure the asymmetry of two probability distributions. The position problem is converted to minimize the KL divergence between P D (x) and P Θ (x), the closer the KL divergence is to 0, the more similar the two probability distributions are, which is shown as follows, where the KL divergence makes the bounding box distribute by Gaussian and closer to the ground truth. The IoU of the predicted bounding box is regarded as regression confidence. To further improve the accuracy of the bounding box, the candidate bounding boxes within the threshold IoU will be averaged based on their neighbor bounding boxes to obtain the final bounding box. Take the new x1 object position for ith box x1 i as an example, where the final bounding box with both higher RC and higher CC is obtained. By combining the RC and the CC, we effectively solve the three situations mentioned above. Furthermore, the more accurate final bounding box will be generated based on the predicted neighbor bounding boxes, which can alleviate the loss of object due to interference information, and improve the robustness of the model in complex scenes.

Experiments
The proposed method is implemented in python with the PyTorch toolbox, which runs at 45 fps on a PC with a 4-cores 4.2 GHz Intel 8700k CPU and two NVIDIA 2080 Ti GPU with 11G memory. TrackingNet, OxUvA, and LaSOT datasets are used for pre-training and the network parameters remain the same for all evaluation datasets. All hyperparameters are set according to related works. The training parameters are described as follows. For the backbone network, we freeze all weight during training. For the network , the weight decay is 0.00005, and momentum is 0.9. Dropout (50%) is used in the first two fc layers. We use the mean-squared error loss function and train for 40 epochs with 64 image pairs per batch. The ADAM optimizer is employed with initial learning rate of 10 −3 , and using a factor 0.2 decay every epochs. The experiments are carefully designed based on the same protocols and parameters.
The tracking results of state-of-the-art methods under one-pass evaluation (OPE) on OTB100. As shown in Figure 4, the proposed FAF exhibits high precision and success rates. Compared with state-of-the-art real-time tracker ATOM with 30 FPS, our tracker achieves 90.1% and 67.3% in the precision and success rates, which are 1.9% and 1.4% higher than ATOM. KCF uses a handcraft feature and can track at 160 FPS. However, due to the weak discrimination ability, lower tracking accuracy is obtained. ECO and MDNet both use deep models with optimization, and achieve better tracking performance, but they cannot meet the real-time tracking requirements. In addition, our tracker outperforms them in both speed and accuracy in the following datasets experiments.

Ablation Analysis
To analysis the accuracy and speed depend on alpha, we compare different alpha value on OTB100 dataset. As shown in Table 1, we only increase positive or negative samples and no mixup samples obtained when alpha is 0 or ∞. The tracker speed will be improved without mixup process. The tracker performs better when alpha = 1, it gains 0.015 and 0.013 improvement than alpha = 0.5 on precision and AUC rates, respectively. For mixup samples are hard samples when when alpha = 1, it can help the trained model to have better robustness. Table 1. Analysis the accuracy and speed of the proposed method depend on α. The best results are in bold. To demonstrate the effectiveness of each component in the proposed method FAF, ablation experiments are performed on OTB2015. The baseline means the original model without any optimization, "I" means the baseline with training sample fusion optimization, and "I + II" denotes the baseline with both training sample fusion and frequency-aware feature optimizations. For the version of the full components "I + II + III" denotes the complete model with all training sample fusion, frequency-aware feature, and joint judgment strategy optimizations. The performance of all those variations is shown in Table 2, and every component can improve the performance of the proposed method. Training sample fusion: Training sample fusion increases the diversity of samples and enhances the ability of the model to learn vicinity relation across different classes, which can enhance the discrimination ability of model without extra cost. The results show that 1.1% and 0.8% have been improved on precision and AUC rates, respectively.
Frequency-aware feature: Frequency-aware feature increases the precision and AUC rates by 2.0% and 2.2%, and dramatically improves tracking speed by 1.5 times. Because we innovatively decompose the layer feature into high-frequency and low-frequency features, and compress the redundant low-frequency feature and splice them into multi-frequency features. Without increasing model complexity, the frequency-aware feature can reduce redundant in low-frequency feature calculations and further improves the feature discrimination ability of the proposed model. Joint judgment strategy: Finally, to obtain a more accurate target position, the joint judgment strategy is proposed by considering both classification and regression results. As shown in Table 2, the precision and AUC rates are improved by 0.7% and 0.6%, respectively.

State-Of-The-Art Comparison
We compare our tracker FAF with state-of-the-art methods on four challenging tracking datasets. VOT2018: VOT2018 consists of 60 test video sequences and the performance are evaluated by failure rate (R), average overlap (A), and Expected Average Overlap (EAO) to provide the overall performance ranking. We choose short-term tracking tests with state-of-the-art methods for comparison. As shown in Table 3, we compare our method with the five top methods in the VOT2018 dataset. Our method achieves the best R and EAO scores while having a competitive A score. Among the top trackers, only SiamRPN++ achieves a 0.003 higher accuracy score than the proposed method. Compared with ATOM, our method obtains 2.1%, 2.5%, and 0.7% improvements on EAO, R, and A score, respectively. GOT10K: GOT10K includes more than 10,000 video sequences and the target frames are over 1.5 million, all of which are manually annotated. The data set consists of five categories: animals, man-made objects, people, natural scenery, and part, which can be subdivided into 563 target categories. Only the GOT10K dataset is used to train model and 180 test video sequences are used to evaluate the performance of FAF with five state-of-the-art methods. As shown in Table 4. FAF achieves the best scores with 0.581, 0.453 and 0.672 on AUC, precision (0.5) and precision (0.75) rates. Compared with non-real-time methods ECO and MDNet, the proposed method achieves huge improvements in all three evaluation indexes.
TrackingNet: TrackingNet uses the video sequences in Youtube-BB and divides the original 23 categories into 27 categories. The video sequence is divided into 15 attributes by automatically estimated and visually inspected. Use the DCF tracker to label missing target boxes. There are 12 chunks of 2511 sequences for the training and 1 chunk of 511 sequences for the testing. Table 5 shows the results in terms of precision, normalized precision, and AUC. In terms of precision, normalized precision, and AUC, C-RPN achieves scores of 0.619, 0.749, and 0.669, respectively. The proposed method FAF outperforms the second method ATOM with 1.9%, 1.5%, and 2.4% in terms of precision, normalized precision, and AUC rates, respectively.
LaSOT: LaSOT collects 1,400 sequences and 3.52 million frames of YouTube videos with an average video length of 2512 frames. It contains 70 categories and each category contains 20 sequences, the training subset contains 1120 videos, 2.83m frames, and the test subset contains 280 sequences, 690k frames. We evaluate the proposed method with five state-of-the-art methods on the test dataset with 280 sequences. The results in terms of normalized precision and success are shown in Table 6. Among those state-of-the-art methods, FAF achieves the best AUC and precision scores with 0.537 and 0.601. Compared with SiamRPN++, our method significantly improves the AUC and precision rates with 4.1% and 3.2%, respectively.   Table 6. Comparison with state-of-the-art trackers on the LaSOT dataset. The results are presented in terms of precision and AUC. The best and second results are in red and blue, respectively.

Failure Case Analysis
As shown in Figure 5, the first row is the Singer2 sequence, and the second row is the Tran sequence, the proposed method does not perform well on those two sequences. For the Singer2 sequence, the the target and the background are too similar, the proposed method does not distinguish between them accurately and looses the target. For the Tran sequence, the scale and appearance of the target has changed drastically during the tracking process. Our tracker does not learn the target characteristics accurately during the rapid and dramatic change of the target, which eventually caused the target to be lost. We will try to design size-aware module and use handcraft features to solve those problems in future work.

Conclusions
In this paper, we present a novel tracking method FAF based on frequency-aware feature and sample fusion. Our method innovatively factorizes feature map into different frequency features and reduce the redundant information. The frequency-aware feature can improve the discrimination ability by enlarging the receptive field of layers, while reducing calculations by compressing the low-frequency feature. Further, our method designs a data-independent augmentation for object tracking model training. The model can learn vicinity relations across different classes by convex combination of both tags and images, which can improve the discrimination ability of model. Finally, a joint judgment strategy based on regression and classification scores is proposed to fine-tune the bounding box of the target, which can solve the conflict of regression and classification scores in complex scenes and improve the robustness of the model. Extensive experiments on five famous benchmarks show that our proposed FAF performs favorably against SOTA tracking methods while running around 45 frames per second.