An Anchor-Free Siamese Network with Multi-Template Update for Object Tracking

: Siamese trackers are widely used in various fields for their advantages of balancing speed and accuracy. Compared with the anchor-based method, the anchor-free-based approach can reach faster speeds without any drop in precision. Inspired by the Siamese network and anchor-free idea, an anchor-free Siamese network (AFSN) with multi-template updates for object tracking is proposed. To improve tracking performance, a dual-fusion method is adopted in which the multi-layer features and multiple prediction results are combined respectively. The low-level feature maps are concatenated with the high-level feature maps to make full use of both spatial and semantic information. To make the results as stable as possible, the final results are obtained by combining multiple prediction results. Aiming at the template update, a high-confidence multi - template update mechanism is used. The average peak to correlation energy is used to determine whether the template should be updated. We use the anchor-free network to implement object tracking in a per-pixel manner, which computes the object category and bounding boxes directly. Experimental results indicate that the average overlap and success rate of the proposed algorithm increase by about 5% and 10%, respectively, compared to the SiamRPN++ algorithm when running on the dataset of GOT-10k (Generic Object Tracking Benchmark).


Introduction
Visual object tracking is a fundamental research direction in computer vision. It is widely used in diverse fields such like visual surveillance, vehicle tracking, and humancomputer interaction [1]. Visual object tracking, which detects and locates a specified target in a changing video sequence, depends on the ground-truth bounding box of the initial frame and then obtains the complete trajectory of the target. Recently, rapid progress has been made in visual object tracking. However, it is still a great challenge in real-world applications, as objects under unconstrained recording conditions often suffer from illumination variation, heavy occlusion, background clutters, and scale deformation, to name a few [1]. Modern trackers can be roughly divided into methods based on correlation filters or deep learning. 1. We propose an anchor-free Siamese network (AFSN) for object tracking, which can perform end-to-end online training and offline tracking. It changes the original strides and receptive field and eventually achieves powerful performance.

2.
A dual-fusion method is designed to combine feature maps and prediction results. The high-level features are added to the low-level and middle-level features to make full usage of both spatial and sematic information. Application of the weighted-sum method to multiple prediction results can improve accuracy and boost robustness. 3. A multi-template update mechanism is designed to determine whether the template should be updated. The score of the peak to correlation energy is used to measure the degree of occlusion of the object and ensure the effectiveness of the template. 4. We present a proposal of replacing the RPN module with the anchor-free prediction network, which can decrease the number of hyper-parameters, make the tracker simpler, speed up the tracking process, and enhance performance. 5. The proposed method achieves state-of-the-art performance on GOT-10k [9], LaSOT (Large-Scale Single Object Tracking) [10], UAV123(A Benchmark and Simulator for UAV Tracking) [11], and OTB100 (Object Tracking Benchmark 100) [12] tracking datasets.

Related Works
Visual object tracking has always been a hot research direction. In recent years, tracking researchers have been focusing on improving speed and accuracy from different aspects such as feature extraction, template updating, classifier design, and location regression [6]. With the development of various methods, significant progress has been achieved. Siamese trackers have drawn great attention and achieved outstanding tracking performance. This section mainly introduces two aspects: Siamese trackers and anchorfree detection algorithms.

Siamese Network-Based Trackers
A Siamese network consists of two subnetworks with the same structures sharing weights. The proposed fully convolutional Siamese networks (SiamFC) [2] first introduces the correlation operation to produce a similarity map by the method of sliding windows. The highest score represents the target position. It is simple in construction but effective in improving accuracy. SiamRPN (Siamese Region Proposal Network) [3] constructs a Siamese network following an RPN module. It decomposes the tracking task as classification and regression for the object. SiamRPN sets a number of pre-defined anchor boxes to evade the time-consuming step of extracting multi-scale feature maps for object scale invariance and achieves robust results. DaSiamRPN (Distractor-aware SiamRPN) [4] introduces a sample training strategy to train the distractor-aware module and a local-to-global search strategy for long-term tracking. Through increasing high-quality positive and hard negative samples, it improves the discrimination of the tracker and achieves robust tracking performance. Siamese C-RPN (Siamese Cascaded Region Proposal Network)_ [5] constructs a sequence of RPNs cascaded from deep to shallow layers in the backbone network. Such cascade structure can balance training samples by motivating hard negative sampling. The cascaded RPNs are more discriminative in distinguishing difficult and complex distractors. Up to now, these Siamese trackers cannot make use of features from deep networks. To deal this problem, SiamRPN++ [6] proposes a spatial-aware sampling strategy to avoid putting a strong center bias on objects and then successfully trains the Siamese tracker by using the ResNet as a feature extraction network. It also adopts a method to aggregate multiple output results obtained by multi-layer feature maps. Experimental results show that Siamese trackers driven by deep networks can achieve better tracking performance.
To address tedious and heuristic configurations, the Siamese Box Adaptive Network (SiamBAN) [7] views the tracking task as a parallel classification and regression problem, and directly classifies objects and regresses their bounding boxes in a unified FCN (Fully Convolutional Network). ATOM (Accurate Tracking by Overlap Maximization) [13] introduces a classification component that is trained online to guarantee high discriminative power in the presence of distractors. Xuesong Chen et al. [14] present a new optimization objective function with dual-attention mechanisms to generate adversarial perturbations for ensuring the efficiency of the one-shot attack. SiamAttn [15] introduces a new Siamese attention mechanism that computes deformable self-attention and cross-attention. The self-attention captures rich context information. The cross-attention aggregates contextual inter-dependencies between template and search region. Siam R-CNN (Siamese Re-detection Convolutional Neural Network) [16] takes advantage of re-detections of both the first-frame template and previous-frame predictions to model the target. The trackletbased dynamic programming algorithm enables to re-detect targets after occlusion.
These Siamese trackers above are totally adopted from a set of pre-defined anchors to evade time-consuming computations, which can significantly improve the tracking performance in terms of both accuracy and speed. However, since these trackers need to define anchor boxes with fixed size and aspect ratios, they still have difficulty in tracking targets with large-scale deformation and pose change.

Anchor-Free Method of Detection
The current mainstream detectors such as Faster-RCNN [17], SSD (Single Shot Multi-Box Detector) [18], and YOLOv3 (You Only Look Once version 3) [19] rely on a set of predefined anchor boxes and achieve state-of-the-art performance. However, these anchorbased detectors have several disadvantages. On the one hand, a set of anchor boxes needs to be pre-defined with large parameters and fixed hyper-parameters, and the detection performance is sensitive to the hyper-parameters related to anchors. On the other hand, to cope well with scale deformation problems, the detectors set a large number of anchor boxes, causing a serious imbalance between positive and negative samples [20].
Therefore, many detectors based on the anchor-free method have been proposed recently. A fully convolutional one-stage object detector (FCOS) [20] eliminates the pre-defined set of anchor boxes and solves object detection in a per-pixel prediction fashion. It completely escapes the large parameters and complex computations related to anchors. CenterNet [21] detects each object using a triplet, including one center keypoint and two corners. These anchor-free approaches can achieve comparable performance to the anchor-based method, but with faster speed.

Methodology
In this section, we introduce the proposed anchor-free Siamese network in detail. As shown in Figure 1, the AFSN adopts the Siamese network to extract multi-level deep features and computes similarity maps between the template and the search region, followed by multiple anchor-free prediction networks. The backbone network consists of two subnetworks with the same structure sharing weights. The proposed dual-fusion method is divided into combining feature maps and prediction results. The high-level features are added to the low-level and middle-level features. The final result is the sum of the multiple prediction results. The anchor-free prediction networks discard the pre-defined anchor boxes and consist of classification and regression networks in per-pixel prediction.

Feature Extraction with a Siamese Network
Previous Siamese networks were designed to be shallow to satisfy strict translation invariance. Recently, the backbone network for object detection and semantic segmentation task has been gradually replaced by deep networks, which lead to decreased performance [1]. However, almost every modern network adds padding structures to make it go deeper, which destroys the strict translation invariance restriction. To this end, Si-amRPN++ [6] has conducted detailed analysis experiments on intrinsic factors, obtaining the following quantitative conclusions: (1) A padding structure leads to a spatial bias because of the violation of the restriction. A spatial-aware sampling strategy with a suitable shift can avoid putting a strong center bias on targets.
(2) The original modern networks generally have a larger stride, which is not suitable for Siamese trackers.
Based on the results of the above analysis, the proposed AFSN adopts the modified ResNet-50 [8] as its backbone network. We reduce the original strides of res3 and res4 blocks from 16 and 32 pixels to 8 pixels and increase the receptive field by dilated convolution operation. Meanwhile, a spatial-aware sampling strategy is adopted to train the whole network. To reduce the number of parameters, we change the channels of multilevel feature maps to 256 by a 1 × 1 convolution operation. The backbone network consists of two subnetworks: a target subnetwork, which extracts feature maps of the tracking template patch T, and a search subnetwork, which extracts feature maps of the search region S. The two subnetworks share the same convolutional neural network (CNN) architecture with the same parameters.
Compared with shallow networks such as AlexNet, ResNet-50 can aggregate multiple layers to retain richer information. Multi-level features can provide different representations. Low-level and middle-level features primarily focusing on spatial features such as the edge, color, and shape of the target are of great significance to estimate the target location more accurately. High-level features have better representations on sematic information, which are of vital importance to distinguish similar distractors and can be beneficial during some challenging scenarios such as huge deformation and fast motion. Compounding these representations can improve the inference of classification and localization.
In our network, we use feature maps extracted from the last three residual blocks of ResNet-50, represented as { , , }. The process of fusion is shown in Figure 1, and these blocks are concatenated as a unity: where and include 256 channels. Hence has 2 × 256 channels. Then and are concatenated as a unity: where and include 2 × 256 and 256 channels, respectively. Hence has 3 × 256 channels.
Finally, the obtained feature maps { , , } will be used in the following tracking network.
To embed the information of feature maps from target patch T and search region S, the response map ℝ can be obtained by cross-correlation operation. We set the extracted feature map of target patch T and search region S as R T and R S , respectively. Since the following prediction networks take the response map ℝ as input to generate the location and classification information of the target, the abundant information of the response map is essential for object prediction. We use the depth-wise cross-correlation operation to calculate the response maps on R S with R T as a kernel to retain massive information. We compute the response maps by using where * denotes the channel-by-channel depth-wise cross-correlation operation. The response maps have the same number of channels as R T (3 × 256, 2 × 256, 256). The response maps are then convoluted with a 1 × 1 kernel to change its channels to 256; as a result, the subsequent prediction computation can be significantly sped up. As shown in Figure 1, the output of ℝ , ℝ and ℝ is fed into three prediction subnetworks individually.

Anchor-Free Prediction Network
For each point (i, j) in the response map ℝ, it can be mapped back onto the search region S as (x, y) by computing the corresponding strides.
where s denotes the effective strides.
The anchor-free prediction operation completely eliminates the complicated computations and tricky parameter tuning related to anchor bounding boxes. If the point (x, y) falls into the ground-truth bounding box, it can be considered a positive training sample, and otherwise, a negative training sample. The anchor-free prediction subnetwork is shown in Figure 2. The total prediction subnetwork can be divided into two branches: a classification branch to classify the category for each point and a regression branch to regress the target bounding box at this point. For each response map ℝ , the classification network outputs a classification feature map and its two dimensions represent the foreground and background scores of the corresponding point, respectively. The regression branch outputs a regression feature map , and it encodes the location of a predicted bounding box at this point. We set regression output ( , , ∶) = ( , , , ) correspondent to the point (i, j, :), which represents the distance from this point to the four sides of the bounding box. To suppress numerous low-quality bounding boxes generated by locations far away from the center of the object, following [20], we add a center-ness branch in parallel with the classification branch to predict the distance between the location and the center of the object. The center-ness branch outputs a centerness feature map . Since the output feature maps of the three anchor-free prediction subnetworks have the same spatial resolution, the weighted sum is adopted directly on the output. A weighted-fusion method combines all the outputs.
where C, R and Cen denote the outputs of classification, regression, and center-ness branches, respectively. The combination weights ( , , ) can be end-to-end trained offline together with the whole network.

Multi-Template Update Mechanism
Traditional Siamese trackers adopt the features extracted from the first frame as the template. The target often suffers from serious occlusions, large-scale variation, similar background clutters, etc. During the whole tracking process, it is difficult to solve the problems using only the one template extracted from the first frame. Therefore, we adopt a high-confidence multi-template update mechanism to update templates extracted from the last three residual blocks of the backbone network.
To prevent the features of distractors and background to be added into templates, we use the score of the peak to correlation energy to ensure the effectiveness of templates. The APCE (Average Peak-to Correlation Energy) can reflect the degree of occlusion and can be calculated by where , , and , denote the maximum, minimum, and corresponding values at the point (w, h) in the response map, respectively. The numerator reflects the peak value, and the denominator represents the fluctuation of the response map. The peak value and the fluctuation can indicate the confidence degree about the tracking results. When the target is not occluded, the APCE becomes larger and the response map shows only one sharp peak. Otherwise, the APCE dramatically decreases if the target is occluded or missing.
We compute the APCE as the sum of multiple response maps (ℝ , ℝ , ℝ ) and determine whether the threshold is exceeded. An APCE greater than the threshold reveals that the result is reliable. Then, we can upgrade templates by using where denotes the update ratio, R T denotes template features, and R X denotes the features extracted from high-confidence results.

Loss Function
Let Then, the IOU between the predicted results and the ground truth can be computed. Here, we only compute the IOU with positive samples, otherwise, it is set to 0. Then we calculate the regression loss by using where g(x, y) denotes the ground-truth bounding box. The score Cen(i, j) in center-ness output is defined by min( , ) min( , ) ( , ) max( , ) max( , ) = × l r t b Cen i j l r t b If the point (i, j) is a negative sample, it should be regressed here. The center-ness loss can be computed by using The loss function of the AFSN consists of the classification loss function, the centerness loss function, and the regression loss function, which can be calculated by using where represents the cross-entropy loss for the classification branch.

Results and Discussion
Our experiments were implemented in Python with Pytorch on one Nvidia Titan 1080Ti GPU (Graphics Processing Unit). The backbone network of our architecture is pretrained on ImageNet [22] using the parameters as initialization to retrain our model. We trained the whole network on the training sets of COCO (Microsoft Common Objects in Context) [23], ImageNet DET [24], ImageNet VID [24], and YouTube-Bounding Boxes [25] for experiments on GOT-10K [9], LaSOT [10], UAV123 [11], and OTB100 [12]. Specifically, we set a 127 × 127 region centered on the ground-truth bounding box as a template patch so as to set a 255 × 255 region as the search region.
During the training process, the proposed AFSN can be trained end to end. There are in total 50 epochs performed, 6000 sample pairs per epoch, by using stochastic gradient descent (SGD) with a learning rate exponentially decayed from 0.01 to 0.0001. A weight decay of 0.0005 and a momentum of 0.9 are used. To substantially boost performance, the parameters of the backbone network are frozen while training the anchor-free prediction subnetwork in the first 10 epochs. In the last 40 epochs, the last three blocks of ResNet-50 are unfrozen to be trained together by setting the learning rate to be 10 times smaller than the anchor-free prediction parts. Especially, we use a warm-up learning rate of 0.001 in the first five epochs to train the anchor-free prediction network.

Results on GOT-10k
GOT-10k is a recently released large, high-diversity benchmark for generic object tracking in the wild [14]. It collects over 10,000 video segments and manually annotates more than 1.5 million high-precision bounding boxes. The class between training and testing sets are zero-overlapped. It provides class-balanced metrics mAO and mSR for the evaluation of generic object trackers and builds an official website that offers an online evaluation server [14]. Authors need to test their models on the given testing dataset and upload the tracking results from the website. The provided evaluation indicators include success plots, the average overlap (AO), and the success rate (SR). The AO denotes the average of overlaps between all ground-truth boxes and estimated bounding boxes. The SR denotes the percentage of successfully tracked frames where the overlaps exceed a threshold. SR0.5 and SR0.75 represent the rate of successfully tracked frames whose overlap exceeds 0.5 and 0.75, respectively.
In this experiment, we compared our method with several trackers, including Si-amRPN++ [6], CCOT [26], SPM [27], and Staple [28]. All the results are provided by the official website of GOT-10k. Table 1 lists the comparison details of different indicators. It shows that the proposed AFSN is able to rank first in AO, SR0.5, and SR0.75. Results of lines 3 and 4 of the table suggest that compared with SiamRPN++, the proposed AFSN improves the score by 5.8% on SR0.5 and 10% on SR0.75. As shown in Figure 3, the proposed AFSN can outperform other trackers and obtains a 5% gain from SiamRPN++ in the overlap success rate. Results validate that the AFSN tracker can estimate more precise bounding boxes and have good generalization for a visual object.
When the dual-fusion method and multi-template update mechanism are removed, the proposed algorithm can run at 42 fps. Under the same conditions, the SiamRPN++ algorithm runs at 38.71 fps. The running speed of the proposed algorithm is higher than that of the anchor-based algorithm. This indicates that the anchor-free prediction network can decrease the number of hyper-parameters, make the tracker simpler, and speed up the tracking process.
In this part, we qualitatively compare the proposed AFSN with four different trackers in Figure 4. As our AFSN concatenates deeper semantic information and low-level spatial information, the tracking results shown in the figure can distinguish targets and background accurately. Furthermore, the AFSN can locate the target more accurately with minimum error.

Results on LaSOT
To further validate the proposed AFSN on a more challenging dataset, we conducted experiments on LaSOT. The dataset contains more than 3.52 million manually annotated frames and 1400 videos [10]. It contains 70 classesm and each class includes 20 tracking sequences. The official website of LaSOT provides 280 videos with high-quality dense annotations in the testing set and 35 algorithms as baselines. The provided evaluation indicators include normalized precision plots, precision plots, and success plots in one-pass evaluation (OPE).
We compared our tracker with the top 20 trackers, including SiamRPN++ [6], MDNet [36], MEEM [33], ECO [34], SiamFC [2], DSiam [37], and other baselines. Figure 4 reports the overall performance of our AFSN on the testing set. From Figure 5, we observe that the proposed AFSN outperforms the best tracker by 0.2%, 0.4%, and 0.7%, respectively, for the three indicators. Notably, compared with the provided baselines, our AFSN heightens the scores by more than 11%, 12%, and 10%, respectively, for the three indicators over MDNet, which is the best tracker reported on the official website of the LaSOT dataset.

Results on UAV123
The UAV123 dataset includes in total 123 video sequences comprising more than 110,000 frames [11]. The objects in the dataset mainly suffer from an aspect ratio change, background clutter, fast motion, and illumination variation, which make tracking challenging.
We tested the proposed tracker on the UAV123 dataset in comparison with several representative trackers, including ATOM [13], RLS-RTMDNet [38], DaSiamRPN [4], Si-amRPN [3], ECO [34], SRDCF [30], MEEM [33], and KCF [29]. We show the results of the comparison with the UAV123 dataset in Table 2 and Figure 6. As shown in Table 2, our AFSN obtains the best results, with an AUC of 61.3%, OP0.50 of 75.8%, and OP0.75 of 55.4%. Our AFSN improves the scores by 9.7%, 12.5%, and 23.4%, respectively, for the three indicators over RLS-RTMDNet. Compared with the ATOM algorithm with Siamese-based architecture, our AFSN improves the scores by 0.7% on OP0.50 and 7.8% on OP0.75. From Figure 6, we can observe that the proposed AFSN tracker achieves the best performance over the sequences with camera motion and viewpoint change attributes. Compared with SiamCAR [39], our AFSN improves the scores by 0.1% and 0.4%, respectively, for the two aspects. Compared with DaSiamRPN, our AFSN improves the scores by 1.4% and 3.9%, respectively, for the two aspects. The results demonstrate that our proposed network has good representation for a visual object, with deeper and richer feature representation.

Results on OTB100
OTB100 contains 100 fully annotated sequences tagged with 11 attributes to represent the challenging aspects, including illumination variation, scale variation, occlusion, deformation, motion blur, fast motion, in-plane rotation, out-of-plane rotation, out of view, background clutters, and low resolution [12]. Each attribute represents a specific challenging factor in object tracking. The evaluation is based on two metrics: precision plot and success plot. The precision plot metric represents the percentage of frames in which the estimated location is within a given threshold distance of the ground-truth position. The success plot metric represents the ratios of successful frames whose overlap is larger than Precision a given threshold ranging from 0 to 1. The area under the curve (AUC) of each success plot is used to rank trackers. In this experiment, we compared our network with eight representative approaches, including SiamRPN [3], Staple [23], and SiamFC [2]. Figure 7 illustrates the precision and success plots of the compared trackers. Compared with Si-amRPN, the proposed AFSN improves 0.7% on the success rate and 0.6% on precision. As shown in Table 3, the proposed AFSN significantly improves the tracking success rate for the aspects of background clutter, fast motion, motion blur, out of view, and scale variation. The results demonstrate that the AFSN can better deal with similar distractors and large pose variation, which evidences the importance of richer features in enhancing the robustness and effectiveness of our anchor-free prediction network.

Ablation Study
To study different levers of feature map effectiveness, we performed ablation experiments on OTB100. As shown in Table 4, multi-layer feature maps can improve tracking performance effectively. A model using one single feature map achieves a performance of 0.603 and 0.827 in the AUC and precision, respectively, on OTB100. A model using two feature maps obtains performance gains of 0.022 (0.625 vs. 0.603) in the AUC and 0.018 (0.845 vs. 0.827) in precision on OTB100. A model using three feature maps obtains performance gains of 0.017 (0.642 vs. 0.625) in the AUC and 0.007 (0.852 vs. 0.845) in precision on OTB100. The proposed model with a multi-template update mechanism obtains performance gains of 0.002 (0.644 vs. 0.642) in the AUC and 0.005 (0.857 vs. 0.852) in precision. In addition, adding more feature maps has an obvious contribution to further improvements. The multi-template mechanism is effective for the improvement of precision.

Conclusions
In this paper, an anchor-free Siamese tracking framework (AFSN) for visual object tracking is proposed. The proposed AFSN adopts a deep network ResNet-50 as its feature extraction network. A dual-fusion method is used to take advantage of multi-layer information, which creates more distinguishable feature representations for dealing with complex distractors. Aiming at the template update, a high-confidence multi-template update mechanism is used to determine whether the template should be updated. In addition, the pre-defined set of anchor boxes is omitted and the object-tracking task is implemented in a per-pixel fashion. The experimental results on GOT-10K [9], LaSOT [10], UAV123 [11], and OTB100 [12] datasets show that the proposed AFSN can achieve state-of-the-art results and run in real time, which proves the generalizability and robustness of our AFSN. We hope that our AFSN can be improved to get better performance after replacing some specific modules in future works.