Siamese Tracking Network with Spatial-Semantic-Aware Attention and Flexible Spatiotemporal Constraint

: Siamese trackers based on classification and regression have drawn extensive attention due to their appropriate balance between accuracy and efficiency. However, most of them are prone to failure in the face of abrupt motion or appearance changes. This paper proposes a Siamese-based tracker that incorporates spatial-semantic-aware attention and flexible spatiotemporal constraint. First, we develop a spatial-semantic-aware attention model, which identifies the importance of each feature region and channel to target representation through the single convolution attention network with a loss function and increases the corresponding weights in the spatial and channel dimensions to reinforce the target region and semantic information on the target feature map. Secondly, considering that the traditional method unreasonably weights the target response in abrupt motion, we design a flexible spatiotemporal constraint. This constraint adaptively adjusts the constraint weights on the response map by evaluating the tracking result. Finally, we propose a new template updating the strategy. This strategy adaptively adjusts the contribution weights of the tracking result to the new template using depth correlation assessment criteria, thereby enhancing the reliability of the template. The Siamese network used in this paper is a symmetric neural network with dual input branches sharing weights. The experimental results on five challenging datasets show that our method outperformed other advanced algorithms.


Introduction
In recent years, the performance of target-tracking algorithms has been greatly improved with the development of artificial feature-based trackers and deep learning-based trackers.Target-tracking technology has found widespread applications in computer vision fields such as smart cities, autonomous driving, and video surveillance [1][2][3][4].However, challenges such as target appearance changes during tracking, complex backgrounds, and the presence of similar objects can lead to tracking drift.Therefore, it is still crucial to design a robust tracking algorithm that can effectively handle the target's abrupt motion.
Siamese network-based tracking has drawn extensive attention due to its appropriate balance between accuracy and efficiency.SiamFC [5] maps the search patch into multiple scales and selects the scale with the highest classification score as the target scale for the current frame to predict the bounding box (Bbox).Zhang et al. [6] leveraged deeper and wider convolutional neural networks to further improve the tracking robustness and accuracy.However, the multiple-scale strategy is not well adapted to targets undergoing deformation while increasing the model parameters.Li et al. [7] combined the Siamese network and the region proposal network to predict the scale variation of the target, which improved the model speed and enhanced the adaptability to the deformed targets.To further simplify the model and reduce the computational complexity, some studies [8,9] introduced the anchor-free mechanism into the tracking field, easing the tuning of complex parameters in anchor-based methods.The above studies are devoted to optimizing the feature extraction network or regression function to improve the accuracy of the Bbox and tracking efficiency.However, such methods still have some limitations.
In real tracking scenarios, a complex background can lead to deviations or even drifting of the tracking prediction box from the ground truth or toward other distractors, especially when the target undergoes drastic appearance changes or moves suddenly over long distances.To address these challenges, existing Siamese network-based trackers [10,11] introduce the centeredness or quality estimation branch independent of the classification branch to suppress excessive displacement, which solves the problem of performance degradation caused by using classification confidence for bounding box selection directly.Chen et al. [12] further proposed the Siamese center prediction network.This model predicts an object's location by correcting the target position appropriately through the offset branch.Some recent methods [13,14] build links between classification and regression, optimizing them in a synchronized manner for consistent inference.Most of these methods add extra branches or networks to improve the accuracy of target localization.In addition, in order to improve the confidence of the response map, some researchers introduced a series of fixed-window penalty functions [15][16][17] into the tracking model to alleviate the boundary effect, and these methods suppress the interference response to a certain extent.However, a pretrained deep network is not enough to model arbitrary forms of target features when the target state changes significantly, and the extracted target depth features may be redundant.Therefore, it is crucial to adaptively adjust the target features based on different target poses.Additionally, when the target undergoes sudden long-distance movements, the incorrect spatial penalty term can result in the response value of the distractor being higher than that of the target, significantly increasing the probability of tracking drift.Moreover, the absence of a robust target template update mechanism can lead to model degradation during the tracking process.
This work presents a Siamese-based method that addresses the aforementioned limitations.A Siamese network is a symmetric network with two input branches that share the same network structure and weight and is widely used in tracking algorithms.Our method contains a spatial-semantic-aware attention model, a flexible spatiotemporal constraint strategy, and an adaptive weight template update model.The proposed algorithm combines the response results of low-level feature maps and high-level feature maps to determine the target.While multilayer features contain richer target information, the contributions of pretrained target's deep features for visual tracking are different.We establish a spatial-semantic-aware attention model that focuses on the most informative region of the target feature map.This model strengthens feature channels with rich target semantic information by assigning them higher weights.Secondly, we observe that the fixed-window penalty function may decrease the confidence value of the correct target on the score map.To overcome this issue, we designed a flexible spatiotemporal constraint strategy which adaptively adjusts the penalty weights on the confidence map to reduce the probability of tracking failure.In order to further adapt to the target deformation, we designed an adaptive weight template updating strategy to enhance the robustness of the tracking model.The contributions of this work can be summarized as follows: (1) A spatial-semantic-aware attention model is proposed for visual tracking.We employ a single convolutional spatially aware attention model to adaptively adjust the significance of various feature regions, thereby emphasizing the most informative location on the target feature map.Additionally, the single convolutional channel attention network is used to strengthen target-specific channels that have more target semantic information, which is achieved by increasing their weights.This approach facilitates the learning of effective feature representations for high-tracking performance.
(2) We propose a flexible spatiotemporal constraint which adaptively adjusts the constraint weights on the response map by evaluating the tracking result features.This constraint addresses the issue of the fixed-window function incorrectly penalizing the target confidence when tracking fails.By incorporating the flexible spatiotemporal constraint, we can obtain a more reliable confidence score for the target location and avoid low-quality but high-scoring tracking results.
(3) We designed an adaptive weight template-updating strategy to mitigate model degradation caused by target appearance changes.This update mechanism evaluates the correlation between the target templates and tracking results using the depth-correlation assessment criteria and thus adaptively assigns weights to both the templates and tracking results to gather reliable template samples.Our update mechanism prevents template contamination while enriching template information.
In this paper, we first briefly review some classical tracking algorithms in recent years, especially those involving attention mechanisms and spatiotemporal constraints, and discuss some disadvantages of the current approaches.Next, we describe in detail our proposed spatial-semantic-aware attention Siamese tracking with a flexible spatiotemporal constraint.Extensive experiments on the OTB100, NFS, UAV123 VOT2016, and TC128 datasets demonstrate the superiority of our approach.Finally, the advances of the proposed methods are summarized, and its limitations are discussed.

Siamese-Based Trackers
The Siamese structure proposes a similarity learning problem in which the similarity loss between two input images is calculated, and the shared parameters of two parallel convolution laminar flows are trained by backpropagation.Siamese-based trackers [18][19][20] solve the similarity matching problem between the target and the search area well and have become popular in the tracking field (see Table 1).SiamFC [5] performs similarity learning through deep cross-correlation, which transforms the tracking process into template matching.SiamFC has greatly improved tracking accuracy and efficiency compared with previous studies, but it is not well adapted to various challenging scenarios.To further improve the tracking performance, researchers have designed a number of Siamese trackers based on SiamFC that can adapt to more tracking challenges.SiamRPN [7] first implanted the region proposal network (RPN) [21] into the Siamese trackers to transform the global search into a region-specific detection task [22].The bounding box regression reduces the amount of computation while improving the accuracy.In order to obtain more abundant target information, SiamMCF [20] and DSiam [18] incorporate cross-correlation on multiple layers to enhance the discriminant ability of the model.SiamBAN [8] and SiamCAR [10] designed the anchor-free strategy to avoid complex parameters caused by setting anchor boxes and further enhance the flexibility of the model.

Trackers with the Attentional Mechanism
The attentional mechanism was first applied in neuroscience and has expanded into other fields, such as image detection [23,24] and target tracking [25].DVAT [26] proposed the concept of 'attention regions', which applies spatial attention to two different discriminative domains (local and semi-local), effectively focusing the attention of the tracker.RTT [27] developed recurrent neural networks (RNNs) to identify components that are useful for target modeling and then used the resulting confidence map to suppress background clutter.Wang et al. [28] constructed a Residual Attentional Siamese Network (RASNet) using different types of attention mechanisms to improve the discriminative ability of the tracking model.Rahman M et al. [29] added channel attention and spatial attention behind the pretrained features to further reduce the distracting information in the target template features.In contrast to these attention mechanisms, it is proposed to learn attention through an online training deep network.We use the single convolutional attention network framework to adaptively adjust the importance of spatial and channel features and target representation.

Trackers with Spatiotemporal Constraint
Since the target motion was mostly smooth in early target tracking, researchers developed a spatiotemporal constraint strategy to suppress tracker drift.MOSSE [30] and KCF [31] avoid boundary effects in the tracking process by introducing cosine window constraints.In addition to cosine windows, other tracking algorithms also introduce window functions such as the Blackman window [15], rectangular window [16], and Hamming window [17] to prevent boundary effects.When the target displacement between two frames is small, and the background is simple, these fixed spatiotemporal constraint functions generally improve the tracking results.However, when the tracker fails to track due to the large displacement of the target between two frames in the video, these fixed spatiotemporal constraints may cause a large weight loss for the confidence of the correct target on the response map, and thus the tracker cannot be corrected.Li et al. [32] developed the NA window to suppress these customized incorrect weights, which improves the SNR of windowed ROIs by adaptively suppressing the variable noise according to the observation of similarity maps.Different from the above methods, this paper focuses on adjusting the spatiotemporal constraint under different conditions according to the quality of the tracking result box, which adaptively adjusts the penalty weights during the tracking process to reduce the probability of tracking failure.

The Proposed Method
Aiming at the problem that most tracking methods easily fail during abrupt motion and target appearance changes, a tracking framework is proposed that can handle this problem.In specific tasks, the depth features acquired by the pretrained network have different importance to the target representation, resulting in a worse discrimination ability for the tracker regarding the target and background information.Immediately after, we find that the fixed-window function unreasonably weights the confidence values on the response map to produce lower-quality but higher-scoring tracking results.Finally, we develop a flexible template-updating strategy to mitigate model degradation.
Figure 1 shows the overall scheme of our proposed algorithm.It is a tracker based on classification and regression that uses ResNet-50 as the feature extraction network.In our work, Conv3, Conv4, and Conv5 from the ResNet-50 [33] network were selected to extract image features.Since the shallow features contain more spatial structure information while the deep features contain rich semantic information [34], we introduced spatial attention in Conv3 and channel attention in Conv4 and Conv5 to highlight the information that was valuable for target representation.Then, the classification (Cls) map and regression (Reg) map were obtained by correlation matching between the target template feature and the search area feature.The Cls map estimated the probability that each position in the search area was the target, and the Reg map performed bounding box prediction.Next, the multilayer depth features of each frame-tracking result were compared with the template based on the Euclidean distance to determine whether the flexible spatiotemporal constraint strategy was activated.The flexible spatiotemporal constraint strategy was activated, which gradually increased the weight of the edge of the confidence map with the time of target loss to help the tracking recover.Finally, the adaptive weight template-updating strategy was used to generate a new template for the next frame tracking.

Spatial-Semantic-Aware Attention Model
Human visual perception usually does not need to focus on the whole environment but rather on the part of the target to perceive comprehensive information about and thus understand the corresponding visual patterns [35].The coordinate attention [36] enables mobile networks to focus on a larger area by embedding positional information into the channel domain.Yang et al. [23] proposed that dual wavelet attention can coordinate spatial and structural attention for different channels to prevent the loss of feature information and structural features.Since single-target tracking is similar to focusing on the most salient features, it is advantageous to focus on the critical regions of the target feature map.Unlike other trackers with attention mechanisms, we propose a spatial-semantic-aware attention model where the spatial-aware attention model focuses on prominent target regions in the shallow feature map, while the semantic-aware attention model distinguishes the importance of different channels of deeper features for target representation.

Spatial-Aware Attention Model
For the tracking target, the depth features are constructed by multiple two-dimensional feature maps.However, the contribution of all regions of the depth features obtained by the pretrained network to the tracking task is not equally important, and only the location related to the task needs to be focused upon.
Spatial attention focuses on 'where' an informative part is and enhances the informative features of the target in the image to facilitate target localization.To program this attention, we performed global max pooling GP max (•) and average pooling GP avg (•) on the Conv3 feature map F H×W×C M and fused the resulting pooling features F H×W×1 max and F H×W×1 avg in the channel domain.This kind of local convolution operation can focus on the desired information on the feature map.
After fusing the doubly pooled features, we used a convolution layer ψ 3×3 1 to downsample the number of feature channels to one to obtain a single-channel feature map (the 3 × 3 convolution filter was selected as the best result through experimentation).Then, the obtained single-channel convolution feature map was broadcasted with a sigmoid operation, and the single-channel convolutional feature map was multiplied by the previous Conv3 feature map F H×W×C M to obtain the spatial attention feature map S H×W×C A , with the ultimate effect shown in Figure 2. The computation of the attention feature map can be described as follows: and where concat[•] represents the concatenation operation, ψ 3×3 1 is the convolution operation with the 3 × 3 kernel, the padding and stride are one, and δ represents the usual sigmoid function f In addition, to make our aware attention mechanism more compatible with different targets, we adjusted the feature weights online utilizing a single convolutional.The specific method involved convolving all the samples S H×W×C A acquired by the attention mechanism into one-dimensional features and regressing them to a Gaussian label map where (i, j) is the offset against the target and σ is the kernel width.Then, the new aware attention weight ϕ s (•) H×W×1 ′ was obtained by minimizing the following objective function: where ⊙ denotes the convolution operation and W is the regression weight, while λ is a regularization parameter which can inhibit the overfitting of the training process.
After online training of the target in the first frame, we could find better attention weights ϕ s (•) H×W×1 ′ .Lastly, the features which reinforced the target area were obtained by the spatial-aware attention module as follows: 3.1.2.Semantic-Aware Attention Model Some feature channels have a more prominent contribution to modeling the visual pattern of an object; that is, different channels contain different semantic information about the target.Therefore, each channel should not be treated equally when using these depth features for tracking.
For Conv4 and Conv5 obtained from the backbone network, the global average pooling operation was performed on them (the squeeze process), the detailed operations of which were as follows: where F sq (•) represents the squeeze process, F H×W×C M represents the Conv4 or Conv5 features, and W, H are the width and height of the feature map, respectively.
We obtained two feature vectors F 1×1×1024 where σ represents the usual sigmoid function f (x) = 1 1+e −x and ReLU(•) is a rectified linear unit layer.
Then, the ϕ c (•) 1×1×1024 and ϕ c (•) 1×1×2048 , as feature weights, are multiplied by the corresponding channels of the features of Conv4 and Conv5 to acquire the output features for the channel attention model: Similar to Section 3.1.1,we assigned different weights to each channel utilizing a single convolutional.We convolved all the multi-channel C H×W×C A values acquired by the channel attention mechanism into one-dimensional features and regressed them to a Gaussian label map.The better aware attention weight ϕ c (•) H×W×1 ′ was obtained by minimizing the following objective function: Lastly, the target semantic features obtained by the semantic-aware attention module were as follows:

Flexible Spatiotemporal Constraint
Most of the existing trackers were proposed under the assumption of smoothness; that is, researchers assume that the target displacement between two frames will not be too large, and thus various window functions were proposed to punish the final response graph (assign a value [0,1] according to the distance between the sample center and the target in the previous frame).This can improve the confidence of the target response to a certain extent.But in the actual tracking scene, there will always be some similar targets or other interference information that leads to tracker drift.Once the tracking fails, the response of the correct target location will be continuously suppressed under the action of the fixed-window function, resulting in low-quality but high-scoring tracking results.The fixed-window function (Hanning window) fails to correct the tracker when the target deviates too far from the center of the search area, as shown in Figure 4. Therefore, to reduce the continuous negative impact of fixed spatio-temporal constraints on the target when the tracker fails, we developed a flexible spatiotemporal constraint strategy.

Hanning window
Flexible Spatiotemporal constraint Generally speaking, due to the smoothness assumption, the depth features of the target will not change greatly between adjacent frames.Therefore, when the tracker produces low-quality tracking results, the depth features of the tracking result will be significantly different from the template features.Based on this, we can consider whether to switch the spatiotemporal constraint by evaluating the depth features of both the tracking result and the target template.We expanded the tracking results to the same size as the target template and used the backbone network ResNet-50 to obtain the three-layer depth features of the tracking results.For the tracking result and target template, we compared the depth features of their corresponding layers based on the Euclidean distance.We will switch the spatiotemporal constraints when Equation ( 12) is met:

Groundtruth
where T G 0,L is the initial template feature, T R t,L is the tracking result feature (t is the sequence number of frames, while L is the layer of the features index), and T G l,x,y and T R l,x,y are the feature pixel values of the template and tracking result, respectively (l is the channel ordinal number, while x and y represent the pixel position index).
We observed that when the tracking error was caused by a change in target appearance, although the confidence score of the correct target was higher on the response graph without applying the window penalty function, due to the fixed spacetime constraints, the response of the target far away from the center of the search area would be suppressed, and thus the tracker could not recover to the correct target.However, in most cases, the window function could reduce the likelihood that the tracker would track similar objects far from the center point of the search area.Based on this, we established a flexible spatiotemporal constraint to penalize the target confidence score S con on the response map.More details are shown in Figure 4. Our strategy is defined as follows: where ρ con is a predefined hyperparameter and determines the degree to which the flexible spatiotemporal constraint affects the original response map.If ρ con is set to a large value, then the flexible spatiotemporal constraint has minimal impact on the original response map, which may cause the response value far from the center of the response map to be too large, resulting in the boundary effect.On the other hand, if ρ con is set to a small value, then the final response map is primarily determined by the flexible spatiotemporal constraint, and the initial response map output by the tracker is largely disregarded.This will greatly reduce the confidence of the response map, where α is the expansion rate indicating the distance penalty, σ represents the amount of translation to the left, which allows the value to continue expanding from any position without having to start from Q 0 (Q 0 is one), Q i , n, and Q f represent the initial value, the expansion time length, and the final value, respectively, and each Q i represents the spatiotemporal constraint weight of the original position, in which different Q i values form different expansion curves.

Adaptive Weight Template Updating
In practical tracking tasks, most tracker models continuously degrade due to the constant change in target appearance, resulting in tracker drift.Some Siamese trackers utilize the target state given in the first frame to obtain an initial template and do not update it again [5,7].Most update functions are limited to linear combinations with previous templates, and fixed combination weights severely limit the universality of the update mechanism.In order to make the template dynamically update to reduce model degradation and prevent contamination of the template from undifferentiated updates, we developed an adaptive weight template-updating strategy which can dynamically fuse the tracking results to generate the cumulative template for subsequent frame tracking.
First, the object defined by the ground truth in the initial frame has its most reliable original information, and thus we used the appearance features of the initial template as a baseline for the tracking results of the subsequent frames to generate cumulative templates by using a convolutional neural network to learn the target information that the initial template had.A new cumulative template was updated for each frame during the tracking process.For each frame to be tracked, its corresponding template T C t was generated from three components: the initial template, the cumulative template T C t−1 , and the tracking result T R t−1 for the previous frame.This would give the template richer temporal information.The generation process can be formalized as follows: where conv[•] represents the convolutional operation and t is the sequence index of the frame.
Furthermore, as can be seen in Figure 5, the tracking result features of different frames differed significantly from the initial template features due to the constant changes in the target appearance.Even for the same object, updates to the target template needed to change dynamically based on the tracking state.If all tracking results were utilized indiscriminately to update the template, then this may have led to redundancy or contamination of the template.
Therefore, we estimated the correlation between the cumulative template and the initial template and assigned weights to both by means of depth correlation assessment criteria.Since the depth features of different layers had different contributions to the final response map, we conducted depth cross-correlation between the three depth features of the initial template and the corresponding layers of the cumulative template.Then, we found the ratio with the autocorrelation of the initial template features to generate the weights of the corresponding layers.According to the weight of the corresponding layer, the feature of the tracking result and the feature of the cumulative template were fused to generate a new cumulative template for the tracking of the next frame.Note that the target from the first frame provided the most reliable information, and therefore we set the shrink parameter so that the template retained more of the initial information of the target.The following is thte recursive formula for the template update: where L is the layer of the features index and t is the sequence index of the frame.The operator ⊛ denotes the cross-correlation operation (i.e., the former is used as a convolution kernel to perform convolution operations on the latter).
It can be seen in Figure 5 that our model can improve the template degradation caused by target deformation or target background changes.

Settings and Datasets
The method presented in this article was implemented with Pytorch.The experiment was conducted on a PC with 16.0 GB of RAM, an Intel(R) Core(TM) i7-10700 CPU 2.90 GHz, and an NVIDIA GeForce GTX 1660 SUPER GPU.Our tracker was evaluated on four datasets: OTB100 [37], UAV123 [38], NFS [39], VOT2016 [40], and TC128 [41].The Conv3, Conv4, and Conv5 datasets on ResNet-50 [33] were used to extract the depth features of the target template and the search region.We enhanced the target representation by introducing spatial attention and channel attention, which could suppress the depth features of interference information.The number of iterations during online training affected the quality of the aware attention features.If the number of iterations is too low, then the loss value will not decrease effectively.Consequently, the distractor information in the target template features will not be adequately suppressed, and the tracking success rate will not improve.Conversely, if there are too many iterations, then the training time will be prolonged, thereby negatively affecting the tracking speed.To determine the optimal number of iterations, we conducted a comparative experiment within the range of 100-800 iterations.The experiment utilized the OTB100 dataset, and the results are presented in Table 2. Considering the balance between tracking accuracy and speed, and taking into account the resource limitations, we set the maximum number of iterations during training of the aware attention models to 500 based on the experimental findings.The learning rates of the neural network were 0.02 (Conv3), 0.05 (Conv4), and 0.0006 (Conv5).

Results on OTB100
OTB100 is one of the most widely used datasets in the field of tracking, and it consists of 100 video sequences.The test sequence includes deformation illumination variation, outof-plane rotation, scale variation, in-plane rotation, occlusion, motion blur, fast motion, and other challenging aspects.The evaluation was based on two metrics: success and precision plot metrics.The precision plot metric is the percentage curve of video frames, whose center position error is less than a given threshold.The success plot metric is the percentage curve of video frames with border overlap greater than a given threshold.We compared our tracker with some state-of-the-art trackers (SiamFC++ [11], SiamBAN [8], DaSiamRPN [42], GradNet [43], DeepSRDCF [44], SiamRPN [7], SiamDW-FC [6], SRDCF [45], SiamFC [5], and fDSST [46]).As shown in Figure 6, the performance of our tracker on both benchmarks was at an advanced level.Our tracker provided varying degrees of gain compared with the SiamBAN tracker.

Attribute-Based Comparison
We tested the tracking results of the proposed method on the OTB100 dataset under six common challenges: background clutters, fast motion, motion blur, occlusion, being out of view, and scale variation.Figure 7 shows the comparison results between our tracker and other trackers for the six different attributes.It can be seen that the proposed method coped better with various tracking challenges.In particular, compared with SiamBAN [8], our tracker performed better under the background clutter challenge, with a 1.6% improvement in the success rate and 1.4% higher accuracy for the background clutters, which can indicate that the spatial-semantic attention learning model effectively distinguished the targets from the background and similar objects.Moreover, when the target had fast motion and scale variation, the excellent results produced by our method show that the flexible spatiotemporal constraint can reduce the error response caused by the abrupt change in the target.

Results on UAV123
The UAV123 dataset has video captured by a low-altitude drone, containing 123 videos characterized by a large number of viewpoint changes.We tested our algorithm on the UAV123 dataset using the same evaluation metric as OTB.(The precision is the percentage of video frames whose center position error is less than a given threshold, and the success rate is the percentage of video frames with border overlap greater than a given threshold).Table 3 shows the comparison of our method with ATOM [47], SiamBAN [8], SiamRPN++ [48], DaSiamRPN [42], SiamRPN [7], ECO [49], TCTrack++ [49], SRDCF [45], SiamTHN [50], LGFF [51], and SAMF [52]).The results show that our tracker had 1.9% higher accuracy and a 2.6% higher success rate compared with SiamBAN.However, ATOM had better performance in terms of precision compared with most classification regressionbased trackers.This was due to the fact that ATOM introduced an online trained classification component, which allowed the network to estimate the target state with higher classification confidence.Therefore, the centroid of the bounding box was closer to the true position.

Results on NFS
The Need for Speed (NFS) dataset consists of 100 video sequences captured from real-world scenarios, with a total frame count of up to 380,000.All sequences are preannotated with different visual attributes such as occlusion, fast motion, and background clutter.We evaluated our tracker in the 30 FPS version of the dataset.The area under the curve (AUC) score of the success plot reflects the overall tracking results.Table 4 lists the evaluation results of our approach as well as MDNet [53], ECO [54], C-COT [55], UPDT [56], ATOM [47], SiamBAN [8], and LGFF [51].Our tracker ranked second out of all the methods that participated in the comparison.VOT-2016 consists of video sequences in 60 different scenarios.Unlike OTB's evaluation system, VOT-2016's evaluation metrics include robustness (failure rate, where a lower value is best), accuracy (average overlap during successful tracking periods, where a higher value is best), and EAO (expected average overlap, which quantitatively reflects both robustness and accuracy, where a higher value is best).In the VOT evaluation protocol, the tracker will be reinitialized when no overlap between the prediction Bbox and the ground truth is detected.We compared our tracker with others, namely SiamRPN [7], C-COT [55], MDNet [53], SiamRN [57], D3S [58], ROAM [59], SPS [60], SiamRNE [61], SiamTHN [50], and SiamBAN [8].Table 5 shows the evaluation results for each tracker.The EAO of our tracker was further improved compared with SiamBAN, and the failure rate was also reduced, which shows that our spatial-semantic-aware attention model, adaptive template updating, and flexible spatiotemporal constraint strategy can effectively reduce the probability of tracking failure.However, the accuracy of our tracker was worse than that of D3S, which was limited by the fact that the bounding box in the VOT evaluation system is rotatable, while the bounding box that our tracker predicts is flush with the image boundary.

Results on TC128
To further demonstrate the generality of the proposed tracking method in various scenarios, we tested our tracker on the TC128 dataset, which is more complex than OTB.TC128 has 128 color video sequences, and its tracking scenario is more variable than that of OTB.We compared our method with other trackers, including SiamBAN [8], ADMT [62], SiamCAR [10], SiamGAT [25], MEEN [63], and Struck [64]. Figure 8 shows the precision plots and success plots of seven trackers on the TC128 dataset.It can be seen that the scores of the proposed method on AUC and DP were 79.9% and 58.0%, respectively.Compared with the second-best tracker (SiamBAN), our tracking method increased by 1.6% and 1.8% in terms of the AUC and DP, respectively.

Visual Evaluation
To further demonstrate the performance of our tracker in the face of various challenges, we visualized the tracking results of our tracker versus other trackers in real-world scenarios.The scenario in Figure 9 includes tracking challenges such as occlusion (in Matrix and Soccer), scale variation (in Biker, Soccer, Walking2, and Trans), illumination changes (in Matrix and Soccer), deformation (in Trans), and other changes in the appearance of the target, as well as fast motion (in Matrix and Biker) and background clutters (in Matrix, Soccer, and Walking2).It can be seen that trackers such as SiamBAN [8], DaSi-amRPN [42], and ADMT [62] would cause the prediction box to be less accurate and even track drift when faced with the above challenges.In contrast, our tracking method can better adapt to target appearance changes and various challenges due to the introduction of the spatial-semantic-aware attention model and spatiotemporal constraint strategy.

Ablation Study
In this section, we perform an ablation analysis of the spatial-semantic attention model and flexible spatiotemporal constraint strategy as well as the adaptive weight template-updating model.To visually illustrate our proposed components' effectiveness, we analyzed our algorithm on the OTB100 dataset using one-pass evaluation.The baseline method adopted the original deep features of Conv3, Conv4, and Conv5 from ResNet.The precision (DP) and success rate (OP) are shown in Table 6.Siambase denotes the basic tracker used by the algorithm.SiamSA denotes the addition of the spatial-semantic-aware attention model.SiamSAST denotes the tracker for the combination of the spatial-semanticaware attention model and the flexible spatiotemporal constraint strategy.SiamSDP denotes the final tracker with the combination of the three components.In Table 5, it can be seen that both the DP and OP gradually improved after adding each of the three components to the basic tracker.This shows that our spatial-semantic-aware attention model can effectively enhance the sensitive features of the target and improve the ability of the tracker to distinguish the target and background information.Thanks to the introduction of the flexible spatiotemporal constraint strategy, the tracker can react in case of tracking drift.Therefore, the DP plot using SiamSAST was higher than that for the basic method.After adding the adaptive weight template update, the accuracy and success rate of SiamSDP increased by 0.33% and 0.29%, respectively, compared with SiamSAST, indicating that the adaptive weight template update can adapt to the appearance changes in the target and reduce the degradation of the tracker.In addition, we further tested the tracking speed of the proposed algorithm.Since both our spatial-semantic-aware attention model and adaptive weight template update model contain convolutional neural networks, it can be seen that the number of parameters of our model and the amount of computation rose compared with the base tracker, which led to a decrease in tracking speed.Limited by the resource environment and tracking speed, our method has not been considered for application to real-time tracking.

Conclusions
This paper proposes Siamese tracking with spatial-semantic-aware attention and adaptive template updating to suppress irrelevant information about an object's appearance and reduce model degradation.We used the spatial-semantic-aware attention model to enhance the feature representation ability and improve the tracking performance.The proposed spatial-aware attention module is responsible for highlighting the location of the target, and the semantic-aware attention module focuses on important feature channels.Then, the flexible spatiotemporal constraint strategy was proposed to remove the incorrect penalty of the fixed spatiotemporal constraint strategy on the correct target response in case of tracking failure.Finally, we proposed an adaptive weight template-updating strategy to adapt to changes in target appearance during tracking.It can adaptively generate new reliable templates using the tracking results of each frame.We conducted extensive experiments on several challenging datasets such as OTB100, VOT2016, NFS, UAV123, and TC128 to validate the effectiveness of the proposed method.
In this work, our primary focus was on addressing the challenges associated with target tracking in scenarios involving abrupt motion.While our spatial-semantic-aware attention model improved the tracking accuracy, it is important to note that the global average pooling and convolutional network utilized in the model may result in the loss of certain feature information.Additionally, the increased number of model parameters can lead to a decrease in tracking speed.In future works, we will explore alternative attention mechanisms and consider developing lightweight models to reduce the overall number of model parameters.Furthermore, tracking models trained and tested on specific datasets have limitations in their generalization ability, and our study is no exception.Although the dataset samples used for training of the proposed method are sufficiently varied, the capturing device acquires video sequences under unbalanced illumination, a certain viewing angle, etc., which may lead to capture bias.Ambiguous definitions of visual semantic facts can also lead to labeling and category bias.The limited nature of the dataset when confronted with new, unseen samples may lead to erroneous conclusions.Studying the differences between existing datasets and debiasing methods to improve the generalization ability of tracking algorithms will be our future research direction.

Figure 1 .
Figure 1.The proposed tracking framework, where FST represents the flexible spatiotemporal constraint strategy, EDJ represents the judgment based on the Euclidean distance, and AWU stands for adaptive weight template update.
avg and F 1×1×2048 avg through two levels of full connection (the excitation process).The first full connection f c 1 compressed C channels into C r channels to reduce computation, and the second full connection f c 2 reverted to C channels.The excitation process can be expressed as follows:

Figures 2 and 3 Figure 2 .Figure 3 .
Figures 2 and 3 show our spatial-aware attention model and semantic-aware attention model frameworks, respectively.The method enhances the effective features online and weakens those that are redundant or even interfering with the tracking.Global Average Pooling

Figure 4 .
Figure 4. Tracking results and target-scored heat map visualization results under the Hanning window and flexible spatiotemporal constraint.

Figure 5 .
Figure 5. Visualization of the features of the initial template, tracking results, and the cumulative templates.The green box represents the tracking result when the template is not updated.The yellow box represents the tracking result under our updating mechanism.

Figure 7 .
Figure 7.The precision plots and success plots for six challenging attributes on the OTB100 dataset.

Figure 9 .
Figure 9.The qualitative results of our approach and other trackers on four challenging real-world scenarios (from top to bottom: Matrix, Biker, Soccer, Walking2, and Trans).

Table 1 .
Related works overview.

Table 2 .
The experimental results for different iteration numbers on OTB100.

Table 3 .
Comparison of results of other trackers with ours on UAV123.

Table 4 .
Comparison with state-of-the-art trackers on the NFS dataset in terms of AUC.

Table 5 .
Details on the state-of-the-art trackers in VOT2016.

Table 6 .
Ablation studies on the OTB100 dataset.