3.1.1. Obtaining Multi-Scale Object Templates
In this paper, in order to address the issue of adapting to appearance deformation in object tracking, a two-stage network model integrating multi-scale templates is constructed, combining multi-scale features and template update mechanisms. By introducing multi-scale object features, the model can more comprehensively capture the information of objects at different scales, improving the model’s perception ability of object deformation.
The search image is obtained by entering the original image into the network through color dithering and cropping. The original template is produced by further tailoring with the goal in mind. From 
Figure 3, it can be seen that after a period of tracking, the position and shape of the object have undergone significant changes, and the actual tracking process will also undergo more drastic changes, such as rapid movement, rotation, or scale changes. These transformation processes can maintain model stability in input data at different scales. In this paper, the center template is introduced into the model, which is obtained by further scaling the original template with the object as the center and only focusing on the object itself. It can provide stable reference when the object undergoes changes, especially when the object undergoes drastic changes. The basic shape and features of the object are the keys to tracking, and the center template is obtained through basic feature cropping. At this time, the effectiveness of the center template is particularly significant, ensuring that the model can accurately lock the object.
An image with a scale of 0.5 times the original template can effectively balance the integrity of the object with the inclusion of background information. At this point, the image can effectively capture the midpoint of the key visual features of the object without losing details due to a small scale, nor making the features too scattered due to a large scale, and avoiding the introduction of too much irrelevant background noise, which helps the algorithm to more accurately identify and track the object. Therefore, the multi-scale information introduced in this paper mainly comes from search images, center templates (0.5 times the original template range), and original templates, as shown in 
Figure 4.
  3.1.2. Two-Stage Fusion Multi-Scale Template Network
If only the original template is used, the object location may be inaccurate due to a lot of background information in the original template. If only the center template is used without the assistance of background information, the tracking effect may be poor. Therefore, this paper adopts a two-stage template fusion method based on the attention mechanism transformer. Through the reasonable combination of a search image, original template, and center template, we can not only use the center template to lock the object but also use some background information to locate and track the object more accurately.
In the attention mechanism of Transformer, when fusing two parts of the features, the matching of weights is closely related to the relationship between the object and background. If the model focuses more on the object part, then, when fusing features, features related to the object will receive higher weights. This means that during the feature fusion process, the model will focus more on capturing key information about the object, thereby better understanding and tracking the object. On the contrary, if the model focuses more on the background, features related to the background will receive higher weights. Since this paper focuses on object tracking, the object positioning can be achieved through the attention mechanism first, and the background information can be used to assist so that the model can accurately track the object. Because the center template only focuses on the object itself, the center template can be placed in the first stage for calculation to achieve object positioning, and the template with background information can be added in the second stage for auxiliary tracking. Based on this, there are two ways to achieve this, that is, cross-attention can be performed first between the center template and the original template, and then cross-attention calculation with the search image is performed to introduce background information. Or a cross-attention calculation can be performed first between the search image and the center template, and then a cross-attention calculation is performed with the original template. However, since the original template is object-centric and has a more accurate tracking range, it is better to use the original template to enhance the representation of the template in the second stage, which is also confirmed by the experimental results, which have higher accuracy and success rates on the OTB-100 dataset. This paper introduces template information with different background contents. By allocating weights reasonably, the model can better understand the scene and effectively perform object tracking tasks. 
However, introducing too many scale templates can lead to information redundancy, especially when there is a high degree of similarity between templates. This redundancy not only consumes computational resources but may also interfere with the model learning more robust feature representations. Therefore, this paper chooses to use a single center template to maintain the simplicity of the network, which can reduce unnecessary information duplication, make the model more focused on learning and tracking information directly related to the task, and also improve the transparency and interpretability of the module. As shown in 
Figure 5, this paper designs two fusion methods for multi-scale templates. One is to first combine the center template and the original template with cross-attention and then associate the enhanced original template with search features through a cross-attention mechanism. The second step is to first cross-attention associate the center template with the features of the search image to obtain the search features of the first stage, and then cross-attention associate the output with the original template to obtain the search features of the second stage.
This paper used the same parameter settings to train two fusion models and found that both models performed well during the training process, and the final loss value and Intersection over Union (IoU) only differed by 0.01, indicating that the tracking performance of the two models was relatively close during the training process. To verify the tracking performance of the two fusion methods, this paper tested the trained model on the OTB-100 dataset and found significant differences in the results, as shown in 
Table 1.
The accuracy of the fusion mode (II) model is 6.6% higher than that of the fusion mode (I) model, and the success rate is 7.4% higher. This is because the fusion method (1) overly relied on feature distribution in the training data during the training process and had already reached the fitting state without learning enough parameters, resulting in the weak generalization ability of the model for new scenes. Fusion method (2) allowed the model to reevaluate and adjust its attention to features at each stage during the training process, improving its generalization ability. When there are enough learning parameters, the model can balance the attention to the object and background, avoiding the phenomenon of performance degradation or overfitting caused by excessive attention to a certain aspect of information. Therefore, when designing fusion methods, it is necessary to balance the model’s attention to object and background information and ensure that the model can effectively generalize to new data. The network framework in this paper follows the processing method of fusion method (2).
This paper adopts a two-stage fusion multi-scale template fusion method based on Transformer, which is named TransT-C. 
Figure 6 shows the network structure of a two-stage fusion multi-scale template.
Step 1: Feature Extraction. Use the ResNet50 network to extract features from the original template, center template, and search image. Output the features from the third layer of the network, resulting in three feature maps.
Step 2: Positional Encoding of Feature Maps. Convert the feature maps into 256-dimensional long vectors through a 1 × 1 convolution layer. The output feature vectors will be initialized as three values: keys denoted as Q, K, and V. This is to ensure projection in different spaces. Apply positional encoding to the respective Q and K keys of each feature vector to explicitly retain positional information.
Step 3: Self-Attention Learning on Feature Vectors. Notably, each of the three feature vectors undergoes self-attention learning, as illustrated in Equation (1). This helps the model capture the global dependencies within the input sequence, thereby extracting representations of each position relevant to the overall context. Through self-attention mechanism learning applied to the two templates and the search image, the model can understand the semantic structure of the input on a global scale, aiding in the extraction of more enriched feature representations.
          
Step 4: First Stage Feature Fusion. Perform cross-attention fusion between the center template and the search image, as shown in Equation (2). This primarily associates the features of the object itself with the search image, ensuring that the attention values in the search scene are more focused on object-related features. The learned search features are then outputted to the second-stage fusion structure.
          
          where 
 is spatial position encoding, 
 and 
 are the center template features and search image features, respectively, and 
 is the output of the search image features in this stage.
Step 5: Second Stage Feature Fusion. Perform cross-attention fusion between the original template and the enhanced search image features, as shown in Equation (3). This allows the model to further associate the search image with the surrounding information of the object’s position, enabling the model to more accurately understand the object’s location and environment within the entire image.
          
          where 
 is the search image feature after the first stage of fusion learning, 
 is the feature information of the original template, and 
 is the search image feature after the second stage of learning, used for the final classification regression prediction.
Step 6: Output of the Second Stage. The output  from the second stage will be mapped on to 2-dimensional and 4-dimensional outputs using a multi-layer perceptron (MLP), facilitating numerical attribute prediction for classification and regression tasks. The predicted values are input into the classifier and regressor, respectively. The classifier applies the sigmoid function to the output tensor, converting each class score into probability values and selecting the class with the highest probability as the prediction result. The regressor converts the output tensor, which represents various attributes of the object (such as the coordinates and size of the bounding box), directly into a Numpy array to obtain the regression prediction results.