Object Tracking Algorithm Based on Multi-Layer Feature Fusion and Semantic Enhancement

Wang, Jing; Wang, Yanru; Yuan, Dan; Que, Yuxiang; Huang, Weichao; Wei, Yuan

doi:10.3390/app15137228

Open AccessArticle

Object Tracking Algorithm Based on Multi-Layer Feature Fusion and Semantic Enhancement

by

Jing Wang

^1,*

,

Yanru Wang

¹,

Dan Yuan

¹,

Yuxiang Que

¹,

Weichao Huang

²

and

Yuan Wei

¹

School of Printing, Packaging and Digital Media, Xi’an University of Technology, Xi’an 710048, China

²

School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7228; https://doi.org/10.3390/app15137228

Submission received: 2 April 2025 / Revised: 14 June 2025 / Accepted: 24 June 2025 / Published: 26 June 2025

Download

Browse Figures

Versions Notes

Abstract

The TransT object tracking algorithm, built on Transformer architecture, effectively integrates deep feature extraction with attention mechanisms, thereby enhancing the stability and accuracy of the algorithm. However, this algorithm exhibits insufficient tracking accuracy and boundary box drift when dealing with similar background clutter, which directly affects the subsequent tracking process. To overcome this problem, this paper constructs a semantic enhancement model, which utilizes multi-layer feature representations extracted from deep networks, and correlates and fuses shallow features with deep features by using cross-attention. At the same time, in order to adapt to the changes in the surrounding environment of the object and establish good discrimination with similar objects, this paper proposes a dynamic mask strategy to optimize the attention allocation mechanism and finally employs an object template update mechanism to improve the adaptability of the model by comparing the spatio-temporal information of successive frames to update the object template in time, further enhancing its tracking performance in complex scenes. Experimental comparison results demonstrate that the algorithm proposed in this paper can effectively handle similar background clutter, leading to a significant improvement in the overall performance of the tracking model.

Keywords:

dynamic mask; multi-layer depth features; object tracking; similar background clutter; template update; transformer

1. Introduction

Object tracking technology primarily involves analyzing dynamic objects in video or image sequences to achieve the accurate estimation and prediction of their continuous states. Object tracking technology typically relies on deep learning methods, which train models to recognize the features of objects and track their motion trajectories. Research on object tracking not only drives technological advancements in the field of computer vision but also serves as a fundamental support for a wide range of applications, including automated surveillance, autonomous driving, interactive media, and augmented reality.

At present, deep learning-based tracking methods are primarily categorized into two types: fully convolutional Siamese networks and Transformer-based tracking approaches [1]. Siamese networks utilize a dual-branch structure with shared weights to achieve fast and precise similarity matching. The SiamFC [2] network was the first to introduce a fully convolutional architecture for single-object tracking, achieving efficient end-to-end tracking. Subsequent works such as SiamRPN [3] and DaSiamRPN [4] enhanced model robustness by incorporating region proposal networks and optimizing the backbone structure. SiamRPN++ [5] and SiamMask [6] further improved feature fusion and task integration to better adapt to complex scenarios. SiamFCA [7] applies local histogram equalization to each patch to enhance image contrast and suppress noise, while employing an improved AlexNet to extract deep features and introducing a coordinated attention mechanism, enabling the more accurate modeling of relationships between different objects and improving performance in complex visual scenes. Despite significant advances in tracking accuracy and speed, Siamese networks still face challenges in handling fast motion or heavily deformed objects, as well as maintaining stability and efficiency in long sequences [8]. To enable comprehensive analysis and understanding of the input data, Transformer models have been introduced into various stages of the tracking pipeline, serving both as modules for specific functions and as the foundation for overall tracking frameworks. Transformer-based tracking methods leverage the attention mechanism to establish long-range dependencies in spatial and temporal dimensions, enhancing the modeling capacity for dynamic object variations. In terms of temporal information modeling, STARK [9] and ProContEXT [10] incorporate online updating and context-aware modules to improve adaptability to object changes. VideoTrack [11] further optimizes the representation of temporal features by integrating a hierarchical structure and attention mechanisms. SwinTrack [12] integrates a Transformer architecture and motion tokens into the Siamese framework to enhance feature interaction and tracking robustness by incorporating temporal context. For spatial modeling, SiamTPN [13] uses a pyramid structure to fuse multi-layer features, improving adaptability to objects of varying scales. HiFT [14] utilizes feature modulation to enhance the recognition capability for small objects. MixFormer [15] constructs its framework through progressive embedding and a localization head, introduces an asymmetric attention mechanism to reduce computational cost, and designs an effective score prediction module for high-quality template selection. TrDimp [16], TaTrack [17], and MMT [18] enhance model robustness under extreme conditions by introducing multi-mask mechanisms. DropMAE [19] optimizes spatial information processing through a mask reconstruction approach that reduces redundancy and strengthens temporal information alignment. The dual-branch CNN–Transformer network TabCtNet [20] applies an object-aware erasure strategy, generating attenuated heatmaps from template masks in a data-driven manner to reduce the impact of similar distractors, and designs a corner-based pixel-level refinement module to extract finer-grained spatial information, achieving more accurate bounding box estimation and mitigating the effects of object deformation.

TransT [21], as a representative approach that combines Siamese networks with Transformer mechanisms, makes full use of cross-attention modules to model the interaction between the template and the search region, effectively overcoming traditional Siamese networks’ limitation in capturing global dependencies. TransT was tested on the OTB100 dataset, with the success rates (Suc) under various challenging factors presented in Table 1. The full names of each factor are introduced in Section 3.2 of this paper. As observed from Table 1, the success rate in background clutter (BC) scenarios is relatively low, with the highest number of failures. Background clutter refers to scenes where the background contains colors or textures similar to the object, which may cause the tracker to mistakenly identify background regions as the object, leading to tracking failure.

Through an in-depth analysis of the TransT network structure, it was observed that while its main strength lies in the comprehensive feature fusion enabled by the Transformer, it lacks detailed information necessary for accurately distinguishing and localizing the object and background. In addition, it does not effectively utilize the multi-layer outputs of the backbone network to perform a thorough analysis of scene content. Furthermore, the absence of temporal information integration to accommodate changes in the object’s appearance and position leads to unstable tracking results, thereby limiting the algorithm’s robustness and applicability. Therefore, this paper performs the following work to solve these problems:

(1): The object semantic enhancement model is constructed by using a multi-layer feature fusion method. Specifically, multi-layer features are fused through the cross-attention mechanism to ensure that the detailed description of the object is associated with the background context and enhance the expression ability of the search features.
(2): The dynamic mask strategy is proposed. We introduce dynamic masks into the output features of cross-attention, further focusing on object information and significantly reducing the impact of background clutter.
(3): The object template update mechanism guided by the judgment conditions [22] is employed. In particular, when the prediction result of the current frame exhibits high correlation with the adjacent two frames, a new template image is generated by cropping around the current predicted frame. This updated template is then fed back into the network input, enabling the template to better adapt to appearance variations in the object and changes in the scene background throughout the video sequence, thus maintaining tracking effectiveness over time.

2. Method

When objects with the same appearance and texture frequently appear around an object, the predicted bounding boxes are easily transferred to similar objects. In the long-term tracking process, the model may fail to accurately locate the true object, eventually deviating from it. To address the issues of insufficient feature representation and error accumulation in TransT, this paper proposes a model that focuses explicitly on the object itself, thereby effectively distinguishing the object from similar entities in the background. Since cross-attention can enhance focus on key regions by leveraging the interrelations between different images and suppress irrelevant background information, and the mask mechanism can further filter critical regions and eliminate redundant features, this work introduces an object semantic enhancement model to ensure that the algorithm concentrates on the object, reducing background interference. Furthermore, to improve the model’s adaptability to variations in object appearance and spatial position, the object template update mechanism guided by the judgment conditions [22] is adopted. This enhances tracking accuracy and stability in complex scenes with background distractions, providing a more robust solution for object tracking tasks. The overall tracking framework of the algorithm proposed in this paper is shown in Figure 1.

2.1. Construction of Object Semantic Enhancement Model

2.1.1. Extraction of Multi-Layer Fusion Features

The feature vectors used by TransT are extracted from the ResNet50 [23] backbone network. ResNet50 is a variant of the deep Residual Network, with a structure of 50 layers, used to solve the degradation problem in deep convolutional neural networks (CNNs), where the training set error increases as the network depth increases. The main hierarchical structure of ResNet50 is [Layer1, Layer2, Layer3, Layer4]. Through the progressive deepening from preliminary features to advanced features, it provides rich and hierarchical feature representations for the model. Due to their different levels of abstraction, features at different levels can have different application values for tracking tasks. Low-level features capture fine-grained details, whereas high-level features convey more abstract and semantic information. This hierarchical feature representation improves the model’s ability to adapt to external variations in the object, such as changes in lighting, occlusion, and fast motion. At the same time, it also improves the robustness of the model in different scenarios and background conditions.

From Figure 2, it can be observed that the outputs of [Layer1, Layer2, Layer3, Layer4] in ResNet50 provide continuous feature representations from low to high levels.

Layer1 features contain fewer channels but can capture finer feature details such as finer edges and texture features. As the depth increases, Layer2 can capture more abstract and high-level features, and the feature map begins to display partial shape and key structural parts, and the basic composition of the object can be understood. The number and depth of channels in Layer3 are further deepened to capture more complex features and relationships. Feature maps contain higher-level abstract features, which can provide more stable feature representations for objects with significant changes, helping to maintain the continuous tracking of objects in complex scenes. Layer4 provides the highest level of abstract features, which are more global and semantic, helping to identify the category of the object. In object tracking tasks, the appearance of the object often undergoes significant changes across frames due to factors such as illumination variation, occlusion, and viewpoint changes. Using single-level features may result in insufficient representation capacity or poor robustness. While low-level features can capture fine details like edges and textures, they lack semantic abstraction and are prone to background interference; high-level features, on the other hand, offer strong semantic representation but often lose spatial structure and local information. Hence, integrating multi-level features allows the model to capitalize on their complementary advantages in capturing spatial details and semantic information, thereby enhancing its ability to adapt to object appearance changes and improving the stability and accuracy of tracking. Since template and search images often differ due to viewpoint changes, occlusions, or background clutter, directly combining multi-level features using concatenation or weighted averaging may introduce redundant or misleading information, causing the model to focus on incorrect regions. In contrast, the cross-attention mechanism adaptively captures corresponding regions between the template and search images based on their mutual relationships, enhancing attention to key areas while suppressing irrelevant backgrounds. This facilitates more accurate and distinctive feature integration, which in turn enhances the model’s robustness and stability when dealing with complex scenarios. Therefore, this paper adopts cross-attention to fuse multi-level features. In order to balance the specificity and abstraction of features, this paper selects the outputs of Layer2 and Layer3 as the feature information for this tracking task. These two layers provide a level of abstraction between low-level features and high-level semantic features, which are highly effective in capturing the shape, texture, and partial structural details of an object. Moreover, it excludes both fine-grained low-level features (e.g., Layer1 edge features) and overly broad high-level semantics (e.g., Layer4 global features), since they lack stability under fast motion and struggle to capture detail in complex environments. Therefore, choosing Layer2 and Layer3 can ensure sufficient information while avoiding unnecessary computational costs.

2.1.2. Object Semantic Model Under Multiple Cross-Attention

This paper proposes an enhanced appearance semantic model of multi-layer feature fusion. By utilizing the multi-level feature representations of Layer2 and Layer3 of ResNet50 and adopting cross-attention learning and feature fusion strategies, the model achieves deep understanding and the accurate tracking of objects. The model can effectively extract and emphasize the most relevant features to an object, which helps the model accurately recognize objects among many similar objects and improve feature differentiation ability. Figure 3 shows the internal components of the multi-layer feature fusion network.

(1): Cross-attention learning

There has always been an inherent correlation between template images and search images, which can be explicitly learned through cross-attention to improve object recognition ability. ResNet50 is used to extract the second and third layer features of template image and search image and obtain template features [Layer2:

f_{Z_{2}} \in R^{C_{2} \times H_{Z_{2}} \times W_{Z_{2}}}

; Layer3:

f_{Z_{3}} \in R^{C_{3} \times H_{Z_{3}} \times W_{Z_{3}}}

] and search features [Layer2:

f_{X_{2}} \in R^{C_{2} \times H_{X_{2}} \times W_{X_{2}}}

; Layer3:

f_{X_{3}} \in R^{C_{3} \times H_{X_{3}} \times W_{X_{3}}}

], where C, H, and W represent the number of channels, height, and width of the feature map, respectively, and

C_{2} = 512

and

C_{3} = 1024

. Z represents a template image, and X represents a search image.

The position of the query Q and key K of the extracted features is encoded, and the specific position of each sequence in the feature vector is marked by sine–cosine coding. The feature mapping of the two layers is expanded on the plane dimension.

Firstly, cross-attention is applied to the Layer2 features of the template image and search image to learn the details and shape matching between the template and search images. As shown in Equations (1) and (2),

X_{2}^{'}

and

Z_{2}^{'}

, respectively, represent the search feature and template feature after Layer2 features are applied with cross-attention;

P_{X_{2}}

and

P_{Z_{2}}

, respectively, represent the position coding of Layer2’s search feature and template feature; and MultiHead represents the ‘Cross-Attention’ part in Figure 3. Layer2 features are processed through cross-attention, and the feature representation in the search image highlights more detailed features similar to the template image, such as edges, textures, etc. This change enables the model to better match the object and background at the level of detail. When facing clutter from similar objects, it can reduce the model’s distraction from non-similar objects and improve tracking accuracy.

X_{2}^{'} = f_{X_{2}} + M u l t i H e a d (f_{X_{2}} + P_{X_{2}}, f_{Z_{2}} + P_{Z_{2}}, f_{Z_{2}})

(1)

Z_{2}^{'} = f_{Z_{2}} + M u l t i H e a d (f_{Z_{2}} + P_{Z_{2}}, f_{X_{2}} + P_{X_{2}}, f_{X_{2}})

(2)

At the same time, cross-attention is applied to the features at the Layer3 level to enhance the understanding of the high-level semantic information of the object. As shown in Equations (3) and (4),

X_{3}^{'}

and

Z_{3}^{'}

, respectively, represent the search feature and template feature after Layer3 features are applied with cross-attention;

P_{X_{3}}

and

P_{Z_{3}}

, respectively, represent the position coding of Layer3’s search feature and template feature; and MultiHead represents the ‘Cross-Attention’ part in Figure 3. After applying cross-attention in Layer3, the feature representation of the search image is more focused on the global and advanced semantic information of the object, such as the overall shape and object category features. This focus on high-level semantic variations enhances the model’s overall visual grasp of the object, allowing the model to distinguish based on global features rather than just local or low-level features when encountering objects with similar appearances to the object, in order to maintain stable tracking.

X_{3}^{'} = f_{X_{3}} + M u l t i H e a d (f_{X_{3}} + P_{X_{3}}, f_{Z_{3}} + P_{Z_{3}}, f_{Z_{3}})

(3)

Z_{3}^{'} = f_{Z_{3}} + M u l t i H e a d (f_{Z_{3}} + P_{Z_{3}}, f_{X_{3}} + P_{X_{3}}, f_{X_{3}})

(4)

To integrate attention information across multiple levels and boost the model’s capacity to represent features, the quantity of stacked cross-attention modules in the model is aligned with that in TransT, specifically set at N = 4. The model learns the dynamic relationship between template images and search images through the cross-attention mechanism, which enables the model to dynamically adjust its attention based on real-time changes in the object and background. In complex scenes, especially when the object is surrounded by similar objects, the model can update its understanding of the object in real time, thereby more effectively resisting the clutter of similar objects.

(2): Feature fusion strategy

The search features fused through Layer2 and Layer3 can integrate information from different network depths, including local details and global semantics. This integration strategy provides rich and comprehensive representation information for the model, enabling the model to better understand the essential attributes of the object, including shape, size, texture, etc., thereby improving the accuracy of object recognition. The output of the search feature

X_{2}^{'}

from Layer2 and the output of the search feature

X_{3}^{'}

from Layer3 are merged, and the main output process is shown in Equations (5)–(8).

S_{1} (X_{3}^{'}) = n o r m (X_{3}^{'} + d r o p o u t (X_{2}^{'}))

(5)

S_{2} (X_{3}^{'}) = r e l u (l i n e a r 1 (S_{1} (X_{3}^{'})))

(6)

S_{3} (X_{3}^{'}) = l i n e a r 2 (d r o p o u t (S_{2} (X_{3}^{'})))

(7)

X_{3}^{″} = n o r m (S_{3} (X_{3}^{'}) + d r o p o u t (S_{3} (X_{3}^{'})))

(8)

By leveraging linear layers, activation layers, and normalization layers, the model is capable of effectively integrating features from various levels. The application of dropout operations aids in preventing model overfitting and enhances its generalization capability across different scenarios. Ultimately, cross-attention correlation is conducted between

X_{3}^{″}

and the template feature

Z_{3}^{'}

, which refines the model’s attention allocation toward the object. This process ensures that the attention is more concentrated on the object’s key areas, thereby improving the model’s accuracy in predicting the object’s position. The resulting search features are then utilized for the final prediction task.

(3): Feature visualization

In this paper, the model is named Transtfpn, based on TransT, which integrates multi-layer semantic information. Due to the problem of similar background clutter between the Basketball and Bird video sequences in the OTB-100 dataset, in order to visually present the effect of multi-layer feature fusions, for the first frame images of the Basketball and Bird video sequences in the OTB-100 dataset, the multi-level attention feature maps of TransT’s Layer3 feature and the multi-level attention feature maps of Transtfpn’s fusion feature are visualized in the form of heat maps in this paper. This is shown in Figure 4, where N represents the number of stacked cross-attention modules.

As shown in Figure 4, when N = 4, the model is able to effectively synthesize attention information from multiple levels and boost its feature representation capability. Moreover, it is evident that the results of TransT are rather dispersed, lacking sufficient focus on the object itself. The model’s grasp of the object’s state is not detailed enough, and the distribution of influence values is more susceptible to background factors. The results of Transfpn are more accurate in both positioning and object range locking. More importantly, it can concretize the actual shape of the object, which has a great promoting effect on distinguishing between objects and similar objects. Therefore, the model can improve judgment accuracy in the tracking process through this learning.

2.1.3. An Object Enhancement Model Integrating Mask Features

While the cross-attention mechanism does improve the model’s focus on object regions to a certain degree, it may still cause attention to spread out or become misaligned in complex backgrounds or when there are multiple similar objects present. To tackle this problem, this paper incorporates a masking strategy during the cross-attention learning process of Layer2 and Layer3. The newly proposed model is called Transtfpn-Mask, with the goal of further filtering out key regions and suppressing unnecessary information. After each cross-attention fusion stage, the feature influence value—representing the association strength between each position in the search features and the template—is computed. These values are then sorted in descending order, and the top 60% of positions with the highest influence values are retained as salient regions. The influence values of the discarded feature points are set to zero, while those of the retained regions remain unchanged to form an updated search feature, which is then used in the next round of cross-attention learning. This mask is dynamically updated after each fusion stage, ensuring that the model consistently attends to the most relevant regions associated with the object. By incorporating this masking mechanism, the model is able to automatically identify and retain the most informative regions before each round of cross-attention computation, effectively suppressing background noise and irrelevant features. Compared to treating all regions equally, the dynamic filtering provided by the mask offers several advantages: first, it improves the focus of attention distribution, allowing the model to concentrate more effectively on the core area of the object and reducing the risk of attention drift; second, by diminishing the influence of low-relevance features, it enhances the model’s discriminative capability in scenarios with similar objects or cluttered backgrounds; third, the mask helps reduce redundant computation, thereby improving feature utilization efficiency. Overall, the masking strategy not only improves the quality of feature fusion but also enhances the model’s tracking robustness and stability in dynamic and complex scenes. Figure 5 takes Layer3 as an example to show the processing process of using a mask strategy for Layer3 features.

The shape of the search feature X is H × W, and the value of each position represents the degree of association between the feature point and the template feature, which is called the feature influence value in this paper. The search features are flattened into a sequence of the length H × W. In the previous section, it is explained that the features of Layer2 and Layer3 undergo four rounds of cross-attention fusion. This paper dynamically generates feature masks according to the feature influence values of the search features output by each cross-attention and further optimizes the original feature influence values, which can effectively adjust the distribution of feature influence values.

The feature points of the search feature are sorted from high to low according to the size of the feature influence value, and the sorted feature influence values are represented as

v_{i}

, where i represents the sorted index, which is used to distinguish different feature points.

Keeping a certain proportion of feature points with high feature influence values is helpful to improve the recognition and tracking performance of the model. Therefore, in order to determine the specific retention ratio, this paper conducted tests on the OTB-100 data, and the results are shown in Table 2. As can be seen from the table, when proportions of 40%, 50%, 80%, and 90% were retained, the success rate of the model was low. In addition, compared with the 70% feature points, the 60% feature points were retained with less computation, a higher success rate, and the best tracking effect, which could balance the problem of feature retention and suppression.

We label the mask of the retained feature point with

M_{i} = 1

and the mask of the removed feature point with

M_{i} = 0

. During calculation, the feature influence value of the feature points whose mask is equal to 0 is 0, and the feature influence value of the feature points whose mask is equal to 1 is unchanged. The feature influence value

V_{i}

of each feature point after applying the mask is calculated as shown in Equation (9).

V_{i} = v_{i} \cdot M_{i}

(9)

Finally, the feature points after applying the mask are formed into a new search feature,

X'

, according to their original positions. After the mask strategy is applied, the influence values of the feature inside the search feature are updated. The updated search features are used for the next cross-attention learning. The search features after each cross-attention learning dynamically generate masks based on the updated feature influence values, ensuring that the model only focuses on the features labeled as important in subsequent processing.

This paper visualizes the feature influence value distribution diagram of the search feature after each cross-attention learning; the masks were dynamically generated based on the feature influence value distribution, and the feature influence value distribution maps after the application of the masks are shown in Table 3 (taking Layer3 features as an example).

From Table 3, it can be observed that by applying feature masks after cross-attention learning in each layer, the most relevant features to the object can be effectively highlighted. The feature mask enhances the semantic understanding ability of the model, and by filtering out the most influential features, it can better support the model for effective semantic enhancement and object recognition in various complex environments. Simultaneously optimizing attention allocation, by optimizing at the feature level, i.e., retaining only a certain proportion of high-impact value features, feature masks help the model more accurately locate objects among numerous candidate regions. This optimized attention allocation mechanism can effectively handle the problem of similarity clutter in tracking, as it ensures that the model’s attention is more focused on the true object.

2.2. Object Template Update Mechanism

During the tracking process, the object’s appearance and posture may change, and it can also be influenced by external factors. If the template is not updated in a timely manner, the tracking model may fail because it cannot adapt to the object’s new appearance. Common template updating methods include updating templates periodically according to preset time intervals or frame numbers and large-scale time sequence information modeling methods that predict the future state by learning the dynamic change in the object with rich context information. However, the former may have the problem of low template image quality, while the latter is high in computing cost. In practical application, both the real-time functionality and accuracy of tracking should be considered. Therefore, this paper uses the object template update mechanism guided by the judgment conditions [22] proposed before to start the template update mechanism under specific conditions to achieve the timely and accurate updating of the object template. Specifically, Transtfpn-Mask is taken as the baseline, and the trained IoU prediction module is combined to judge the tracking results, and the combined model is named Transtfpn-T. The IoU prediction module leverages the outcomes of classification and regression to estimate the IoU value in the current frame. By gauging the correlation between the current frame and the prior two frames, it decides whether an update to the object template is warranted. The detailed template update model is illustrated in Figure 6.

3. Experimental Results and Analysis

3.1. Experimental Environment and Settings

In this research, the experimental foundation was laid using the Ubuntu 20.04.5 operating system, with the PyTorch 1.13.1 deep learning framework constructed upon it. For the development and training of the network model, the Python 3.9 programming language was predominantly employed, while model testing was collaboratively executed using Python and MATLAB. Throughout the experimental process, a suite of development modules, including opencv-python, pycocotools, pandas, torch, pytz, tqdm, and torchvision, were utilized to facilitate the implementation of various functionalities. The primary software involved was PyCharm-community-2022.3.3 and MATLAB 2020b.

The key specifications of the computer utilized in the experiment were as follows: It was equipped with the 12th Gen Intel (R) Core (TM) i7-12700 processor (Intel Corporation, Santa Clara, CA, USA), an NVIDIA GeForce RTX 3080 Ti graphics card (NVIDIA Corporation, Santa Clara, CA, USA), and 32 GB of memory. Given the constraints of available equipment, these hardware configurations robustly supported the experimental operations, ensuring the smooth execution of deep learning model training and testing. They met the fundamental computational performance requirements of this research and contributed to the acquisition of precise and dependable experimental results.

In terms of model training, this paper adopted a similar strategy with TransT, which chose AdamW [24] as the optimizer. The specific training process was divided into two stages:

The first stage: We concentrated on honing the network backbone of Transtfpn-Mask. In TransT’s initial setup, the backbone’s learning rate was set at 1 × 10⁻⁵, other parameters’ learning rate at 1 × 10⁻⁴, and weight decay at 1 × 10⁻⁴, and the batch size was 38. The training spanned 1000 epochs, with each epoch comprising 1000 iterations. After hitting the 500-epoch mark, the learning rate was reduced to a tenth of its initial value. Given the constraints of our current equipment, we mirrored the main learning rate at 1 × 10⁻⁵, other parameters’ learning rate at 1 × 10⁻⁴, and weight decay at 1 × 10⁻⁴. However, the batch size was capped at 6. We could only train for a total of 665 epochs, with each epoch still undergoing 1000 iterations. This is because the model was found to converge at around 665 epochs, indicating that further training would not significantly improve performance. The learning rate similarly dropped to a tenth of its original value after 500 epochs.

The second stage: Building on the Transtfpn-Mask foundation, we froze all weights of its trunk and shifted the training focus to the IoU prediction head. The training dataset remained consistent with the first stage. Here, the learning rate was cranked up to 1 × 10⁻³, weight decay stayed at 1 × 10⁻⁴, and the batch size was set at 24. We trained for a total of 150 epochs, with each epoch consisting of 200 iterations. Once we hit the 50-epoch milestone, the learning rate was slashed by a factor of 10.

Regarding the equipment performance, this paper finally realized that TransT’s effect compared with the original was slightly inferior, but to emphasize, for the processes of both the emergent TransT and the other algorithm, this paper adopted the same experimental environment, training datasets, and parameters. Therefore, the TransT effect used for comparison in this paper was based on the results after they were reproduced.

3.2. Datasets

To precisely assess the strengths and weaknesses of the proposed tracking algorithm relative to other algorithms, this study chose the globally recognized LaSOT, OTB-100, and UAV123 datasets for evaluation. The video images of these datasets were collected by a single camera, which realistically simulated the real tracking scenes under fixed and non-fixed camera situations. The application scenarios are extremely wide-ranging, involving public area monitoring, natural environments, urban traffic, sports events, unmanned aerial vehicle tracking, and other realistic fields and even cover some animation scenes. Furthermore, the dataset encompasses a variety of challenging factors, including changes in illumination, occlusion, object deformation, and cluttered backgrounds. These elements effectively replicate the diverse real-world issues that can arise in tracking tasks, thus offering ample and realistic data for rigorous algorithm testing.

Regarding the configuration of training parameters, the training process primarily utilized the LaSOT training set. The sampling approach involved randomly selecting two frames from each image sequence. In the test phase, OTB-100, LaSOT, and UAV123 test sets were selected. The reason why these datasets were selected is that their scenarios are rich and diverse, and they can test algorithm performance in an all-round way. At the same time, their memory consumption is small, which is convenient for efficient training and testing processes, and can adjust algorithm parameters flexibly and promptly according to test results. The following are the details of these datasets.

(1): LaSOT [25]: Launched at CVPR in 2019, the LaSOT (Large-scale Single Object Tracking) dataset has become a cornerstone in the realm of single-object tracking, offering a demanding and representative benchmark for evaluating object tracking algorithms. This dataset comprises 1550 video sequences, which are thoughtfully divided among training, testing, and validation sets. Spanning 85 object categories, with each category containing 20 sequences, the dataset amasses over 3.87 million frames, making it one of the most extensive collections in the single-object tracking domain at the time of its release. The LaSOT dataset is distinguished by its comprehensive attribute labeling, with each sequence meticulously annotated with 14 attributes. These attributes encompass a wide range of challenges, such as illumination variation (IV), fast motion (FM), full occlusion (FOC), object deformation (DEF), partial occlusion (POC), motion blur (MB), background clutter (BC), scale variation (SV), occlusion variation (OV), camera motion (CM), rotation (ROT), low resolution (LR), aspect ratio change (ARC), and viewpoint change (VC). This detailed labeling accurately captures the diverse and complex scenarios that objects may encounter within video sequences. In constructing the dataset, the creators adhered to principles of high-quality dense annotation, long-term tracking sequences, balanced categorization, and thorough labeling. These principles ensured the dataset’s high quality and richness, providing a robust and reliable foundation for single-object tracking research. As a result, the LaSOT dataset has significantly propelled advancements and breakthroughs in this field.
(2): OTB-100 [26]: As a classic dataset in the field of single-object tracking, the OTB-100 (Visual Object Tracking Benchmark) builds upon the foundation of OTB-2013 by adding 50 new video sequences, bringing the total number of videos to 100. These sequences encompass 11 challenging attributes: low resolution (LR), illumination variation (IV), occlusion (OCC), fast motion (FM), motion blur (MB), in-plane rotation (IPR), background clutter (BC), out-of-plane rotation (OPR), deformation (DEF), occlusion variation (OV), and scale variation (SV). The dataset includes both grayscale and color images. Notably, each video sequence features at least 3–5 of these challenging attributes, creating a demanding environment for evaluating and comparing various object tracking algorithms. This setup effectively tests the robustness and performance of these algorithms.
(3): UAV123 [27]: Comprising 123 video sequences and over 110,000 frames, the UAV123 dataset is designed to evaluate the performance of object tracking algorithms on unmanned aerial vehicles (UAVs). It provides videos captured in diverse and complex environments, assisting researchers in developing and testing robust and reliable tracking algorithms. The videos feature a wide range of scenes, including cities, villages, and coastlines, as well as various object types such as pedestrians, vehicles, and bicycles, which closely simulate real-world scenarios that drones may encounter during object tracking. The dataset’s video sequences are annotated with 12 challenging attributes: aspect ratio change (ARC), background clutter (BC), out-of-view (OV), camera motion (CM), fast motion (FM), similar objects (SOB), full occlusion (FOC), illumination variation (IV), viewpoint change (VC), low resolution (LR), partial occlusion (POC), and scale variation (SV).

3.3. Analysis of Experimental Results

Due to the large number of finely annotated and publicly available short video sequences in the OTB-100 dataset, it maintains a challenging nature while offering high operability and strong visual presentation, making it well-suited for visualizing detailed changes during the tracking process. In contrast, although LaSOT and UAV123 are more challenging and representative in terms of performance evaluation, their video sequences are generally longer, with frequent variations in object scale and scene content, making them less suitable for complete or efficient visual presentation within limited space. Moreover, the complex scenes in LaSOT and the aerial perspectives in UAV123 are primarily intended for the quantitative evaluation of algorithms’ overall performance in complex, real-world scenarios. However, when it comes to showcasing detailed aspects—such as distinguishing objects from similar backgrounds—the typical short sequences in OTB-100 are more conducive to highlighting the fine-grained improvements achieved by the proposed method. Therefore, in order to effectively test whether the proposed algorithm has made progress in dealing with similar background clutter during the tracking period, this study carefully selected those video sequences with complicated background and similar clutter in the OTB-100 dataset, used these sequences to test the proposed algorithm, and visualized the tracking results, in order to visually observe the algorithm performance. In addition, quantitative analysis of the proposed algorithm was carried out on the OTB-100, LaSOT, and UAV123 datasets. In the test of the LaSOT dataset, on the one hand, the video sequence with background interference was tested, and the response efficiency of the algorithm in such complex scenes was deeply explored. Conversely, the algorithm presented in this paper was benchmarked against other comparable algorithms in terms of overall performance. The objective evaluation metrics utilized were accuracy and success rate. Through accurate quantitative data evaluation, the advantages and disadvantages of each algorithm were clearly identified.

3.3.1. Qualitative Analysis

As shown in Figure 7, this paper tested video sequences containing background clutter factors (including similarity clutter) on the OTB-100 dataset, including (a) Bolt, (b) Deer, (c) Football, (d) Liquor, (e) Matrix, and (f) Soccer. The focus was on testing the original TransT algorithm and the improved algorithm based on it to visualize the bounding box of the prediction results. In the (a) video sequence, there is always an object with consistent texture, color, and appearance around the object. It was found that even with the fusion of hierarchical features, there was still deviation in the later stage of tracking. However, after using the template update module and mask module, the model could achieve stable tracking, indicating that the initial information was not sufficient to cope with changes in scene information, and key guidance information needed to be obtained in a timely manner from the intermediate process. In the (b) video sequence, TransT could not focus its attention well on the parts to be focused on but instead classified all non-focused parts around the object as prediction results. In contrast, the algorithm presented in this paper concentrated solely on the regions of interest, and its localization accuracy remained unaffected even when encountering objects with similar appearances. This demonstrated the algorithm’s strong capability in capturing the distinctive shape characteristics of the object.

In the (c) video sequence, there are many similar objects around the object, and TransT, Transtfpn, and Transtfpn-Mask all exhibited boundary box drift. Only Transtfpn-T could adapt well to environmental changes during the tracking process, because the model can make good use of temporal information to adapt to the shape and spatial position changes in an object during the process and give higher influence to the information near the center of the object, making the model always stable in complex scenes. In the (d) video sequence, both TransT and Transtfpn exhibited varying degrees of bounding box deviation. Transtfpn-Mask was subject to clutter during object movement, and Transtfpn-T predicted the range most accurately. It is evident that by consistently updating the object’s latest state and narrowing down the scope of object features to a certain extent, the system could adapt to the object’s continuous positional changes and mitigate the clutter of background information. The biggest problem with TransT in the (e) video sequence was its inability to adapt to rapidly changing scenes and achieve precise positioning, resulting in boundary box drift. After integrating multiple layers of features, it is already able to adapt to most scene transformations, while incorporating temporal information and concentrating feature attention processing results in adapting to all scene transformations. In the (f) video sequence, the object frequently disappears and appears, and as the perspective changes, the number of similar objects around the object gradually increases. Both TransT and Transtfpn could not adapt well to this complex scene, and Transtfpn-Mask experienced a brief boundary box drift phenomenon. After introducing temporal information, even if the object disappeared in the image for a period of time, the Transtfpn-T model could continue to accurately locate the object in the image where it reappeared. The analysis of the scene was relatively accurate, and it could distinguish the object from similar objects around it regardless of the perspective.

3.3.2. Quantitative Analysis

This paper tested the original TransT algorithm and the improved algorithms Transtfpn, Transtfpn-T, and Transtfpn-Mask based on it on the public datasets OTB-100 and LaSOT. The algorithm was objectively evaluated by evaluating the accuracy and success rate values of the evaluation indicators, as shown in Table 4.

When comparing the Transtfpn algorithm to the TransT algorithm, the success rate on the OTB-100 dataset saw an increase of 0.9%, while the accuracy on the LaSOT dataset improved by 1.1% and the success rate by 0.4%. This study demonstrates that combining features from Layer2 and Layer3 allows the model to capture both rich semantic information and finer spatial details simultaneously. This fusion of hierarchical features enhances the model’s ability to distinguish between objects and backgrounds, thereby improving tracking accuracy. The Transtfpn-Mask algorithm, which incorporates dynamic masks based on Transtfpn, showed significant improvements over the TransT algorithm. On the OTB-100 dataset, it achieved a 3.7% increase in accuracy and a 5% increase in the success rate. On the LaSOT dataset, the average accuracy improved by 9.4% and the success rate by 7.9%. The addition of masks helps reduce background clutter and enhances focus on the object, thereby improving tracking accuracy and robustness. The Transtfpn-T algorithm, which integrates temporal information into Transtfpn-Mask, further enhanced performance. Compared to the TransT algorithm, it achieved a 4.3% increase in accuracy and a 4.5% increase in the success rate on the OTB-100 dataset. On the LaSOT dataset, the average accuracy improved by 11.4% and the success rate by 10.8%. The incorporation of temporal information fusion makes the algorithm more robust in handling scenarios such as rapid object movement, occlusion, and lighting changes, thereby improving overall tracking performance.

In addition, using the UAV123 dataset, this paper compared the final algorithm Transfpn-T with other object tracking algorithms, such as ViTCRT [28], SiamAttn [29], HCAT [30], DiMP [31], etc. The results are shown in Table 5.

As can be seen from Table 5, compared to some classic Transformer-based object tracking algorithms, Transtfpn-T has the highest success rate and the best tracking performance. The reason lies in the effective combination of multi-layer feature fusion, the dynamic masking strategy, and the template update module. These innovative methods enable algorithms to have higher tracking accuracy and robustness in complex scenes, especially when dealing with small-object scenes and similar-object clutter problems, showing significant advantages.

The issue under investigation was evaluated using the OTB-100 and LaSOT datasets in this study. Table 6 illustrates the video sequence of the object in a background clutter scenario.

By comparing the results of anti-similar background clutter performance, Transtfpn, Transtfpn-Mask, and Transtfpn-T improved the success rates by 2.3%, 6%, and 9%, respectively, on the OTB-100 dataset and increased the success rate values by 0.2%, 9.6%, and 10.7% on the LaSOT dataset. The proposed algorithm in this paper achieves a deep understanding and accurate tracking of objects. By enhancing appearance semantics, the model is able to focus more effectively on an object’s key features, resulting in improved adaptability and accuracy in complex environments.

As presented in Table 7, this study conducted evaluations on the long-term tracking dataset LaSOT. In addition to the methods discussed in this paper, comparisons were made with several well-established object tracking algorithms, including SiamFC, SiamRPN++, SiamMask, and DiMP, as well as Transformer-based tracking algorithms such as SiamTPN, HCAT, and TrSiam.

This paper used a curve graph to plot the comparative tracking results of all the algorithms mentioned above, as shown in Figure 8.

Transtfpn-T, proposed in this paper, surpassed the average accuracy of the classic fully convolutional Siamese networks SiamFC, SiamMask, and SiamRPN++ by 31.8%, 18.6%, and 16.8%, respectively, and the success rate increased by 31.3%, 18.2%, and 15.4%; Compared to Transformer-based tracking algorithms, there was also a 2% to 6% improvement in the average accuracy and success rate.

Compared to the classic fully convolutional Siamese network algorithms, Transtfpn-T, proposed in this paper, applies a self-attention mechanism, allowing the algorithm to consider all elements in the input sequence when processing each element. This integration of global information is particularly important for tracking, as it enables the model to better understand the relationship between the object and the background, as well as the interaction between objects, thereby greatly improving the accuracy of tracking. Therefore, it has better adaptability and generalization ability than fully convolutional Siamese networks in tracking results. This paper further optimizes the model based on the scalability of the Transformer to ensure that it can adapt to various tracking tasks and improves the algorithm’s understanding of deep semantics and contextual information. Therefore, the algorithm proposed in this paper performs more accurately and robustly in handling complex video sequences and dynamically changing environments.

3.4. Analysis of Algorithm Limitations

The comparative experimental results indicate that the proposed algorithm can still accurately track an object even in the presence of similar background interference, thereby verifying its robustness in complex environments. To further evaluate the generalization capability and practical value of the algorithm in real-world complex scenarios, this paper conducted tests on the representative OTB-100, LaSOT, and UAV123 datasets for scenarios such as rapid movement, occlusion, and illumination changes. Among them, the OTB-100 dataset represents classic tracking scenarios in real-world settings; the LaSOT dataset features longer sequences and richer interference factors; and the UAV123 dataset focuses on aerial video scenarios captured from UAV platforms, characterized by large scene variations, small object sizes, and frequent occlusions. The corresponding experimental results are shown in Table 8. Among them, “-” indicates that there is no corresponding scenario in the dataset for experimental testing.

From the table, it can be seen that the proposed algorithm demonstrated strong tracking performance across multiple representative datasets. In the LaSOT dataset, the proposed algorithm significantly outperformed TransT in both tracking accuracy and the success rate under all testing conditions, verifying its robustness and stability in long-term and complex scenarios. It showed particularly strong adaptability when dealing with challenges such as fast motion, motion blur, and illumination changes. On the OTB-100 dataset, the proposed algorithm generally achieved higher tracking accuracy and success rates than TransT but showed slight performance degradation in certain cases such as fast motion, motion blur, out-of-view scenarios, and scale variation: under fast motion, the average accuracy was 0.005% lower than that of TransT; under motion blur, it was 0.032% lower; for out-of-view scenarios, the average accuracy and success rate were 0.132% and 0.066% lower, respectively; and under scale variation, the average accuracy was 0.007% lower. This can be attributed to the algorithm’s suboptimal responsiveness in sudden and severe change scenarios, particularly when an object undergoes rapid morphological changes or temporary disappearance within a short timeframe. The model’s adaptability to such feature variations still requires improvement. On the UAV123 dataset, the proposed algorithm outperformed TransT in most scenarios in terms of both tracking accuracy and the success rate, indicating its advantage in identifying and tracking objects from aerial viewpoints and at long distances. However, under illumination variation, there was a slight decline, with the average accuracy and success rate being 0.008% and 0.005% lower than those of TransT, respectively. This was due to the dramatic lighting changes affecting the matching accuracy between the feature template and the current object.

Overall, the proposed algorithm exhibited strong robustness and generalization capability in addressing diverse tracking challenges such as fast motion, low resolution, and scale variation, with particularly outstanding performance on more challenging datasets like LaSOT and UAV123, fully validating its effectiveness. However, the algorithm still showed limitations in handling abrupt changes such as motion blur and out-of-view situations on OTB-100 dataset. Future improvements may include developing a more efficient motion prediction module to enhance responsiveness and stability under rapidly changing scenarios; further optimizing the template update strategy to better adapt to variations in illumination, scale, and viewpoint; and integrating temporal context information to strengthen recognition capabilities in out-of-view cases.

4. Summary

This paper proposes an object tracking algorithm that includes a Transformer attention mechanism. To address the problem of boundary box drift in the model when objects with the same texture or appearance appear around the object, a hierarchical feature fusion method is proposed to enhance semantic information and focus on the temporal feature information of the object. By using the features of Layer2 and Layer3 in ResNet50, information about the object at different levels of abstraction can be captured. This fusion strategy helps the model better understand and distinguish subtle differences between the object and the background or similar objects. By dynamically adding masks, the model’s attention can be more focused on the object itself, reducing the negative impact of similar objects or complex backgrounds on tracking performance. Finally, an update module is added at the end of the network to reduce tracking loss or false tracking caused by background clutter. The experiment found that the improvement proposed in this paper not only significantly improves the tracking performance of the model in scenarios with background clutter but also significantly improves the overall tracking performance of the model.

However, experimental results reveal that the proposed algorithm performed variably across different scenarios in different datasets. For instance, its performance was relatively poor in scenarios such as motion blur, scale variation, and fast motion within the OTB-100 dataset, while it demonstrated better effectiveness in other scenarios and on the LaSOT and UAV123 datasets. Moreover, compared with other Transformer-based object tracking algorithms such as STARK, SwinTrack, and MixFormer, the performance of the proposed method was still inferior. Therefore, future improvements to the algorithm can be made in the following aspects:

(1): Introducing an occlusion detection mechanism to dynamically control the frequency and content of template updates, enabling the model to better adapt to various scenarios and enhance robustness.
(2): Incorporating temporal contextual information to improve the model’s ability to recognize objects when they temporarily go out of view.
(3): Embedding the mask learning process into the training pipeline to achieve end-to-end optimization and enable the adaptive adjustment of the mask threshold. Instead of relying on manually set fixed thresholds, the model will be able to dynamically learn the optimal masking strategy for different scenarios during training, further improving the flexibility and generalization capability of the algorithm.

Through these improvements and optimizations, the generalization ability, adaptability, and robustness of the proposed algorithm are expected to be further enhanced, providing stronger support for practical applications in complex and dynamic environments.

While the proposed object tracking algorithm demonstrates promising performance in various challenging scenarios, it is important to recognize its potential societal impacts, particularly concerning privacy and ethical risks. Object tracking technology, if misused, can lead to unauthorized surveillance, privacy infringement, or other malicious applications. To mitigate such risks, it is imperative that the development and deployment of object tracking algorithms strictly adhere to relevant legal frameworks, ethical guidelines, and responsible research practices. Future work should also consider incorporating privacy-preserving mechanisms, such as differential privacy or federated learning, to reduce the possibility of misuse in sensitive application domains. Furthermore, we emphasize that the research in this paper was conducted for general-purpose and positive applications, including intelligent transportation, robotics, and sports analytics. We strongly oppose any use of the proposed technology for illegal or unethical purposes. By acknowledging these potential risks and taking proactive measures, we aim to promote the sustainable and responsible development of object tracking technologies.

Author Contributions

Conceptualization, J.W.; methodology, J.W.; software, Y.W. (Yanru Wang) and D.Y.; validation, Y.W. (Yuan Wei) and Y.Q.; formal analysis, J.W. and D.Y.; investigation, W.H.; resources, Y.W. (Yanru Wang) and D.Y.; data curation, W.H. and Y.Q.; writing—original draft preparation, J.W. and Y.W. (Yuan Wei); writing—review and editing, J.W.; visualization, W.H. and Y.W. (Yanru Wang); supervision, J.W. and W.H.; project administration, J.W., Y.W. (Yuan Wei), and Y.Q.; funding acquisition, J.W. and W.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Natural Science Foundation (NNSF) of China (NO.62127809) and the National science basic research program of Shaanxi (2024JC-YBMS-552).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in this study are openly available in Transtfpn at https://pan.baidu.com/s/1BKVfZwQBS67dK1_OJoIwUw?pwd=lp5s (accessed on 1 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4 December 2017; pp. 5998–6008. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-Convolutional Siamese Networks For Object Tracking. In Proceedings of the Computer Vision–ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 850–865. [Google Scholar]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High Performance Visual Tracking With Siamese Region Proposal Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-Aware Siamese Networks For Visual Object Tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar]
Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution Of Siamese Visual Tracking With Very Deep Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4282–4291. [Google Scholar]
Wang, Q.; Zhang, L.; Bertinetto, L.; Hu, W.; Torr, P.H. Fast Online Object Tracking And Segmentation: A Unifying Approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1328–1338. [Google Scholar]
Mei, Y.; Yan, N.; Qin, H.; Yang, T.; Chen, Y. SiamFCA: A new fish single object tracking method based on siamese network with coordinate attention in aquaculture. Comput. Electron. Agric. 2024, 216, 108542. [Google Scholar] [CrossRef]
Lu, J.; Li, S.; Guo, W.; Zhao, M.; Yang, J.; Liu, Y.; Zhou, Z. Siamese graph attention networks for robust visual object tracking. Comput. Vis. Image Underst. 2023, 229, 103634. [Google Scholar] [CrossRef]
Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning Spatio-Temporal Transformer For Visual Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10448–10457. [Google Scholar]
Lan, J.P.; Cheng, Z.Q.; He, J.Y.; Li, C.; Luo, B.; Bao, X.; Xiang, W.; Geng, Y.; Xie, X. Procontext: Exploring Progressive Context Transformer For Tracking. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Xie, F.; Chu, L.; Li, J.; Lu, Y.; Ma, C. VideoTrack: Learning To Track Objects via Video Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22826–22835. [Google Scholar]
Lin, L.; Fan, H.; Zhang, Z.; Xu, Y.; Ling, H. Swintrack: A simple and strong baseline for transformer tracking. Adv. Neural Inf. Process. Syst. 2022, 35, 16743–16754. [Google Scholar]
Xing, D.; Evangeliou, N.; Tsoukalas, A.; Tzes, A. Siamese Transformer Pyramid Networks For Real-Time UAV Tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2139–2148. [Google Scholar]
Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. Hift: Hierarchical Feature Transformer For Aerial Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 15457–15466. [Google Scholar]
Cui, Y.; Jiang, C.; Wang, L.; Wu, G. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13608–13618. [Google Scholar]
Wang, N.; Zhou, W.; Wang, J.; Li, H. Transformer Meets Tracker: Exploiting Temporal Context For Robust Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 1571–1580. [Google Scholar]
Zheng, Y.; Zhang, Y.; Xiao, B. Target-Aware Transformer Tracking. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4542–4551. [Google Scholar] [CrossRef]
Wang, Z.; Kamata, S.-I. Multiple Mask Enhanced Transformer For Robust Visual Tracking. In Proceedings of the IEEE 2022 4th International Conference on Robotics and Computer Vision, Wuhan, China, 25–27 September 2022; pp. 43–48. [Google Scholar]
Wu, Q.; Yang, T.; Liu, Z.; Wu, B.; Shan, Y.; Chan, A.B. Dropmae: Masked Autoencoders With Spatial-Attention Dropout For Tracking Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14561–14571. [Google Scholar]
Zhu, Q.; Huang, X.; Guan, Q. TabCtNet: Target-aware bilateral CNN-transformer network for single object tracking in satellite videos. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103723. [Google Scholar] [CrossRef]
Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 8126–8135. [Google Scholar]
Wang, J.; Wang, Y.; Que, Y.; Huang, W.; Wei, Y. Object Tracking Algorithm Based on Integrated Multi-Scale Templates Guided by Judgment Mechanism. Electronics 2024, 13, 4309. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning For Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Loshchilov, I. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A High-Quality Benchmark For Large-Scale Single Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5374–5383. [Google Scholar]
Wu, Y.; Lim, J.; Yang, M.-H. Online Object Tracking: A Benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2411–2418. [Google Scholar]
Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 445–461. [Google Scholar]
Di Nardo, E.; Ciaramella, A. Tracking Vision Transformer With Class And Regression Tokens. Inf. Sci. 2023, 619, 276–287. [Google Scholar] [CrossRef]
Yu, Y.; Xiong, Y.; Huang, W.; Scott, M.R. Deformable Siamese Attention Networks For Visual Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6728–6737. [Google Scholar]
Chen, X.; Kang, B.; Wang, D.; Li, D.; Lu, H. Efficient Visual Tracking Via Hierarchical Cross-Attention Transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 461–477. [Google Scholar]
Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning Discriminative Model Prediction For Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6182–6191. [Google Scholar]
Blatter, P.; Kanakis, M.; Danelljan, M.; Van Gool, L. Efficient Visual Tracking With Exemplar Transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 1571–1581. [Google Scholar]
Liang, L.; Chen, Z.; Dai, L.; Wang, S. Target signature network for small object tracking. Eng. Appl. Artif. Intell. 2024, 138, 109445. [Google Scholar] [CrossRef]
Nai, K.; Chen, S. Learning a novel ensemble tracker for robust visual tracking. IEEE Trans. Multimed. 2023, 26, 3194–3206. [Google Scholar] [CrossRef]
Yang, X.; Zeng, D.; Wang, X.; Wu, Y.; Ye, H.; Zhao, Q.; Li, S. Adaptively bypassing vision transformer blocks for efficient visual tracking. Pattern Recognit. 2025, 161, 111278. [Google Scholar] [CrossRef]
Wei, Z.; He, Y.; Cai, Z. Enhanced Object Tracking by Self-Supervised Auxiliary Depth Estimation Learning. arXiv 2024, arXiv:2405.14195. [Google Scholar]

Figure 1. The overall framework of the tracking algorithm proposed in this paper.

Figure 2. ResNet50 output heat maps for different layers.

Figure 3. Network framework of attention mechanism fused with multi-layer deep features.

Figure 4. Visual comparison of attention feature maps between Transtfpn and TransT at different values of N (N represents the number of stacked cross-attention modules).

Figure 5. Feature mask generation and processing process (taking Layer3 as an example).

Figure 6. The object template update module. (The green, blue, and brown modules correspond to the feature vectors of the frames t − 1, t, and t − 2, respectively. Meanwhile, the gray module signifies the feature vector output by the IOU prediction module.).

Figure 7. Performance comparison of the proposed algorithms in this paper and TransT on the OTB-100 dataset.

Figure 8. Graph comparing the results of the algorithms involved in this paper with other algorithms on the LaSOT dataset. (a) Norm precision; (b) success rate value.

Table 1. The success rate of TransT under 11 tracking factors on the OTB100 dataset.

Factors	IV	SV	OCC	DEF	MB	IPR
Suc (%)	63.8	67.2	62.5	60.6	68.4	64.5
Factors	OPR	OV	BC	LR	FM
Suc (%)	62.8	63.2	56.4	58.6	66.2

Table 2. The accuracy and success rate of the model when retaining different rates of high feature influence values.

Feature Points Retention Rate	Precision (Pre)	Success (Suc)
40%	0.886	0.677
50%	0.892	0.681
60%	0.888	0.689
70%	0.895	0.686
80%	0.887	0.679
90%	0.690	0.479

Table 3. A visualization of the effect of mask application on each cross-attention search feature.

	Map of Feature Influence Values Distribution of Search Feature	Generated Mask Map	Feature Influence Value Distribution Map After Mask Application
The output of the first cross-attention
The output of the second cross-attention
The output of the third cross-attention
The output of the fourth cross-attention

Table 4. Comparison of tracking results between the algorithm proposed in this paper and other algorithms on the UAV123 dataset.

Algorithm	Depth Features	Template Type	OTB-100		LaSOT
Algorithm	Depth Features	Template Type	Pre	Suc	Pre	Suc
TransT	Layer3	Initial frame	0.851	0.639	0.624	0.541
Transfpn	Layer2 + Layer3	Initial frame	0.848	0.648	0.635	0.545
Transfpn-Mask	Layer2 + Layer3	Initial frame + update frame	0.888	0.689	0.718	0.620
Transfpn-T	Layer2 + Layer3	Initial frame + update frame	0.894	0.684	0.738	0.649

Table 5. A performance comparison of the proposed algorithm in this paper with other algorithms on the UAV123 dataset.

Tracking Algorithm	Norm Precision	Success
Transtfpn-T	0.885	0.687
TransT	0.876	0.681
TaTrack [17]	-	0.641
SiamRPN++ [5]	0.804	0.611
SiamFC [1]	0.694	0.468
ViTCRT [28]	-	0.686
TrSiam [16]	0.865	0.663
SiamAttn [29]	0.845	0.650
SiamTPN [13]	0.823	0.636
HCAT [30]	0.812	0.627
DiMP [31]	0.849	0.642
E.T.Track [32]	-	0.623
TSN [33]	0.791	0.613
HSET [34]	0.762	0.544
ABTrack [35]	0.823	0.652
HiFT [14]	0.787	0.589

Table 6. Comparison of tracking results between TransT and the proposed algorithm in background clutter scenes.

Algorithm Name	OTB-100		LaSOT
Algorithm Name	Norm Precision	Success	Norm Precision	Success
TransT	0.752	0.564	0.544	0.473
Transtfpn	0.762	0.587	0.550	0.475
Transtfpn-Mask	0.801	0.624	0.653	0.569
Transtfpn-T	0.860	0.654	0.660	0.580

Table 7. Comparison of tracking performance between the proposed algorithms and other methods on the LaSOT dataset.

Tracking Algorithm	Norm Precision	Success
Transtfpn-T	0.738	0.649
TransT	0.624	0.541
TrSiam [16]	0.718	0.629
HCAT [30]	0.687	0.593
ViTCRT [28]	0.678	0.646
E.T.Track [32]	-	0.591
SiamTPN [13]	0.683	0.581
TaTrack [17]	0.661	0.578
DiMP [31]	0.642	0.560
SiamRPN++ [5]	0.570	0.495
SiamMask [6]	0.552	0.467
SiamAttn [29]	0.648	0.560
SiamFC [1]	0.420	0.336
TSN [33]	0.532	0.325
HSET [34]	0.354	0.372
ABTrack [35]	0.733	0.634
MDETrack [36]	0.661	0.591

Table 8. Comparison of results between TransT and the proposed algorithm under other influencing factors.

Factors	Algorithm Name	OTB-100		LaSOT		UAV123
Factors	Algorithm Name	Pre	Suc	Pre	Suc	Pre	Suc
Fast motion	TransT	0.871	0.662	0.434	0.373	0.860	0.656
Fast motion	Transtfpn-T	0.866	0.684	0.583	0.505	0.863	0.658
Motion blur	TransT	0.883	0.684	0.562	0.487	-
Motion blur	Transtfpn-T	0.851	0.685	0.713	0.623	-
Deformation	TransT	0.852	0.606	0.640	0.559	-
Deformation	Transtfpn-T	0.877	0.658	0.754	0.670	-
Illumination variation	TransT	0.837	0.638	0.609	0.524	0.816	0.617
Illumination variation	Transtfpn-T	0.841	0.673	0.775	0.675	0.808	0.612
Low resolution	TransT	0.808	0.586	0.506	0.431	0.772	0.542
Low resolution	Transtfpn-T	0.956	0.714	0.651	0.564	0.794	0.558
Occlusion	TransT	0.814	0.605	-		-
Occlusion	Transtfpn-T	0.851	0.656	-		-
Out of view	TransT	0.850	0.632	0.512	0.439	0.857	0.663
Out of view	Transtfpn-T	0.718	0.566	0.656	0.575	0.867	0.672
Scale variation	TransT	0.896	0.672	0.621	0.538	0.860	0.667
Scale variation	Transtfpn-T	0.889	0.694	0.737	0.647	0.870	0.674

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Wang, Y.; Yuan, D.; Que, Y.; Huang, W.; Wei, Y. Object Tracking Algorithm Based on Multi-Layer Feature Fusion and Semantic Enhancement. Appl. Sci. 2025, 15, 7228. https://doi.org/10.3390/app15137228

AMA Style

Wang J, Wang Y, Yuan D, Que Y, Huang W, Wei Y. Object Tracking Algorithm Based on Multi-Layer Feature Fusion and Semantic Enhancement. Applied Sciences. 2025; 15(13):7228. https://doi.org/10.3390/app15137228

Chicago/Turabian Style

Wang, Jing, Yanru Wang, Dan Yuan, Yuxiang Que, Weichao Huang, and Yuan Wei. 2025. "Object Tracking Algorithm Based on Multi-Layer Feature Fusion and Semantic Enhancement" Applied Sciences 15, no. 13: 7228. https://doi.org/10.3390/app15137228

APA Style

Wang, J., Wang, Y., Yuan, D., Que, Y., Huang, W., & Wei, Y. (2025). Object Tracking Algorithm Based on Multi-Layer Feature Fusion and Semantic Enhancement. Applied Sciences, 15(13), 7228. https://doi.org/10.3390/app15137228

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Object Tracking Algorithm Based on Multi-Layer Feature Fusion and Semantic Enhancement

Abstract

1. Introduction

2. Method

2.1. Construction of Object Semantic Enhancement Model

2.1.1. Extraction of Multi-Layer Fusion Features

2.1.2. Object Semantic Model Under Multiple Cross-Attention

2.1.3. An Object Enhancement Model Integrating Mask Features

2.2. Object Template Update Mechanism

3. Experimental Results and Analysis

3.1. Experimental Environment and Settings

3.2. Datasets

3.3. Analysis of Experimental Results

3.3.1. Qualitative Analysis

3.3.2. Quantitative Analysis

3.4. Analysis of Algorithm Limitations

4. Summary

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI