An Object Tracking Algorithm Based on Multi-Scale Attention and Adaptive Fusion

Zhang, Deyu; Li, Haiyang; Lv, Yanhui

doi:10.3390/app16062646

Open AccessArticle

An Object Tracking Algorithm Based on Multi-Scale Attention and Adaptive Fusion

by

Deyu Zhang

,

Haiyang Li

and

Yanhui Lv

^*

College of Information Science and Engineering, Shenyang Ligong University, Shenyang 110159, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(6), 2646; https://doi.org/10.3390/app16062646

Submission received: 21 January 2026 / Revised: 4 March 2026 / Accepted: 6 March 2026 / Published: 10 March 2026

Download

Browse Figures

Versions Notes

Abstract

Single-object tracking in complex scenes faces challenges such as drastic target scale variation and strong background interference. To address these issues, an object tracking algorithm based on multi-scale attention and adaptive fusion is proposed. The method integrates a multi-scale attention module and an adaptive gated fusion module, enabling the adaptive mining of key features and dynamic adjustment of fusion weights across multi-level features. This effectively highlights target regions, suppresses redundant information, and enhances the model’s discriminative capability and robustness under complex backgrounds and occlusion. Experiments are conducted on the OTB100 and UAV123 datasets. Results show that, compared with the baseline model, the proposed algorithm improves the success rate and precision by 1.9% and 3.3%, respectively, on OTB100, and by 2.9% and 3.5%, respectively, on UAV123. Moreover, it achieves superior performance when facing typical challenging attributes such as occlusion, scale variation, and background clutter. In summary, the proposed algorithm enhances both tracking accuracy and robustness, offering a viable approach for object tracking under complex conditions.

Keywords:

object tracking; siamese network; attention mechanism; gated fusion

1. Introduction

With the rapid development of intelligent vision systems, visible-light object tracking, as an important branch of computer vision, has been widely used in video surveillance [1], intelligent transportation [2], and autonomous driving [3], and has attracted sustained attention from researchers worldwide. Recent advances in object tracking have witnessed a transition from hand-crafted features to deep neural networks [4]. We provide a systematic analysis of prior work from the following aspects.

Traditional visible-light tracking methods are mainly based on CF (Correlation Filter) approaches and hand-crafted feature-based algorithms, such as MOSSE (Minimum Output Sum of Squared Erro) [5], KCF (Kernel Correlation Filter) [6], and DSST (Discriminative Scale Space Tracking) [7]. Such methods are computationally efficient and easy to implement, making real-time tracking feasible in many settings, and providing a solid foundation for subsequent low-latency tracking pipelines. Moreover, ECO [8] substantially reduces the parameter complexity and mitigates the risk of overfitting in correlation filter-based trackers by introducing factorized convolution operators and a compact generative model, achieving state-of-the-art performance on multiple benchmarks at the time. However, due to their reliance on hand-crafted features, these methods often struggle to cope with occlusion, scale variation, and appearance changes in complex scenes, which can easily lead to tracking drift or failure. Consequently, their robustness and generalization ability remain limited.

With recent advances in deep learning, Transformers [9] have gained traction in object tracking because they can model global context and capture long-range dependencies through multi-head self-attention. Methods such as TransT [10] and STARK [11] incorporate Transformer modules to relate the target to the surrounding background, which can improve tracking accuracy and robustness. Subsequently, MixFormer [12] further unifies feature extraction and target information fusion within a single Transformer framework, thereby avoiding the information loss induced by the conventional two-stage “extract-then-match” pipeline. In contrast, OSTrack [13] introduces a one-stage architecture that jointly performs feature extraction and relational modeling; with a candidate elimination strategy, it significantly accelerates inference while maintaining high tracking accuracy. However, Transformer-based trackers are often parameter-heavy and computationally expensive, which makes it challenging to satisfy real-time requirements. As a result, their adoption in practical engineering applications remains limited.

Siamese networks, due to their concise structure and computational efficiency, have become a mainstream paradigm in object tracking. Typical Siamese tracking frameworks include SiamFC, SiamRPN, and SiamRPN++, which extract features through weight-shared branches and perform efficient matching between the template and the search region. SiamFC [14] adopts a fully convolutional architecture for end-to-end feature matching, featuring a simple design and high speed; however, its performance is limited under target scale variation and pose variation. SiamRPN [15] introduces a Region Proposal Network (RPN), which better handles target scale variation and partial appearance variations. Yet, under complex backgrounds or severe target occlusion, this method still suffers from inaccurate localization or even tracking failure. SiamRPN++ [16] further enhances tracking performance through deep feature fusion, but it can be sensitive to training data quality and shows limited generalization capability. SiamBAN [17] boosts localization accuracy by introducing a bounding box regression module; however, the model becomes heavier, and drift may still occur in long-term tracking and frequent occlusion scenarios, while adaptation to diverse target categories remains insufficient. SiamCAR (Siamese Center-Aware Region) [18] simplifies the candidate generation and selection procedure by introducing a center-aware mechanism, which enhances real-time performance and localization accuracy.

Although such methods offer advantages in terms of computational speed, prior studies and surveys [19] indicate that their ability to model global context and multi-scale variations remains limited. Moreover, in cluttered backgrounds or in the presence of visually similar distractors, they are prone to spurious responses, which can lead to localization drift and even tracking failure. To alleviate the aforementioned limitations, related works have been improved from perspectives including online updating and sample mining. SGSiamAttn [20] leverages saliency maps to generate spatial attention weights, thereby enhancing target representations and suppressing interference from sea clutter. However, its limitation lies in the fact that the saliency-guided mechanism can be prone to failure in low-contrast conditions or under severe occlusion. CLNet [21] enhances generalization through sample mining and conditional updating schemes, but its adaptability to multi-scale object scenarios remains to be further improved. SiamDQCFA [22] measures the semantic similarity between the target and the search region using cosine similarity. However, it insufficiently models the global dependency between the target and the background, which easily weakens the target response in cluttered backgrounds and thus leads to tracking drift.

To address the above issues, an object tracking algorithm based on multi-scale attention and adaptive gated fusion is proposed, with SiamCAR adopted as the baseline framework. The main contributions of this work are summarized as follows:

(1) A Multi-Scale Attention Module (MSAM). In Siamese networks, features at different channels and spatial positions are often treated equally, which may weaken critical cues while amplifying redundant information, making it difficult to accurately distinguish the target from the background. To mitigate this issue, A multi-scale attention module is introduced, where multi-scale receptive fields are formed using branches with varied kernel sizes and dilated convolutions. Softmax-normalized learnable branch weights are further adopted for adaptive fusion, strengthening target responses in scale-varying scenarios and mitigating redundant background interference.

(2) An Adaptive Gated Fusion Module (AGFM). To mitigate the issues of redundancy and lack of adaptive complementarity in common fusion operations (e.g., concatenation), we propose a novel gating-based weighted fusion module. This module dynamically assigns weights to multi-level features through a learnable gating mechanism, which selectively amplifies useful information and suppresses less informative or contradictory signals. This adaptive approach leads to more effective feature integration and enhanced representation capability.

With these improvements, the proposed method enhances tracking stability and localization accuracy under complex scenarios on top of the SiamCAR baseline, and provides a feasible implementation for feature enhancement and adaptive fusion in lightweight Siamese tracking frameworks.

2. Preliminary Work

2.1. SiamCAR Network Model

SiamCAR is an anchor-free tracking framework built upon a fully convolutional Siamese network. In this work, we adopt SiamCAR as the baseline, primarily due to its efficient feature extraction and matching paradigm. Specifically, the method employs ResNet-50 as the backbone to extract multi-level features, and applies depth-wise cross-correlation [23] to produce multi-scale similarity response maps. The response maps from different feature levels are then concatenated along the channel dimension and subsequently fed into the prediction head for target localization. However, this baseline architecture exhibits several limitations in complex scenarios. First, the backbone treats features at all spatial locations with equal importance, making it difficult to effectively suppress background noise when confronted with similar-object distractors or cluttered backgrounds. Second, the naive static concatenation of multi-level features overlooks potential semantic discrepancies and information redundancy across feature hierarchies; consequently, it cannot adaptively reweight features according to the scene, which constrains further improvements in tracking robustness.

2.2. Convolutional Block Attention Module

Convolutional block attention module [24] typically consists of a channel-attention submodule and a spatial-attention submodule, which selectively enhance or suppress features along the channel and spatial dimensions, respectively, thereby exhibiting strong complementarity. The channel-attention submodule aggregates spatial information via global pooling and learns channel-wise dependency weights using a multilayer perceptron (MLP). In contrast, the spatial-attention submodule generates a spatial weight map through channel-wise compression followed by convolution, thereby highlighting salient regions on the feature map. However, directly applying it to visual object tracking still has limitations. CBAM’s spatial attention primarily relies on single-scale convolutional operations; such a fixed receptive field is difficult to adapt to the pronounced target scale variations that occur during tracking. Moreover, due to the lack of multi-scale contextual modeling, it is difficult to simultaneously preserve local details and capture global semantics under complex background interference, which in turn limits the discriminative capability of the tracker.

3. Model Improvements

3.1. Overall Algorithm Framework

When deployed in complex tracking scenarios characterized by occlusion, scale variation, and background clutters, SiamCAR exhibits limitations in its inherent feature representation and fusion mechanisms. These constraints often lead to pronounced tracking drift and eventual failure, highlighting the need for more robust feature integration. To address the aforementioned issues, we enhance the SiamCAR model through targeted modifications in both feature representation and fusion. During feature extraction, a MSAM (Multi-Scale Attention Module) is introduced to integrate channel and spatial attention across different scales, thereby strengthening the expressiveness of discriminative features. For feature fusion, an AGFM (Adaptive Gated Fusion Mechanism) is designed to dynamically recalibrate the contributions from different feature levels, which effectively suppresses redundant or conflicting information while preserving useful cues. The feature fusion strategy is further refined to jointly improve discriminative capability and localization accuracy. The architecture of the enhanced model is depicted in Figure 1.

3.2. Multi-Scale Attention Module

In practical visual tracking, efficiently extracting target features and enhancing the model’s discriminative capability are essential for achieving robust performance. This is particularly critical in scenarios characterized by rapid scale variations and complex backgrounds, where conventional convolutional architectures encounter significant challenges [25]. SiamCAR employs a standard convolutional architecture as its backbone network. With increasing network depth, the semantic information pertaining to small targets in the deep feature representations tends to diminish, making it challenging to effectively balance the modeling of global contextual cues and the preservation of fine-grained local details. In addition, SiamCAR fails to differentiate the importance of features across channels and space, attenuating critical signals and exaggerating redundant ones, which degrades overall feature selectivity. To address the challenges posed by scale variations, a range of variants have been proposed in visual object tracking, such as multi-branch attention and dilated spatial attention, aiming to strengthen the model’s capacity to represent multi-scale features. However, most of these methods remain limited to architectural extensions or feature aggregation with fixed weights; even when weighting mechanisms are introduced, some still lack adaptive learning capability. In challenging tracking scenarios with pronounced target scale variations and cluttered backgrounds, these approaches still tend to suffer from feature redundancy and attenuated responses over the target region.

Therefore, MSAM is proposed based on the Convolutional Block Attention Module (CBAM). As shown in Figure 2, the module consists of two components: a channel attention submodule and a multi-scale spatial attention submodule. First, the channel attention submodule models the channel dimension of the feature map. By learning the importance weight of each channel, it achieves adaptive enhancement of key semantic channels and suppression of redundant channels. Secondly, the multi-scale spatial attention submodule performs modeling along the spatial dimensions of the feature maps. It adopts a multi-branch structure to extract spatial features at different scales, fuses the results from each branch, and highlights critical regions through a spatial attention mechanism, thereby balancing global context and local details. Finally, the features are processed within the classification-regression subnetwork of the SiamCAR single-object tracking algorithm to accomplish object tracking.

Although the above spatial attention mechanism has achieved favorable results in feature enhancement, its reliance on single-scale pooling operations limits its capacity to adequately capture saliency variations across different spatial scales. This limitation is particularly pronounced in scenarios involving significant scale variations, complex target morphologies, or substantial background interference, where its performance is consequently constrained. To overcome this issue, a multi-scale spatial-attention submodule is introduced, as shown in Figure 3. It is able to mine salient regional features under different receptive fields and performs adaptive inter-branch fusion through learnable weights, further enhancing the model’s adaptability to scale variations and the discriminative capability of the target region. The structure and workflow are described as follows.

(1): Multi-scale branch design

The core principle of spatial attention is to compress the input feature map along the channel dimension to derive spatially distributed saliency information. To capture spatial relationships across different scales, multiple parallel branches are designed [26], each employing convolutional kernels of varying sizes (e.g., 3 × 3 and 7 × 7) with stride set to 1 to accommodate the modeling requirements for both local and global spatial information. Specifically, convolutional kernels with smaller receptive fields are well-suited for capturing fine-grained local structures, while kernels with larger receptive fields facilitate the modeling of broader contextual dependencies. Furthermore, a dilated convolution branch [27] is incorporated into the module. By configuring an appropriate dilation rate, this branch further expands the receptive field, thereby effectively enhancing the model’s capacity to adapt to complex spatial patterns.

(2): Spatial feature encoding

Given an input feature map

F \in R^{C \times H \times W},

average pooling and max pooling are first performed along the channel dimension to obtain two spatial feature maps

F_{a v g}^{s} and F_{m a x}^{s},

both with the size

R^{1 \times H \times W} .

These operations capture the overall distribution and local extreme responses, respectively, which facilitates subsequent spatial-attention generation. Then, the two maps are concatenated along the channel dimension to form a fused feature

F_{c a t}^{s}

.

F_{c a t}^{s} = Concat (F_{a v g}^{s}, F_{m a x}^{s}), F_{c a t}^{s} \in R^{2 \times H \times W}

(1)

(3): Multi-branch spatial attention generation

The concatenated fused feature

F_{c a t}^{s}

is taken as input and processed by the convolution layer in each branch. Each branch consists of a convolution operation (with a specific kernel size or dilation rate) followed by a Sigmoid activation function. The convolution extracts spatial features, and the Sigmoid function normalizes the output into a weight map. For the i-th branch (kernel size

k_{i}

and dilation rate

d_{i}

), the spatial-attention map is generated as:

M_{i} = σ ({Conv}_{k_{i}, d_{i}} (F_{c a t}^{s})), M_{i} \in R^{1 \times H \times W}

(2)

where

{Conv}_{k_{i}, d_{i}}

denotes the convolution operation with the corresponding parameters, and

σ (\cdot)

is the Sigmoid activation function.

(4): Adaptive branch-weight fusion

To achieve adaptive fusion across branches, a set of learnable weight parameters

ω \in R^{K}

is introduced, where

K

is the number of branches. Softmax normalization is applied to ensure that the weights sum to 1. The branch-wise spatial-attention maps are then combined via a weighted summation to obtain the final fused spatial-attention map

M_{s}

:

M_{s} = \sum_{i = 1}^{K} α_{i} M_{i}

(3)

where

α_{i} = Softmax (ω_{i})

is the normalized weight for the i-th branch. This mechanism allows the model to dynamically adjust the contribution of spatial features at different scales according to task requirements, improving the flexibility and expressive power of feature fusion.

(5): Spatial attention recalibration

Finally, the fused spatial-attention map

M_{s}

is applied to the input feature map via element-wise multiplication to perform spatial recalibration. This operation effectively highlights key regions and suppresses redundant background responses, thereby optimizing the feature distribution and providing more discriminative representations for subsequent tasks. The spatially recalibrated feature map

F_{s}

is obtained as:

F_{s} = F ⊙ M_{s}

(4)

where

⊙

represents element-wise multiplication.

Compared with existing multi-scale spatial attention mechanisms, the MSAM introduces learnable branch weights to achieve dynamic, adaptive fusion of multi-scale features, enabling the network to automatically adjust the contribution of each branch across different tracking scenarios rather than relying on fixed weighting schemes or naive architectural stacking. In addition, MSAM makes channel attention a prerequisite to spatial attention, first enhancing semantically informative channels and then performing multi-scale spatial modeling. This design establishes an end-to-end coupling between the channel and spatial dimensions and effectively suppresses the propagation of redundant information. Moreover, by synergistically combining parallel multi-branch modeling with learnable weight selection, MSAM alleviates the semantic inconsistencies introduced by straightforward multi-scale feature concatenation or weighted summation. This facilitates effective cooperation among branch-specific representations during fusion, thereby adaptively preserving the scale features that are most discriminative for the target region. With the above design, features at different scales can be dynamically adjusted according to the discriminative requirements of the target region, highlighting critical cues while suppressing redundancy and interference, thereby significantly improving the model’s robustness and discriminative capability in complex scenarios.

3.3. Adaptive Gated Fusion Module

In response to issues such as information redundancy, insufficient adaptivity, and the overlooking of hierarchical differences in multi-level feature fusion, AGFM is proposed in this work. This module performs progressive, level-wise fusion over three stages of backbone features, corresponding to the ResNet-50 outputs of conv3, conv4, and the conv5 feature enhanced by the proposed MSAM, respectively. It further dynamically learns the fusion weights across feature levels to promote information complementarity while suppressing redundancy. Let the input feature at level l be

F_{l}

, and denote the fused output from the previous level as

F_{f u s i o n}^{l - 1}

, with initialization

F_{f u s i o n}^{1}

=

F_{1}

. The overall architecture of AGFM is depicted in Figure 4.

The detailed procedure can be divided into three steps.

The first step is feature aggregation. To build a comprehensive fusion context, two input features are concatenated along the channel dimension: one is the fused feature from the previous level, denoted as

F_{f u s i o n}^{l - 1},

and the other is the input feature at the current level, denoted as

F_{l} .

They are concatenated along the channel dimension to obtain the joint feature

F_{c a t}^{l}

:

F_{c a t}^{l} = C o n c a t (F_{f u s i o n}^{l - 1}, F_{l})

(5)

where

F_{c a t}^{l} \in R^{2 C \times H \times W}

. C denotes the number of channels, and H and W represent the height and width of the feature map, respectively.

C o n c a t (\cdot)

indicates concatenation along the channel dimension. This step aggregates the historical fused representation with the current-level features, thereby providing full contextual information for subsequent dynamic weight generation.

The second step is dynamic weight generation. The joint feature

F_{c a t}

is fed into a gating network

G (\cdot)

to dynamically generate fusion weights

W .

The gating network consists of two 1 × 1 convolution layers and a nonlinear activation function, and is formulated as:

W = σ ({C o n v}_{1 \times 1}^{2} (R e L U ({C o n v}_{1 \times 1}^{C} (F_{c a t}^{l})))

(6)

where

{Conv}_{1 \times 1}^{C}

denotes a 1 × 1 convolution that compresses the channel dimension from

2 C

to

C,

and

{Conv}_{1 \times 1}^{2}

denotes a 1 × 1 convolution that further projects the C-channel feature to a 2-dimensional output. Therefore,

W \in R^{2 \times H \times W}

and

W = C o n c a t (W_{1}, W_{2}

).

σ

is the Sigmoid activation function, which constrains the output weights to the range [0, 1].

The final step is adaptive weighted fusion. The 2-dimensional weight vector W output by the gating network is split along the channel dimension into

W_{1}

and

W_{2}

. Using channel-wise broadcasting, both weights are expanded from 1 D to C dimensions. Each weight map is then multiplied element-wise with its corresponding input feature and summed to obtain the final fused feature

F_{f u s i o n}

:

F_{f u s i o n}^{l} = F_{f u s i o n}^{l - 1} ⨀ W_{1} + F_{l} ⨀ W_{2}

(7)

This design allows the network to dynamically modulate the relative contributions of historical and newly introduced information at each spatial location, thereby achieving more flexible and accurate feature fusion.

Through the above gating mechanism, the module can not only adaptively select and strengthen useful features, but also effectively suppress redundant and conflicting information, fully exploiting the complementary strengths of multi-level features [28]. The learnability of gating weights allows the model to automatically adjust its fusion strategy according to real-world scenarios, which is also one of the core advantages of gating mechanisms such as SENet [29].

4. Experiments and Results

4.1. Experimental Setup

To validate the effectiveness of the proposed single-object tracking model, SiamCAR is selected as the baseline. All experiments are conducted on Ubuntu 18.04. The GPU is an NVIDIA GeForce RTX 3090 with 24 GB memory, and the CPU is a 36-core Intel i9-10980XE. The implementation is based on PyTorch 1.13.1 with Python 3.8 and CUDA 11.7, as shown in Table 1.

4.2. Training Parameters and Strategy Settings

To ensure stable and effective training, we adopt SiamCAR as the baseline and configure the training hyperparameters as follows. The model is trained for 20 epochs with a batch size of 128. The initial learning rate is set to 0.001, warmed up during the first five epochs and then decayed logarithmically to 0.0005. We use SGD optimizer with momentum 0.9 and weight decay 5 × 10⁻⁴. Gradient clipping is applied with a threshold of 3.0 to improve training stability. Input images are normalized and resized to a unified resolution. Data augmentation includes translation, scaling, color jittering for template images, and additional Gaussian blur for search images. Model parameters are initialized using Xavier uniform initialization (bias terms set to 0.01). To assess result stability, all experiments are repeated five times with fixed random seeds across Python, NumPy, and PyTorch. The reported mean, standard deviation, and 95% confidence intervals are calculated based on these five runs. The confidence intervals are computed as mean ±

t_{0 . 025,4} \times S D / \sqrt{5}

, where

t_{0 . 025,4} = 2.776

is the critical value of the t-distribution with 4 degrees of freedom. A summary of the training settings is provided in Table 2.

4.3. Datasets and Evaluation Metrics

In single-object tracking, the choice of datasets and evaluation metrics is crucial for algorithm development and performance assessment. To comprehensively verify the effectiveness of the proposed method, the model is trained on the GOT-10K dataset and tested on the OTB100 and UAV123 datasets. The characteristics of these datasets and the commonly used evaluation metrics in single-object tracking are described in this section.

GOT-10K (Generic Object Tracking Benchmark) [30] is one of the largest and most diverse datasets used in single-object tracking. It contains over 10,000 video sequences, covering 563 object classes and 87 motion classes, which greatly enriches the diversity of tracking scenarios. All sequences are captured in real-world environments and include various challenging factors such as background clutters, occlusion, motion blur, and scale variation.

OTB100 (Object Tracking Benchmark 100) [31] is one of the earliest and most widely used benchmarks in object tracking. It consists of 100 video sequences covering a wide range of scenes and target types. The benchmark defines a set of 11 challenge attributes for systematic evaluation: occlusion (OCC), illumination variation (IV), scale variation (SV), fast motion (FM), background clutters (BC), deformation (DEF), motion blur (MB), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), and low resolution (LR). These attributes provide a standardized platform for performance comparison and analysis.

UAV123 (Unmanned Aerial Vehicle 123) [32] focuses on object tracking from aerial UAV viewpoints. It contains 123 high-definition video sequences covering various objects such as vehicles, pedestrians, and boats. The benchmark defines a set of 12 challenge attributes: illumination variation (IV), scale variation (SV), partial occlusion (POC), full occlusion (FOC), out-of-view (OV), fast motion (FM), camera motion (CM), background clutters (BC), similar object (SOB), aspect ratio change (ARC), viewpoint change (VC), and low resolution (LR). Owing to this structured attribute annotation, UAV123 serves as a critical benchmark for assessing and comparing the robustness of tracking algorithms in UAV-based application scenarios.

To scientifically evaluate tracking performance, success rate (SR) and precision rate (PR) are used as evaluation metrics [33].

Success rate measures tracking accuracy by computing the Intersection over Union (IoU) between the predicted bounding box and the ground-truth bounding box:

IoU = \frac{G \cap P}{G \cup P}

(8)

where

G

denotes the ground-truth bounding box and

P

denotes the predicted bounding box produced by the tracker. IoU ranges from 0 to 1. A frame is considered successfully tracked when IoU is greater than a threshold

τ .

To avoid randomness introduced by using a single threshold, the success plot is computed over a set of thresholds

τ \in [0, 1],

and the area under the curve (AUC) is used as the final success-score metric. A larger AUC indicates a greater area under the success plot curve, implying that the model performs more consistently and evenly across different threshold ranges.

Precision rate measures localization accuracy by computing the Euclidean distance between the center of the predicted bounding box and that of the ground-truth bounding box:

d = \sqrt{(x_{p r e d} - x_{g t})^{2} + (y_{p r e d} - y_{g t})^{2}}

(9)

where

x_{p r e d}, y_{p r e d}

are the predicted center coordinates and

x_{g t}, y_{g t}

are the ground-truth center coordinates. A frame is considered precisely localized when the center distance

d

is smaller than a threshold

τ

(typically 20 pixels).

Parameters: The total number of learnable parameters in the model, reported in millions (M). This metric directly reflects the model’s storage requirement and structural complexity.

GFLOPs: The total number of floating-point operations required for a single forward pass, reported in billions of FLOPs. It is commonly used to quantify computational complexity and inference cost.

Frames Per Second (FPS): The number of image frames the model can process per second, used to evaluate real-time inference capability.

4.4. Ablation Studies

To further validate the effectiveness of the proposed modules for object tracking, ablation studies are conducted on the OTB100 and UAV123 datasets. The goal of the ablation experiments is to evaluate the contribution of each key component to the overall performance and to analyze how different modules affect tracking accuracy and robustness. SiamCAR is adopted as the baseline. Both the proposed multi-scale attention module and Adaptive Gated Fusion Module (AGFM) are integrated individually and jointly to construct multiple model variants for systematic ablation analysis. The experimental configurations are detailed as follows:

(1): Model A. As a control, the original SiamCAR framework is used as the baseline for subsequent comparisons.
(2): Model B. The AGFM is applied at the feature fusion stage to enhance feature fusion quality.
(3): Model C. A multi-scale attention module is added to the backbone network to improve feature representation capability.
(4): Model D. Both the multi-scale attention module and the AGFM are integrated, forming the complete model proposed in this paper.

As shown in Table 3, the progressive integration of the multi-scale attention module and the AGFM leads to consistent and substantial improvements in overall performance. After incorporating the AGFM, Model B attains an SR of 0.640 and a PR of 0.866 on OTB100, and the SR of 0.582 with the PR of 0.797 on UAV123. Meanwhile, the model contains 51.45 M parameters and requires 59.39 GFLOPs per forward pass, while maintaining an inference speed of over 83 FPS. These results suggest that dynamic fusion and selection of multi-level features improve localization accuracy and robustness without introducing a substantial computational burden. Model C further elevates the SR to 0.649 and the PR to 0.865 on OTB100, while achieving an SR of 0.586 and a PR of 0.790 on UAV123. In terms of efficiency, Model C has 51.91 M parameters, with virtually unchanged GFLOPs, and maintains an inference speed of over 82 FPS, indicating that the enhanced representational capacity is attributable to efficient multi-scale attention modeling rather than redundant computation. By integrating both modules, Model D delivers the best overall performance, with SR improvements of 0.019 and PR gains of 0.033 on OTB100, as well as SR and PR increments of 0.029 and 0.035, respectively, on UAV123. The parameter count is 51.97 M, with a total increase of 1.15%, and GFLOPs is 59.4, with a total increase of 0.08%. The FPS still maintains real-time tracking capability above 81 FPS. It is worth noting that although the SR numerical gain of Model C and Model D on OTB100 is only 0.004, the comprehensive performance across both datasets shows that Model D’s SR improvement on UAV123 of 0.009 is significantly higher than that on OTB100. Moreover, the PR metric achieves stable gains on both datasets, with improvements of 0.019 on OTB100 and 0.023 on UAV123. This consistent improvement across datasets and multiple metrics validates the effectiveness of module complementary gains. The multi-scale discriminative features enhanced by MSAM provide a higher-quality input foundation for AGFM’s dynamic fusion, while AGFM’s adaptive selection mechanism further amplifies the contribution of key features. Together, they form an optimization of feature enhancement and precise fusion, rather than simple performance superposition. Experimental results demonstrate that the MSAMs and AGFMs achieve a good balance between performance improvement and computational cost control through efficient feature enhancement and fusion design.

As shown in Figure 5 and Figure 6, on both the OTB100 and UAV123 datasets, all model variants exhibit consistent trends in success rate and precision rate as the evaluation threshold varies. Specifically, as the overlap threshold gradually increases, the success rates of all models decrease; however, the complete Model D demonstrates superior performance across the vast majority of threshold intervals, with its success rate curve positioned above those of other comparative models overall. This indicates that Model D provides more stable target location predictions and demonstrates stronger robustness in challenging scenarios such as occlusion and scale variation. Meanwhile, in the precision plots, as the localization error threshold increases, the precision rates of all models improve. The complete Model D also exhibits a measurable performance gain, with the improvement being particularly pronounced in the high-threshold regime. Notably, it performs slightly better than the variants that incorporate only a single module. The consistent improvements across multiple metrics and datasets suggest that the two modules provide complementary benefits rather than redundant contributions.

To more intuitively illustrate the performance differences among tracking algorithms under various challenge attributes, we further analyze the experimental results through visualization. Three video sequences, Coupon, Liquor, and Vase, are selected from the OTB100 dataset; and three sequences, car5, person20, and truck1, are chosen from the UAV123 dataset. Both the baseline tracker and the proposed method are applied to these videos, and the tracking bounding boxes of each tracker are overlaid on the video frames to enable a more direct comparison of tracking quality. In the visualizations, the magenta boxes denote the baseline method, while the green boxes denote the proposed method.

As shown in Figure 7, the improved model demonstrates more accurate and stable tracking performance across different sequences. In the Coupon sequence, despite challenging conditions such as occlusion and fast motion, the proposed method can still track the target steadily, showing strong robustness. In the Liquor sequence, the object moves rapidly, causing the baseline model to fail in tracking; in contrast, the improved model can accurately capture and continuously track the target, significantly improving tracking success. In the Vase sequence, which is characterized by challenging factors such as background clutters, the baseline model suffers from insufficient localization accuracy, whereas the improved model can effectively distinguish the target from the background and achieve more precise tracking. The 4.4% improvement under the BC attribute reflects the average gain across all sequences annotated with BC; however, since this subset includes a substantial number of relatively easy sequences that the baseline already handles well, the performance gains on more difficult cases are partially diluted. As shown in Figure 8, the car5, person20, and truck1 sequences exhibit pronounced viewpoint changes. In such scenarios, the baseline model often produces inaccurate tracking results or even loses the target, while the improved model adapts better to viewpoint change and tracks the target more accurately and consistently. Therefore, under complex conditions such as fast motion, occlusion, background clutters, and viewpoint change, the improved model consistently achieves higher tracking accuracy and robustness, which fully validates the effectiveness and broad applicability of the proposed method for achieving stable object tracking in complex visual scenarios.

Further analysis indicates that the multi-scale attention module can capture salient target regions at different spatial scales, thereby improving the model’s adaptability to scale variation and background clutters. The adaptive gated fusion module dynamically adjusts the fusion weights to effectively select informative features while suppressing redundant and conflicting information, which strengthens feature complementarity. Working together, these two modules substantially enhance the model’s feature representation capability and fusion flexibility, providing a solid technical foundation for object tracking in complex scenarios.

4.5. Comparative Experiments

To validate the performance of the proposed method for object tracking, we select several representative Siamese-network-based trackers, including SiamFC, SiamBAN, SiamRPN, SiamRPN++, and SiamCAR. All methods are trained on the GOT-10K dataset, and their tracking performance is evaluated on the OTB100 and UAV123 datasets. The experimental results, as detailed in Table 4, demonstrate the superiority of our method. On the OTB100 dataset, it attains an SR of 0.653 and a PR of 0.884. These metrics further reach 0.595 in SR and 0.813 in PR on UAV123. Under the same training settings, the proposed method exhibits a pronounced performance advantage over the SiamCAR baseline as well as other mainstream Siamese trackers. These results suggest that the proposed MSAM and AGFM effectively refine feature representation and fusion in complex scenarios, enabling the tracker to maintain robust object tracking while further improving bounding-box localization accuracy. In terms of efficiency, the proposed method runs slower than lightweight baselines such as SiamFC and SiamRPN, but delivers more substantial performance gains. Compared with some more complex models, our approach maintains competitive and stable success and precision while reducing the number of parameters and GLOPs. Overall, the proposed method achieves strong tracking performance while preserving a lightweight design, thereby striking an effective balance between accuracy and computational cost. This trade-off makes it a practical solution for the engineering deployment of Siamese-network-based tracking algorithms.

As shown in Figure 9 and Figure 10, on both the OTB100 and UAV123 datasets, all models exhibit consistent trends in success rate and precision rate as the evaluation thresholds vary. Specifically, as the overlap threshold increases, the success rates of all methods gradually decrease.

However, within the selected range of thresholds, the proposed method achieves superior tracking performance across the majority of evaluation thresholds. The precision plot shows a similar, consistent advantage, indicating that the proposed improvements positively contribute to enhanced target localization accuracy. The experimental results indicate that our approach exhibits better generalization capability and stability in complex environments.

To further analyze the performance of the proposed tracker, we investigate the success rate and precise rate under 11 challenge attributes on the OTB100 dataset and 12 challenge attributes on the UAV123 dataset, and compare our method with several representative Siamese-network-based trackers.

As reported in Table 5, our method achieves superior success rates compared to the baseline in the majority of the 11 OTB100 challenge attributes, specifically including OCC, IV, SV, FM, BC, DEF, MB, IPR, OPR and OV. Notably, the gains are most pronounced under deformation and background clutters, where the success rates are 4.5% and 4.4% higher than those of SiamCAR, respectively. This indicates that the proposed method has clear advantages in handling target deformation and interference from complex backgrounds. In addition, it performs strongly in challenging conditions such as motion blur and illumination variation, demonstrating strong environmental adaptability. Furthermore, the relationship between success rate and overlap threshold for different trackers under these 11 challenge attributes is shown in Figure 11. This figure provides a detailed comparison of performance trends and differences as the overlap threshold varies across different challenge attributes on OTB100.

As shown in Table 6, the proposed method achieves improvements in both success rate and precision across most of the 12 challenge attributes on the UAV123 dataset, including IV, SV, POC, FOC, OV, FM, CM, BC, SOB, ARC and VC. Notably, the largest gains in success rate are observed under illumination variation and viewpoint change, where our method outperforms SiamCAR by 4.7% and 3.7%, respectively. These results indicate that the proposed tracker has clear advantages in handling illumination and viewpoint variations. However, under the low-resolution (LR) attribute, the success rate is marginally lower than that of the baseline, indicating a slight limitation when target texture information is severely degraded. This degradation attenuates effective target features and increases the difficulty of extracting discriminative representations, thereby leading to a modest performance drop. These observations also point to a clear direction for subsequent, targeted improvements. Furthermore, the relationship between success rate and overlap threshold for different trackers under these 12 challenge attributes is illustrated in Figure 12. This figure provides a detailed comparison of performance trends and differences as the overlap threshold varies across different challenge attributes on UAV123.

As shown in Table 7, the proposed method achieves higher precision under most of the 11 challenge attributes on the OTB100 dataset, including OCC, IV, SV, FM, BC, DEF, MB, IPR, OPR and LR. Notably, the improvements are particularly significant under deformation and background clutters, where the precision is 5.9% and 4.8% higher than that of SiamCAR, respectively. This suggests that our method not only tracks the target reliably, but also localizes the target boundaries more accurately. Furthermore, the relationship between precision and the location error threshold for different trackers under these 11 challenge attributes is illustrated in Figure 13. This figure provides a detailed comparison of performance trends and differences as the location error threshold varies across different challenge attributes on OTB100. Overall, the proposed approach demonstrates stronger robustness and adaptability in a wide range of challenging scenarios, effectively mitigating the susceptibility of traditional methods to disturbances in extreme environments.

As shown in Table 8, the proposed method achieves higher precision across most of the 12 challenge attributes on the UAV123 dataset, including IV, SV, POC, FOC, OV, FM, CM, BC, SOB, ARC and VC. Notably, the gains are particularly significant under illumination variation and viewpoint change, where the precision is 4.7% and 4.5% higher than that of SiamCAR, respectively. Furthermore, the relationship between precision and the location error threshold for different trackers under these 12 challenge attributes is illustrated in Figure 14. This figure provides a detailed comparison of performance trends and differences as the location error threshold varies across different challenge attributes on UAV123.

To provide a more intuitive visualization of the performance differences among tracking algorithms under various challenge attributes, we further analyze the experimental results through qualitative visualization. As shown in Table 9, three video sequences from the OTB100 dataset are selected, Basketball, Box, and Suv, each containing multiple challenges. As illustrated in Figure 15, the predicted bounding boxes of different trackers are overlaid on the video frames to enable a direct visual inspection of tracking quality. In Figure 15, the black bounding box denotes the proposed method; SiamFC is indicated by a red box, SiamBAN by a green box, SiamRPN by a blue box, SiamRPN++ by a bright cyan box, and SiamCAR by a magenta box. For the Basketball sequence, when the target undergoes occlusion or abrupt motion, noticeable differences can be observed among the predicted boxes of different methods. Only SiamRPN fails to track the target, while the other trackers remain successful. For the Box sequence, under motion blur or occlusion, only the proposed method is able to track the target consistently. The other trackers fail to identify the correct target and exhibit varying degrees of drift or inaccurate localization, which eventually leads to tracking failure. For the Suv sequence, during rapid motion, only the proposed tracker and SiamCAR can track the target correctly, whereas the remaining trackers suffer from inaccurate localization.

As presented in Table 10, three video sequences, specifically bike2, building1, and wakeboard3, are selected from the UAV123 dataset, each encompassing multiple challenging conditions.

As illustrated in Figure 16, the tracking bounding boxes of different trackers are visualized on video frames, enabling a more intuitive comparison of their tracking performance. In Figure 16, the black bounding box indicates the proposed method; SiamFC is indicated by a solid red box, SiamBAN by a green box, SiamRPN by a blue box, SiamRPN++ by a bright cyan box, and SiamCAR by a magenta box. For the bike2 sequence, when the target undergoes scale variation, aspect ratio change, fast motion, illumination variation, viewpoint change, camera motion, or the presence of similar objects, the predicted bounding boxes differ noticeably across methods.

Only the proposed method and SiamCAR successfully track the target, while the other trackers fail due to distraction from similar objects. For the building1 sequence, under scale variation, low resolution, full occlusion, or viewpoint change, all trackers are able to locate the correct target; however, the other trackers exhibit inaccurate localization to varying degrees. For the wakeboard3 sequence, during viewpoint change, all trackers can track the target correctly, whereas the other trackers still suffer from inaccurate localization.

5. Discussion

The effectiveness of the MSAMs and AGFMs has been validated in the preceding experiments. This section provides an in-depth discussion from three perspectives: feature fusion mechanisms, typical scenario analysis, and limitations.

5.1. Dynamic Adaptation Mechanism of the MSAM

Traditional Siamese trackers are constrained by a fixed receptive field and are unable to preserve multi-scale information simultaneously during feature extraction. When the target scale variation exceeds the training distribution, a single receptive field can cause the target representation in the feature space to shift, subsequently leading to tracking drift. By integrating parallel branches with different receptive fields, MSAM constructs a multi-scale feature space at the feature extraction stage, enabling the network to perceive both local texture and global semantics simultaneously. The core function of the attention mechanism lies in dynamically selecting the most adaptive feature subspace based on the current scale state of the target, thereby achieving scale invariance at the feature level.

Taking the 174th and 414th frames of the Box sequence in Figure 15 as an example, the effectiveness of this mechanism is illustrated. In the 174th frame, the box moves to the foreground. Here, the large receptive field branch of MSAM captures its global contour, while the small receptive field branch preserves local details. The attention mechanism dynamically adjusts the weight distribution, significantly improving the tightness of the tracking bounding box. In the 414th frame, the box is located in the background, appearing small in scale and cluttered by background noise. Constrained by a fixed receptive field, traditional single-scale trackers fail to capture the surface texture details of the box, resulting in low tracking response scores and a tendency for slight drift. In contrast, for MSAM, the small receptive field branch captures local details such as edges and surface texture, while the large receptive field branch captures the surrounding background context to distinguish the target from the background. The attention mechanism adaptively allocates weights, effectively enhancing the accuracy of the tracking response and preventing drift. This dynamic weight allocation mechanism enables MSAM to achieve a notable performance improvement specifically on the scale variation attribute.

5.2. Suppressing Feature Redundancy with the AGFM

To address the semantic conflicts and background redundancy introduced by traditional static fusion strategies, AGFM employs a gating mechanism to generate independent fusion weights for each spatial location, which essentially transforms feature fusion from a global operation into a local adaptive decision-making process. In target regions, the gating signal enhances the consistency of multi-level features; in background regions, it suppresses responses across different levels to mitigate the risk of false detection. In the 645th frame of the Basketball sequence illustrated in Figure 15, the athlete is situated in a cluttered background, where redundant elements such as spectators and court sidelines interfere with target localization. Traditional static fusion methods, which are unable to filter out such redundancy, are prone to false detections. In contrast, AGFM, via its spatial gating mechanism, substantially suppresses responses in background regions while enhancing features in the target region. This effectively improves the discriminability between the target and the background, thereby enhancing tracking precision.

5.3. Limitation Analysis

On the low-resolution attribute subset of UAV123, the proposed method achieves a success score (0.431) slightly lower than that of the SiamCAR baseline (0.442). This performance gap is primarily attributed to the dependence of the attention mechanism on the feature signal-to-noise ratio. When image textures are severely degraded, the weight learning process becomes susceptible to noise interference, diminishing the effectiveness of dynamic feature selection. In contrast, the simple concatenation strategy employed by the baseline model exhibits a degree of statistical robustness under extreme blurring conditions. Nevertheless, the proposed method maintains a leading performance on the remaining challenging attributes and overall metrics, achieving a favorable balance between robustness and overall effectiveness in general tracking scenarios.

6. Conclusions

To address the limited representational capacity of feature embeddings and the lack of adaptivity in multi-level fusion in Siamese trackers, MSAM and AGFM are proposed. Specifically, MSAM performs scale-adaptive modeling during feature extraction by leveraging parallel receptive-field branches coupled with dynamic weight allocation. Meanwhile, AGFM introduces a spatial gating mechanism to enable locally adaptive selection during feature fusion. Working in concert, these two modules transform the conventional static processing pipeline into a dynamic, adaptive decision-making paradigm. Moreover, the proposed design exhibits generality and can be readily adapted to alternative Siamese tracking frameworks.

Experimental results demonstrate that the proposed method achieves a success score of 65.3% and a precision score of 88.4% on OTB100, outperforming the SiamCAR baseline by 1.9% and 3.3%, respectively. On UAV123, it attains a success score of 59.5% and a precision score of 81.3%, yielding improvements of 2.9% and 3.5% over the baseline, respectively. Across most challenging attributes, such as SV, BC, and FM, the proposed approach exhibits consistent and substantial gains. However, under scenarios with severe texture degradation (e.g., low-resolution targets), the performance improvement remains limited, suggesting that attention-based mechanisms are inherently dependent on the signal-to-noise ratio of the underlying features.

In the future, efforts can be devoted to module lightweighting and inference efficiency optimization. For example, structural re-parameterization, pruning or more efficient implementations of convolution and attention can be adopted to reduce the parameter count and computational overhead, thereby facilitating deployment in resource-constrained settings.

Author Contributions

Conceptualization, D.Z.; methodology, D.Z.; software, D.Z. and H.L.; validation, H.L.; formal analysis, Y.L.; investigation, H.L. and Y.L.; resources, D.Z. and Y.L.; data curation, H.L. and Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, H.L. and Y.L.; visualization, H.L.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by joint program project of liaoning provincial department of science and technology in 2025, grant number 2025JH2/101800311.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Srinivasan, K. Object tracking using optimized Dual interactive Wasserstein generative adversarial network from surveillance video. Knowl.-Based Syst. 2025, 311, 113084. [Google Scholar] [CrossRef]
Elassy, M.; Al-Hattab, M.; Takruri, M.; Badawi, S. Intelligent transportation systems for sustainable smart cities. Transp. Eng. 2024, 16, 100252. [Google Scholar] [CrossRef]
Hu, Y.; Yang, J.; Chen, L.; Li, K.; Sima, C.; Zhu, X.; Chai, S.; Du, S.; Lin, T.; Wang, W.; et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17853–17862. [Google Scholar]
Abdelaziz, O.; Shehata, M.; Mohamed, M. Beyond traditional visual object tracking: A survey. Int. J. Mach. Learn. Cybern. 2025, 16, 1435–1460. [Google Scholar] [CrossRef]
Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2544–2550. [Google Scholar]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [PubMed]
Danelljan, M.; Häger, G.; Khan, F.; Felsberg, M. Accurate scale estimation for robust visual tracking. In Proceedings of the British Machine Vision Conference, Nottingham, UK, 1–5 September 2014. [Google Scholar]
Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6638–6646. [Google Scholar]
Li, S.; Yang, X.; Wang, X.; Zeng, D.; Ye, H.; Zhao, Q. Learning target-aware vision transformers for real-time UAV tracking. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–18. [Google Scholar] [CrossRef]
Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10448–10457. [Google Scholar]
Cui, Y.; Jiang, C.; Wang, L.; Wu, G. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13608–13618. [Google Scholar]
Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 341–357. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 850–865. [Google Scholar]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4282–4291. [Google Scholar]
Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese box adaptivex network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6668–6677. [Google Scholar]
Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6269–6277. [Google Scholar]
Meibodi, F.A.; Alijani, S.; Najjaran, H. A Deep Dive into Generic Object Tracking: A Survey. arXiv 2025, arXiv:2507.23251. [Google Scholar] [CrossRef]
Wang, Q.; Zhou, L.; Xu, C.; Shang, Y.; Jin, P.; Cao, C.; Shen, T. Progress and Perspectives on UAV Visual Object Tracking. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 20214–20239. [Google Scholar] [CrossRef]
Shajeena, J.; Shiny, R.M.; Palas, P.B.; Vespa, M.M.; Stanley, B.F.; Kumar, R.J.R. Siamese Deep Q-Learning Based Online Correlation Filter Adaptation for Visual Object Tracking in Complex Scenarios. Circuits Syst. Signal Process. 2025, 44, 6913–6956. [Google Scholar] [CrossRef]
Dong, X.; Shen, J.; Porikli, F.; Luo, J.; Shao, L. Adaptive Siamese Tracking with a Compact Latent Network. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 8049–8062. [Google Scholar] [CrossRef] [PubMed]
Ashikuzzaman, M.; Héroux, A.; Tang, A.; Cloutier, G.; Rivaz, H. Displacement tracking techniques in ultrasound elastography: From cross correlation to deep learning. IEEE Trans. Ultrason. Ferroelectr. Freq. Control. 2024, 71, 842–871. [Google Scholar] [CrossRef] [PubMed]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zhang, W.; Li, X.; Liu, X.; Lu, S.; Tang, H. Facing challenges: A survey of object tracking. Digit. Signal Process. 2025, 161, 105082. [Google Scholar] [CrossRef]
Wu, Y.; Lin, Y.; Xu, T.; Kang, T.; Meng, X.; Liu, H. Multi-scale feature integration and spatial attention for accurate lesion segmentation. In Proceedings of the 2025 6th International Conference on Electronic Communication and Artificial Intelligence, Chengdu, China, 20–22 June 2025; pp. 736–740. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Xiao, H.; Zhang, W.; Zuo, L.; Wen, L.; Li, Q.; Li, X. WFF-Net: Trainable weight feature fusion convolutional neural networks for surface defect detection. Adv. Eng. Inform. 2025, 64, 103073. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; Lim, J.; Yang, M.H. Online object tracking: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2411–2418. [Google Scholar]
Diana, B.A.; Rail, R.M.; Boris, V.V. Object localization for subsequent UAV tracking. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2018, XLII-2, 9–14. [Google Scholar]
Sun, N.; Zhao, J.; Shi, Q.; Liu, C.; Liu, P. Moving target tracking by unmanned aerial vehicle: A survey and taxonomy. IEEE Trans. Ind. Inform. 2024, 20, 7056–7068. [Google Scholar] [CrossRef]

Figure 1. Architecture of the improved SiamCAR algorithm. The network consists of a Siamese Subnetwork (left) and a Classification-Regression Subnetwork (right). The regression branch predicts L (left), T (top), and R (right) distances for anchor-free bounding box regression.

Figure 2. Illustration of Multi-Scale Attention Module: The module sequentially applies channel attention and multi-scale spatial attention to refine the input feature, where

⨂

denotes element-wise multiplication.

Figure 2. Illustration of Multi-Scale Attention Module: The module sequentially applies channel attention and multi-scale spatial attention to refine the input feature, where

⨂

denotes element-wise multiplication.

Figure 3. Multi-scale spatial attention submodule.

Figure 4. The structure of Adaptive gated fusion module.

Figure 5. Ablation study results on the OTB100 dataset.

Figure 6. Ablation study results on the UAV123 dataset.

Figure 7. Comparison of tracking results on different sequences in OTB100. The top-left corner indicates the frame number.

Figure 8. Comparison of tracking results on different sequences in UAV123. The top-left corner indicates the frame number.

Figure 9. Success and precision plots of different models on OTB100.

Figure 10. Success and precision plots of different models on UAV123.

Figure 11. Success plots under OTB100 challenge attributes.

Figure 12. Success plots under UAV123 challenge attributes.

Figure 13. Precision plots under OTB100 challenge attributes.

Figure 14. Precision plots under UAV123 challenge attributes.

Figure 15. Visualization results of object tracking on the OTB100 dataset. The top-left corner indicates the frame number.

Figure 16. Visualization results of object tracking on the UAV123 dataset. The top-left corner indicates the frame number.

Table 1. Experimental Environment Settings.

Parameter	Value
OS	Ubuntu 18.04
GPU	N VIDIA GeForce RTX 3090(24 GB)
CPU	Intel i9-10980XE
pytorch	1.13.1
python	3.8
CUDA	11.7

Table 2. Training Hyperparameter Settings.

Parameter	Value
Epochs	20
Batch	128
LR	0.001
Warm-up	5 epochs
Final LR	0.0005
Optimizer	SGD
Momentum	0.9
Weight Decay	0.0005
Grad Clip	3

Table 3. Performance comparison of ablation study results across different model variants. Bold values indicate the method used in this paper.

Dataset	Method	SR (Mean ± SD) [95%CI]	PR (Mean ± SD) [95%CI]	Parameters (M)	GFLOPs	FPS
OTB100	A	0.634 ± 0.016 [0.614, 0.654]	0.851 ± 0.014 [0.834, 0.868]	51.38	59.35	85.36
	B	0.640 ± 0.018 [0.618, 0.662]	0.866 ± 0.015 [0.847, 0.885]	51.45	59.39	83.48
	C	0.649 ± 0.012 [0.634, 0.664]	0.865 ± 0.009 [0.854, 0.876]	51.91	59.36	83.3
	D	0.653 ± 0.009 [0.642, 0.664]	0.884 ± 0.008 [0.874, 0.894]	51.97	59.4	81.08
UAV123	A	0.566 ± 0.018 [0.544, 0.588]	0.778 ± 0.018 [0.756, 0.800]	51.38	59.35	85.46
	B	0.582 ± 0.025 [0.551, 0.613]	0.797 ± 0.020 [0.772, 0.822]	51.45	59.39	83.04
	C	0.586 ± 0.015 [0.567, 0.605]	0.790 ± 0.015 [0.771, 0.809]	51.91	59.36	82.98
	D	0.595 ± 0.012 [0.580, 0.610]	0.813 ± 0.011 [0.799, 0.827]	51.97	59.4	81.16

Table 4. Tracking results of different models. Bold values indicate the method used in this paper.

Dataset	Tracker	SR (Mean ± SD) [95%CI]	PR (Mean ± SD) [95%CI]	Parameters (M)	GFLOPs	FPS
OTB100	SiamFC	0.580 ± 0.032 [0.540, 0.620]	0.777 ± 0.037 [0.731, 0.823]	2.33	3.18	151.92
	SiamRPN	0.592 ± 0.021 [0.566, 0.618]	0.791 ± 0.025 [0.760, 0.822]	6.25	5.57	120.97
	SiamRPN++	0.622 ± 0.011 [0.608, 0.636]	0.833 ± 0.011 [0.819, 0.847]	53.95	59.6	81.29
	SiamBAN	0.613 ± 0.015 [0.594, 0.632]	0.834 ± 0.015 [0.815, 0.853]	53.53	59.59	81.32
	SiamCAR	0.634 ± 0.016 [0.614, 0.654]	0.851 ± 0.014 [0.834, 0.868]	51.38	59.35	85.36
	Ours	0.653 ± 0.009 [0.642, 0.664]	0.884 ± 0.008 [0.874, 0.894]	51.97	59.4	81.08
UAV123	SiamFC	0.448 ± 0.028 [0.413, 0.483]	0.639 ± 0.027 [0.605, 0.673]	2.33	3.18	135.62
	SiamRPN	0.511 ± 0.019 [0.487, 0.535]	0.727 ± 0.016 [0.707, 0.747]	6.25	5.57	119.07
	SiamRPN++	0.575 ± 0.011 [0.561, 0.589]	0.792 ± 0.009 [0.781, 0.803]	53.95	59.6	80.97
	SiamBAN	0.555 ± 0.011 [0.541, 0.569]	0.771 ± 0.012 [0.756, 0.786]	53.53	59.59	84.30
	SiamCAR	0.566 ± 0.018 [0.544, 0.588]	0.778 ± 0.018 [0.756, 0.800]	51.38	59.35	85.46
	Ours	0.595 ± 0.012 [0.580, 0.610]	0.813 ± 0.011 [0.799, 0.827]	51.97	59.4	81.16

Table 5. Model success rates under OTB100 challenge attributes. Bold values indicate the best performance.

Tracker	SiamFC	SiamPRN	SiamRPN++	SiamBAN	SiamCAR	Ours
IV	0.544	0.586	0.631	0.614	0.638	0.669
SV	0.559	0.584	0.625	0.611	0.639	0.647
OCC	0.525	0.530	0.586	0.567	0.581	0.606
DEF	0.530	0.576	0.587	0.592	0.577	0.622
MB	0.575	0.569	0.622	0.631	0.667	0.681
FM	0.571	0.573	0.612	0.630	0.641	0.655
IPR	0.559	0.597	0.651	0.645	0.648	0.673
OPR	0.553	0.583	0.628	0.620	0.614	0.642
OV	0.471	0.505	0.531	0.504	0.569	0.573
BC	0.531	0.527	0.570	0.525	0.569	0.613
LR	0.637	0.531	0.547	0.617	0.685	0.656

Table 6. Model success rates under UAV123 challenge attributes. Bold values indicate the best performance.

Tracker	SiamFC	SiamPRN	SiamRPN++	SiamBAN	SiamCAR	Ours
IV	0.305	0.449	0.554	0.501	0.504	0.551
SV	0.426	0.488	0.564	0.542	0.554	0.584
POC	0.347	0.404	0.485	0.470	0.478	0.503
FOC	0.222	0.263	0.344	0.329	0.367	0.383
OV	0.403	0.474	0.533	0.502	0.519	0.523
FM	0.348	0.487	0.512	0.551	0.536	0.552
CM	0.430	0.535	0.587	0.558	0.557	0.591
BC	0.255	0.292	0.374	0.333	0.361	0.381
SOB	0.408	0.436	0.504	0.509	0.489	0.519
ARC	0.378	0.459	0.545	0.529	0.533	0.554
VC	0.411	0.540	0.619	0.604	0.582	0.619
LR	0.331	0.337	0.395	0.389	0.442	0.431

Table 7. Model Precision under OTB100 challenge attributes. Bold values indicate the best performance.

Tracker	SiamFC	SiamPRN	SiamRPN++	SiamBAN	SiamCAR	Ours
IV	0.730	0.803	0.833	0.823	0.859	0.886
SV	0.760	0.794	0.835	0.832	0.849	0.871
OCC	0.695	0.720	0.791	0.760	0.789	0.829
DEF	0.722	0.812	0.829	0.838	0.816	0.875
MB	0.736	0.748	0.802	0.822	0.870	0.895
FM	0.744	0.761	0.802	0.830	0.848	0.872
IPR	0.754	0.811	0.877	0.886	0.880	0.921
OPR	0.762	0.803	0.857	0.865	0.854	0.888
BC	0.715	0.709	0.769	0.733	0.806	0.854
OV	0.614	0.699	0.687	0.668	0.807	0.794
LR	0.915	0.810	0.800	0.926	0.954	0.957

Table 8. Precision under UAV123 challenge attributes. Bold values indicate the best performance.

Tracker	SiamFC	SiamPRN	SiamRPN++	SiamBAN	SiamCAR	Ours
IV	0.443	0.643	0.791	0.713	0.724	0.771
SV	0.612	0.699	0.776	0.751	0.758	0.797
POC	0.530	0.617	0.701	0.687	0.692	0.721
FOC	0.451	0.489	0.583	0.712	0.609	0.631
OV	0.600	0.683	0.735	0.713	0.711	0.718
FM	0.537	0.712	0.711	0.508	0.730	0.756
CM	0.622	0.764	0.804	0.772	0.767	0.805
BC	0.434	0.486	0.616	0.527	0.590	0.599
SOB	0.601	0.658	0.718	0.727	0.727	0.749
ARC	0.556	0.669	0.756	0.738	0.740	0.768
VC	0.573	0.724	0.815	0.795	0.760	0.805
LR	0.566	0.578	0.630	0.616	0.679	0.676

Table 9. Video sequences and challenge categories in the OTB100 dataset.

Sequence	Challenge Categories
Basketball	IV, OPR, OCC, DEF, BC
Box	IV, OPR, SV, OCC, MB, IPR, OV, BC
Suv	OCC, IPR, OV

Table 10. Video sequences and challenge categories in the UVA123 dataset.

Sequence	Challenge Categories
bike2	SV, ARC, FM, IV, VC, CM, SOB
building1	SV, LR, POC, VC
wakeboard3	SV

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, D.; Li, H.; Lv, Y. An Object Tracking Algorithm Based on Multi-Scale Attention and Adaptive Fusion. Appl. Sci. 2026, 16, 2646. https://doi.org/10.3390/app16062646

AMA Style

Zhang D, Li H, Lv Y. An Object Tracking Algorithm Based on Multi-Scale Attention and Adaptive Fusion. Applied Sciences. 2026; 16(6):2646. https://doi.org/10.3390/app16062646

Chicago/Turabian Style

Zhang, Deyu, Haiyang Li, and Yanhui Lv. 2026. "An Object Tracking Algorithm Based on Multi-Scale Attention and Adaptive Fusion" Applied Sciences 16, no. 6: 2646. https://doi.org/10.3390/app16062646

APA Style

Zhang, D., Li, H., & Lv, Y. (2026). An Object Tracking Algorithm Based on Multi-Scale Attention and Adaptive Fusion. Applied Sciences, 16(6), 2646. https://doi.org/10.3390/app16062646

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Object Tracking Algorithm Based on Multi-Scale Attention and Adaptive Fusion

Abstract

1. Introduction

2. Preliminary Work

2.1. SiamCAR Network Model

2.2. Convolutional Block Attention Module

3. Model Improvements

3.1. Overall Algorithm Framework

3.2. Multi-Scale Attention Module

3.3. Adaptive Gated Fusion Module

4. Experiments and Results

4.1. Experimental Setup

4.2. Training Parameters and Strategy Settings

4.3. Datasets and Evaluation Metrics

4.4. Ablation Studies

4.5. Comparative Experiments

5. Discussion

5.1. Dynamic Adaptation Mechanism of the MSAM

5.2. Suppressing Feature Redundancy with the AGFM

5.3. Limitation Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI