4.3. Datasets and Evaluation Metrics
In single-object tracking, the choice of datasets and evaluation metrics is crucial for algorithm development and performance assessment. To comprehensively verify the effectiveness of the proposed method, the model is trained on the GOT-10K dataset and tested on the OTB100 and UAV123 datasets. The characteristics of these datasets and the commonly used evaluation metrics in single-object tracking are described in this section.
GOT-10K (Generic Object Tracking Benchmark) [
30] is one of the largest and most diverse datasets used in single-object tracking. It contains over 10,000 video sequences, covering 563 object classes and 87 motion classes, which greatly enriches the diversity of tracking scenarios. All sequences are captured in real-world environments and include various challenging factors such as background clutters, occlusion, motion blur, and scale variation.
OTB100 (Object Tracking Benchmark 100) [
31] is one of the earliest and most widely used benchmarks in object tracking. It consists of 100 video sequences covering a wide range of scenes and target types. The benchmark defines a set of 11 challenge attributes for systematic evaluation: occlusion (OCC), illumination variation (IV), scale variation (SV), fast motion (FM), background clutters (BC), deformation (DEF), motion blur (MB), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), and low resolution (LR). These attributes provide a standardized platform for performance comparison and analysis.
UAV123 (Unmanned Aerial Vehicle 123) [
32] focuses on object tracking from aerial UAV viewpoints. It contains 123 high-definition video sequences covering various objects such as vehicles, pedestrians, and boats. The benchmark defines a set of 12 challenge attributes: illumination variation (IV), scale variation (SV), partial occlusion (POC), full occlusion (FOC), out-of-view (OV), fast motion (FM), camera motion (CM), background clutters (BC), similar object (SOB), aspect ratio change (ARC), viewpoint change (VC), and low resolution (LR). Owing to this structured attribute annotation, UAV123 serves as a critical benchmark for assessing and comparing the robustness of tracking algorithms in UAV-based application scenarios.
To scientifically evaluate tracking performance, success rate (SR) and precision rate (PR) are used as evaluation metrics [
33].
Success rate measures tracking accuracy by computing the Intersection over Union (IoU) between the predicted bounding box and the ground-truth bounding box:
where
denotes the ground-truth bounding box and
denotes the predicted bounding box produced by the tracker. IoU ranges from 0 to 1. A frame is considered successfully tracked when IoU is greater than a threshold
To avoid randomness introduced by using a single threshold, the success plot is computed over a set of thresholds
and the area under the curve (AUC) is used as the final success-score metric. A larger AUC indicates a greater area under the success plot curve, implying that the model performs more consistently and evenly across different threshold ranges.
Precision rate measures localization accuracy by computing the Euclidean distance between the center of the predicted bounding box and that of the ground-truth bounding box:
where
are the predicted center coordinates and
are the ground-truth center coordinates. A frame is considered precisely localized when the center distance
is smaller than a threshold
(typically 20 pixels).
Parameters: The total number of learnable parameters in the model, reported in millions (M). This metric directly reflects the model’s storage requirement and structural complexity.
GFLOPs: The total number of floating-point operations required for a single forward pass, reported in billions of FLOPs. It is commonly used to quantify computational complexity and inference cost.
Frames Per Second (FPS): The number of image frames the model can process per second, used to evaluate real-time inference capability.
4.4. Ablation Studies
To further validate the effectiveness of the proposed modules for object tracking, ablation studies are conducted on the OTB100 and UAV123 datasets. The goal of the ablation experiments is to evaluate the contribution of each key component to the overall performance and to analyze how different modules affect tracking accuracy and robustness. SiamCAR is adopted as the baseline. Both the proposed multi-scale attention module and Adaptive Gated Fusion Module (AGFM) are integrated individually and jointly to construct multiple model variants for systematic ablation analysis. The experimental configurations are detailed as follows:
- (1)
Model A. As a control, the original SiamCAR framework is used as the baseline for subsequent comparisons.
- (2)
Model B. The AGFM is applied at the feature fusion stage to enhance feature fusion quality.
- (3)
Model C. A multi-scale attention module is added to the backbone network to improve feature representation capability.
- (4)
Model D. Both the multi-scale attention module and the AGFM are integrated, forming the complete model proposed in this paper.
As shown in
Table 3, the progressive integration of the multi-scale attention module and the AGFM leads to consistent and substantial improvements in overall performance. After incorporating the AGFM, Model B attains an SR of 0.640 and a PR of 0.866 on OTB100, and the SR of 0.582 with the PR of 0.797 on UAV123. Meanwhile, the model contains 51.45 M parameters and requires 59.39 GFLOPs per forward pass, while maintaining an inference speed of over 83 FPS. These results suggest that dynamic fusion and selection of multi-level features improve localization accuracy and robustness without introducing a substantial computational burden. Model C further elevates the SR to 0.649 and the PR to 0.865 on OTB100, while achieving an SR of 0.586 and a PR of 0.790 on UAV123. In terms of efficiency, Model C has 51.91 M parameters, with virtually unchanged GFLOPs, and maintains an inference speed of over 82 FPS, indicating that the enhanced representational capacity is attributable to efficient multi-scale attention modeling rather than redundant computation. By integrating both modules, Model D delivers the best overall performance, with SR improvements of 0.019 and PR gains of 0.033 on OTB100, as well as SR and PR increments of 0.029 and 0.035, respectively, on UAV123. The parameter count is 51.97 M, with a total increase of 1.15%, and GFLOPs is 59.4, with a total increase of 0.08%. The FPS still maintains real-time tracking capability above 81 FPS. It is worth noting that although the SR numerical gain of Model C and Model D on OTB100 is only 0.004, the comprehensive performance across both datasets shows that Model D’s SR improvement on UAV123 of 0.009 is significantly higher than that on OTB100. Moreover, the PR metric achieves stable gains on both datasets, with improvements of 0.019 on OTB100 and 0.023 on UAV123. This consistent improvement across datasets and multiple metrics validates the effectiveness of module complementary gains. The multi-scale discriminative features enhanced by MSAM provide a higher-quality input foundation for AGFM’s dynamic fusion, while AGFM’s adaptive selection mechanism further amplifies the contribution of key features. Together, they form an optimization of feature enhancement and precise fusion, rather than simple performance superposition. Experimental results demonstrate that the MSAMs and AGFMs achieve a good balance between performance improvement and computational cost control through efficient feature enhancement and fusion design.
As shown in
Figure 5 and
Figure 6, on both the OTB100 and UAV123 datasets, all model variants exhibit consistent trends in success rate and precision rate as the evaluation threshold varies. Specifically, as the overlap threshold gradually increases, the success rates of all models decrease; however, the complete Model D demonstrates superior performance across the vast majority of threshold intervals, with its success rate curve positioned above those of other comparative models overall. This indicates that Model D provides more stable target location predictions and demonstrates stronger robustness in challenging scenarios such as occlusion and scale variation. Meanwhile, in the precision plots, as the localization error threshold increases, the precision rates of all models improve. The complete Model D also exhibits a measurable performance gain, with the improvement being particularly pronounced in the high-threshold regime. Notably, it performs slightly better than the variants that incorporate only a single module. The consistent improvements across multiple metrics and datasets suggest that the two modules provide complementary benefits rather than redundant contributions.
To more intuitively illustrate the performance differences among tracking algorithms under various challenge attributes, we further analyze the experimental results through visualization. Three video sequences, Coupon, Liquor, and Vase, are selected from the OTB100 dataset; and three sequences, car5, person20, and truck1, are chosen from the UAV123 dataset. Both the baseline tracker and the proposed method are applied to these videos, and the tracking bounding boxes of each tracker are overlaid on the video frames to enable a more direct comparison of tracking quality. In the visualizations, the magenta boxes denote the baseline method, while the green boxes denote the proposed method.
As shown in
Figure 7, the improved model demonstrates more accurate and stable tracking performance across different sequences. In the Coupon sequence, despite challenging conditions such as occlusion and fast motion, the proposed method can still track the target steadily, showing strong robustness. In the Liquor sequence, the object moves rapidly, causing the baseline model to fail in tracking; in contrast, the improved model can accurately capture and continuously track the target, significantly improving tracking success. In the Vase sequence, which is characterized by challenging factors such as background clutters, the baseline model suffers from insufficient localization accuracy, whereas the improved model can effectively distinguish the target from the background and achieve more precise tracking. The 4.4% improvement under the BC attribute reflects the average gain across all sequences annotated with BC; however, since this subset includes a substantial number of relatively easy sequences that the baseline already handles well, the performance gains on more difficult cases are partially diluted. As shown in
Figure 8, the car5, person20, and truck1 sequences exhibit pronounced viewpoint changes. In such scenarios, the baseline model often produces inaccurate tracking results or even loses the target, while the improved model adapts better to viewpoint change and tracks the target more accurately and consistently. Therefore, under complex conditions such as fast motion, occlusion, background clutters, and viewpoint change, the improved model consistently achieves higher tracking accuracy and robustness, which fully validates the effectiveness and broad applicability of the proposed method for achieving stable object tracking in complex visual scenarios.
Further analysis indicates that the multi-scale attention module can capture salient target regions at different spatial scales, thereby improving the model’s adaptability to scale variation and background clutters. The adaptive gated fusion module dynamically adjusts the fusion weights to effectively select informative features while suppressing redundant and conflicting information, which strengthens feature complementarity. Working together, these two modules substantially enhance the model’s feature representation capability and fusion flexibility, providing a solid technical foundation for object tracking in complex scenarios.
4.5. Comparative Experiments
To validate the performance of the proposed method for object tracking, we select several representative Siamese-network-based trackers, including SiamFC, SiamBAN, SiamRPN, SiamRPN++, and SiamCAR. All methods are trained on the GOT-10K dataset, and their tracking performance is evaluated on the OTB100 and UAV123 datasets. The experimental results, as detailed in
Table 4, demonstrate the superiority of our method. On the OTB100 dataset, it attains an SR of 0.653 and a PR of 0.884. These metrics further reach 0.595 in SR and 0.813 in PR on UAV123. Under the same training settings, the proposed method exhibits a pronounced performance advantage over the SiamCAR baseline as well as other mainstream Siamese trackers. These results suggest that the proposed MSAM and AGFM effectively refine feature representation and fusion in complex scenarios, enabling the tracker to maintain robust object tracking while further improving bounding-box localization accuracy. In terms of efficiency, the proposed method runs slower than lightweight baselines such as SiamFC and SiamRPN, but delivers more substantial performance gains. Compared with some more complex models, our approach maintains competitive and stable success and precision while reducing the number of parameters and GLOPs. Overall, the proposed method achieves strong tracking performance while preserving a lightweight design, thereby striking an effective balance between accuracy and computational cost. This trade-off makes it a practical solution for the engineering deployment of Siamese-network-based tracking algorithms.
As shown in
Figure 9 and
Figure 10, on both the OTB100 and UAV123 datasets, all models exhibit consistent trends in success rate and precision rate as the evaluation thresholds vary. Specifically, as the overlap threshold increases, the success rates of all methods gradually decrease.
However, within the selected range of thresholds, the proposed method achieves superior tracking performance across the majority of evaluation thresholds. The precision plot shows a similar, consistent advantage, indicating that the proposed improvements positively contribute to enhanced target localization accuracy. The experimental results indicate that our approach exhibits better generalization capability and stability in complex environments.
To further analyze the performance of the proposed tracker, we investigate the success rate and precise rate under 11 challenge attributes on the OTB100 dataset and 12 challenge attributes on the UAV123 dataset, and compare our method with several representative Siamese-network-based trackers.
As reported in
Table 5, our method achieves superior success rates compared to the baseline in the majority of the 11 OTB100 challenge attributes, specifically including OCC, IV, SV, FM, BC, DEF, MB, IPR, OPR and OV. Notably, the gains are most pronounced under deformation and background clutters, where the success rates are 4.5% and 4.4% higher than those of SiamCAR, respectively. This indicates that the proposed method has clear advantages in handling target deformation and interference from complex backgrounds. In addition, it performs strongly in challenging conditions such as motion blur and illumination variation, demonstrating strong environmental adaptability. Furthermore, the relationship between success rate and overlap threshold for different trackers under these 11 challenge attributes is shown in
Figure 11. This figure provides a detailed comparison of performance trends and differences as the overlap threshold varies across different challenge attributes on OTB100.
As shown in
Table 6, the proposed method achieves improvements in both success rate and precision across most of the 12 challenge attributes on the UAV123 dataset, including IV, SV, POC, FOC, OV, FM, CM, BC, SOB, ARC and VC. Notably, the largest gains in success rate are observed under illumination variation and viewpoint change, where our method outperforms SiamCAR by 4.7% and 3.7%, respectively. These results indicate that the proposed tracker has clear advantages in handling illumination and viewpoint variations. However, under the low-resolution (LR) attribute, the success rate is marginally lower than that of the baseline, indicating a slight limitation when target texture information is severely degraded. This degradation attenuates effective target features and increases the difficulty of extracting discriminative representations, thereby leading to a modest performance drop. These observations also point to a clear direction for subsequent, targeted improvements. Furthermore, the relationship between success rate and overlap threshold for different trackers under these 12 challenge attributes is illustrated in
Figure 12. This figure provides a detailed comparison of performance trends and differences as the overlap threshold varies across different challenge attributes on UAV123.
As shown in
Table 7, the proposed method achieves higher precision under most of the 11 challenge attributes on the OTB100 dataset, including OCC, IV, SV, FM, BC, DEF, MB, IPR, OPR and LR. Notably, the improvements are particularly significant under deformation and background clutters, where the precision is 5.9% and 4.8% higher than that of SiamCAR, respectively. This suggests that our method not only tracks the target reliably, but also localizes the target boundaries more accurately. Furthermore, the relationship between precision and the location error threshold for different trackers under these 11 challenge attributes is illustrated in
Figure 13. This figure provides a detailed comparison of performance trends and differences as the location error threshold varies across different challenge attributes on OTB100. Overall, the proposed approach demonstrates stronger robustness and adaptability in a wide range of challenging scenarios, effectively mitigating the susceptibility of traditional methods to disturbances in extreme environments.
As shown in
Table 8, the proposed method achieves higher precision across most of the 12 challenge attributes on the UAV123 dataset, including IV, SV, POC, FOC, OV, FM, CM, BC, SOB, ARC and VC. Notably, the gains are particularly significant under illumination variation and viewpoint change, where the precision is 4.7% and 4.5% higher than that of SiamCAR, respectively. Furthermore, the relationship between precision and the location error threshold for different trackers under these 12 challenge attributes is illustrated in
Figure 14. This figure provides a detailed comparison of performance trends and differences as the location error threshold varies across different challenge attributes on UAV123.
To provide a more intuitive visualization of the performance differences among tracking algorithms under various challenge attributes, we further analyze the experimental results through qualitative visualization. As shown in
Table 9, three video sequences from the OTB100 dataset are selected, Basketball, Box, and Suv, each containing multiple challenges. As illustrated in
Figure 15, the predicted bounding boxes of different trackers are overlaid on the video frames to enable a direct visual inspection of tracking quality. In
Figure 15, the black bounding box denotes the proposed method; SiamFC is indicated by a red box, SiamBAN by a green box, SiamRPN by a blue box, SiamRPN++ by a bright cyan box, and SiamCAR by a magenta box. For the Basketball sequence, when the target undergoes occlusion or abrupt motion, noticeable differences can be observed among the predicted boxes of different methods. Only SiamRPN fails to track the target, while the other trackers remain successful. For the Box sequence, under motion blur or occlusion, only the proposed method is able to track the target consistently. The other trackers fail to identify the correct target and exhibit varying degrees of drift or inaccurate localization, which eventually leads to tracking failure. For the Suv sequence, during rapid motion, only the proposed tracker and SiamCAR can track the target correctly, whereas the remaining trackers suffer from inaccurate localization.
As presented in
Table 10, three video sequences, specifically bike2, building1, and wakeboard3, are selected from the UAV123 dataset, each encompassing multiple challenging conditions.
As illustrated in
Figure 16, the tracking bounding boxes of different trackers are visualized on video frames, enabling a more intuitive comparison of their tracking performance. In
Figure 16, the black bounding box indicates the proposed method; SiamFC is indicated by a solid red box, SiamBAN by a green box, SiamRPN by a blue box, SiamRPN++ by a bright cyan box, and SiamCAR by a magenta box. For the bike2 sequence, when the target undergoes scale variation, aspect ratio change, fast motion, illumination variation, viewpoint change, camera motion, or the presence of similar objects, the predicted bounding boxes differ noticeably across methods.
Only the proposed method and SiamCAR successfully track the target, while the other trackers fail due to distraction from similar objects. For the building1 sequence, under scale variation, low resolution, full occlusion, or viewpoint change, all trackers are able to locate the correct target; however, the other trackers exhibit inaccurate localization to varying degrees. For the wakeboard3 sequence, during viewpoint change, all trackers can track the target correctly, whereas the other trackers still suffer from inaccurate localization.