In this section, we first introduce the experimental settings and implementation details of the proposed GTUTrack. We then compare GTUTrack with a broad range of state-of-the-art multi-modal tracking methods on multiple benchmarks. Finally, extensive ablation studies and qualitative analyses are conducted to validate the effectiveness of each component and to further investigate the behavior of the proposed search region-guided adaptive template update framework.
3.2. Implementation Details
GTUTrack is implemented based on the OSTrack framework, and the overall tracking architecture follows the standard template-search matching pipeline. During tracking, the initial template pair is fixed as the anchor reference, while dynamic templates are updated online through the proposed search region-guided adaptive template update strategy.
For the Guided Template Selection Transformer (GTST), we use a batch size of 32 for training and adopt the AdamW optimizer with an initial learning rate of . A cosine annealing schedule is employed to ensure stable convergence. The selector is trained to estimate the relative quality of candidate templates with respect to the current search region, so that it can provide reliable template selection during inference.
To alleviate the severe class imbalance in template selection, where only one template is optimal and the remaining templates are non-optimal, we assign adaptive weights to different samples during training. Specifically, the positive sample (i.e., the optimal template) is assigned a weight of , while each negative sample is assigned a weight of 1. This strategy balances their contributions during optimization and improves the discriminative ability of the selector.
For the Dynamic Threshold Module (DTM), the update threshold of each sequence is estimated from the confidence statistics of the first several frames using the truncated mean strategy. This design introduces no additional trainable parameters and incurs negligible computational overhead. For the Dynamic Template Memory Module (DTMM), the fixed, candidate, and historical template memories are maintained online during tracking to support adaptive template selection and update. All experiments are conducted under the same evaluation protocol as the compared methods to ensure fair comparison.
3.3. Quantitative Comparison
We first evaluate GTUTrack on VTUAV, a standard benchmark for multi-modal UAV tracking, to verify its effectiveness in UAV-specific scenarios. We then further evaluate the proposed method on three additional multi-modal tracking benchmarks collected in typical surveillance scenarios, including RGBT210, RGBT234, and LasHeR, in order to validate its generalization capability. The overall comparison results are summarized in
Table 1, where the best, second-best, and third-best results are highlighted in
bold,
underline, and
italic, respectively.
A clear observation from
Table 1 is that GTUTrack consistently achieves the best performance across all four datasets. Compared with recent strong baselines, including CAFormer, CGATrack, and TATrack, the proposed method yields consistent gains in both localization and overlap-based metrics. These results indicate that the proposed search region-guided adaptive template update mechanism improves not only target localization accuracy, but also the stability of template matching under complex appearance variation.
(1) Evaluation on VTUAV.
We first evaluate GTUTrack on VTUAV, which is specifically designed for multi-modal UAV tracking and contains challenges such as small target size, fast motion, significant viewpoint variation, and cluttered backgrounds. As shown in
Table 1, GTUTrack achieves 91.4% in PR and 78.4% in SR, significantly outperforming all competing methods. Compared with CGATrack, GTUTrack improves PR and SR by 2.4% and 1.8%, respectively. Compared with CAFormer, the gains further increase to 2.8% in PR and 2.2% in SR. These improvements are particularly meaningful on VTUAV, where the target appearance often changes abruptly due to UAV motion and viewpoint variation.
The strong performance on VTUAV demonstrates that GTUTrack is well suited to multi-modal UAV tracking. On the one hand, GTST can adaptively select the template that best matches the current search region, which is crucial when the target undergoes a rapid appearance change. On the other hand, DTM prevents unreliable search regions from being updated into the template pool, while DTMM preserves both stable and high-quality historical target representations. As a result, the tracker can better maintain robust target representation throughout the UAV tracking process.
(2) Generalization on Surveillance Benchmarks.
To further validate the generalization capability of GTUTrack beyond UAV scenarios, we additionally evaluate it on three widely used RGBT tracking benchmarks from typical surveillance scenarios, namely RGBT210, RGBT234, and LasHeR.
On RGBT210, GTUTrack achieves the best performance on both PR and SR, reaching 90.9% and 66.3%, respectively. Compared with the second-best method CGATrack, GTUTrack improves PR and SR by 3.1% and 2.0%, respectively. It also surpasses CAFormer by 5.3% and 3.1% in PR and SR, respectively, and outperforms QAT by 4.1% in PR and 4.4% in SR. Since RGBT210 contains imperfectly aligned RGB and thermal pairs, these gains suggest that the proposed adaptive template selection mechanism can effectively alleviate the impact of cross-modal inconsistency and noisy template updates.
On RGBT234, GTUTrack achieves 92.0% in PR and 68.8% in SR, consistently outperforming all competing methods. Compared with CGATrack, GTUTrack achieves gains of 3.0% in PR and 2.2% in SR. Compared with CAFormer and USTrack, GTUTrack also shows clear improvements. Since RGBT234 provides better cross-modal alignment and more accurate annotations, the results on this dataset further verify that the proposed framework can effectively exploit multi-modal complementary information while maintaining accurate and adaptive template update.
On LasHeR, a large-scale and highly challenging benchmark, GTUTrack achieves the best performance across all three metrics, reaching 75.0% in PR, 71.1% in NPR, and 59.3% in SR. Compared with CGATrack, GTUTrack achieves improvements of 2.8%, 2.8%, and 1.8% in PR, NPR, and SR, respectively. Compared with TATrack and BAT, GTUTrack also yields consistent gains. Because LasHeR contains diverse challenges such as occlusion, low resolution, deformation, and background clutter, these results indicate that the proposed adaptive template memory and selection mechanism can effectively handle substantial appearance variation over long tracking sequences.
Overall, the quantitative results on VTUAV verify the effectiveness of GTUTrack for multi-modal UAV tracking, while the consistent improvements on RGBT210, RGBT234, and LasHeR demonstrate its strong generalization capability across typical surveillance scenarios. More importantly, the superiority of GTUTrack across both UAV and surveillance benchmarks suggests that search region-guided adaptive template update is a generally effective solution for improving multi-modal tracking robustness under dynamic scene variation. Meanwhile, as shown in the
Table 2, GTUTrack maintains favorable efficiency with 182.52M parameters, 76.36G FLOPs and 62 FPS, offering a good balance between accuracy and speed for real-time UAV tracking.
3.4. Ablation Study
(1) Component Analysis.
We conduct incremental ablation experiments to evaluate the contribution of each component in GTUTrack. As shown in
Table 3, the baseline tracker (OSTrack+RGBT) achieves 67.8%/64.3%/54.0% in PR/NPR/SR on LasHeR and 86.4%/64.5% in PR/SR on RGBT234. After introducing GTST, the performance increases to 72.5%/68.5%/57.5% on LasHeR and 88.8%/66.2% on RGBT234. This corresponds to gains of 4.7% in PR and 3.5% in NPR on LasHeR, as well as 2.4% in PR on RGBT234, demonstrating that search region-guided adaptive template selection plays a dominant role in improving tracking performance.
After further incorporating the Dynamic Threshold Module (DTM), the performance improves to 73.3%/69.1%/58.2% on LasHeR and 89.8%/67.0% on RGBT234. These gains indicate that adaptive thresholding can effectively filter unreliable candidate templates and reduce the risk of noisy template updates. Compared with fixed update strategies, DTM provides sequence-specific update criteria, which are better suited to varying scene conditions.
When the Historical Template Memory (HTM) is added, the performance further increases to 74.4%/70.8%/59.0% on LasHeR and 91.5%/68.4% on RGBT234. This result shows that preserving high-quality historical templates is beneficial for maintaining stable target representation, especially when recent candidate templates are affected by temporary noise or local appearance degradation.
Finally, by introducing the Candidate Template Memory (CTM), the complete GTUTrack achieves the best performance across all metrics, i.e., 75.0%/71.1%/59.3% on LasHeR and 92.0%/68.8% on RGBT234. This final improvement demonstrates that recent candidate templates and reliable historical templates are complementary. Together, they provide both adaptability to current appearance variation and stability against noisy updates. Overall, the ablation results confirm that GTST, DTM, and DTMM each make a positive contribution, and their combination yields the strongest performance.
(2) Threshold Analysis.
We further analyze the impact of different template update thresholds. As shown in
Table 4, fixed thresholds lead to noticeable performance fluctuations. On LasHeR, the best fixed-threshold result is obtained at 0.75, with 73.3% PR, 69.0% NPR, and 58.3% SR. However, when the threshold is increased to 0.80 and 0.85, the performance drops consistently. A similar trend can be observed on RGBT234, where the best fixed-threshold setting still underperforms the proposed adaptive strategy.
In contrast, the proposed DTM consistently outperforms all fixed-threshold settings. Compared with the best fixed threshold (0.75), DTM improves PR/NPR/SR by 1.7%/2.1%/1.0% on LasHeR and improves PR/SR by 2.2%/1.9% on RGBT234. These results demonstrate that adaptive thresholding can better balance template quality and update frequency. Instead of relying on a globally fixed rule, DTM adjusts the update criterion according to the confidence statistics of each sequence, thereby producing more reliable candidate templates and improving overall tracking robustness.
We further perform ablation analysis on the initial threshold of the DTM module. As shown in Experiment 5, tracking performance declines significantly without adopting an initial threshold, where the template update threshold is entirely determined by the confidence scores of the first 20 frames within each tracking sequence. This phenomenon mainly arises because the calculated threshold tends to be unreliable when the early frames suffer from complex tracking difficulties. By comparing Experiments 6, 7 and 8, we verify that setting the initial threshold to 0.75 yields the optimal configuration for template updating.
The threshold analysis also provides further evidence for the necessity of adaptive template update. A fixed threshold that works reasonably well on one dataset or sequence may not generalize to others with different confidence distributions. By contrast, the proposed DTM offers a lightweight yet effective way to improve update reliability without introducing additional learnable parameters.
(3) Memory Capacity Analysis.
We conduct ablation experiments to investigate how different template memory configurations affect tracking performance on three benchmarks, namely LasHeR, RGBT234, and VTUAV. The triplet in the “Methods” column denotes the capacities of the historical template memory, candidate template memory, and fixed template memory, respectively. As shown in
Table 5, the configuration
achieves the best overall performance, obtaining the highest scores across all datasets. Reducing the historical memory to 0 or 1, as in
and
, leads to a noticeable performance drop, indicating that a small but effective historical template pool is important for modeling long-term target appearance variations. In contrast, increasing the historical memory to 3 while reducing the candidate memory to 4 (
) slightly degrades performance, as a smaller candidate pool limits the model’s ability to adapt to recent appearance changes. These results suggest that balancing historical and candidate memories is crucial. Therefore, we adopt the setting of 2 historical templates, 5 candidate templates, and 1 fixed template in our full model.
(4) Analysis of Threshold Initialization Parameters.
We conduct ablation experiments in
Table 6 to explore the influence of hyperparameters
n and
m in the dynamic threshold module. Here, the two values in the “Methods” column follow the format
, where
n denotes the number of initial frames adopted to calculate the adaptive threshold, and
m represents the number of maximum and minimum confidence scores eliminated to eliminate outliers.
Experimental results show that inappropriate combinations of n and m will degrade tracking accuracy. When n is too small, the early frame information is insufficient, making the estimated threshold unable to reflect the real scene difficulty. When n is excessively large, redundant frames introduce more interference and increase computation cost. Meanwhile, a too small m cannot effectively filter abnormal confidence values caused by occlusion and background clutter, while an overlarge m will discard valid feature information.
The optimal performance is achieved when n = 20 and m = 2. This setting can stably exclude outlier scores, accurately fit the baseline confidence level of the tracking sequence, and generate reliable adaptive update thresholds. It successfully balances scene adaptability and computational simplicity, so we adopt this parameter combination in all subsequent experiments.
(5) Perturbation Analysis.
We evaluate the robustness of our tracker against spatial misalignment between RGB and thermal modalities through perturbation experiments. As shown in
Table 7, ours method (row 1) corresponds to perfectly aligned data, yielding the highest performance on both LasHeR and VTUAV. When introducing translational offsets of
pixels (row 2) or rotational deviations of
(row 3), the tracking metrics gradually degrade, and performance drops further under combined translation and rotation perturbations (row 4). This is expected, since our model is trained on well-aligned multi-modal data; any spatial misalignment breaks the learned cross-modal correspondence and thus weakens the tracking ability.
3.5. Attribute Analysis
We evaluate the attribute-based tracking performance of GTUTrack on the challenging RGBT234 dataset. The radar chart reports the overall performance across 12 typical attribute subsets, including background clutter (BC), scale variation (SC), partial occlusion (PO), thermal crossover (TC), no occlusion (NO), motion blur (MB), low resolution (LR), low illumination (LI), heavy occlusion (HO), fast motion (FM), deformation (DEF), and camera motion (CM). As shown in
Figure 4, GTUTrack consistently achieves the best performance across all attribute scenarios compared with other state-of-the-art trackers, including BAT, CGATrack, SDSTrack, ViPT, and TBSI. The polygon corresponding to GTUTrack forms the outermost boundary in the radar chart, indicating superior robustness under diverse challenging conditions.
GTUTrack also exhibits strong adaptability to both RGB-degraded and TIR-degraded scenarios. When RGB information is severely impaired, such as in low illumination (LI), partial occlusion (PO), and fast motion (FM), GTUTrack can effectively exploit complementary thermal information to maintain stable tracking. Meanwhile, when thermal information becomes unreliable, such as under thermal crossover (TC) and background clutter (BC), GTUTrack remains robust by leveraging discriminative RGB features. Moreover, GTUTrack shows clear advantages in handling non-rigid objects (NO), camera motion (CM), and scale variation (SC), demonstrating its ability to cope with complex geometric variations and motion patterns. Overall, the attribute-based evaluation confirms that GTUTrack is robust to a wide range of adverse conditions, and its cross-modal adaptive mechanism enables consistent performance gains over existing trackers.
3.6. Qualitative Analysis
We visualize the tracking results on representative sequences from VTUAV and LasHeR datasets, covering several challenging scenarios, including occlusion (OCC), small targets (ST), low illumination (LI), and high reflection (HR).
Figure 5 shows that GTUTrack consistently produces more accurate and stable tracking results than competing methods in these challenging situations.
In bus_14, the target undergoes extreme illumination variation together with noticeable shape change during tracking. Under such conditions, several competing methods gradually drift away from the target or fail to maintain stable localization. In contrast, GTUTrack consistently remains locked onto the target throughout the sequence. This result demonstrates that the proposed method can effectively adapt to severe appearance variation and maintain reliable target representation even when the visual characteristics of the target change significantly.
In rightdarksingleman, GTUTrack successfully adapts to illumination changes and achieves consistent target localization throughout the sequence. Compared with competing methods that gradually deviate from the target under changing brightness conditions, GTUTrack remains more stable due to its ability to update templates adaptively according to the current search region.
Overall, these qualitative results demonstrate that GTUTrack exhibits superior robustness compared with existing multi-modal tracking methods, particularly in dynamic and challenging scenarios involving severe appearance variation, environmental interference, and modality degradation. The observations are also consistent with the quantitative and ablation results, further validating the effectiveness of the proposed search region-guided adaptive template update framework.