Next Article in Journal
High-Precision Instance Segmentation of Tree Saplings by Multimodal Mask R-CNN Integrating RGB and Multispectral Image-Derived Indices Through a Field Phenotyping Platform
Previous Article in Journal
Cloud-Aware Dual-Path Prompt Learning with CLIP for Few-Shot Fine-Grained Ship Classification in Mixed-Sky Remote Sensing Imagery
Previous Article in Special Issue
Depth-Aware Adversarial Domain Adaptation for Cross-Domain Remote Sensing Segmentation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Search Region-Guided Adaptive Template Update for Robust Multi-Modal UAV Tracking

1
School of Artificial Intelligence, Anhui University of Science & Technology, Huainan 232001, China
2
Key Laboratory of Intelligent Computing & Signal Processing, Ministry of Education, Anhui University, Hefei 230601, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(11), 1817; https://doi.org/10.3390/rs18111817
Submission received: 12 April 2026 / Revised: 21 May 2026 / Accepted: 29 May 2026 / Published: 2 June 2026

Highlights

What are the main findings?
  • We propose GTUTrack, a search region-guided adaptive template update framework for robust multi-modal UAV tracking, which jointly performs adaptive template selection, thresholding, and memory management.
  • GTUTrack achieves state-of-the-art performance on VTUAV and also shows strong generalization on RGBT210, RGBT234, and LasHeR under challenging conditions such as occlusion, illumination variation, and scale change.
What are the implications of the main findings?
  • The results show that template update in multi-modal UAV tracking should be conditioned on the current search region rather than relying on fixed update intervals and manually designed thresholds.
  • The proposed framework provides an effective solution for robust target tracking in UAV remote sensing and other multi-modal surveillance scenarios with complex appearance variation and modality discrepancy.

Abstract

Existing multi-modal UAV tracking methods typically rely on fixed-interval dynamic template update strategies to capture diverse target appearances, together with predefined thresholds to select high-quality search regions for template update. However, due to the irregular motion of targets and the complexity of real-world scenarios, such passive update mechanisms suffer from notable limitations. Fixed sampling intervals often fail to adequately capture appearance variations, while fixed threshold-based selection is insufficient to accommodate diverse imaging conditions, leading to ineffective updates or the introduction of noisy templates, thereby degrading tracking robustness and accuracy. To address these issues, we propose a search region-guided adaptive dynamic template update framework for robust multi-modal UAV tracking, aiming to improve both scene adaptability and target matching capability. Specifically, we design a Guided Template Selection Transformer, which dynamically matches templates conditioned on the current search region, enabling the tracker to autonomously select the most suitable template for the target’s current state. Furthermore, we introduce a Dynamic Threshold Module that adaptively adjusts template selection criteria according to different tracking scenarios, ensuring the reliability and contextual relevance of candidate templates. In addition, we develop a Dynamic Template Memory Module to maintain an ordered repository of target templates under different target states, providing a structured and high-quality template pool for the proposed selection mechanism. Extensive experiments on a standard multi-modal UAV tracking benchmark demonstrate that the proposed method significantly outperforms existing approaches, effectively overcoming the limitations of conventional fixed update strategies. Moreover, the proposed approach exhibits strong generalization capability across three additional multi-modal tracking datasets from typical surveillance scenarios.

1. Introduction

Vision-based remote sensing image analysis is currently booming, attracting numerous researchers to engage in relevant studies, such as the analysis of SAR images. Xue et al. [1] propose a lightweight modality compensation network to transfer multimodal knowledge to single-SAR models via knowledge distillation, solving the problems of missing AIS data and high computational cost to balance ship detection accuracy and real-time performance. Ai et al. [2] present MKSFF-CNN, which employs multi-scale parallel convolutions to extract diverse features and fuse them reasonably, improving feature integrity and achieving superior SAR target classification performance verified on the MSTAR dataset. Ai et al. [3] propose a SAR multi-target ship detection algorithm based on MSRIHL-CNN, which adopts TCS-JCFAR for efficient prescreening and fuses MSRI-HL low-level features with CNN high-level features for discrimination, achieving excellent performance verified on Gaofen-3 SAR images. Multi-modal UAV tracking [4] has attracted increasing attention in recent years due to its wide applications in aerial surveillance, disaster response, intelligent transportation systems, border patrol, and public safety monitoring. Compared with conventional ground-view tracking, UAV-based tracking [5,6,7] provides a broader field of view and more flexible observation perspectives, making it particularly suitable for large-scale dynamic scene perception. By leveraging complementary information from multiple modalities, such as RGB and thermal infrared [8,9], multi-modal UAV tracking is expected to achieve more robust performance under challenging conditions, including illumination variation, partial occlusion, background clutter, and adverse weather. In particular, RGB data provide rich texture and structural cues under favorable lighting conditions, while thermal infrared data offer stronger target saliency in low-light or visually degraded environments. The integration of these modalities therefore provides an effective way to improve target perception in complex UAV tracking scenarios.
Despite these advantages, robust multi-modal UAV tracking remains highly challenging. In practical UAV scenarios, targets often exhibit rapid motion, abrupt scale variation, significant viewpoint changes, and frequent occlusion due to camera motion, altitude changes, or scene complexity. Moreover, UAV platforms typically capture targets at relatively low resolution, making the target appearance more ambiguous and unstable than in conventional surveillance settings. These characteristics make it difficult for a tracker to maintain a reliable and discriminative target representation over time. As a result, effectively modeling dynamic target appearance variations remains a critical challenge in multi-modal UAV tracking.
To alleviate this issue, dynamic template updating has been widely adopted as a key component in modern multi-modal UAV trackers [9]. The underlying motivation is intuitive: as the target appearance evolves over time, the tracker should continuously update its target representation by incorporating newly observed appearances under varying poses, scales, and illumination conditions. In this way, a dynamic template pool can be maintained to cover target appearance variations throughout the tracking process. Dynamic template update is particularly important for UAV tracking, where target states can change rapidly across frames and the mismatch between outdated templates and the current target appearance may easily lead to degraded matching quality or even tracking drift. Therefore, a reliable template update mechanism is essential for preserving robust target representation and improving long-term tracking stability.
However, due to the intrinsic complexity of multi-modal UAV tracking scenarios and the stochastic nature of target motion, existing template update strategies remain largely heuristic and exhibit limited adaptability. Specifically, current approaches [9,10,11,12] typically rely on fixed update intervals and predefined thresholds for template selection, which suffer from several inherent limitations. First, the irregular motion patterns and nonlinear appearance changes of targets make fixed update intervals fundamentally suboptimal. They may introduce redundant samples when the target state is stable, leading to unnecessary computation and noise accumulation, while failing to capture critical appearance variations under abrupt motion or scale changes, resulting in insufficient coverage of the target appearance distribution and potential tracking drift. Second, the significant variation in imaging quality across different scenarios [13,14] further limits the effectiveness of fixed threshold-based selection. For example, the thermal modality often dominates under low-light conditions, whereas both modalities may degrade in adverse environments such as fog or smoke. In UAV tracking, this issue is often more pronounced because aerial viewpoints, altitude variation, and camera motion can further amplify image blur, background interference, and modality inconsistency. Such variability makes predefined thresholds difficult to generalize: overly strict thresholds may discard informative samples, while loose thresholds tend to introduce low-quality templates, undermining the reliability of the template pool. Furthermore, existing template memory mechanisms typically lack structured management [9,15], leading to redundancy and disorder within the template pool. This not only hinders efficient template retrieval and matching but also degrades tracking performance. In many cases, newly collected templates may contain overlapping or low-quality target appearances, while historically reliable templates are not explicitly preserved and organized. As a result, the template pool may gradually accumulate noisy or redundant templates, weakening its ability to provide stable target representations for subsequent matching. Meanwhile, the inherent modality discrepancy between RGB and thermal infrared data introduces additional challenges in feature alignment, making template selection more susceptible to cross-modal noise and further reducing tracking accuracy.
From this perspective, the key difficulty of multi-modal UAV tracking is not merely determining how to update templates more frequently, but how to update them more appropriately. An effective template update strategy should answer three closely related questions: when to update, which template should be selected, and how the selected templates should be stored and managed over time. Existing passive strategies based on fixed rules often fail to address these questions in a unified and adaptive manner. This observation motivates us to rethink template update from the perspective of search-region-guided decision making, where the current search region serves as an active cue for evaluating template relevance, estimating update reliability, and organizing template memory.
To address the above challenges, we propose a search region-guided adaptive template update framework for robust multi-modal UAV tracking, which establishes a fully adaptive pipeline for template selection, update, and memory management, as shown in Figure 1. Instead of relying on passive and predefined update rules, the proposed framework actively leverages the current search region to guide template-related decisions. In this way, template update becomes a context-aware process that is dynamically adjusted according to the current target state and scene characteristics.
Specifically, we design a Guided Template Selection Transformer (GTST) that formulates template selection as a search region-conditioned attention matching process. This enables the tracker to dynamically select the most relevant template according to the current target state, achieving precise alignment between the template and the search region. Unlike the CSIL [16] framework designed based on the center-to-surrounding interaction structure, the GTST proposed in this paper is built on the Vision Transformer architecture, focusing on the multi-modal UAV tracking task and improving tracking robustness through an adaptive template selection mechanism. Furthermore, we introduce a Dynamic Threshold Module (DTM) to adaptively adjust template selection criteria by modeling scene characteristics and tracking confidence, thereby overcoming the limitations of fixed thresholds. In addition, we develop a Dynamic Template Memory Module (DTMM) to maintain an ordered and structured template pool, supporting efficient template storage, update, and replacement while avoiding redundancy and disorder. The three components work collaboratively: GTST improves template–target matching, DTM ensures reliable candidate template generation, and DTMM provides a structured memory basis for robust adaptive template update.
Extensive experiments on a standard multi-modal UAV tracking benchmark, i.e., VTUAV [4], demonstrate that the proposed method consistently outperforms state-of-the-art approaches. These results verify the effectiveness of the proposed framework in UAV-specific scenarios characterized by small targets, fast motion, and viewpoint variation. Moreover, our method generalizes well to other multi-modal tracking datasets from diverse surveillance scenarios, including RGBT210 [17], RGBT234 [18], and LasHeR [19], validating its effectiveness and adaptability.
The contributions of this paper are summarized as follows:
  • We propose a search region-guided adaptive template update framework that enables fully adaptive template selection, update, and memory management for multi-modal UAV tracking.
  • We design a Guided Template Selection Transformer (GTST) to dynamically match templates conditioned on the current search region, improving target representation and matching accuracy.
  • We introduce a Dynamic Threshold Module (DTM) to adaptively control template selection across diverse scenarios, enhancing robustness.
  • We develop a Dynamic Template Memory Module (DTMM) to construct a structured and high-quality template pool, alleviating redundancy and improving efficiency.
  • Extensive experiments on four benchmark datasets demonstrate that our tracker achieves state-of-the-art performance.

2. Method

In this section, we present the proposed multi-modal UAV tracking framework, termed GTUTrack, which incorporates a search region-guided adaptive template update mechanism. We first introduce the overall architecture and data flow of GTUTrack. We then describe the proposed Guided Template Selection Transformer, Dynamic Threshold Module, and Dynamic Template Memory Module in detail. Finally, we present the loss functions used for model optimization.

2.1. Overall Framework

The overall architecture of GTUTrack is illustrated in Figure 2. The model takes as input an initial template pair, a dynamic template pair, and a search region pair. The overall tracking architecture follows the baseline tracker OSTrack [20]. During tracking, the Dynamic Threshold Module (DTM) adaptively determines the confidence threshold for template update according to the current tracking scenario. When the tracking confidence score of the current search region exceeds the corresponding update threshold, the search region is cropped into template format and stored in the Dynamic Template Memory Module (DTMM), which consists of Historical Template Memory, Candidate Template Memory, and Fixed Template Memory.
Specifically, the Historical Template Memory stores template pairs that have been selected as dynamic templates so as to accumulate reliable historical target appearance variations. The Candidate Template Memory stores candidate templates obtained by cropping reliable search region pairs. The Fixed Template Memory preserves the unique initial template pair as a stable fallback throughout tracking. Subsequently, the Guided Template Selection Transformer (GTST) takes the templates stored in the DTMM together with the current search region pair as input, and selects the most suitable dynamic template pair from the Historical and Candidate Template Memories for object tracking.

2.2. Guided Template Selection Transformer

The Guided Template Selection Transformer is built upon a Vision Transformer architecture and is formulated as a template selection module, as illustrated in Figure 2. Its goal is to select the template that yields the best tracking performance for the current search region, thereby improving the robustness of multi-modal UAV tracking. By learning the matching relationship between candidate templates and the current search region, the module establishes a mapping from input features to template quality, enabling adaptive template selection.
The training process of the GTST is shown in Figure 3. Given a search region pair S = { I x r , I x t } containing both visible and thermal modalities, and a candidate template set T = { T 1 , T 2 , , T K } , where T i = { I z r , I z t } denotes the i-th multi-modal template pair, each template–search pair independently calculates its tracking performance, each template–search pair ( T i , S ) is fed into the tracker to obtain the corresponding Intersection over Union (IoU) score O i = I o U ( T i , S ) . This score directly reflects the tracking quality of template T i on the current search region. The IoU scores of all candidate templates are organized into a vector O = [ O 1 , O 2 , , O K ] , and the optimal template index is determined as:
i * = arg max i { 1 , , K } O i .
The optimal index is then converted into a one-hot label vector y { 0 , 1 } K , in which the entry corresponding to the optimal template is set to 1 and all others are set to 0. Notably, the number of candidate templates remains fixed throughout training and inference, since the dimension of the classification head is predefined and cannot adapt to variable template pool sizes.
The GTST adopts a standard Vision Transformer architecture. In practical implementation, we traverse all candidates in a loop and process each template–search pair separately. For each candidate template T i and search region S, their multi-modal data are concatenated to form an input sample X i = Concat ( T i , S ) , Each sample X i is independently sent into the network for self-attention and feed-forward feature extraction following the standard image classification design, and all candidate samples { X 1 , X 2 , , X K } are processed in a batch. The model maps each input sample X i to a classification logit vector z i = f θ ( X i ) , where f θ denotes the Transformer classifier parameterized by θ . All K independent scalar logits are assembled into a unified logit vector, and the assembled logits are then passed through a Softmax function to obtain the classification probabilities:
y ^ i = Softmax ( z i ) = e z i , 1 k = 1 K e z i , k , , e z i , K k = 1 K e z i , k .
This probability distribution reflects the model’s confidence that a candidate template T i is the optimal template for the current search region. During inference, the template with the highest predicted confidence T i ^ is selected and fed into the tracker for subsequent target localization, enabling adaptive template selection during tracking.

2.3. Dynamic Threshold Module

The Dynamic Threshold Module is a key component of GTUTrack, designed to generate adaptive thresholds for template update under different tracking scenarios. This module addresses the limitations of fixed thresholds, which often fail to generalize across complex tracking conditions and may either suppress effective updates or introduce excessive noisy templates.
The prediction head of the tracker produces a confidence response map F s c o r e _ m a p R H × W for the input search region, where H and W denote the spatial dimensions of the feature map. Each element in this map represents the confidence that the corresponding location belongs to the target, and the maximum response is taken as the core tracking confidence score of the current frame. This score reflects the reliability of matching the current template to the target. Since different tracking sequences may exhibit substantially different scene characteristics, including target appearance complexity, background interference, and illumination variation, a fixed threshold is often inappropriate. An overly high threshold may prevent valid template updates in low-confidence but reliable sequences, whereas an overly low threshold may introduce noisy templates and cause tracking drift.
To address this issue, we adopt a robust statistical strategy based on the truncated mean to estimate a personalized template update threshold for each tracking sequence. By removing extreme values, this strategy reduces the influence of outlier frames and improves both robustness and scene adaptability.
For a given sequence, the dynamic threshold is computed from the confidence scores of the first n frames. Specifically, the confidence scores of the first n frames are collected into a set S = { s 1 , s 2 , , s n } , where s t denotes the highest confidence score of the t-th frame. We then remove the largest m and smallest m elements from S and retain the remaining n 2 m valid scores in the set S valid . The dynamic threshold θ for template update is computed as the mean of the valid scores:
θ = 1 | S valid | s S valid s .
In our implementation, we set n = 20 and m = 2. We further set a fixed initial baseline threshold of 0.7. If the statistically calculated dynamic threshold θ is higher than 0.7, we adopt θ as the final update threshold; otherwise, we keep the initial threshold 0.7 as the standard. Importantly, the threshold is only calculated once using the early-stage frame information, and it will not be further updated during the subsequent whole tracking process. The resulting final threshold θ is used as the criterion for template update in the current sequence. When the core tracking confidence score F s of a subsequent frame exceeds θ , the corresponding search region is cropped into a candidate template and stored in memory. Otherwise, template update is suspended to avoid introducing unreliable templates.
This design is motivated by two considerations. First, the truncated mean is a classical robust statistical estimator with strong resistance to outliers, such as abnormally high scores caused by occasional mismatches or low scores caused by temporary occlusion, thus providing a more reliable estimate of the baseline confidence level of the sequence. Second, the first n frames usually capture the initial scene characteristics of the sequence, and the threshold estimated from this stage can better reflect the intrinsic difficulty of the sequence. For example, low-texture targets tend to yield lower baseline confidence, whereas high-contrast targets generally produce higher-confidence responses. In this way, the proposed DTM effectively overcomes the poor generalization of fixed thresholds.
Notably, the DTM only requires lightweight statistical computation on the first n frames and introduces neither additional trainable parameters nor extra optimization objectives. It therefore achieves a favorable balance between computational efficiency and scene adaptability. For sequences with simple backgrounds and stable target motion, the threshold is automatically increased to avoid redundant updates. In contrast, for challenging sequences with cluttered backgrounds and overall lower confidence, the threshold is automatically reduced to preserve valid template updates. This fully automatic and personalized threshold generation strategy provides a reliable foundation for adaptive template updates.

2.4. Dynamic Template Memory Module

The Dynamic Template Memory Module is the core component responsible for storing and managing the lifecycle of templates in GTUTrack. It is implemented using a double-ended queue (Deque), which supports efficient insertion and deletion operations with O ( 1 ) time complexity. This design enables efficient template storage, update, and retrieval while satisfying the real-time requirements of UAV tracking. To control memory usage and maintain retrieval efficiency, the module adopts a fixed-capacity queue design. The DTMM consists of three memory components, each serving a distinct role in template storage and update.
(1) Fixed Template Memory
The Fixed Template Memory provides the fundamental reference throughout the entire tracking process. Its role is to store the initial template pair T init = { T init r g b , T init t i r } , which contains both visible and thermal modalities. This memory has a fixed capacity of one and remains unchanged during tracking, i.e., no update, replacement, or eviction is performed.
The main advantage of this design is that it provides a stable tracking anchor. Since the initial template preserves the original target appearance, it can serve as a reliable fallback when dynamic templates become unreliable due to occlusion, abrupt appearance variation, or severe tracking drift. In such cases, the fixed template helps prevent catastrophic tracking failure by maintaining a consistent reference for target localization.
(2) Candidate Template Memory
The Candidate Template Memory serves as a dynamic pool for accumulating valid candidate templates during tracking. After each frame, if the core tracking confidence score F s exceeds the dynamic threshold θ , the current search region is regarded as reliable and is assumed to contain valid target information. The search region is then cropped according to the predicted target center and resized into template format, forming a candidate template T c t = { T c t r g b , T c t t i r } , which is appended to the tail of the candidate memory.
To maintain both efficiency and relevance, the Candidate Template Memory adopts a first-in-first-out (FIFO) strategy with a fixed capacity. When the memory is full, the oldest template is removed from the head of the queue. As a result, the memory always maintains a set of recent candidate templates T c t = { T c t 1 , T c t 2 , , T c t N } , where N denotes the memory capacity. This design allows the module to continuously capture target appearance variations, including pose, scale, and viewpoint changes, thereby providing up-to-date candidates for adaptive template selection and alleviating the adverse effect of outdated templates.
(3) Historical Template Memory
The Historical Template Memory acts as a repository of high-quality templates. Templates stored in this memory are those selected as optimal by the Guided Template Selection Transformer and therefore exhibit strong matching reliability and robustness.
During tracking, if the GTST selects an optimal template T best , t from the candidate memory, this template is further inserted into the historical memory. Similar to the candidate memory, the historical memory is implemented as a fixed-capacity FIFO queue, maintaining a template set T h t = { T h t 1 , T h t 2 , , T h t M } , where M denotes the memory capacity.
The main advantage of the Historical Template Memory lies in its ability to provide reliable fallback templates. Since all stored templates have been validated through previous matching performance, they can robustly represent the target appearance across different stages of tracking. Even when the candidate memory is contaminated by low-quality templates caused by sudden scene changes, the historical memory can still provide reliable alternatives, thereby reducing the risk of failure due to erroneous template selection.
Moreover, the interaction between the Candidate Template Memory and the Historical Template Memory forms a positive feedback mechanism. The candidate memory captures recent appearance variations of the target, while high-quality templates selected through matching are preserved in the historical memory. This design ensures both adaptability and stability during tracking.
During inference, all components of DTMM are initialized with the initial template. Specifically, the capacity of the fixed template memory is set to 1, the candidate template memory is 5, and the historical template memory is 2, reaching a total capacity of 8, which is kept consistent with the dimension setting of GTST. In the first 20 frames, whether dynamic templates are inserted into the candidate template memory is judged according to the initial threshold defined in DTM. Additionally, duplicate templates will not be repeatedly stored in the historical template memory.
Overall, the collaboration of the three memory components enables the DTMM to balance adaptability and reliability. The fixed memory provides a stable reference, the candidate memory captures real-time appearance variations, and the historical memory preserves reliable high-quality templates. Together, they form a structured and hierarchical template memory system that supports robust and accurate template selection.

2.5. Loss Function

Given the backbone-extracted search region tokens from both modalities, we first perform element-wise fusion between the RGB and thermal tokens, and then reshape the fused tokens into a 2D spatial feature map. This fused representation effectively integrates complementary cross-modal information for subsequent prediction.
The resulting feature map is then fed into a series of C o n v - B N - R e L U layers to produce three prediction branches, including the target classification score map, local offset prediction, and normalized bounding box size prediction.
Following OSTrack [20], the tracking head is optimized using both classification and regression losses. The overall tracking loss is formulated as:
L t r a c k i n g = L c l s + λ g i o u L g i o u + λ L 1 L 1 ,
where L c l s denotes the weighted Focal Loss for classification, L g i o u denotes the Generalized IoU loss, and L 1 denotes the L 1 loss. The balancing coefficients are set to λ g i o u = 2 and λ L 1 = 5 .
For the Guided Template Selection Transformer, which is formulated as a template classification task, we adopt the Binary Cross-Entropy (BCE) loss for optimization.
Specifically, the input to the selector is defined as the concatenated representation of a candidate template and the current search region, denoted as X i = Concat ( T i , S ) , where T i is the i-th candidate template and S is the current search region. The model outputs a probability p ^ i [ 0 , 1 ] indicating the likelihood that T i is the optimal template. The ground-truth label p i is defined using one-hot encoding, where the optimal template is assigned 1 and all others are assigned 0.
The selector loss is formulated as:
L selector = 1 K i = 1 K ( K 1 ) · p i log ( p ^ i ) + 1 · ( 1 p i ) log ( 1 p ^ i ) ,
where K is the number of candidate templates, p i is the ground-truth label, and p ^ i is the predicted probability. Specifically, the positive sample is assigned a weight of K 1 , while each negative sample is assigned a weight of 1. In our experiments, we set K = 8.
The BCE loss encourages the selector to assign higher confidence to the optimal template while suppressing non-optimal templates. In particular, it penalizes low predicted probabilities for positive samples and high predicted probabilities for negative samples, thereby enabling the model to learn discriminative matching features between candidate templates and the current search region.
To jointly optimize target localization and adaptive template selection, we combine the tracking loss and the selector loss into a unified multi-task objective. The overall loss function of GTUTrack is defined as:
L t o t a l = L t r a c k i n g + L s e l e c t o r ,
By jointly optimizing these two objectives, the proposed framework simultaneously enhances tracking accuracy and template selection reliability. The tracking loss ensures precise target localization and bounding box regression, while the selector loss improves the ability of the Guided Template Selection Transformer to identify the most suitable template for the current search region. As a result, the tracking network and the template selector can be trained in an end-to-end manner, allowing the tracking network and the template selector to mutually reinforce each other during optimization.

2.6. Differences from Existing Methods

Unlike existing RGBT tracking methods [8,9,21], which mainly regard multimodal feature fusion as their core contribution and focus on cross-modal alignment and interaction, we pay attention to the adaptive optimization of template resources. Motivated by advanced template updating schemes [22], we move beyond the fusion-dominated paradigm and explore a more effective template updating strategy. In contrast to most passive template updating methods [23,24], which rely on fixed rules and refresh templates mechanically according to preset thresholds, we propose an active template updating strategy. Specifically, the search region dynamically perceives target-state changes and autonomously selects the most suitable dynamic templates during online tracking. In this way, we can effectively filter out invalid interference templates and enhance the discriminability of target templates in complex scenarios.

3. Experiments

In this section, we first introduce the experimental settings and implementation details of the proposed GTUTrack. We then compare GTUTrack with a broad range of state-of-the-art multi-modal tracking methods on multiple benchmarks. Finally, extensive ablation studies and qualitative analyses are conducted to validate the effectiveness of each component and to further investigate the behavior of the proposed search region-guided adaptive template update framework.

3.1. Experimental Settings

3.1.1. Evaluation Datasets

To comprehensively evaluate the effectiveness and generalization capability of GTUTrack, we conduct experiments on four public multi-modal tracking benchmarks, including one UAV-oriented benchmark and three surveillance-oriented benchmarks.
VTUAV [4] is collected in unmanned aerial vehicle (UAV) scenarios. It focuses on UAV-specific tracking challenges, such as small target size, fast motion, viewpoint variation, scale change, and cluttered backgrounds. The dataset contains 500 aligned RGBT (RGB–Thermal) video pairs with approximately 1.7 million frames in total. It is divided into a training set and a test set, each containing 250 sequences. Moreover, VTUAV is further divided into long-term and short-term tracking subsets. Since our work focuses on short-term tracking, all training and evaluation are conducted on the short-term subset. VTUAV serves as the primary benchmark in our experiments because it directly corresponds to the target task of multi-modal UAV tracking.
RGBT210 [17] is a widely used RGBT tracking benchmark consisting of 210 video pairs with approximately 104,700 frames. It is divided into 12 subsets according to different attributes, enabling detailed analysis of tracker performance under diverse conditions. However, the alignment between RGB and thermal image pairs in this dataset is not sufficiently accurate, making it more challenging for methods that rely heavily on precise cross-modal correspondence.
RGBT234 [18] contains 234 RGBT video pairs with approximately 116,700 frames. Compared with RGBT210, this dataset provides more accurate alignment between RGB and thermal modalities as well as more precise target annotations. It is therefore suitable for evaluating the robustness of multi-modal tracking methods under relatively well-aligned surveillance scenarios.
LasHeR [19] is currently the largest RGBT tracking dataset for surveillance scenarios. It contains 1224 aligned RGBT video sequences with approximately 734,800 frames, including 979 training sequences and 245 test sequences. The dataset covers a wide range of challenging attributes and can be divided into 19 subsets, including occlusion, illumination variation, low resolution, deformation, and background clutter. Due to its large scale and rich challenge diversity, LasHeR provides a comprehensive benchmark for assessing the robustness and generalization ability of tracking methods.
Overall, these four datasets provide complementary evaluation perspectives. VTUAV is used to verify the effectiveness of the proposed method in UAV-specific scenarios, while RGBT210, RGBT234, and LasHeR are adopted to assess the generalization capability of GTUTrack in representative surveillance scenarios.

3.1.2. Evaluation Metrics

Following previous methods [8], we adopt precision rate (PR), success rate (SR), and normalized precision rate (NPR) as the quantitative evaluation metrics under the one-pass evaluation (OPE) protocol.
PR measures the percentage of frames in which the Euclidean distance between the center of the predicted bounding box and that of the ground-truth bounding box is smaller than a threshold τ . It mainly evaluates the localization accuracy of the tracker. A higher PR indicates that the tracker can more accurately estimate the target center across frames.
SR measures the percentage of frames in which the intersection over union (IoU) between the predicted bounding box and the ground-truth bounding box is larger than a threshold δ . It mainly evaluates the ability of the tracker to estimate target scale and spatial extent. By varying δ , we obtain the success curve, and the representative SR score is computed as the area under the curve.
NPR is a normalized version of PR that alleviates the influence of image resolution and target size. By varying the normalization threshold, a normalized precision curve is obtained, and the area under the curve within [ 0 ,   0.5 ] is used as the representative NPR score. This metric is especially useful when the target size varies substantially across different sequences and datasets.

3.2. Implementation Details

GTUTrack is implemented based on the OSTrack framework, and the overall tracking architecture follows the standard template-search matching pipeline. During tracking, the initial template pair is fixed as the anchor reference, while dynamic templates are updated online through the proposed search region-guided adaptive template update strategy.
For the Guided Template Selection Transformer (GTST), we use a batch size of 32 for training and adopt the AdamW optimizer with an initial learning rate of 1 × 10 4 . A cosine annealing schedule is employed to ensure stable convergence. The selector is trained to estimate the relative quality of candidate templates with respect to the current search region, so that it can provide reliable template selection during inference.
To alleviate the severe class imbalance in template selection, where only one template is optimal and the remaining K 1 templates are non-optimal, we assign adaptive weights to different samples during training. Specifically, the positive sample (i.e., the optimal template) is assigned a weight of K 1 , while each negative sample is assigned a weight of 1. This strategy balances their contributions during optimization and improves the discriminative ability of the selector.
For the Dynamic Threshold Module (DTM), the update threshold of each sequence is estimated from the confidence statistics of the first several frames using the truncated mean strategy. This design introduces no additional trainable parameters and incurs negligible computational overhead. For the Dynamic Template Memory Module (DTMM), the fixed, candidate, and historical template memories are maintained online during tracking to support adaptive template selection and update. All experiments are conducted under the same evaluation protocol as the compared methods to ensure fair comparison.

3.3. Quantitative Comparison

We first evaluate GTUTrack on VTUAV, a standard benchmark for multi-modal UAV tracking, to verify its effectiveness in UAV-specific scenarios. We then further evaluate the proposed method on three additional multi-modal tracking benchmarks collected in typical surveillance scenarios, including RGBT210, RGBT234, and LasHeR, in order to validate its generalization capability. The overall comparison results are summarized in Table 1, where the best, second-best, and third-best results are highlighted in bold, underline, and italic, respectively.
A clear observation from Table 1 is that GTUTrack consistently achieves the best performance across all four datasets. Compared with recent strong baselines, including CAFormer, CGATrack, and TATrack, the proposed method yields consistent gains in both localization and overlap-based metrics. These results indicate that the proposed search region-guided adaptive template update mechanism improves not only target localization accuracy, but also the stability of template matching under complex appearance variation.
(1) Evaluation on VTUAV.
We first evaluate GTUTrack on VTUAV, which is specifically designed for multi-modal UAV tracking and contains challenges such as small target size, fast motion, significant viewpoint variation, and cluttered backgrounds. As shown in Table 1, GTUTrack achieves 91.4% in PR and 78.4% in SR, significantly outperforming all competing methods. Compared with CGATrack, GTUTrack improves PR and SR by 2.4% and 1.8%, respectively. Compared with CAFormer, the gains further increase to 2.8% in PR and 2.2% in SR. These improvements are particularly meaningful on VTUAV, where the target appearance often changes abruptly due to UAV motion and viewpoint variation.
The strong performance on VTUAV demonstrates that GTUTrack is well suited to multi-modal UAV tracking. On the one hand, GTST can adaptively select the template that best matches the current search region, which is crucial when the target undergoes a rapid appearance change. On the other hand, DTM prevents unreliable search regions from being updated into the template pool, while DTMM preserves both stable and high-quality historical target representations. As a result, the tracker can better maintain robust target representation throughout the UAV tracking process.
(2) Generalization on Surveillance Benchmarks.
To further validate the generalization capability of GTUTrack beyond UAV scenarios, we additionally evaluate it on three widely used RGBT tracking benchmarks from typical surveillance scenarios, namely RGBT210, RGBT234, and LasHeR.
On RGBT210, GTUTrack achieves the best performance on both PR and SR, reaching 90.9% and 66.3%, respectively. Compared with the second-best method CGATrack, GTUTrack improves PR and SR by 3.1% and 2.0%, respectively. It also surpasses CAFormer by 5.3% and 3.1% in PR and SR, respectively, and outperforms QAT by 4.1% in PR and 4.4% in SR. Since RGBT210 contains imperfectly aligned RGB and thermal pairs, these gains suggest that the proposed adaptive template selection mechanism can effectively alleviate the impact of cross-modal inconsistency and noisy template updates.
On RGBT234, GTUTrack achieves 92.0% in PR and 68.8% in SR, consistently outperforming all competing methods. Compared with CGATrack, GTUTrack achieves gains of 3.0% in PR and 2.2% in SR. Compared with CAFormer and USTrack, GTUTrack also shows clear improvements. Since RGBT234 provides better cross-modal alignment and more accurate annotations, the results on this dataset further verify that the proposed framework can effectively exploit multi-modal complementary information while maintaining accurate and adaptive template update.
On LasHeR, a large-scale and highly challenging benchmark, GTUTrack achieves the best performance across all three metrics, reaching 75.0% in PR, 71.1% in NPR, and 59.3% in SR. Compared with CGATrack, GTUTrack achieves improvements of 2.8%, 2.8%, and 1.8% in PR, NPR, and SR, respectively. Compared with TATrack and BAT, GTUTrack also yields consistent gains. Because LasHeR contains diverse challenges such as occlusion, low resolution, deformation, and background clutter, these results indicate that the proposed adaptive template memory and selection mechanism can effectively handle substantial appearance variation over long tracking sequences.
Overall, the quantitative results on VTUAV verify the effectiveness of GTUTrack for multi-modal UAV tracking, while the consistent improvements on RGBT210, RGBT234, and LasHeR demonstrate its strong generalization capability across typical surveillance scenarios. More importantly, the superiority of GTUTrack across both UAV and surveillance benchmarks suggests that search region-guided adaptive template update is a generally effective solution for improving multi-modal tracking robustness under dynamic scene variation. Meanwhile, as shown in the Table 2, GTUTrack maintains favorable efficiency with 182.52M parameters, 76.36G FLOPs and 62 FPS, offering a good balance between accuracy and speed for real-time UAV tracking.

3.4. Ablation Study

(1) Component Analysis.
We conduct incremental ablation experiments to evaluate the contribution of each component in GTUTrack. As shown in Table 3, the baseline tracker (OSTrack+RGBT) achieves 67.8%/64.3%/54.0% in PR/NPR/SR on LasHeR and 86.4%/64.5% in PR/SR on RGBT234. After introducing GTST, the performance increases to 72.5%/68.5%/57.5% on LasHeR and 88.8%/66.2% on RGBT234. This corresponds to gains of 4.7% in PR and 3.5% in NPR on LasHeR, as well as 2.4% in PR on RGBT234, demonstrating that search region-guided adaptive template selection plays a dominant role in improving tracking performance.
After further incorporating the Dynamic Threshold Module (DTM), the performance improves to 73.3%/69.1%/58.2% on LasHeR and 89.8%/67.0% on RGBT234. These gains indicate that adaptive thresholding can effectively filter unreliable candidate templates and reduce the risk of noisy template updates. Compared with fixed update strategies, DTM provides sequence-specific update criteria, which are better suited to varying scene conditions.
When the Historical Template Memory (HTM) is added, the performance further increases to 74.4%/70.8%/59.0% on LasHeR and 91.5%/68.4% on RGBT234. This result shows that preserving high-quality historical templates is beneficial for maintaining stable target representation, especially when recent candidate templates are affected by temporary noise or local appearance degradation.
Finally, by introducing the Candidate Template Memory (CTM), the complete GTUTrack achieves the best performance across all metrics, i.e., 75.0%/71.1%/59.3% on LasHeR and 92.0%/68.8% on RGBT234. This final improvement demonstrates that recent candidate templates and reliable historical templates are complementary. Together, they provide both adaptability to current appearance variation and stability against noisy updates. Overall, the ablation results confirm that GTST, DTM, and DTMM each make a positive contribution, and their combination yields the strongest performance.
(2) Threshold Analysis.
We further analyze the impact of different template update thresholds. As shown in Table 4, fixed thresholds lead to noticeable performance fluctuations. On LasHeR, the best fixed-threshold result is obtained at 0.75, with 73.3% PR, 69.0% NPR, and 58.3% SR. However, when the threshold is increased to 0.80 and 0.85, the performance drops consistently. A similar trend can be observed on RGBT234, where the best fixed-threshold setting still underperforms the proposed adaptive strategy.
In contrast, the proposed DTM consistently outperforms all fixed-threshold settings. Compared with the best fixed threshold (0.75), DTM improves PR/NPR/SR by 1.7%/2.1%/1.0% on LasHeR and improves PR/SR by 2.2%/1.9% on RGBT234. These results demonstrate that adaptive thresholding can better balance template quality and update frequency. Instead of relying on a globally fixed rule, DTM adjusts the update criterion according to the confidence statistics of each sequence, thereby producing more reliable candidate templates and improving overall tracking robustness.
We further perform ablation analysis on the initial threshold of the DTM module. As shown in Experiment 5, tracking performance declines significantly without adopting an initial threshold, where the template update threshold is entirely determined by the confidence scores of the first 20 frames within each tracking sequence. This phenomenon mainly arises because the calculated threshold tends to be unreliable when the early frames suffer from complex tracking difficulties. By comparing Experiments 6, 7 and 8, we verify that setting the initial threshold to 0.75 yields the optimal configuration for template updating.
The threshold analysis also provides further evidence for the necessity of adaptive template update. A fixed threshold that works reasonably well on one dataset or sequence may not generalize to others with different confidence distributions. By contrast, the proposed DTM offers a lightweight yet effective way to improve update reliability without introducing additional learnable parameters.
(3) Memory Capacity Analysis.
We conduct ablation experiments to investigate how different template memory configurations affect tracking performance on three benchmarks, namely LasHeR, RGBT234, and VTUAV. The triplet in the “Methods” column denotes the capacities of the historical template memory, candidate template memory, and fixed template memory, respectively. As shown in Table 5, the configuration [ 2 ,   5 ,   1 ] achieves the best overall performance, obtaining the highest scores across all datasets. Reducing the historical memory to 0 or 1, as in [ 0 ,   7 ,   1 ] and [ 1 ,   6 ,   1 ] , leads to a noticeable performance drop, indicating that a small but effective historical template pool is important for modeling long-term target appearance variations. In contrast, increasing the historical memory to 3 while reducing the candidate memory to 4 ( [ 3 ,   4 ,   1 ] ) slightly degrades performance, as a smaller candidate pool limits the model’s ability to adapt to recent appearance changes. These results suggest that balancing historical and candidate memories is crucial. Therefore, we adopt the setting of 2 historical templates, 5 candidate templates, and 1 fixed template in our full model.
(4) Analysis of Threshold Initialization Parameters.
We conduct ablation experiments in Table 6 to explore the influence of hyperparameters n and m in the dynamic threshold module. Here, the two values in the “Methods” column follow the format [ n , m ] , where n denotes the number of initial frames adopted to calculate the adaptive threshold, and m represents the number of maximum and minimum confidence scores eliminated to eliminate outliers.
Experimental results show that inappropriate combinations of n and m will degrade tracking accuracy. When n is too small, the early frame information is insufficient, making the estimated threshold unable to reflect the real scene difficulty. When n is excessively large, redundant frames introduce more interference and increase computation cost. Meanwhile, a too small m cannot effectively filter abnormal confidence values caused by occlusion and background clutter, while an overlarge m will discard valid feature information.
The optimal performance is achieved when n = 20 and m = 2. This setting can stably exclude outlier scores, accurately fit the baseline confidence level of the tracking sequence, and generate reliable adaptive update thresholds. It successfully balances scene adaptability and computational simplicity, so we adopt this parameter combination in all subsequent experiments.
(5) Perturbation Analysis.
We evaluate the robustness of our tracker against spatial misalignment between RGB and thermal modalities through perturbation experiments. As shown in Table 7, ours method (row 1) corresponds to perfectly aligned data, yielding the highest performance on both LasHeR and VTUAV. When introducing translational offsets of ± 10 pixels (row 2) or rotational deviations of ± 5 ° (row 3), the tracking metrics gradually degrade, and performance drops further under combined translation and rotation perturbations (row 4). This is expected, since our model is trained on well-aligned multi-modal data; any spatial misalignment breaks the learned cross-modal correspondence and thus weakens the tracking ability.

3.5. Attribute Analysis

We evaluate the attribute-based tracking performance of GTUTrack on the challenging RGBT234 dataset. The radar chart reports the overall performance across 12 typical attribute subsets, including background clutter (BC), scale variation (SC), partial occlusion (PO), thermal crossover (TC), no occlusion (NO), motion blur (MB), low resolution (LR), low illumination (LI), heavy occlusion (HO), fast motion (FM), deformation (DEF), and camera motion (CM). As shown in Figure 4, GTUTrack consistently achieves the best performance across all attribute scenarios compared with other state-of-the-art trackers, including BAT, CGATrack, SDSTrack, ViPT, and TBSI. The polygon corresponding to GTUTrack forms the outermost boundary in the radar chart, indicating superior robustness under diverse challenging conditions.
GTUTrack also exhibits strong adaptability to both RGB-degraded and TIR-degraded scenarios. When RGB information is severely impaired, such as in low illumination (LI), partial occlusion (PO), and fast motion (FM), GTUTrack can effectively exploit complementary thermal information to maintain stable tracking. Meanwhile, when thermal information becomes unreliable, such as under thermal crossover (TC) and background clutter (BC), GTUTrack remains robust by leveraging discriminative RGB features. Moreover, GTUTrack shows clear advantages in handling non-rigid objects (NO), camera motion (CM), and scale variation (SC), demonstrating its ability to cope with complex geometric variations and motion patterns. Overall, the attribute-based evaluation confirms that GTUTrack is robust to a wide range of adverse conditions, and its cross-modal adaptive mechanism enables consistent performance gains over existing trackers.

3.6. Qualitative Analysis

We visualize the tracking results on representative sequences from VTUAV and LasHeR datasets, covering several challenging scenarios, including occlusion (OCC), small targets (ST), low illumination (LI), and high reflection (HR). Figure 5 shows that GTUTrack consistently produces more accurate and stable tracking results than competing methods in these challenging situations.
In bus_14, the target undergoes extreme illumination variation together with noticeable shape change during tracking. Under such conditions, several competing methods gradually drift away from the target or fail to maintain stable localization. In contrast, GTUTrack consistently remains locked onto the target throughout the sequence. This result demonstrates that the proposed method can effectively adapt to severe appearance variation and maintain reliable target representation even when the visual characteristics of the target change significantly.
In rightdarksingleman, GTUTrack successfully adapts to illumination changes and achieves consistent target localization throughout the sequence. Compared with competing methods that gradually deviate from the target under changing brightness conditions, GTUTrack remains more stable due to its ability to update templates adaptively according to the current search region.
Overall, these qualitative results demonstrate that GTUTrack exhibits superior robustness compared with existing multi-modal tracking methods, particularly in dynamic and challenging scenarios involving severe appearance variation, environmental interference, and modality degradation. The observations are also consistent with the quantitative and ablation results, further validating the effectiveness of the proposed search region-guided adaptive template update framework.

4. Discussion

The superior performance of GTUTrack can be mainly attributed to the proposed search region-guided adaptive template update mechanism. Unlike conventional methods that rely on fixed update intervals and manually designed thresholds, GTUTrack actively leverages the current search region to guide template selection and update decisions. This allows the tracker to better adapt to irregular target motion and complex scene variation.
The quantitative comparison results show that GTUTrack consistently improves performance on both UAV and surveillance benchmarks. This suggests that the proposed framework is not limited to a specific scenario, but instead captures a more general principle for multi-modal tracking: the template update should be conditioned on the current search context rather than predefined rules. In UAV tracking, this is particularly important because targets often undergo abrupt appearance variation caused by scale change, viewpoint transition, and background motion. In surveillance scenarios, the same mechanism also improves robustness against occlusion, low resolution, and modality inconsistency.
The ablation and comparison results further show that the three proposed components play complementary roles. GTST improves the matching quality between templates and the current search region, DTM provides adaptive and reliable update criteria under different scenarios, and DTMM maintains a structured template pool that balances adaptability and stability. Their collaboration enables GTUTrack to effectively avoid noisy template updates while preserving useful historical target appearances.
Nevertheless, the proposed method still has room for improvement. In extremely challenging cases, such as severe cross-modal degradation, long-term full occlusion, or dramatic target deformation, the quality of both the current search region and candidate templates may deteriorate simultaneously, which may limit the benefit of adaptive template selection. In addition, although the proposed template memory mechanism is lightweight, further efficiency optimization may still be desirable for resource-constrained UAV platforms. Future work will investigate stronger cross-modal interaction mechanisms, more robust template quality estimation strategies, and more efficient online memory management to further improve performance in such scenarios.
In future work, we will explore the weakly supervised paradigm inspired by ITER [49] to alleviate the reliance on expensive pixel-level annotations. Moreover, we will incorporate the online spectral compensation strategy proposed in OSICN [50] to promote more effective representation and fusion of spectral–spatial information.

5. Conclusions

In this paper, we propose a search region-guided adaptive template update framework for robust multi-modal UAV tracking. The proposed method enables adaptive template selection conditioned on the current search region, allowing the tracker to dynamically match templates with the most relevant target appearance. Specifically, we introduce a Guided Template Selection Transformer (GTST) to perform template quality evaluation and adaptive selection in a learnable manner. We further design a Dynamic Threshold Module (DTM) to adaptively determine template update criteria under different tracking scenarios, ensuring reliable candidate template generation. To improve stability and robustness, we develop a Dynamic Template Memory Module (DTMM), which includes a Fixed Template Memory that preserves the initial target appearance as a stable anchor, a Candidate Template Memory that captures recent appearance variations, and a Historical Template Memory that retains high-quality templates to prevent performance degradation caused by noisy updates. Extensive experiments on multiple benchmarks demonstrate that the proposed method achieves superior performance under significant target appearance variations. These results verify the effectiveness and generalization capability of GTUTrack across diverse and challenging surveillance scenarios.

Author Contributions

Conceptualization, L.L. and Q.L. Methodology, Q.L. Validation, Q.L. and J.L.; Formal analysis, J.L.; Investigation, Q.L. Resources, J.W.; Writing—original draft preparation, L.L.; Writing—review and editing, Q.L. visualization, Q.L. Supervision, J.W.; Project administration, J.W.; Funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Scientific Research Foundation for High-level Talents of Anhui University of Science and Technology (2024yjrc94), the Open Laboratory Project of the Key Laboratory of Intelligent Computing & Signal Processing, Ministry of Education, Anhui University (2024A003), and the National Natural Science Foundation of China under Grant (62506004).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors are grateful to the editor and anonymous reviewers for the suggestions they made to help us improve the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Xue, W.; Ai, J.; Zhu, Y.; Sun, X.; Zhang, Y.; Gao, G. LMCNet: Lightweight Modality Compensation Network Via Knowledge Distillation for Salient Ship Detection Under Missing-Modality Conditions. IEEE Trans. Aerosp. Electron. Syst. 2026, 62, 6547–6560. [Google Scholar] [CrossRef]
  2. Ai, J.; Mao, Y.; Luo, Q.; Jia, L.; Xing, M. SAR Target Classification Using the Multikernel-Size Feature Fusion-Based Convolutional Neural Network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
  3. Ai, J.; Tian, R.; Luo, Q.; Jin, J.; Tang, B. Multi-Scale Rotation-Invariant Haar-Like Feature Integrated CNN-Based Ship Detection Algorithm of Multiple-Target Environment in SAR Imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 10070–10087. [Google Scholar] [CrossRef]
  4. Zhang, P.; Zhao, J.; Wang, D.; Lu, H.; Ruan, X. Visible-Thermal UAV Tracking: A Large-Scale Benchmark and New Baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 8886–8895. [Google Scholar]
  5. Deng, A.; Han, G.; Zhang, Z.; Chen, D.; Ma, T.; Liu, Z. Cross-parallel attention and efficient match transformer for aerial tracking. Remote Sens. 2024, 16, 961. [Google Scholar] [CrossRef]
  6. Deng, A.; Han, G.; Chen, D.; Ma, T.; Liu, Z. Slight aware enhancement transformer and multiple matching network for real-time UAV tracking. Remote Sens. 2023, 15, 2857. [Google Scholar] [CrossRef]
  7. Zhang, H.; Kuang, Y.; Wang, J.; Jin, L.; Xu, C.; Meng, Y.; Huang, B. SiamDiff: A Diffusion-Driven Siamese Network for Scale-Aware Anti-UAV Tracking. Remote Sens. 2025, 18, 18. [Google Scholar] [CrossRef]
  8. Hui, T.; Xun, Z.; Peng, F.; Huang, J.; Wei, X.; Wei, X.; Dai, J.; Han, J.; Liu, S. Bridging Search Region Interaction With Template for RGB-T Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 13630–13639. [Google Scholar]
  9. Wang, H.; Liu, X.; Li, Y.; Sun, M.; Yuan, D.; Liu, J. Temporal adaptive rgbt tracking with modality prompt. Proc. AAAI Conf. Artif. Intell. 2024, 38, 5436–5444. [Google Scholar] [CrossRef]
  10. Zhang, L.; Gonzalez-Garcia, A.; Weijer, J.V.D.; Danelljan, M.; Khan, F.S. Learning the model update for siamese trackers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4010–4019. [Google Scholar]
  11. Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual Conference, 11–17 October 2021; pp. 10448–10457. [Google Scholar]
  12. Chen, X.; Yan, B.; Zhu, J.; Lu, H.; Ruan, X.; Wang, D. High-performance transformer tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 8507–8523. [Google Scholar] [CrossRef]
  13. Zhu, J.; Lai, S.; Chen, X.; Wang, D.; Lu, H. Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 9516–9526. [Google Scholar]
  14. Cao, B.; Guo, J.; Zhu, P.; Hu, Q. Bi-directional adapter for multimodal tracking. Proc. AAAI Conf. Artif. Intell. 2024, 38, 927–935. [Google Scholar] [CrossRef]
  15. Li, B.; Peng, F.; Hui, T.; Wei, X.; Wei, X.; Zhang, L.; Shi, H.; Liu, S. RGB-T Tracking With Template-Bridged Search Interaction and Target-Preserved Template Updating. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 634–649. [Google Scholar] [CrossRef]
  16. Yang, J.; Du, B.; Zhang, L. From center to surrounding: An interactive learning framework for hyperspectral image classification. ISPRS J. Photogramm. Remote Sens. 2023, 197, 145–166. [Google Scholar] [CrossRef]
  17. Li, C.; Zhao, N.; Lu, Y.; Zhu, C.; Tang, J. Weighted sparse representation regularized graph learning for RGB-T object tracking. In Proceedings of the ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1856–1864. [Google Scholar]
  18. Li, C.; Liang, X.; Lu, Y.; Zhao, N.; Tang, J. RGB-T object tracking: Benchmark and baseline. Pattern Recognit. 2019, 96, 106977. [Google Scholar] [CrossRef]
  19. Li, C.; Xue, W.; Jia, Y.; Qu, Z.; Luo, B.; Tang, J.; Sun, D. LasHeR: A Large-scale High-diversity Benchmark for RGBT Tracking. IEEE Trans. Image Process. 2022, 33, 392–404. [Google Scholar] [CrossRef]
  20. Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 341–357. [Google Scholar]
  21. Xiao, Y.; Zhao, J.; Lu, A.; Li, C.; Lin, Y.; Yin, B.; Liu, C. Cross-modulated Attention Transformer for RGBT Tracking. Proc. AAAI Conf. Artif. Intell. 2025, 39, 8682–8690. [Google Scholar] [CrossRef]
  22. Zheng, Y.; Zhong, B.; Liang, Q.; Mo, Z.; Zhang, S.; Li, X. Odtrack: Online dense temporal token learning for visual tracking. Proc. AAAI Conf. Artif. Intell. 2024, 38, 7588–7596. [Google Scholar] [CrossRef]
  23. Sun, M.; Xiao, J.; Lim, E.G.; Zhang, B.; Zhao, Y. Fast template matching and update for video object tracking and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Conference, 14–19 June 2020; pp. 10791–10799. [Google Scholar]
  24. Wang, Y.; Ye, B.; Cai, Z. Dynamic template updating using spatial-temporal information in siamese trackers. IEEE Trans. Multimed. 2023, 26, 2006–2015. [Google Scholar] [CrossRef]
  25. Li, C.; Liu, L.; Lu, A.; Ji, Q.; Tang, J. Challenge-aware rgbt tracking. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 222–237. [Google Scholar]
  26. Wang, C.; Xu, C.; Cui, Z.; Zhou, L.; Zhang, T.; Zhang, X.; Yang, J. Cross-modal pattern-propagation for RGB-T tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Conference, 14–19 June 2020; pp. 7064–7073. [Google Scholar]
  27. Zhang, P.; Wang, D.; Lu, H.; Yang, X. Learning Adaptive Attribute-Driven Representation for Real-Time RGB-T Tracking. Int. J. Comput. Vis. 2021, 129, 2714–2729. [Google Scholar] [CrossRef]
  28. Lu, A.; Li, C.; Yan, Y.; Tang, J.; Luo, B. RGBT Tracking via Multi-Adapter Network with Hierarchical Divergence Loss. IEEE Trans. Image Process. 2021, 30, 5613–5625. [Google Scholar] [CrossRef]
  29. Xiao, Y.; Yang, M.; Li, C.; Liu, L.; Tang, J. Attribute-based Progressive Fusion Network for RGBT Tracking. Proc. AAAI Conf. Artif. Intell. 2022, 36, 2831–2838. [Google Scholar] [CrossRef]
  30. Zhu, Y.; Li, C.; Tang, J.; Luo, B.; Wang, L. RGBT tracking by trident fusion network. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 579–592. [Google Scholar] [CrossRef]
  31. Lu, A.; Qian, C.; Li, C.; Tang, J.; Wang, L. Duality-gated mutual condition network for RGBT tracking. IEEE Trans. Neural Netw. Learn. Syst. 2022, 36, 4118–4131. [Google Scholar] [CrossRef] [PubMed]
  32. Wang, X.; Shu, X.; Zhang, S.; Jiang, B.; Wang, Y.; Tian, Y.; Wu, F. MFGNet: Dynamic modality-aware filter generation for RGB-T tracking. IEEE Trans. Multimed. 2023, 25, 4335–4348. [Google Scholar] [CrossRef]
  33. Hou, R.; Ren, T.; Wu, G. MIRNet: A Robust RGBT Tracking Jointly with Multi-Modal Interaction and Refinement. In Proceedings of the International Conference on Multimedia and Expo, Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
  34. Mei, J.; Zhou, D.; Cao, J.; Nie, R.; He, K. Differential Reinforcement and Global Collaboration Network for RGBT Tracking. IEEE Sens. J. 2023, 23, 7301–7311. [Google Scholar] [CrossRef]
  35. Liu, L.; Li, C.; Xiao, Y.; Ruan, R.; Fan, M. RGBT Tracking via Challenge-based Appearance Disentanglement and Interaction. IEEE Trans. Image Process. 2024, 33, 1753–1767. [Google Scholar] [CrossRef]
  36. Zhang, X.; Ye, P.; Peng, S.; Liu, J.; Xiao, G. DSiamMFT: An RGB-T fusion tracking method via dynamic Siamese networks using multi-layer feature fusion. Signal Process. Image Commun. 2020, 84, 115756. [Google Scholar] [CrossRef]
  37. Zhang, T.; Liu, X.; Zhang, Q.; Han, J. SiamCDA: Complementarity-and distractor-aware RGB-T tracking based on Siamese network. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1403–1417. [Google Scholar]
  38. Feng, M.; Su, J. Learning Multi-Layer Attention Aggregation Siamese Network for Robust RGBT Tracking. IEEE Trans. Multimed. 2024, 26, 3378–3391. [Google Scholar] [CrossRef]
  39. Liu, L.; Li, C.; Xiao, Y.; Tang, J. Quality-Aware RGBT Tracking via Supervised Reliability Learning and Weighted Residual Guidance. In Proceedings of the ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 3129–3137. [Google Scholar]
  40. Hou, R.; Xu, B.; Ren, T.; Wu, G. MTNet: Learning modality-aware representation with transformer for RGBT tracking. In Proceedings of the International Conference on Multimedia and Expo, Brisbane, Australia, 10–14 July 2023; pp. 1163–1168. [Google Scholar]
  41. Luo, Y.; Guo, X.; Dong, M.; Yu, J. Learning Modality Complementary Features with Mixed Attention Mechanism for RGB-T Tracking. Sensors 2023, 23, 6609. [Google Scholar] [CrossRef]
  42. Hong, L.; Yan, S.; Zhang, R.; Li, W.; Zhou, X.; Guo, P.; Jiang, K.; Chen, Y.; Li, J.; Chen, Z.; et al. OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
  43. Wu, Z.; Zheng, J.; Ren, X.; Vasluianu, F.A.; Ma, C.; Paudel, D.P.; Van Gool, L.; Timofte, R. Single-model and any-modality for video object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19156–19166. [Google Scholar]
  44. Hou, X.; Xing, J.; Qian, Y.; Guo, Y.; Xin, S.; Chen, J.; Tang, K.; Wang, M.; Jiang, Z.; Liu, L.; et al. Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26551–26561. [Google Scholar]
  45. Xia, J.; Shi, D.; Song, K.; Song, L.; Wang, X.; Jin, S.; Zhao, C.; Cheng, Y.; Jin, L.; Zhu, Z.; et al. Unified Single-Stage Transformer Network for Efficient RGB-T Tracking. In Proceedings of the International Joint Conference on Artificial Intelligence, Jeju, Republic of Korea, 3–9 August 2024. [Google Scholar]
  46. Gao, Z.; Zhou, D.; Cao, J.; Liu, Y.; Shan, Q. Enhanced RGBT Tracking Network With Semantic Generation and Historical Context. IEEE Trans. Instrum. Meas. 2025, 74, 5017817. [Google Scholar] [CrossRef]
  47. Liu, Y.; Gao, Z.; Cao, Y.; Zhou, D. Two-stage Unidirectional Fusion Network for RGBT tracking. Knowl.-Based Syst. 2025, 310, 112983. [Google Scholar] [CrossRef]
  48. Xiao, Y.; Li, Q.; Liu, L.; Li, C. Cross-modal Guiding Attention for RGBT Tracking. Inf. Fusion 2026, 129, 104008. [Google Scholar] [CrossRef]
  49. Yang, J.; Du, B.; Wang, D.; Zhang, L. ITER: Image-to-Pixel Representation for Weakly Supervised HSI Classification. IEEE Trans. Image Process. 2024, 33, 257–272. [Google Scholar] [CrossRef] [PubMed]
  50. Yang, J.; Du, B.; Xu, Y.; Zhang, L. Can Spectral Information Work While Extracting Spatial Distribution?—An Online Spectral Information Compensation Network for HSI Classification. IEEE Trans. Image Process. 2023, 32, 2360–2373. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Overview of the proposed search region-guided adaptive template update framework.
Figure 1. Overview of the proposed search region-guided adaptive template update framework.
Remotesensing 18 01817 g001
Figure 2. The overall framework of GTUTrack. During tracking, the initial template, dynamic templates, and the search frame are fed into the tracker to complete the tracking process. Subsequently, the dynamic threshold module determines whether the search frame qualifies as a new dynamic template. If eligible, the search frame is cropped into a template sample and stored in the dynamic template bank. This sample then serves as input to the guided template selection Transformer, enabling the dynamic template selection workflow for tracking the next frame.
Figure 2. The overall framework of GTUTrack. During tracking, the initial template, dynamic templates, and the search frame are fed into the tracker to complete the tracking process. Subsequently, the dynamic threshold module determines whether the search frame qualifies as a new dynamic template. If eligible, the search frame is cropped into a template sample and stored in the dynamic template bank. This sample then serves as input to the guided template selection Transformer, enabling the dynamic template selection workflow for tracking the next frame.
Remotesensing 18 01817 g002
Figure 3. Training pipeline of GTST.
Figure 3. Training pipeline of GTST.
Remotesensing 18 01817 g003
Figure 4. Attribute-based evaluation on the RGBT234 dataset. The values in parentheses indicate the minimum PR on the left and the maximum PR on the right.
Figure 4. Attribute-based evaluation on the RGBT234 dataset. The values in parentheses indicate the minimum PR on the left and the maximum PR on the right.
Remotesensing 18 01817 g004
Figure 5. Qualitative comparison between GTUTrack and some representative trackers. (a) denotes sequences from VTUAV dataset and (b) represents sequences from LasHeR dataset. This visualization mainly demonstrates the robustness of our proposed method in scenarios where targets are occluded and then reappear.
Figure 5. Qualitative comparison between GTUTrack and some representative trackers. (a) denotes sequences from VTUAV dataset and (b) represents sequences from LasHeR dataset. This visualization mainly demonstrates the robustness of our proposed method in scenarios where targets are occluded and then reappear.
Remotesensing 18 01817 g005
Table 1. Comparison of GTUTrack with advanced trackers on four public datasets. The highest and second-best results are highlighted in bold and underline respectively.
Table 1. Comparison of GTUTrack with advanced trackers on four public datasets. The highest and second-best results are highlighted in bold and underline respectively.
AlorithmsPublicationRGBT210RGBT234LasHeRVTUAV
PR SR PR SR PR NPR SR PR SR 
CAT [25]ECCV 2079.253.380.456.145.039.531.4--
CMPP [26]CVPR 20--82.357.5-----
ADRNet [27]IJCV 21--80.757.0---62.246.6
MANet++ [28]TIP 21--80.055.446.740.431.4--
APFNet [29]AAAI 2022--82.757.950.0-36.2--
TFNet [30]TCSVT 2277.752.980.656.0-----
DMCNet [31]TNNLS 2279.755.583.959.349.043.135.5--
MFGNet [32]TMM 2274.949.478.353.5-----
MIRNet [33]ICME 22--81.658.9-----
DRGCNet [34]SENS J 23--82.558.148.342.333.8--
CAT++ [35]TIP 2482.256.184.059.250.944.435.6--
DSiamMFT [36]SPIC 2064.243.6-------
SiamCDA [37]TCSVT 21--76.056.9-----
HMFT [4]CVPR 2278.653.578.856.8---75.862.7
SiamMLAA [38]TMM 2377.956.779.558.453.8-43.1--
QAT [39]ACM MM 2386.861.988.464.364.259.650.180.166.7
MTNet [40]ICME 23--85.061.960.8-47.4--
ViPT [13]CVPR 23--83.561.765.1-52.5--
TBSI [8]CVPR 2385.362.587.163.769.265.755.6--
MACFT [41]Sensors 23--85.762.265.3-51.480.166.8
OneTracker [42]CVPR 24--85.764.267.2-53.8--
BAT [14]AAAI 2024--86.864.170.2-56.3--
TATrack [9]AAAI 2485.361.887.264.470.266.756.1--
Un-Track [43]CVPR 24--84.262.566.7-53.6--
SDSTrack [44]CVPR 24--84.862.566.5-53.1--
USTrack [45]IJCAI 24--87.465.8---86.974.4
SHT [46]TIM 2586.463.788.165.670.166.156.1--
TUFNet [47]KBS 2586.761.988.264.170.8-55.7--
CAFormer AAAI 2585.663.288.366.470.066.155.688.676.2
CGATrack [48]IFFUS 2587.864.389.066.672.268.357.589.076.6
Ours-90.966.392.068.875.071.159.391.478.4
Table 2. Comprehensive comparison between GTUTrack with advanced trackers. The highest and second-best results are highlighted in bold and underline respectively.
Table 2. Comprehensive comparison between GTUTrack with advanced trackers. The highest and second-best results are highlighted in bold and underline respectively.
AlgorithmsPublicationLasHeRVTUAVParamsFLOPsFPS
PR NPR SR PR SR (M)(G)
ViPT [13]CVPR 2365.1-52.5--92.9621.8070.7
TBSI [8]CVPR 2369.265.755.6--201.9882.5236.2
MACFT [41]Sensors 2365.3-51.480.166.8--33.3
OneTracker [42]CVPR 2467.2-53.8--94.92--
BAT [14]AAAI 202470.2-56.3--92.4456.6867.2
TATrack [9]AAAI 2470.266.756.1----26.1
Un-Track [43]CVPR 2466.7-53.6--98.72--
SDSTrack [44]CVPR 2466.5-53.1--102.18108.4021
USTrack [45]IJCAI 24---86.974.4--84.2
SHT [46]TIM 2570.166.156.1----28
TUFNet [47]KBS 2570.8-55.7--97.35-27
CAFormer AAAI 2570.066.155.688.676.297.5042.9183.6
CGATrack [48]IFFUS 2572.268.357.589.076.695.9658.7188.8
Ours-75.071.159.391.478.4182.5276.3662
Table 3. Ablation study on components of the GTUTrack. The highest results are highlighted in bold.
Table 3. Ablation study on components of the GTUTrack. The highest results are highlighted in bold.
#MethodsLasHeRRGBT234VTUAVFPS
PR NPR SR PR SR PR SR 
1OSTrack+RGBT67.864.354.086.464.585.373.296.4
2+ GTST72.568.557.588.866.288.976.566.3
3+ DTM73.369.158.289.867.089.877.664.1
4+ HTM74.470.859.091.568.490.778.162.6
5+ CTM75.071.159.392.068.891.478.462
Table 4. Ablation study on template update threshold. The highest results are highlighted in bold.
Table 4. Ablation study on template update threshold. The highest results are highlighted in bold.
#ThresholdsLasHeRRGBT234VTUAV
PR NPR SR PR SR PR SR 
10.7072.868.358.089.466.889.176.4
20.7573.369.058.389.866.989.576.6
30.8072.668.257.889.166.789.076.4
40.8572.067.957.688.966.588.976.3
5 D T M i = n u l l 70.767.055.787.065.187.675.3
6 D T M i = 0.65 73.669.758.391.568.589.676.8
7 D T M i = 0.70 75.071.159.392.068.891.478.4
8 D T M i = 0.75 74.670.558.991.568.491.077.8
Table 5. Ablation study on dynamic template memory capacity. The highest results are highlighted in bold.
Table 5. Ablation study on dynamic template memory capacity. The highest results are highlighted in bold.
#MethodsLasHeRRGBT234VTUAV
PR NPR SR PR SR PR SR 
1[0, 7, 1]71.667.856.789.966.788.676.2
2[1, 6, 1]72.468.657.190.166.988.977.3
3[2, 5, 1]75.071.159.392.068.891.478.4
4[3, 4, 1]74.670.759.091.668.590.677.8
Table 6. Ablation study on dynamic threshold initialization parameters. The highest results are highlighted in bold.
Table 6. Ablation study on dynamic threshold initialization parameters. The highest results are highlighted in bold.
#MethodsLasHeRRGBT234VTUAV
PR NPR SR PR SR PR SR 
1[10, 0]72.668.957.291.468.389.576.5
2[10, 1]73.269.458.190.767.789.977.1
3[10, 2]73.169.358.190.767.889.677.0
4[20, 0]74.971.059.491.668.691.578.4
5[20, 2]75.071.159.392.068.891.478.4
6[20, 4]74.170.358.892.269.291.678.3
Table 7. Ablation study on Perturbation. The highest results are highlighted in bold.
Table 7. Ablation study on Perturbation. The highest results are highlighted in bold.
#DeviationLasHeRVTUAV
TranslationalRotationalPR NPR SR PR SR 
1--75.071.159.391.478.4
2 ± 10 px-72.268.257.388.475.8
3- ± 5 ° 70.967.356.286.873.6
4 ± 10 px ± 5 ° 67.563.953.880.167.1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, L.; Li, Q.; Lv, J.; Wang, J. Search Region-Guided Adaptive Template Update for Robust Multi-Modal UAV Tracking. Remote Sens. 2026, 18, 1817. https://doi.org/10.3390/rs18111817

AMA Style

Liu L, Li Q, Lv J, Wang J. Search Region-Guided Adaptive Template Update for Robust Multi-Modal UAV Tracking. Remote Sensing. 2026; 18(11):1817. https://doi.org/10.3390/rs18111817

Chicago/Turabian Style

Liu, Lei, Qi Li, Jiaxin Lv, and Jiaxiang Wang. 2026. "Search Region-Guided Adaptive Template Update for Robust Multi-Modal UAV Tracking" Remote Sensing 18, no. 11: 1817. https://doi.org/10.3390/rs18111817

APA Style

Liu, L., Li, Q., Lv, J., & Wang, J. (2026). Search Region-Guided Adaptive Template Update for Robust Multi-Modal UAV Tracking. Remote Sensing, 18(11), 1817. https://doi.org/10.3390/rs18111817

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop