FSTC-DiMP: Advanced Feature Processing and Spatio-Temporal Consistency for Anti-UAV Tracking

Bu, Desen; Ding, Bing; Tong, Xiaozhong; Sun, Bei; Sun, Xiaoyong; Guo, Runze; Su, Shaojing

doi:10.3390/rs17162902

Open AccessArticle

FSTC-DiMP: Advanced Feature Processing and Spatio-Temporal Consistency for Anti-UAV Tracking

by

Desen Bu

,

Bing Ding

^*

,

Xiaozhong Tong

,

Bei Sun

,

Xiaoyong Sun

,

Runze Guo

and

Shaojing Su

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(16), 2902; https://doi.org/10.3390/rs17162902

Submission received: 7 July 2025 / Revised: 11 August 2025 / Accepted: 18 August 2025 / Published: 20 August 2025

(This article belongs to the Special Issue Recent Advances in Infrared Target Detection)

Download

Browse Figures

Versions Notes

Abstract

The widespread application of UAV technology has brought significant security concerns that cannot be ignored, driving considerable attention to anti-unmanned aerial vehicle (UAV) tracking technologies. Anti-UAV tracking faces challenges, including target entry into and exit from the field of view, thermal crossover, and interference from similar objects, where Siamese network trackers exhibit notable limitations in anti-UAV tracking. To address these issues, we propose FSTC-DiMP, an anti-UAV tracking algorithm. To better handle feature extraction in low-Signal-to-Clutter-Ratio (SCR) images and expand receptive fields, we introduce the Large Selective Kernel (LSK) attention mechanism, achieving a balance between local feature focus and global information integration. A spatio-temporal consistency-guided re-detection mechanism is designed to mitigate tracking failures caused by target entry into and exit from the field of view or similar-object interference through spatio-temporal relationship analysis. Additionally, a background augmentation module has been developed to more efficiently utilise initial frame information, effectively capturing the semantic features of both targets and their surrounding environments. Experimental results on the AntiUAV410 and AntiUAV600 datasets demonstrate that FSTC-DiMP achieves significant performance improvements in anti-UAV tracking tasks, validating the algorithm’s strong robustness and adaptability to complex environments.

Keywords:

anti-UAV tracking; single-object tracking (SOT); re-detection

Graphical Abstract

1. Introduction

With the rapid advancement of deep learning and automation technologies, UAVs have been extensively applied across various domains including aerial photography, logistics transportation, and environmental monitoring [1,2]. However, the widespread deployment of drones has simultaneously posed significant challenges to low-altitude security governance due to improper operational practices. To prevent unauthorised drone activities from endangering public safety and jeopardising people’s lives and property, it is imperative to implement timely detection and countermeasures against malicious UAV operations [3]. Anti-UAV tracking is a specialised branch of single-object tracking (SOT) in computer vision, dedicated to the continuous detection and monitoring of designated targets across video sequences. While the constraints on UAV targets alleviate the need for multi-class recognition, they simultaneously introduce unique technical challenges that go beyond traditional SOT methods [4,5].

Considering low-light conditions, anti-UAV tracking typically utilises thermal infrared (TIR) data, which mitigates the impact of adverse weather and low illumination. However, the absence of colour and texture features from RGB images increases the difficulty of target discrimination [6]. The physical mechanism of infrared imaging is based on the energy conversion process of thermal radiation. When a target object actively emits heat, an infrared detector reconstructs thermal radiation signals of varying intensities into a thermal distribution image [7]. According to the Stefan–Boltzmann law, the radiative power of a target is proportional to the fourth power of its absolute temperature. This nonlinear relationship causes the grey-value distribution of targets with significant thermal contrast to exhibit distinct gradient variations [8].

However, due to factors such as the small size of the UAV target, long shooting distance, and complex background environments, the target imaging area usually has a low SCR. Figure 1 presents an analysis of the grey-value characteristics of the UAV target.

Figure 1b illustrates the calculation region of the SCR for the target area. The ratio of the difference between the pixel mean values of the target region and the background region to the standard deviation of the background pixels obtains the SCR. According to [9], the boundary distance, d, is set to 20 pixels. This study organises representative sequences from the AntiUAV410 dataset and selects four types of backgrounds (sky, urban, mountain, and lake) for SCR analysis of the UAV target region. The results are shown in Figure 2. The sky background is relatively clean, and the target region exhibits a relatively high SCR. In contrast, urban, mountain, and lake backgrounds exhibit complex texture features, resulting in a relatively low SCR for the target region. Through in-depth analysis, we infer that when thermal crossover effects occur in the image, the target and background exhibit similar greyscale values, which significantly increases the difficulty of distinguishing the target from the background. Figure 2 The line chart analyses the Signal-to-Clutter Ratio (SCR) on the AntiUAV410 dataset. It marks the lowest-SCR region and provides corresponding image examples.

From a visual perspective, cluttered textures, flickering light and shadow, and other factors in complex backgrounds can generate visual noise. This noise often resembles the contour features of UAVs, leading to tracking errors. Backgrounds such as clouds, mountains, and forests are particularly susceptible to visual interference due to variations in lighting conditions and weather. Appearance-based tracking methods typically update their models based on the target’s visual features to adapt to changes in the target. As shown in Figure 3a, when similar objects (e.g., birds) share comparable contours or appearances with the actual target, they may trigger erroneous updates of the appearance model, compromising tracking robustness.

Meanwhile, Figure 3b illustrates the thermal crossover issue in anti-UAV tracking tasks. Surface temperature is influenced by environmental fluctuations and lighting conditions, often causing the thermal radiation values of the target and background to converge. This results in the target’s features being overwhelmed by the background, leading to tracking inaccuracies—especially in complex urban thermal environments [10]. When the target moves out of the field of view, the loss of appearance features increases the difficulty of calculating the similarity between the target and candidate regions. As background features continuously accumulate through iterative updates, the tracker may experience drift when the target re-enters the field of view, ultimately causing tracking failure. This phenomenon is demonstrated in Figure 3c.

To effectively address the issues above and achieve robust anti-UAV tracking, this study proposes a novel tracker, FSTC-DiMP. First, to address the issues of insufficient feature extraction and limited receptive fields in low-SCR images, we enhance the feature extractor by incorporating the LSK attention mechanism. This achieves a balance between local feature focus and global information integration, significantly improving the model’s ability to represent targets at different scales. Second, to mitigate tracking failures caused by the target exiting and re-entering the field of view or interference from similar targets, this study designs a spatio-temporal consistency-guided re-detection mechanism. By analysing spatio-temporal consistency relationships, newly detected targets are verified, effectively improving tracking loss issues. Additionally, to more efficiently utilise the initial frame information and learn richer feature representations, we design a background augmentation module that effectively captures the semantic features of the target and its surrounding environment.

In the challenging field of anti-UAV tracking, our method was evaluated on two benchmark datasets (AntiUAV410 [11] and AntiUAV600 [12]), achieving area under the curve (AUC) scores of 67.7% and 53.6%, respectively. Compared to the baseline model Super-DiMP, this represents an improvement of 8.1% and 4.3%. These results fully validate the adaptability and effectiveness of our proposed algorithm in complex environments. The main contributions of this paper can be summarised as follows:

To address the insufficient feature representation of initial frame images and inadequate utilisation of background information, we designed an enhanced feature extraction method and background augmentation module to improve the utilisation efficiency of initial frame images.
A spatio-temporal consistency-guided re-detection mechanism was proposed, which effectively mitigates tracking loss in complex environments.
Extensive experimental results demonstrate that FSTC-DiMP exhibits superior performance and strong robustness in complex scenarios.

2. Materials and Methods

2.1. Visual Tracking

Vision-guided tracking technology plays a crucial role in the field of anti-UAV systems. As a specialised branch of visual tracking, SOT warrants in-depth research. In recent years, deep learning techniques have achieved remarkable success in SOT, which can be broadly categorised into generative trackers [11,13,14,15,16,17] and discriminative trackers [18,19,20,21] based on their core mechanisms.

Generative trackers approach tracking from the perspective of cross-correlation in deep feature space, employing offline learning strategies to train dual-branch networks for similarity measurement. The two branches process target images and template images, respectively. Within this framework, Siamese network architectures have gained significant attention due to their end-to-end training approach and specialised template-matching mechanism.

SiamFC [13], as a classic tracking algorithm based on Siamese networks, brought template-matching concepts into mainstream focus. Li et al. [14] improved SiamRPN’s operational speed by integrating Region Proposal Networks (RPNs) into the Siamese architecture. DaSiamRPN [15] proposed an adversarial interference recognition mechanism that significantly enhanced tracking performance in complex scenarios by incorporating background interference features into the optimisation process of deep feature representation learning. SiamRPN++ [16] enhanced feature representation capability by integrating ResNet into the Siamese architecture, achieving superior performance across multiple benchmarks. SiamDT [11] introduced a Dual-Semantic Feature Extraction mechanism and a background interference suppression mechanism to capture the semantic saliency of targets. This semantic saliency exhibits distinct discriminative characteristics in search templates, facilitating target localisation and significantly enhancing the robustness of UAV target tracking in complex background environments. SiamCAR [17] incorporates a centrality assessment branch that operates in parallel with the classification task to evaluate the actual tracking quality.

However, Siamese networks exhibit inherent limitations when handling prolonged target motion. On the one hand, they are prone to template “contamination” when encountering interfering objects, causing the tracker to mistakenly identify the interference as the target itself, ultimately leading to tracking failure. On the other hand, as the background continuously changes during motion, Siamese networks rely solely on target feature information and fail to utilise background context fully. This significantly constrains their tracking accuracy and robustness in complex scenarios.

Discriminative model-based approaches train classifiers to distinguish targets from backgrounds through online learning. The ATOM [18] framework employs a two-stage tracking strategy: A target classification network performs coarse localisation to estimate target positions rapidly. In contrast, a target evaluation network conducts precise position refinement. Inspired by ATOM, DiMP50 [19] and PrDiMP50 [20] address the insufficient background feature learning in Siamese architectures. DiMP comprises two components: a target classification branch for target–background discrimination and a bounding box prediction branch for position estimation. PrDiMP introduces a probabilistic regression formulation for state estimation confidence scores, enhancing the model’s tolerance to annotation errors and ambiguous tracking scenarios. Ocean [21] proposes a feature alignment module that implements spatial optimisation through an expansion strategy, adaptively mapping convolutionally extracted local features to complete prediction regions.

Furthermore, KYS [22] develops an end-to-end deep learning framework that learns and propagates temporal feature encodings via cross-frame dense association matching. Specifically, this method fuses explicit contextual representations with deep appearance features to generate more robust target predictions. STRCFD [23] designs a guided local contrast mechanism for more efficient target–background discrimination.

2.2. Attention Mechanism

Vision Transformer (ViT) [24], developed based on the Transformer architecture, has demonstrated outstanding performance in various vision tasks due to its global receptive field characteristics. Unlike traditional methods that compress image information into fixed feature representations, the attention mechanism enables the network to focus on crucial visual features adaptively.

HiT [25] is a lightweight Transformer architecture that features a cross-layer feature interaction module for the deep integration of high-level and low-level features. TransT [26] is an attention-based Transformer-like feature fusion module that differs from traditional correlation computation by directly integrating template and search region feature representations through attention-based interaction.

With the successful application of Transformer in the field of computer vision, refs. [27,28,29] explore how to deeply integrate the core attention mechanism with traditional target tracking frameworks. Conventional Siamese tracking networks primarily rely on cross-correlation operations for feature matching, which, while computationally efficient, often exhibit limitations in handling complex scenarios. In contrast, the self-attention mechanism of the Transformer can model long-range dependencies, offering a more effective approach to feature fusion for tracking tasks. Wang et al. [27] further proposed an encoder–decoder structure applied to the template branch and search branch of the Siamese network, respectively, to enhance feature discriminability. Specifically, the encoder in the template branch aggregates historical frame information to generate more expressive dynamic template representations, thereby improving tracking performance. CRM-DiMP [28] introduced a sparse attention mechanism into Super-DiMP, effectively enhancing the accuracy of the target tracking system. SiamCAP [29] proposed an approach incorporating enhanced contextual awareness and a pixel-level attention mechanism.

In addition, SiamGAT [30] employs a graph-structured modelling approach, representing features from the template branch and search branch of the Siamese network as graph nodes. Feature interaction is achieved through a Graph Attention Module (GAM). Compared to the global matching mechanism of traditional cross-correlation methods, GAM establishes dynamic correspondences between local nodes, effectively capturing geometric deformation characteristics of targets and thereby enhancing adaptability to pose variations. Gao et al. [31] proposed an improved strategy based on hierarchical attention mechanisms. By synergistically utilising inter-frame and intra-frame attention relationships across convolutional layers, it significantly enhances feature representation capability while suppressing background interference. FocTrack [32] introduced a local template update strategy that selectively updates features to mine valid visual cues.

Thus, the introduction of attention mechanisms in visual object tracking enables networks to focus on critical features, improving both discriminative capability and adaptability.

2.3. Re-Detection

In the process of anti-UAV tracking, the high uncertainty of target motion trajectories and the variability of complex battlefield environments often lead to target loss or feature degradation due to occlusion or leaving the field of view. This makes the design of re-detection modules critical for improving tracking performance. Meanwhile, dynamic updates of target features during online tracking can significantly enhance the tracker’s adaptability to complex scenarios and variations in target appearance. In this direction, TLD (Tracking–Learning–Detection) [33] stands as one of the most classic trackers, employing a modular design that integrates three core components: tracking, learning, and detection. The tracking module employs motion prediction strategies to estimate target position changes, while the detection module conducts global searches across the entire image using a cascade classifier to identify potential target locations probabilistically.

Given the high robustness demonstrated by Siamese networks in target tracking, numerous tracking algorithms based on this architecture have emerged in recent years [11,34,35,36,37,38,39]. The MMLT tracker [34] was specifically designed to address target deformation and disappearance issues, offering long-term robust tracking capabilities. For re-detection, this method abandons reliance on historical position information and adopts a full-image search mode to ensure target reacquisition capability. Siam R-CNN [35] is an improved Siamese network tracking algorithm based on the Faster R-CNN [36] framework, with its main innovations in two aspects. First, it introduces a complex example mining mechanism for challenging samples, specifically optimising the re-detector’s ability to discriminate against challenging interference targets. Second, the algorithm employs dynamic programming techniques to comprehensively analyse the short-term trajectory information of all candidate targets and interference items, thereby enabling optimal target selection within the current frame. These improvements allow Siam R-CNN to effectively mitigate tracking drift while demonstrating strong target reacquisition capabilities, thereby excelling in long-term tracking tasks. Inspired by Siam R-CNN, SiamDT [11] is a dual-semantic extraction mechanism comprising a dual-semantic RPN subnetwork and a multi-functional R-CNN subnetwork. The dual-semantic RPN subnetwork predicts candidate bounding boxes through a Siamese branch RPN structure that models both the similar relationship between target regions and templates and the probability of foreground UAV objects. The multi-functional R-CNN subnetwork refines predicted candidate boxes through a weight-shared R-CNN based on fused information. Nocal-Siam [37] was developed to collaboratively optimise feature representations of targets and search regions in tracking tasks, thereby suppressing background interference. Building upon this, OSTrack [38] was constructed as a new tracking framework that dynamically detects distractors and incorporates an “explanation elimination” reasoning mechanism, considering both target and distractor features during target localisation. This approach effectively reduces mismatches between target models and background regions, improving tracking accuracy. Hou et al. [39] proposed an efficient and accurate global re-detection module.

Relying solely on single features or fixed search strategies proves inadequate for addressing challenges such as rapid target deformation and background interference. To overcome these limitations, refs. [40,41,42,43] proposed multiple technical approaches. RLT-DiMP [40] introduced a random erasure strategy that mitigates the influence of irrelevant background elements during tracking. Bo Gao et al. [41] incorporated target shape cues into post-re-detection precise localisation, thereby improving target positioning accuracy. PromptVT [42] effectively enhances the tracker’s target recognition and representation capabilities by fusing spatial features between search regions and templates across multiple scales and levels. AODiMP-TIR [43] introduces an anti-occlusion strategy for infrared target tracking, significantly improving robustness under occlusion conditions.

These approaches collectively address the limitations of conventional trackers from multiple perspectives, providing crucial technical support for stable tracking in complex scenarios.

3. Methodology

3.1. Overall Framework

The proposed FSTC-DiMP method in this study adopts Super-DiMP as its baseline model. FSTC-DiMP comprises two branches: a classification branch for target localisation and a regression branch for estimating target size. When foreground and background conditions change, its online update mode can effectively learn and utilise feature information. Combined with the model weight optimisation approach through offline training, this hybrid method achieves exceptional robustness in anti-UAV tracking, particularly in challenging combat scenarios.

As shown in Figure 4, the overall architecture of FSTC-DiMP performs anti-UAV tracking through the following process: The Enhanced Feature Learning Based on Background Augmentation (ELB) module first conducts data augmentation on the first-frame real image with reliable labels. The backbone network then extracts features from all training samples, followed by feature enhancement and focus processing through LSK attention on feature maps from the additional convolutional block (Cls Feat). The model predictor computes weights based on these feature maps to perform target classification on the feature maps extracted from test frames. The resulting score map is fed into the spatio-temporal consistency-guided re-detection (STR) module for confidence evaluation, which determines whether to activate the re-detection function and ultimately produces the final feature score map.

FSTC-DiMP enables more comprehensive learning of target and background information from the first frame. Through ELB, it proactively learns various potential backgrounds where the UAV target may appear, making more efficient use of global information from the initial frame to understand feature variations across different backgrounds. The incorporation of LSK attention enhances the discriminative capability between target and background features under low-SCR conditions. Meanwhile, STR effectively mitigates adverse effects caused by feature degradation, significantly improving the robustness of anti-UAV tracking.

3.2. Enhanced Feature Extractor

Existing target tracking algorithms face numerous challenges when processing infrared images, primarily due to the unique properties of infrared imaging. Specifically, as single-channel greyscale images, infrared data possess significantly reduced information dimensionality compared to three-channel RGB visible-light images. Furthermore, the small-scale and highly manoeuvrable motion characteristics of UAV targets often result in blurred imaging and large positional displacements. These challenges necessitate the development of a feature enhancement mechanism that can simultaneously address the limitations of local receptive fields and the redundancy in global attention.

Both Convolutional Neural Networks (CNNs) and self-attention mechanisms exhibit inherent limitations in handling these issues: CNNs, constrained by their local receptive field design, primarily focus on extracting features from target subregions, demonstrating notable deficiencies in global information integration efficiency. While self-attention mechanisms can capture global contextual information, their full-input receptive fields often lead to insufficient focus on target features and excessive attention to background information, thereby compromising foreground–background discrimination. In contrast, the LSK attention mechanism achieves an optimal balance between local feature concentration and global information integration through dynamic adjustment of receptive field ranges and selectivity [44]. As illustrated in Figure 5, this adaptive capability enables effective capture of spatial dependencies between target objects and their backgrounds in infrared scenarios.

The LSK attention mechanism achieves precise coupling between local details and global context through coordinated multi-scale dynamic convolutional kernel modelling, effectively balancing fine-grained localisation of small weak targets and robust suppression of complex backgrounds in infrared imagery. Furthermore, addressing the high-manoeuvrability characteristics of UAV targets, LSK performs dynamic adaptive receptive field adjustment while preserving local feature extraction capabilities, thereby establishing accurate correlations between target regions and relevant contextual information. This approach delivers three critical enhancements: (1) alleviating feature representation limitations caused by insufficient information in single-channel images; (2) suppressing irrelevant background interference through selective attention mechanisms; (3) optimising target distinguishability against complex backgrounds.

Specifically, this study designs an Enhanced Feature Extractor based on the Super-DiMP tracking framework by incorporating the LSK attention mechanism at the back end of feature classification to enhance focus on critical infrared target characteristics. The LSK module adopts a hierarchical large-kernel convolution decomposition strategy, which decomposes conventional large-size convolution kernels into multi-branch depthwise separable convolution structures to extract spatial contextual information across different receptive fields in parallel. Subsequently, a channel–spatial dual-attention fusion mechanism adaptively weights and integrates these multi-scale features, thereby preserving the advantages of large receptive fields while suppressing common noise interference in infrared imagery.

The adaptive feature selection mechanism employed in LSK originates from a multi-scale, long-range contextual model. This method structurally modifies conventional spatial attention modules by introducing large-size growth kernels and progressive depthwise convolution sequences to construct convolutional kernels with expanded receptive fields. Specifically, for the i-th layer depthwise convolution operation, the expansion patterns of kernel size (

k_{i}

), dilation rate (

d_{i}

), and receptive field (RF^′) can be expressed as

k_{i - 1} \leq k_{i}; d_{1} = 1, d_{i - 1} < d_{i} \leq R F_{i - 1},

(1)

R F_{1} = k_{1}, R F_{i} = d_{i} (k_{i} - 1) + R F_{i - 1} .

(2)

A single large convolutional kernel can be decomposed into a cascaded structure of 2–3 depthwise separable convolutions. The spatial selection mechanism in LSK enhances the network’s focus on target-relevant spatial background regions by performing multi-scale spatial filtering on feature maps generated by large kernels. Specifically, LSK first concatenates multi-scale feature tensors from large-kernel convolutions with different receptive fields:

U = [U_{1}; \dots; U_{i}],

(3)

where

U_{i}

represents features extracted from different convolutional kernels. These features subsequently undergo spatial max-pooling and spatial min-pooling operations to capture spatial relationships.

S A_{max} = P_{max} (U), S A_{min} = P_{min} (U),

(4)

S = F (\sum_{i = 1}^{N} (S A_{i} \cdot U_{i})),

(5)

Y = S \cdot X .

(6)

{S A}_{m a x}

and

{S A}_{m i n}

characterise spatial features extracted through max-pooling and min-pooling operations, respectively. The pooled feature tensors are then processed through convolutional operations, where outputs from larger-sized kernels undergo optimisation via tensor decomposition and merging. The processed results are ultimately multiplied with the input features through element-wise product operations. This design significantly enhances the feature extraction module’s adaptability to low-SCR infrared imagery, providing more discriminative feature representations for subsequent tracking processes. Consequently, it achieves more robust target tracking performance in complex infrared scenarios.

As shown in Table 1, we compared our method with the feature extraction strategies and limitations of Dual-Semantic Feature Extraction [11], Centre-Prediction Feature Extraction [45], and Band Selection Algorithm Based on Multi-Feature and Affinity Propagation Clustering (GE-AP) [46].

Dual-Semantic Feature Extraction achieves high target tracking accuracy in high-contrast scenarios by parallel extraction of dual-branch features, including “matching semantics” and “foreground semantics.” However, when the target and background share similar semantics, the dual-branch structure may amplify misjudgment risks due to feature confusion, making it more suitable for scenarios with distinct background differences. Centre-Prediction Feature Extraction simplifies the target localisation process by directly mapping CNN features into a single-channel centre confidence heatmap. Nevertheless, its single-channel heatmap performs poorly in adapting to complex backgrounds. The GE-AP method enhances hyperspectral image processing by constructing a multi-feature similarity matrix, excelling in handling high-dimensional data. However, its reliance on texture feature calculations limits its applicability in scenarios requiring high real-time performance. In contrast, our method dynamically adjusts the receptive field range and selectivity, making it more suitable for scenarios with a low SCR. However, the increased computational complexity poses challenges for deployment.

3.3. Spatio-Temporal Consistency-Guided Re-Detection

To address the tracking failure issues caused by the target entry into and exit from the field of view or interference from similar targets during the tracking process, this study designed a spatio-temporal consistency-guided re-detection mechanism. Traditional re-detection methods overlook the spatio-temporal relationship between targets and interfering objects, rendering them susceptible to false detections due to appearance similarity or occlusion interference. This study validates newly detected targets by analysing spatio-temporal consistency relationships. Suppose a freshly detected target shows excessive spatial deviation from the predicted position in the previous frame or exhibits unreasonable disappearance–reappearance time intervals. In that case, its matching confidence will be significantly reduced. This mechanism effectively improves false tracking problems caused by background noise interference or similar targets.

As shown in Figure 6, the confidence division judges initial confidence based on the maximum peak value of the feature response map. For samples that fail to meet the preset confidence threshold, the system initiates a random re-detection process to obtain updated feature response maps through resampling and recalculation. Subsequently, the updated feature responses undergo spatio-temporal consistency verification to complete secondary confidence evaluation.

Samples that meet the confidence threshold are passed to the classification branch for precise target position regression, which enables the final target localisation. Conversely, if samples still fail to meet the confidence standard during secondary verification, the system will determine them as target loss states. This dual-stage confidence evaluation architecture effectively balances tracking accuracy and robustness, enabling adaptive response to complex tracking scenario changes.

3.3.1. Confidence Classification

During the tracking process, the tracker performs feature extraction on the input image X and outputs a discriminative response map

S (X) \in R^{H \times W}

, expressed as

S (X) = A ★ ϕ (X),

(7)

where

S (X) = A ★ ϕ (X)

represents the online-trained convolutional kernel, A denotes the target appearance model and ★ indicates the cross-correlation operation.

The feature response map localises the target by computing the similarity between the appearance model and the deep features of the current frame. In object tracking tasks, this process generates a 2D response distribution map by computing similarity scores. The target’s most probable position is determined by identifying the global maximum value in the feature response map.

The primary peak response evaluation function is defined as

ϕ_{state} (s_{\max}) = \{\begin{matrix} Missing & s_{\max} < τ_{m} \\ Challenging negative & τ_{m} \leq s_{\max} < τ_{cn} \\ Ambiguous & τ_{cn} \leq s_{\max} < τ_{a} \\ Confirmed & s_{\max} \geq τ_{a} \end{matrix}

(8)

where

τ_{m}, τ_{cn}

and

τ_{a}

are threshold coefficients, set to 0.25, 0.5, and 0.8, respectively. In this study,

s_{\max}

represents the global maximum value of the response map. The four intervals defined by these thresholds correspond to target missing, challenging negative samples, ambiguous state, and target confirmation.

Under ideal conditions, when target features are relatively prominent, the response map exhibits a single dominant peak, indicating high-confidence target localisation. However, in practical scenarios, the score map often demonstrates complex multi-peak distribution characteristics due to potential interference from objects with similar appearances. These secondary peaks reflect candidate regions in the scene that share feature similarities with the target. Through systematic analysis of these peaks—including their spatial distribution, intensity contrast, and temporal evolution characteristics—we can effectively establish discriminative criteria between targets and distractors, thereby enhancing the robustness of the tracker in complex environments. Specifically, relative intensity variations between peaks reflect the distinguishability between targets and background interference, while spatial relationships among peak positions reveal potential target–distractor spatial distribution patterns.

It should be noted that when the maximum response value of the primary peak

s_{\max} < τ_{m}

, the target is directly determined as “Missing”. This study exclusively analyses secondary peak interference under the condition

s_{\max} \geq τ_{m}

, aiming to promptly reflect the target’s confidence state and provide a basis for subsequent operations, including model cognition and learning rate adjustment. Figure 7 demonstrates examples of interfering targets.

When the maximum response value Sd of the secondary peak exceeds the preset threshold, the displacement consistency metric is calculated:

θ = γ_{d} \cdot s_{\max},

(9)

Δ_{k} = {∥p_{k} - p_{prev}∥}_{2}, k \in {1, 2},

(10)

where

γ_{d}

is the interference discrimination coefficient,

p_{p r e v}

represents the predicted target position in the previous frame, and

p_{k}

denotes the peak position in the current frame.

The secondary peak interference analysis is determined based on displacement conditions formulated as follows:

\begin{matrix} Φ_{d} = \{\begin{matrix} Challenging negative & (Δ_{1} < τ_{d}) \oplus (Δ_{2} < τ_{d}) \\ Ambiguous & (Δ_{1} > τ_{d}) \land (Δ_{2} > τ_{d}) \end{matrix} \end{matrix}

(11)

τ_{d} = η \sqrt{W H} / 2,

(12)

where ⊕ denotes the exclusive OR operation,

τ_{d}

represents the displacement threshold, and

η

indicates the displacement sensitivity coefficient. A target is classified as “Challenging Negative” (Scenario One in Figure 7) when either the distance between the primary peak or the previous frame position exceeds

τ_{d}

. At the same time, the secondary peak’s distance is less than

τ_{d}

, or the primary peak’s distance is less than

τ_{d}

, while the secondary peak’s distance exceeds

τ_{d}

. The “Ambiguous” state (Scenario Two in Figure 7) occurs when distances from both primary and secondary peaks to the previous frame position exceed

τ_{d}

.

3.3.2. Spatio-Temporal Consistency Relationship

Image search region partitioning serves as a critical component in the re-detection process. Traditional global sliding-window search methods not only incur high computational costs but also struggle to capture spatio-temporal consistency relationships in complex interference environments effectively. To address this, we propose a random sampling-based adaptive search strategy. As demonstrated in [47], random search strategies exhibit superior efficiency compared to global grid search methods. Unlike traditional global sliding-window searches, the random approach can locate target image patches with fewer iterations, thereby accelerating target detection.

As shown in Figure 6, our method first partitions the image globally at fixed intervals to construct multi-scale search units. The search iteration count is then dynamically adjusted based on the size ratio between the target and the image. Specifically, when dealing with larger targets, we reduce the number of iterations to lower computational complexity, whereas, for smaller targets, we increase the number of iterations to enhance detection accuracy. This adaptive mechanism maintains efficiency while ensuring detection precision, significantly improving re-detection performance.

For target verification, when a newly detected target’s confidence score Sn exceeds a predefined threshold, the video frame where it first appears is marked as a new initial frame to restart tracking. However, this re-detection approach has notable limitations: even when a newly detected object shows substantial spatial deviation from the original target post-loss, the system may still incorrectly identify it as the same target. As illustrated in Figure 7, when the target moves out of view at frame 514, interference objects might be falsely detected as new targets. To mitigate this, we introduce spatio-temporal consistency constraints [40] to refine confidence scores, mathematically expressed as

s_{n}^{'} = w_{b} (1 - α \frac{Δ_{k}}{d_{\max}} \cdot e^{- β |t_{new} - t_{old}|}) \cdot s_{n},

(13)

where

w_{b}

represents the re-detection hyperparameter,

α

denotes the spatial consistency hyperparameter,

β

indicates the temporal consistency hyperparameter, and

d_{\max}, p

and t correspond to image diagonal length, position vector, and frame count, respectively. When a newly detected target position exhibits a significant spatial deviation from its previous location in the frame, its matching score decreases due to the spatial discrepancy. This mechanism effectively prevents mismatches at distant locations by adapting to spatial distance. Simultaneously, if the target remains undetected across consecutive frames, the influence of spatial factors becomes progressively balanced by temporal considerations. This accounts for scenarios where targets may reappear at more distant positions after a prolonged absence. The incorporation of temporal factors permits target detection at relatively remote locations while maintaining spatio-temporal coherence.

The spatio-temporal consistency-guided re-detection workflow proposed in this study is formalised in Algorithm 1, which takes as input images where the system confidence state is classified as “Missing”, thereby initiating our re-detection approach.

Algorithm 1: Spatio-temporal consistency-guided re-detection.

Input: i-th image and the number of consecutively lost frames c

Output: The bounding box

P_{n e w}

of the i-th image and its new confidence

ϕ_{i}

3.4. Enhanced Feature Learning Based on Background Augmentation

The annotated initial frame serves as reliable prior information, and its effective utilisation is crucial for tracking performance. However, existing end-to-end tracking paradigms often suffer from inefficient use of initial frame information, primarily due to their limited exploitation of only the current frame’s search region. To address this issue, this study designs an Enhanced Feature Learning Based on Background Augmentation (ELB) module, as illustrated in Figure 8, which consists of two components: a data augmentation strategy and a feature learning strategy.

Data augmentation strategy. During the initial frame feature learning phase, we implement a multi-dimensional data augmentation approach on the original search region images to enhance the tracker’s representational capacity and robustness, storing the augmented data in internal memory. As detailed in Table 2, our augmentation strategy incorporates six image processing techniques: horizontal flipping, scale variation, viewpoint translation, rotation, Gaussian blurring, and salt-and-pepper noise. This multi-dimensional augmentation methodology not only substantially increases training sample diversity but also effectively improves the model’s adaptability to target appearance variations.

Feature learning strategy. This study proposes a cross-region semantic fusion mechanism to exploit global background information in images fully. This mechanism adopts a multi-scale perspective to comprehensively integrate target features with global background representations, thereby enhancing the model’s regional perception and target localisation capabilities. Specifically, to address the insufficient utilisation of background information in traditional tracking algorithms, we introduce a background fusion-based sample enhancement method that prevents tracking drift caused by limited positive sample training. The detailed workflow of this method is as follows:

In the annotated first frame, multiple candidate region sets

S = [p_{1}, p_{2}, p_{3}, \dots p_{n}]

are randomly sampled from background areas outside the given target search region, with each

p_{i}

maintaining an equal area to the target search region to ensure proper scale matching. The ground truth target x is obtained by applying binary mask processing to the given target bounding box in the first frame. The target x is then combined with all

p_{i}

to form an augmented sample set

P = [{p^{'}}_{1}, {p^{'}}_{2}, {p^{'}}_{3}, \dots {p^{'}}_{n}]

, where each

{p^{'}}_{i}

represents the target appearing against corresponding backgrounds. To further enhance training data diversity, the augmented sample set P undergoes data augmentation methods defined in Table 2, simulating real-world background interference. To prevent excessive computational burden from sample generation, a background fusion parameter

λ

is introduced—samples exceeding

λ

are not stored in memory. Through the ELB module, the model maintains stable tracking performance across varying conditions, enabling robust handling of rapid motion scenarios.

4. Experiments

4.1. Dataset and Metrics

4.1.1. Dataset

The AntiUAV410 [11] dataset contains 410 thermal infrared video sequences, totalling over 438k annotated frames. These sequences are divided into training (200 sequences), validation (90 sequences), and testing (120 sequences) subsets. On average, each sequence consists of 1069 frames. The dataset encompasses two seasons (summer and autumn) and lighting conditions (day and night) while featuring diverse complex backgrounds including buildings, mountains, forests, urban areas, clouds, and water surfaces. The dataset encompasses six challenging scenarios, including an occluded target, a target out of view, a target in fast motion, scale variation, thermal crossover, and dynamic background clutter. More than half of the targets are smaller than 50 pixels, including some cases where targets are fewer than 10 pixels in size.

The AntiUAV600 [12] dataset consists of 300 thermal infrared training sequences, 50 validation sequences, and 250 test sequences, containing 337K frames for training, 56K for validation, and 330K for testing. All frames were captured by a thermal infrared camera with a resolution of 640 × 512 pixels at a frame rate of 25 fps. The dataset encompasses six challenging scenarios, including an occluded target, a target out of view, a target in fast motion, scale variation, thermal crossover, and dynamic background clutter. Compared to AntiUAV410, the AntiUAV600 dataset contains a significant number of fast-motion sequences. It should be noted that the AntiUAV600 test set is not publicly available. Therefore, we conducted scene partitioning on the validation set, following the partitioning method of AntiUAV410. Ultimately, we organised 49 sequences from the validation set.

4.1.2. Metrics

To validate FSTC-DiMP’s performance by comparing it with state-of-the-art models, this study employs the One-Pass Evaluation (OPE) method, utilising three standard metrics: precision rate, success rate, and state accuracy.

The precision rate (P) refers to the percentage of frames in which the difference between the tracking bounding box and the ground-truth centre position is below a threshold.

P = \frac{n (Δ d < T_{p})}{N} .

(14)

The success rate (S) refers to the percentage of frames in which the Intersection over Union (IoU) ratio between the tracking bounding box and the ground truth exceeds a specified threshold.

S = \frac{n (I O U_{t} > T_{s})}{N} .

(15)

The

I O U_{t}

represents the area between the tracking bounding box and the ground truth. The threshold TS takes values in the range [0, 1]. The success rate versus the threshold is plotted in a success plot, and the AUC of this plot serves as a metric for ranking tracker performance.

The state accuracy is obtained by computing the average overlap rate between the tracking bounding box and ground truth across the test sequences.

SA = \sum_{t} \frac{I O U_{t} \times δ (v_{t} > 0) + p_{t} \times (1 - δ (v_{t} > 0))}{T},

(16)

v values are the ground-truth visibility flags. State accuracy (SA) evaluates alignment precision by leveraging the tracker’s predicted P.

This metric requires the tracker to not only accurately predict the target’s position and scale but also to correctly determine whether the target is within the field of view, imposing stringent requirements on target existence judgment.

4.2. Implementation Details

Training. The FSTC-DiMP builds upon Super-DiMP, integrating the standard classifier from DiMP with the probabilistic bounding box regressor of PrDiMP. Our tracker is implemented in Python 3.7, PyTorch 1.13.1, and CUDA 11.8, utilising a deep learning architecture. The experimental platform runs on Ubuntu 20.04 with dual 24-GB NVIDIA RTX 4090 GPUs. The model utilises a ResNet-50 [48] backbone and is trained using the Adam optimiser for 50 epochs, each consisting of 2000 iterations. The training logs indicate that when the number of training epochs exceeds 45, the training loss stabilises while the validation loss begins to increase significantly. Therefore, we infer that the model’s optimal weights are obtained at the 45th epoch. During the training process, we set the batch size to 16.

Inference. In the feature enhancement extraction stage, this study employs convolutional kernels of sizes 5 × 5 and 7 × 7 to capture detailed and global contextual information, respectively. The dilation rate d is set to 3 to expand the model’s receptive field. In the re-detection stage, the interference discrimination

γ_{d}

and displacement sensitivity coefficients

η

are configured as 0.8 and 0.1, respectively. During the data augmentation phase, the salt-and-pepper noise ratio is set to 0.001, the polygonal node rectangular blocks are set to 1–3 pixels, and the background fusion data parameter

λ

is set to 80.

4.3. Comparison with the State-of-the-Arts Methods

In this section, we conduct a comprehensive performance comparison between the proposed FSTC-DiMP tracker and several state-of-the-art approaches in single-object tracking. The evaluation includes two baseline trackers (SiamBAN [49] and ATOM [18]), representative trackers from different paradigms (Super-DiMP, DIMP50 [19], PrDimp50 [20], KYS [22], SiamCAR [17], Stark-ST101 [50], AIATrack [51], ToMP50 [52], ToMP101 [52] and DropTrack [53]), along with SiamDT [11]—the official state-of-the-art reference algorithm for the AntiUAV410 benchmark dataset.

AntiUAV410 Dataset. To verify the performance of FSTC-DiMP in anti-UAV tracking, we conducted comparative experiments with representative single-object trackers on the AntiUAV410 training and testing sets. To ensure the rigour of the experiments, all participating trackers used their original pre-trained weights and default parameter configurations. The experimental results are presented in Table 3. FSTC-DiMP achieved AUC, P, and SA values of 67.7%, 91.3%, and 69.1%, respectively. Compared to the official state-of-the-art reference algorithm, SiamDT, in the AntiUAV410 benchmark dataset, FSTC-DiMP improved AUC by 0.9%, P by 1.8%, and SA by 0.9%. This advantage is further illustrated in Figure 9, which displays the success and precision plots of the retrained trackers. Notably, FSTC-DiMP outperformed ATOM, DiMP50, PrDimp50, and Super-DiMP by more than 10%, all of which are based on the two-stage target tracking framework.

AntiUAV600 Dataset. We also conducted comparative experiments of FSTC-DiMP against other SOT algorithms on the AntiUAV600 dataset, where all trackers were evaluated using their original pre-trained weights and default parameter settings. The experimental results are presented in Table 3. FSTC-DiMP achieved 53.6% AUC, 79.9% P, and 54.4% SA. On the AntiUAV600 dataset, the proposed FSTC-DiMP algorithm outperforms most state-of-the-art single-object tracking methods, being slightly inferior only to SiamDT. Figure 9 displays the success and precision plots of the retrained trackers, illustrating that the proposed method not only achieves higher success rates but also improved tracking precision.

Meanwhile, this study evaluates the stability of the FSTC-DiMP algorithm using the AntiUAV410 test set and the AntiUAV600 validation set. As shown in Table 4, all metrics are presented with 95% confidence intervals (CIs). The analysis reveals that the AntiUAV410 test set exhibits narrower confidence intervals (e.g., a CI width of 6.6% for AUC), which not only validates the method’s stable performance in complex scenarios but also reflects the enhanced statistical reliability due to the larger sample size. In contrast, the AntiUAV600 validation set, which has fewer sequences and includes some highly challenging ones, demonstrates significantly wider confidence intervals. The detailed performance of this method under different scenario attributes will be thoroughly discussed in Section 4.4.

Visualisation. To evaluate tracking performance, this study presents a visual comparison of the assessment results, as illustrated in Figure 10. A comparative analysis was conducted against state-of-the-art algorithms across six typical challenging scenarios. The experimental results demonstrate the significant advantages of FSTC-DiMP, including rapid re-detection of lost targets, avoidance of local optima traps, and reduced recovery time after tracking failures. Under complex scene conditions, the proposed algorithm maintains high localisation accuracy while exhibiting superior robustness against various visual disturbances.

4.4. Attribute-Based Analysis

To comprehensively verify the robustness of FSTC-DiMP in scenarios with targets of different scales, we evaluated the results in Table 3 based on four critical size categories: normal size (NS), medium size (MS), small size (SS), and tiny size (TS). Figure 11 displays the precision and success rate evaluation curves of various tracking algorithms across these scale categories. Quantitative analysis demonstrates that FSTC-DiMP achieves excellent performance in the vast majority of test scenarios. Specifically, FSTC-DiMP outperforms other algorithms in NS, MS, and SS categories. For TS, FSTC-DiMP surpasses most single-object trackers, with only SiamDT showing slightly better performance. In terms of both precision and success rate, FSTC-DiMP achieves 8% and 12% improvements over Super-DiMP, respectively. These results conclusively demonstrate that our tracker exhibits significant performance advantages in handling target-tracking tasks across various scales.

Subsequently, we evaluated the results in Table 3 according to different scenario attributes, categorised as follows: dynamic background clutter (DBC), scale variation (SV), fast motion (FM), occlusion (OC), thermal crossover (TC), and out of view (OV). The results in Figure 12 demonstrate that our tracker achieves strong performance across most scenarios. In particular, for the fast motion (FM) attribute, our tracker demonstrates significant improvements over the baseline Super-DiMP, with a 24% higher precision score and a 25% higher success rate. This enhancement stems from our work in feature processing. For the out-of-view (OV) attribute, FSTC-DiMP achieves a remarkable performance boost in tracking precision (over 15% improvement compared to Super-DiMP) and also exhibits a clear advantage in success rate (over 19% improvement compared to Super-DiMP).

These results conclusively demonstrate the efficacy of our proposed spatio-temporal consistency-guided re-detection mechanism in handling target exits and re-entries, thereby significantly enhancing the tracker’s capability to cope with complex scenarios. Extensive experimental results provide compelling evidence that FSTC-DiMP can effectively address the diverse challenges encountered in real-world tracking environments.

The radar chart shown in Figure 13 comprehensively presents the comprehensive performance of various trackers across multiple attributes. The visualisation results show that the FSTC-DiMP tracking algorithm achieves excellent rankings in most attributes, with its radar chart demonstrating a larger coverage area, which intuitively reflects the superior comprehensive performance of this algorithm.

In addition, to demonstrate the effectiveness of our module, the IoU curves of FSTC-DiMP and the baseline Super-DiMP are evaluated as shown in Figure 14. The FSTC-DiMP algorithm exhibits stable tracking performance in complex scenarios (such as target movement beyond the field of view, occlusion, and camera motion). In cases where the baseline algorithm fails to track, FSTC-DiMP can maintain good tracking performance, thanks to the spatio-temporal consistency re-detection module.

4.5. Ablation Studies

4.5.1. Module Effectiveness Analysis

To investigate the effectiveness of different components in FSTC-DiMP, we conducted comprehensive ablation experiments on both the AntiUAV410 validation set and the AntiUAV600 validation set, accompanied by detailed analysis. To ensure the reliability of experimental results, we employed Super-DiMP as the baseline model while maintaining consistent hyperparameter configurations during training.

As shown in Table 5, the ablation study results demonstrate the varying degrees of contribution from our proposed modules. Taking the AntiUAV410 dataset as an example, the introduction of the LSK attention mechanism improves performance to 64.9% AUC (+0.7), confirming its effectiveness in enhancing feature extraction. Subsequent incorporation of the STR module further boosts performance to 67.4% AUC (+3.5), highlighting the critical role of spatio-temporal consistency-based re-detection in object tracking tasks. Ultimately, our complete proposed architecture achieves additional performance gains, reaching 68.1% AUC (+4.2), which validates the efficacy of the full FSTC-DiMP algorithm.

At the same time, we analysed the GPU memory usage of FSTC-DiMP when processing images of different resolutions and the inference time required for processing a single frame. The results are presented in Table 6. As the image resolution increases, the resource consumption and computation time of the model rise significantly; at a resolution of 512 × 512, it requires only 2.15 GB of memory and 0.152 s of inference time. Processing images at a resolution of 4096 × 4096 causes memory overflow, indicating that this resolution exceeds the hardware’s processing capacity. These data reflect the growth relationship between image resolution and computational resources in computer vision tasks, providing a valuable reference.

4.5.2. Analysis of Attention Mechanisms

The investigation into the influence of different attention mechanisms on our feature extraction enhancement module employed Super-DiMP as the baseline for comparative experiments conducted on the AntiUAV410 dataset, with results presented in Table 7. We compared several mainstream attention mechanisms, namely Convolutional Block Attention Module (CBAM) [54], Efficient Channel Attention (ECA) [55], and Expectation-Maximisation Attention (EMA) [56].

Experimental results reveal the LSK attention mechanism’s superior performance across both test and validation sets of AntiUAV410, consistently outperforming other comparative methods in all evaluation metrics. For instance, on the AntiUAV410 test set, integrating the LSK attention mechanism enhanced baseline model performance to 61.4% AUC (+1.8) and 83.1% precision (+1.3). These findings confirm the LSK attention mechanism’s effective multi-scale receptive field characteristics for anti-UAV tracking applications.

4.5.3. Qualitative Analysis of the ELB Module

Background augmentation learning plays a crucial role in enhancing the model’s feature learning capabilities. It not only improves the model’s perception of potential target regions but also prevents tracking drift that may occur under limited positive sample training. Notably, selecting different background fusion parameters

λ

can lead to significantly varying degrees of supplementary feature learning, making

λ

a critical factor in determining the extent of complementary feature acquisition.

To investigate the impact of the background fusion parameter

λ

on overall model performance, we conducted experiments with varying

λ

values on both the AntiUAV410 and AntiUAV600 datasets while maintaining consistent hyperparameter configurations. The experimental results are presented in Table 8.

Specifically, when

λ

is set to 80, FSTC-DiMP achieves optimal performance across both benchmark datasets, obtaining the highest AUC score (67.7%) and precision score (91.3%) on the AntiUAV410 test set, as well as the highest AUC score (53.6%) and P score (79.9%) on the AntiUAV600 validation set. However, when

λ

is set to 20 or 40, the model exhibits inadequate learning of supplementary feature representations, leading to performance degradation. Conversely, when

λ

is increased to 100, overfitting occurs, resulting in diminished tracking accuracy.

5. Discussion

Although the FSTC-DiMP algorithm demonstrates relatively good performance in thermal infrared anti-drone tracking tasks, there are still aspects of its target tracking effectiveness that require further optimisation. Specifically, the first issue lies in the poor timeliness of target re-detection after prolonged occlusion in complex scenarios. Figure 15 includes a typical challenging sequence from “20190925_134301_1_6.” The target initially reappears from behind an occluding building at frame 102, but the FSTC-DiMP algorithm does not successfully reacquire the target until frame 111. This phenomenon indicates that the current algorithm exhibits inefficiency in handling occlusion scenarios.

Another aspect requiring optimisation is the target scale issue caused by long-distance imaging scenarios. When the target is far from the camera, its pixel representation significantly decreases. Based on the scale classification criteria of the AntiUAV410 dataset, targets with a bounding box diagonal length below 10 pixels are categorised as “Tiny size.” As shown in Figure 15, in the sequence “3700000000002_144152_1,” the target approaches the scale classification threshold (approximately 10 pixels) at frame 160, where the proposed method still maintains stable tracking. However, when the target size further decreases below 8 pixels (e.g., at frames 495 and 663), tracking failure occurs due to the tiny target scale.

Therefore, the core direction of future research should focus on developing a more robust target tracking algorithm, particularly optimising for low-resolution scenarios under occlusion and varying target scale conditions.

6. Conclusions

This study proposes a novel anti-UAV tracking algorithm named FSTC-DiMP. To address the challenges of insufficient feature extraction and limited receptive fields in low-SCR images, we have improved the feature extraction process by introducing an LSK attention mechanism, which achieves an optimal balance between local feature focus and global information integration. To address the tracking loss problem caused by target entry into and exit from the field of view or interference from similar objects during tracking, we design a spatio-temporal consistency-guided re-detection mechanism. Analysing spatio-temporal consistency relationships to verify newly detected targets effectively helps mitigate tracking loss issues. Additionally, to enhance the utilisation of initial frame information and learn richer feature representations, we incorporated a background augmentation module that effectively captures the semantic features of both the target and its surrounding environment. The proposed method was systematically evaluated on the challenging benchmark datasets AntiUAV410 and AntiUAV600 and compared with existing state-of-the-art tracking approaches. Experimental results demonstrate that FSTC-DiMP achieves significant performance improvements in anti-UAV target tracking tasks, confirming its strong robustness and adaptability to complex environments.

Author Contributions

Conceptualisation, D.B. and X.T.; methodology, D.B. and B.S.; validation, X.S. and S.S.; formal analysis, D.B. and R.G.; investigation, B.D. and X.S.; resources, D.B. and X.T.; data curation, B.D. and R.G.; software implementation, D.B.; writing—original draft preparation, D.B. and B.D.; writing—review and editing, D.B. and B.D.; supervision, B.S. and S.S.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, Z.; Hu, Y.; Yang, J.; Zhou, G.; Liu, F.; Liu, Y. A Contrastive-Augmented Memory Network for Anti-UAV Tracking in TIR Videos. Remote Sens. 2024, 16, 4775. [Google Scholar] [CrossRef]
Dang, Z.; Sun, X.; Sun, B.; Guo, R.; Li, C. OMCTrack: Integrating Occlusion Perception and Motion Compensation for UAV Multi-Object Tracking. Drones 2024, 8, 480. [Google Scholar] [CrossRef]
Liu, S.; Xu, T.; Zhu, X.F.; Wu, X.J.; Kittler, J. Learning adaptive detection and tracking collaborations with augmented UAV synthesis for accurate anti-UAV system. Expert Syst. Appl. 2025, 282, 127679. [Google Scholar] [CrossRef]
Huchuan, L.U.; Peixia, L.I.; Dong, W. Visual Object Tracking: A Survey. Pattern Recognit. Artif. Intell. 2018, 222, 103508. [Google Scholar] [CrossRef]
Abdelaziz, O.; Shehata, M.; Mohamed, M. Beyond traditional visual object tracking: A survey. Int. J. Mach. Learn. Cybern. 2025, 16, 1435–1460. [Google Scholar] [CrossRef]
Jiang, W.; Pan, H.; Wang, Y.; Li, Y.; Lin, Y.; Bi, F. A Multi-Level Cross-Attention Image Registration Method for Visible and Infrared Small Unmanned Aerial Vehicle Targets via Image Style Transfer. Remote Sens. 2024, 16, 2880. [Google Scholar] [CrossRef]
Guo, L.; Rao, P.; Gao, C.; Su, Y.; Li, F.; Chen, X. Adaptive Differential Event Detection for Space-Based Infrared Aerial Targets. Remote Sens. 2025, 17, 845. [Google Scholar] [CrossRef]
Tong, X.; Sun, X.; Zuo, Z.; Su, S.; Wu, P.; Wei, J.; Guo, R. GSFNet: Gyro-Aided Spatial-Frequency Network for Motion Deblurring of UAV Infrared Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5003718. [Google Scholar] [CrossRef]
Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared Patch-Image Model for Small Target Detection in a Single Image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef] [PubMed]
Jiang, N.; Wang, K.; Peng, X.; Yu, X.; Wang, Q.; Xing, J.; Li, G.; Guo, G.; Ye, Q.; Jiao, J.; et al. Anti-UAV: A Large-Scale Benchmark for Vision-Based UAV Tracking. IEEE Trans. Multimed. 2023, 25, 486–500. [Google Scholar] [CrossRef]
Huang, B.; Li, J.; Chen, J.; Wang, G.; Zhao, J.; Xu, T. Anti-UAV410: A Thermal Infrared Benchmark and Customized Scheme for Tracking Drones in the Wild. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2852–2865. [Google Scholar] [CrossRef] [PubMed]
Zhu, X.F.; Xu, T.; Zhao, J.; Liu, J.W.; Wang, K.; Wang, G.; Li, J.; Wang, Q.; Jin, L.; Zhu, Z.; et al. Evidential detection and tracking collaboration: New problem, benchmark and algorithm for robust anti-uav system. arXiv 2023, arXiv:2306.15767. [Google Scholar] [CrossRef]
Dong, X.; Shen, J. Triplet Loss in Siamese Network for Object Tracking. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 472–488. [Google Scholar]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High Performance Visual Tracking with Siamese Region Proposal Network. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar] [CrossRef]
Zha, Y.; Wu, M.; Qiu, Z.; Dong, S.; Yang, F.; Zhang, P. Distractor-Aware Visual Tracking by Online Siamese Network. IEEE Access 2019, 7, 89777–89788. [Google Scholar] [CrossRef]
Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4277–4286. [Google Scholar] [CrossRef]
Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6268–6276. [Google Scholar] [CrossRef]
Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. ATOM: Accurate Tracking by Overlap Maximization. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4655–4664. [Google Scholar] [CrossRef]
Bhat, G.; Danelljan, M.; Van Gool, L.; Timofte, R. Learning Discriminative Model Prediction for Tracking. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Long Beach, CA, USA, 15–20 June 2019; pp. 6181–6190. [Google Scholar] [CrossRef]
Danelljan, M.; Van Gool, L.; Timofte, R. Probabilistic Regression for Visual Tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7181–7190. [Google Scholar] [CrossRef]
Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-Aware Anchor-Free Tracking. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 771–787. [Google Scholar]
Bhat, G.; Danelljan, M.; Van Gool, L.; Timofte, R. Know Your Surroundings: Exploiting Scene Information for Object Tracking. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 205–221. [Google Scholar]
Qian, K.; Wang, J.S.; Zhang, S.J. STRCFD: Small Maneuvering Object Tracking via Improved STRCF and Redetection in Near Infrared Videos. IEEE Access 2024, 12, 2901–2913. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Kang, B.; Chen, X.; Wang, D.; Peng, H.; Lu, H. Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 9578–9587. [Google Scholar] [CrossRef]
Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer Tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8122–8131. [Google Scholar] [CrossRef]
Wang, N.; Zhou, W.; Wang, J.; Li, H. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 1571–1580. [Google Scholar]
Xue, Y.; Jin, G.; Shen, T.; Tan, L.; Wang, N.; Gao, J.; Wang, L. Consistent Representation Mining for Multi-Drone Single Object Tracking. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 10845–10859. [Google Scholar] [CrossRef]
Fang, H.; Wu, C.; Wang, X.; Zhou, F.; Chang, Y.; Yan, L. Online Infrared UAV Target Tracking with Enhanced Context-Awareness and Pixel-Wise Attention Modulation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5005417. [Google Scholar] [CrossRef]
Guo, D.; Shao, Y.; Cui, Y.; Wang, Z.; Zhang, L.; Shen, C. Graph Attention Tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 9538–9547. [Google Scholar] [CrossRef]
Gao, P.; Zhang, Q.; Wang, F.; Xiao, L.; Fujita, H.; Zhang, Y. Learning reinforced attentional representation for end-to-end visual tracking. Inf. Sci. 2020, 517, 52–67. [Google Scholar] [CrossRef]
Tao, J.; Chan, S.; Shi, Z.; Bai, C.; Chen, S. FocTrack: Focus attention for visual tracking. Pattern Recognit. 2025, 160, 111128. [Google Scholar] [CrossRef]
Kalal, Z.; Mikolajczyk, K.; Matas, J. Tracking-Learning-Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1409–1422. [Google Scholar] [CrossRef]
Lee, H.; Choi, S.; Kim, C. A Memory Model Based on the Siamese Network for Long-Term Tracking. In Proceedings of the Computer Vision—ECCV 2018 Workshops, Munich, Germany, 8–14 September 2018; Leal-Taixé, L., Roth, S., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 100–115. [Google Scholar]
Voigtlaender, P.; Luiten, J.; Torr, P.H.; Leibe, B. Siam R-CNN: Visual Tracking by Re-Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6577–6587. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Tan, H.; Zhang, X.; Zhang, Z.; Lan, L.; Zhang, W.; Luo, Z. Nocal-Siam: Refining Visual Features and Response With Advanced Non-Local Blocks for Real-Time Siamese Tracking. IEEE Trans. Image Process. 2021, 30, 2656–2668. [Google Scholar] [CrossRef] [PubMed]
Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. arXiv 2022, arXiv:2203.11991. [Google Scholar] [CrossRef]
Hou, Z.; Han, R.; Ma, J.; Ma, S.; Yu, W.; Fan, J. A global re-detection method based on Siamese network in long-term visual tracking. J. Real-Time Image Process. 2023, 20, 112. [Google Scholar] [CrossRef]
Choi, S.; Lee, J.; Lee, Y.; Hauptmann, A. Robust Long-Term Object Tracking via Improved Discriminative Model Prediction. In Proceedings of the Computer Vision—ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020; Bartoli, A., Fusiello, A., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 602–617. [Google Scholar]
Gao, B.; Spratling, M.W. Explaining away results in more robust visual tracking. Vis. Comput. Int. J. Comput. Graph. 2023, 39, 2081–2095. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, Q.; Song, W.; Huang, D.; He, Q. PromptVT: Prompting for Efficient and Accurate Visual Tracking. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7373–7385. [Google Scholar] [CrossRef]
Ma, S.; Yang, Y.; Chen, G. AODiMP-TIR: Anti-occlusion thermal infrared targets tracker based on SuperDiMP. IET Image Process. 2024, 18, 1780–1795. [Google Scholar] [CrossRef]
Chen, M.; Zhang, Z.; Jiang, N.; Li, X.; Zhang, X. YOLO-SRW: An Enhanced YOLO Algorithm for Detecting Prohibited Items in X-Ray Security Images. IEEE Access 2025, 13, 68323–68339. [Google Scholar] [CrossRef]
Chen, D.; Tang, F.; Dong, W.; Yao, H.; Xu, C. SiamCPN: Visual tracking with the Siamese center-prediction network. Comput. Vis. Media 2021, 7, 253–265. [Google Scholar] [CrossRef]
Zhuang, J.; Chen, W.; Huang, X.; Yan, Y. Band Selection Algorithm Based on Multi-Feature and Affinity Propagation Clustering. Remote Sens. 2025, 17, 193. [Google Scholar] [CrossRef]
Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese Box Adaptive Network for Visual Tracking. arXiv 2020, arXiv:2003.06761. [Google Scholar] [CrossRef]
Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning Spatio-Temporal Transformer for Visual Tracking. arXiv 2021, arXiv:2103.17154. [Google Scholar] [CrossRef]
Gao, S.; Zhou, C.; Ma, C.; Wang, X.; Yuan, J. AiATrack: Attention in Attention for Transformer Visual Tracking. arXiv 2022, arXiv:2207.09603. [Google Scholar] [CrossRef]
Mayer, C.; Danelljan, M.; Bhat, G.; Paul, M.; Paudel, D.P.; Yu, F.; Gool, L.V. Transforming Model Prediction for Tracking. arXiv 2022, arXiv:2203.11192. [Google Scholar] [CrossRef]
Durve, M.; Tiribocchi, A.; Bonaccorso, F.; Montessori, A.; Lauricella, M.; Bogdan, M.; Guzowski, J.; Succi, S. DropTrack—Automatic droplet tracking with YOLOv5 and DeepSORT for microfluidic applications. Phys. Fluids 2022, 34, 082003. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]

Figure 1. Analysis of grey-value characteristics of UAV target. A comparison between (c,d) demonstrates that the features of the UAV target in complex backgrounds are prone to being obscured by background noise. (a) Orign image. (b) SCR calculation region. (c) Global 3D grayscale image. (d) 3D grayscale image of the target region.

Figure 2. The line chart analyses the Signal-to-Clutter Ratio (SCR) on the AntiUAV410 dataset. It marks the lowest-SCR region and provides corresponding image examples.

Figure 3. Illustration of challenging scenarios in anti-UAV tracking. # indicates frame number.

Figure 4. The general framework of FSTC-DiMP.

Figure 5. Perceptual capabilities of CNN (a), self-attention (b), and LSK (c).

Figure 6. Overall framework of spatio-temporal consistency-guided re-detection.

Figure 7. When peak interference occurs, the system first retrieves the result from the previous frame (denoted by a red bounding box). The primary peak corresponds to the target region represented by either a blue bounding box or a yellow bounding box, while the other box indicates the secondary peak’s target region. "?" indicates target lost.

Figure 8. Enhanced Feature Learning Based on Background Augmentation.

Figure 9. The overall success plots (a) and precision plots (b) of FSTC-DiMP and other trackers on the AntiUAV410 test set, along with the overall success plots (c) and precision plots (d) on the AntiUAV600 validation set.

Figure 10. Qualitative comparison of five trackers on the Anti-UAV 410 dataset, where we selected six challenging sequences: DBC (dynamic background clutter), FM (fast motion), OC (occlusion), TC (thermal crossover), OV (out of view), and SV (scale variation).

Figure 11. Evaluation of FSTC-DiMP and other trackers on the AntiUAV410 test set in terms of target size, including normal size, medium size, small size, and tiny size. Precision plots (a–d) and success plots (e–h).

Figure 12. Attribute evaluation on the AntiUAV410 test set. In the precision plots (a–f), the legend values indicate the precision scores of corresponding trackers, while in the success plots (g–l), the legend values represent the success AUC scores of respective trackers.

Figure 13. The AntiUAV410 test set was evaluated using attribute-specific metrics, where tracker performance was ranked according to its AUC scores.

Figure 14. The IoU curves of six representative test sequences reflect the tracking quality. In the figure, the ground truth is indicated by green boxes, the red boxes represent the prediction results of FSTC-DiMP, and the blue boxes denote the prediction results of Super-DiMP. Super-DiMP performs poorly in scenarios involving camera motion, target leaving the field of view, and occlusion. In contrast, FSTC-DiMP effectively overcomes these challenges through a spatio-temporal consistency-aware re-detection mechanism.

Figure 15. Manifestations of our method limitations (green boxes indicate the ground truth, and the red boxes represent the prediction results of FSTC-DiMP).

Table 1. Comparison of feature extraction methods.

Method	Strategy	Limitations
Dual-Semantic Feature Extraction	Parallel extraction of “matching” semantics and “foreground” semantics	When target–background semantics are similar, the dual-branch structure amplifies misclassification
Centre-Prediction Feature Extraction	One-step mapping of CNN features into single-channel centre confidence heatmap	Heatmap robustness degrades in complex scenes
GE-AP	Multi-feature similarity matrix construction	Computational overhead from texture feature processing
Our Method	Dynamic adjustment of receptive field range and selectivity	Elevated computational complexity

Table 2. Detailed explanation of image data augmentation methods.

Method	Content
Horizontal flipping	Mirror transformation is performed using the vertical central axis as the symmetry axis to enhance the perspective diversity of samples.
Scale variation	The scaling factors are set to 0.7 and 0.9 to simulate visual variations when the UAV target moves away from the camera.
Viewpoint translation	Slight viewpoint offsets are simulated during the shooting process.
Rotation	Fixed rotation transformations at ±45° are applied to simulate pitch variations induced by rapid target motion in real-world scenarios.
Gaussian blurring	Gaussian noise is introduced to simulate three optical degradation scenarios caused by UAV movement and sensor noise: slight defocus, image blurring, and moderate out-of-focus effects.
Salt-and-pepper noise	By configuring random cluster nodes to aggregate discrete noise into randomly distributed rectangular patches, realistic background noise interference in actual scenarios is effectively replicated.

Table 3. Quantitative comparison of FSTC-DiMP with single-object trackers on AntiUAV410 test set and AntiUAV600 validation set (best results in bold, second-best in red, and third-best in blue).

Method	Source	AntiUAV410			AntiUAV600
Method	Source	AUC	P	SA	AUC	P	SA
ATOM	CVPR19	50.4	70.1	51.4	41.2	61.7	41.9
DiMP50	ICCV19	55.5	75.9	56.7	47.5	70.7	48.3
PrDiMP50	CVPR20	53.6	75.1	54.7	51.0	76.4	51.9
Super-DiMP	-	59.6	81.8	60.8	49.3	74.3	50.2
KYS	ECCV20	44.1	63.9	44.9	39.5	61.8	39.9
SiamCAR	CVPR20	46.0	64.7	46.9	33.8	53.5	34.3
SiamBAN	CVPR20	46.5	67.3	47.3	28.2	47.0	28.5
Stark-ST101	ICCV21	56.1	78.6	57.2	49.1	74.3	49.8
AiATrack	ECCV22	58.4	83.4	59.6	47.7	73.4	48.5
ToMP50	CVPR22	54.0	74.0	55.1	46.3	69.8	47.1
ToMP101	CVPR22	54.0	75.2	55.1	50.6	75.2	51.5
DropTrack	CVPR23	59.0	82.3	60.2	50.0	77.1	50.7
SiamDT	PAMI2024	66.8	89.5	68.2	54.0	82.3	54.8
Ours	-	67.7	91.3	69.1	53.6	79.9	54.4

Table 4. Confidence interval analysis.

Dataset	Sequences	AUC (95% CI)	P (95% CI)
AntiUAV410 Test	120	67.7 [64.4, 71.0]	91.3 [87.8, 94.8]
AntiUAV600 Validation	49	53.6 [45.7, 61.4]	79.9 [72.0, 87.8]

Table 5. Comparison results of ablation experiments for each module, with the optimal performance metrics highlighted in bold.

Dataset	Baseline	LSK	STR	ELB	AUC	$Δ$
AntiUAV410 Validation	√				63.9	-
	√	√			64.6	+0.7
	√	√	√		67.4	+3.5
	√	√	√	√	68.1	+4.2
AntiUAV600 Validation	√				49.3	-
	√	√			49.9	+0.6
	√	√	√		53.1	+3.8
	√	√	√	√	53.6	+4.3

Table 6. Analysis of GPU memory usage and inference time across different image resolutions.

Model	Image Resolution	Memory Usage	Inference Time
FSTC-DiMP	512 × 512	2.15 GB	0.152
	1024 × 1024	5.86 GB	0.396
	2048 × 2048	20.38 GB	1.451
	4096 × 4096	Out of Memory	-

Table 7. Comparison results of experiments with different attention mechanisms, with the optimal performance metrics highlighted in bold.

Dataset	Baseline	Attention	AUC	P
AntiUAV410 Test	Super-DiMP	-	59.6	81.8
		CBAM	52.6	72.2
		ECA	58.4	80.7
		EMA	58.6	80.6
		LSK	61.4	83.1
AntiUAV410 Validation	Super-DiMP	-	63.9	85.5
		CBAM	55.5	74.6
		ECA	62.3	83.4
		EMA	64.1	86.2
		LSK	64.6	86.5

Table 8. Performance comparison under different background fusion data parameters, with the best results highlighted in bold.

Dataset	Model	$λ$	AUC	P
AntiUAV410 Test	FSTC-DiMP	40	67.0	90.2
		60	67.4	90.8
		80	67.7	91.3
		100	67.4	90.8
AntiUAV600 Validation	FSTC-DiMP	40	52.8	77.7
		60	52.6	78.2
		80	53.6	79.9
		100	53.4	79.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bu, D.; Ding, B.; Tong, X.; Sun, B.; Sun, X.; Guo, R.; Su, S. FSTC-DiMP: Advanced Feature Processing and Spatio-Temporal Consistency for Anti-UAV Tracking. Remote Sens. 2025, 17, 2902. https://doi.org/10.3390/rs17162902

AMA Style

Bu D, Ding B, Tong X, Sun B, Sun X, Guo R, Su S. FSTC-DiMP: Advanced Feature Processing and Spatio-Temporal Consistency for Anti-UAV Tracking. Remote Sensing. 2025; 17(16):2902. https://doi.org/10.3390/rs17162902

Chicago/Turabian Style

Bu, Desen, Bing Ding, Xiaozhong Tong, Bei Sun, Xiaoyong Sun, Runze Guo, and Shaojing Su. 2025. "FSTC-DiMP: Advanced Feature Processing and Spatio-Temporal Consistency for Anti-UAV Tracking" Remote Sensing 17, no. 16: 2902. https://doi.org/10.3390/rs17162902

APA Style

Bu, D., Ding, B., Tong, X., Sun, B., Sun, X., Guo, R., & Su, S. (2025). FSTC-DiMP: Advanced Feature Processing and Spatio-Temporal Consistency for Anti-UAV Tracking. Remote Sensing, 17(16), 2902. https://doi.org/10.3390/rs17162902

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FSTC-DiMP: Advanced Feature Processing and Spatio-Temporal Consistency for Anti-UAV Tracking

Abstract

1. Introduction

2. Materials and Methods

2.1. Visual Tracking

2.2. Attention Mechanism

2.3. Re-Detection

3. Methodology

3.1. Overall Framework

3.2. Enhanced Feature Extractor

3.3. Spatio-Temporal Consistency-Guided Re-Detection

3.3.1. Confidence Classification

3.3.2. Spatio-Temporal Consistency Relationship

3.4. Enhanced Feature Learning Based on Background Augmentation

4. Experiments

4.1. Dataset and Metrics

4.1.1. Dataset

4.1.2. Metrics

4.2. Implementation Details

4.3. Comparison with the State-of-the-Arts Methods

4.4. Attribute-Based Analysis

4.5. Ablation Studies

4.5.1. Module Effectiveness Analysis

4.5.2. Analysis of Attention Mechanisms

4.5.3. Qualitative Analysis of the ELB Module

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI