Next Article in Journal
A Trust-Oriented Blockchain Architecture for Compliant and Secure Cross-Border Data Flows
Previous Article in Journal
Conversions Among Z, Y, H, F, T, and S Parameters, Which Are Highly Beneficial for the Analysis of Two-Port Circuits and Filters
Previous Article in Special Issue
From Forecasting to Foresight: Building an Autonomous O&M Brain for the New Power System Based on a Cognitive Digital Twin
error_outline You can access the new MDPI.com website here. Explore and share your feedback with us.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MSTFT: Mamba-Based Spatio-Temporal Fusion for Small Object Tracking in UAV Videos

School of Automation and Electrical Engineering, Lanzhou University of Technology, Lanzhou 730050, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(2), 256; https://doi.org/10.3390/electronics15020256
Submission received: 8 December 2025 / Revised: 2 January 2026 / Accepted: 4 January 2026 / Published: 6 January 2026

Abstract

Unmanned Aerial Vehicle (UAV) visual tracking is widely used but continues to face challenges such as unpredictable target motion, error accumulation, and the sparse appearance of small targets. To address these issues, we propose a Mamba-based Spatio-Temporal Fusion Tracker. To address tracking drift from large displacements and abrupt pose changes, we first introduce a Bidirectional Spatio-Temporal Mamba module. It employs bidirectional spatial scanning to capture discriminative local features and temporal scanning to model dynamic motion patterns. Second, to suppress error accumulation in complex scenes, we develop a Dynamic Template Fusion module with Adaptive Attention. This module integrates a threefold safety verification mechanism—based on response peak, temporal consistency, and motion stability—with a scale-aware strategy to enable robust template updates. Moreover, we design a Small-Target-Aware Context Prediction Head that utilizes a Gaussian-weighted prior to guide feature fusion and refines the loss function, significantly improving localization accuracy under sparse target features and strong background interference. On three major UAV tracking benchmarks (UAV123, UAV123@10fps, and UAV20L), our MSTFT establishes new state-of-the-art with success AUCs of 79.4%, 76.5%, and 75.8% respectively. More importantly, it maintains a tracking speed of 45 FPS, demonstrating a superior balance between precision and efficiency.

1. Introduction

UAV visual object tracking has been widely applied across numerous fields, including intelligent tasks [1], search and rescue in complex terrains [2], environmental monitoring [3], integrated detection–tracking systems [4], and autonomous operations [5]. Compared with traditional ground-view tracking, the complex relative motion patterns between the UAV and the target introduce significant inter-frame displacements, which places higher demands on the spatio-temporal modeling capability of trackers. Moreover, the high-altitude perspective of UAVs results in small, low-resolution targets with sparse features that easily blend into the background, and this inherent difficulty is further exacerbated by frequent environmental challenges such as background clutter, similar-object interference, and illumination variations [6,7]. Therefore, UAV visual object tracking remains an open research issue in the field of object tracking. Early discriminative correlation filter (DCF) methods offered computational efficiency but were limited by handcrafted features, resulting in poor performance in complex UAV scenes. Studies have shown that their fixed spatial modeling fails to adapt to target motion [8], they are ineffective for feature-sparse small targets [9], and they suffer severe drift in cluttered environments [10]. DCFs cannot capture discriminative information for low-resolution targets [11] and are inadequate for large-displacement tracking [12]. Deep learning-based methods substantially improved accuracy by leveraging hierarchical [13], end-to-end [14], and multi-scale [15] deep features. Siamese networks, in particular, demonstrated effectiveness through feature adaptation [16] and enhanced target-background discrimination [17]. However, CNN-based trackers are inherently limited by the local receptive field of convolution, which hinders global motion modeling [18], causes drift under large displacements [13], and struggles with dynamic scale changes [15]. The advent of Vision Transformers (ViTs) addressed this limitation via self-attention’s global modeling capability [19]. Transformer-based trackers subsequently achieved state-of-the-art results on UAV benchmarks through local-global fusion [20], lightweight designs [21], high-performance architectures [22], and structures balancing modeling with efficiency [23]. However, the quadratic computational complexity of Transformers significantly increases computational overhead on long sequences, making real-time tracking challenging—particularly when modeling long-range temporal dependencies [24].
Recently, Mamba models have garnered attention for their linear complexity and superior long-sequence modeling [25,26], prompting exploration into Mamba-based trackers. These include models utilizing hidden states for long-term context [27], Mamba-in-Mamba architectures for spatio-temporal modeling [28], methods decoupling motion and appearance with trajectory tokens [29], Motion Mamba for multi-object tracking [30], and visual state space models designed for tracking [31]. Nevertheless, existing methods still confront a set of critical limitations in tackling UAV-based small object tracking tasks: First, the lack of a dedicated architecture designed for the sparse and transient nature of small objects impedes robust modeling of their weak features; Second, their spatio-temporal modeling strategies are relatively simplistic and fail to effectively synergize spatial details with temporal dynamics, rendering them susceptible to tracking drift when targets exhibit rapid motion, large displacement, or abrupt pose changes; Third, the template update mechanism suffers from insufficient robustness under complex background interference, lacking reliable quality control and adaptive fusion mechanisms, which consequently introduces error accumulation in long-term tracking; Finally, the prediction head is not optimized for the high sensitivity of small object localization errors, thereby constraining further improvement in tracking accuracy.
The State Space Model (SSM) is the mathematical basis of the Mamba architecture, featuring linear computational complexity ( O ( N ) ) that is well-suited for UAV long-sequence tracking. We discretize the SSM via the Zero-Order Hold (ZOH) method with Δ = 0.01 (matching UAV inter-frame motion speed) and hidden state dimension N = 256 (balancing feature representation and memory constraints), with input features enhanced by adaptive contrast normalization. Mamba extends SSM with a selective scanning mechanism, and its visual adaptations (VMamba [32], VisionMamba [33]) enable multi-directional scanning for 2D image tasks—key for capturing small-target local features. Template update is another critical foundation, aiming to balance target adaptivity and error suppression; for UAV small targets, it faces unique challenges of sparse feature discrimination, large displacement-induced unreliability, and long-term error accumulation.
To address the core challenges of small-object tracking in UAV videos, this study focuses on four interconnected research questions: (1) Designing a spatio-temporal modeling mechanism capable of capturing sparse local features and adapting to large inter-frame displacements, while maintaining linear computational complexity. (2) Developing a robust template update strategy that integrates quality verification and scale-aware adaptation to suppress error accumulation caused by occlusion, background clutter, and scale variation. (3) Optimizing the localization head to reduce the sensitivity of small targets to positional errors, enhancing accuracy under conditions of sparse features and strong background interference. (4) Integrating the above mechanisms into a unified framework to achieve an optimal balance between tracking accuracy, robustness, and computational efficiency, thereby providing a deployable solution for practical UAV applications.
We propose a Mamba-based Spatio-Temporal Fusion Tracker (MSTFT) specifically designed for UAV small-target tracking tasks. MSTFT follows a closed-loop framework of Feature Enhancement–Spatio-Temporal Modeling–Template Optimization–Accurate Prediction to systematically address the core challenges in UAV tracking. Specifically, the main contributions of this paper are summarized as follows:
(1)
Bidirectional Spatio-Temporal Mamba Module: It captures local spatial features of small targets through horizontal–vertical bidirectional scanning and models dynamic motion trends via forward–backward temporal scanning, achieving synergistic modeling of spatial details and temporal dynamics.
(2)
Dynamic Template Fusion module based on Adaptive Attention: It introduces a triple safety verification mechanism (response peak, consistency, and motion stability) and a size-aware attention fusion strategy to suppress erroneous updates and cumulative error propagation in complex backgrounds.
(3)
Small-Target-Aware Context Prediction Head: It employs a Gaussian-weighted small-target prior to guide feature fusion and optimizes the loss function specifically for small targets, thereby enhancing localization accuracy under complex backgrounds.
(4)
Extensive experiments demonstrate that MSTFT achieves state-of-the-art performance on multiple UAV tracking benchmarks, attaining a superior balance among accuracy, robustness, and efficiency, and providing an effective and efficient solution for UAV small-target tracking.
The remainder of this paper is organized as follows: Section 2 reviews related work on UAV visual tracking, Mamba-based tracking methods, and dynamic template update mechanisms. Section 3 details the overall architecture of MSTFT and the design principles of its three core modules. Section 4 presents comprehensive experimental settings, comparative results with mainstream algorithms, ablation studies, qualitative analysis, and computational efficiency evaluation. Section 5 discusses the architectural advantages of MSTFT and its practical deployment value on UAV platforms. Finally, Section 6 concludes the paper and outlines future research directions.

2. Related Work

Visual object tracking for Unmanned Aerial Vehicle (UAV) applications faces a series of unique challenges. This section first reviews the evolution of dedicated UAV tracking methods, focusing on how different technical paradigms have attempted to address core challenges such as small target size, dynamic viewpoints, and computational constraints. Subsequently, we survey the emerging application of State Space Models (SSMs), particularly Mamba-based architectures, in visual tracking, assessing their potential to overcome the inherent efficiency bottlenecks of traditional methods. Finally, we examine dynamic template update mechanisms and context-aware modeling strategies, which are crucial for maintaining tracking robustness under appearance variations.

2.1. UAV Visual Tracking

UAV-based tracking differs significantly from ground-view tracking: it confronts unique challenges, including sparse features of small targets, large inter-frame displacements, and constraints on onboard resources. These distinct challenges have further driven the development of dedicated tracking methods across various technical paradigms. Early approaches primarily relied on discriminative correlation filters, but their dependence on handcrafted features limits robustness in complex scenes [8,9,12,34]. These methods fail to model the subtle discriminative features of small UAV targets, leading to rapid tracking drift when targets blend into the background. With the rapid advancement of deep learning, convolutional neural network–based Siamese trackers have progressively become dominant [35,36,37,38,39,40,41,42,43]. For instance, SiamBAN [44] and SiamRPN++ [45] improve tracking performance by integrating region proposal networks and multi-level feature fusion, whereas SiamFC++ [46], designed with UAV resource constraints in mind, maintains real-time performance at the cost of reduced tracking accuracy. Recently, Transformer-based tracking approaches have achieved remarkable progress in UAV vision tasks [20,21,23,47]. STARK [48] employs an encoder–decoder structure to capture global spatio-temporal dependencies; and TATrack [49] adopts a Vision Transformer (ViT) backbone to construct a single-stream tracking framework supporting end-to-end feature learning. Nevertheless, the quadratic computational complexity of Transformers introduces severe efficiency bottlenecks when processing high-resolution images and long video sequences. Furthermore, most existing methods lack specialized designs to handle the sparse and easily lost features of UAV small targets, limiting their performance in such scenarios. In contrast to existing methods, this work proposes the Small-Target-Aware Context Prediction Head (CAPH), which integrates a Gaussian-weighted prior and small-target-optimized loss function to enhance localization robustness under sparse feature and strong background interference.

2.2. Visual Tracking Based on Mamba Models

In tracking tasks, MambaLCT [27] constructs long-term contextual representations using Mamba’s hidden states, integrating historical appearance features in an autoregressive manner. While it achieves efficient long-sequence modeling, it lacks dedicated designs for small UAV targets—relying on unidirectional temporal scanning that fails to capture complete motion patterns. TrackingMiM [28] introduces a Mamba-in-Mamba architecture with nested scanning to model intra-frame spatial structure and inter-frame temporal relationships. Despite its enhanced spatial–temporal synergy, it adopts a size-agnostic feature fusion strategy that dilutes core small-target features with background information, limiting discriminability for sparse target representations. TemTrack [29] introduces trajectory tokens, enabling decoupled learning of temporal dynamics and appearance representations. These studies effectively demonstrate the potential of Mamba in visual tracking tasks. However, existing methods mainly focus on general tracking scenarios and do not fully address the unique challenges posed by small-target tracking in UAV applications, such as sparse and easily lost features. As a result, there is room for improvement in scanning strategies, template update mechanisms, and localization accuracy. In contrast, the proposed bidirectional spatio-temporal Mamba module enhances the extraction of subtle local features through bidirectional spatial scanning, effectively addressing large inter-frame displacements and sudden pose changes caused by UAV platform motion. Simultaneously, the Mamba-based dynamic template fusion module with adaptive attention integrates a triple safety verification mechanism to significantly reduce error accumulation during template updates.

2.3. Dynamic Template Update-Based Context-Aware Modeling

Template update is a critical component for enhancing long-term tracking robustness, as it must balance adaptivity to target appearance changes and suppression of error propagation [50,51,52]. Early methods such as UpdateNet [53] depend on manually crafted selection strategies to identify suitable templates from historical frames, limiting flexibility. MixFormer [54] introduces learnable queries for dynamic updating but uses a fixed confidence threshold—unsuitable for UAV small targets with inherently lower response peaks, leading to frequent false positive or negative updates. SeqTrack [55] enhances performance through sequence-level modeling of multiple templates, albeit at the cost of considerable computational overhead. CTTrack [56] employs temporal attention to fuse historical template features yet lacks a reliable verification mechanism for update quality. For temporal context modeling, KeepTrack [57] and EVPTrack [58] build temporal context by linking motion cues across consecutive frames, but their frame-wise modeling causes fragmented contextual information. ODTrack [59] strengthens inter-frame associations via a token propagation mechanism, though it remains constrained by a fixed temporal window. AQATrack [60] adopts a sliding-window strategy, whose effective length is limited by the quadratic complexity of Transformers. MambaLCT [27] leverages Mamba’s linear complexity to achieve global context modeling and break the computational bottleneck, but still requires improvements in update safety and adaptation to small targets. In response to these limitations, this work proposes a Mamba-based Dynamic Template Fusion module with Adaptive Attention (DTF-AA). It integrates three key innovations: a triple safety verification mechanism (response peak, temporal consistency, motion stability) to ensure update reliability, scale-aware attention to focus on small-target core regions, and a conservative update strategy to suppress error propagation. Combined with the Bi-STM module’s joint spatial-temporal modeling, DTF-AA achieves robust template fusion while maintaining temporal context integrity—effectively addressing the template update challenges in complex UAV scenarios.

3. Mamba-Based Spatio-Temporal Fusion Tracker (MSTFT) Architecture

UAV visual tracking faces several key challenges, including sparse feature representations for small targets, modeling distortions caused by fast target motion, error accumulation under complex background interference, and the difficulty of balancing accuracy and efficiency given the limited resources of onboard platforms. To address these challenges, we propose a Mamba-based Spatio-Temporal Fusion Tracker (MSTFT) that achieves joint optimization of multiple modules through a dedicated collaborative mechanism. As shown in Figure 1, MSTFT is built on an adaptive feature-enhancement backbone that enhances the discriminative capability of features for low-resolution targets. Three core modules are designed: a Bidirectional Spatio-Temporal Mamba (Bi-STM) module that models complex motion patterns via a bidirectional scanning mechanism; a Dynamic Template Fusion module with Adaptive Attention (DTF-AA) featuring a triple verification mechanism to suppress background interference and tracking drift; and a Context-Aware Prediction Head (CAPH) that leverages a Gaussian-weighted prior to refine small-target localization and alleviate ambiguity caused by sparse feature representations.
As illustrated in Figure 1, given a UAV aerial video sequence { x 1 , x 2 , , x T } , the initial template frame is denoted as x t R 3 × H t × W t , and an arbitrary test frame is denoted as x s R 3 × H s × W s . First, a lightweight Vision Mamba backbone network performs feature extraction and enhancement on both frames, producing multi-scale feature maps F t R D × H t × W t and F s R D × H s × W s , where D = 512 is the feature dimension adapted for small targets, and H t , W t , H s , W s are the spatial dimensions after downsampling. The concatenated feature F mix = [ F t , F s ] is then fed into the Bi-STM module, which employs a bidirectional scanning mechanism to capture fine-grained spatial details of small targets and temporal dependencies of dynamic motions, outputting an enhanced spatio-temporal feature F stm R D × ( H t W t + H s W s ) . Subsequently, the DTF-AA module adaptively fuses the initial template feature F t i and the dynamically updated template feature F t d based on tracking confidence and motion stability, generating a target-aware template F t * R D × H t × W t to suppress error accumulation. Finally, the CAPH module incorporates a small-target prior, fuses F stm and F t * to perform response map prediction and bounding box regression, and outputs the target state y ^ = [ l , t , r , b , c ] (containing the bounding box coordinates and a confidence score c).

3.1. Bidirectional Spatio-Temporal Mamba (Bi-STM) Module

The Bi-STM module employs a dual-focus architecture designed to address the inherent spatial and temporal challenges in UAV-based tracking. This design is motivated by two key observations of UAV operational characteristics: first, the discriminative features of small targets are often concentrated in limited spatial regions, requiring specialized scanning mechanisms to extract fine-grained local information; second, the rapid maneuvering and significant inter-frame displacement of UAV targets make unidirectional temporal modeling insufficient, as it often fails to capture complete motion patterns and is prone to accumulating tracking drift. Our bidirectional temporal scanning method effectively addresses these limitations by simultaneously capturing forward motion trends and backward error correction, forming a complementary modeling framework that improves prediction accuracy and tracking stability.
Building on the State Space Model (SSM) foundation introduced in Section 1, we optimize its key parameters for UAV scenarios to balance modeling accuracy and computational cost. Specifically, we set the timescale parameter Δ = 0.01 (matching typical UAV inter-frame motion speeds, where target displacement is within 5% of the image size at 10–30 FPS) and hidden state dimension N = 256 (balancing small-target feature representation and 16GB UAV onboard GPU memory constraints). To enhance the discriminability of low-resolution small-target features, the input feature token x t is preprocessed via adaptive contrast normalization:
x t = x t μ σ + ϵ · γ + β
where μ and σ are local feature statistics adapted to small-target feature distributions, γ and β are learnable parameters, and ϵ = 10 6 is a small constant to avoid division by zero.

3.1.1. Bidirectional Scanning Mechanism

The spatial scanning employs a horizontal–vertical bidirectional strategy tailored to the “locally dominant features” characteristic of small targets, ensuring comprehensive capture of core feature information. First, a “small-window rearrangement” is applied to the concatenated feature F mix instead of global rearrangement—the small-target feature region is cropped according to the initial template’s proportions, and this region undergoes 16 × 16 window rearrangement to reduce background noise interference. Then, SSM modeling is applied separately along the horizontal and vertical directions, generating spatial hidden states h spatial h and h spatial v . Finally, a small-target feature enhancement branch output F mix small , obtained by compressing channels with a 1 × 1 convolution followed by a 3 × 3 depthwise convolution to capture local correlations, is introduced. This is fused with the bidirectional spatial states and normalized via LayerNorm to produce the final spatial feature h spatial , formulated as
h spatial h = A ¯ h h spatial h 1 + B ¯ h Permute h ( F mix )
h spatial v = A ¯ v h spatial v 1 + B ¯ v Permute v ( F mix )
      h spatial = LayerNorm ( h spatial h + h spatial v + F mix small )
where A ¯ h , B ¯ h , A ¯ v , B ¯ v are the learnable parameters of the spatial SSM, and Permute h ( · ) and Permute v ( · ) denote the feature rearrangement operations along the horizontal and vertical directions, respectively.
The temporal scanning employs a forward–backward bidirectional strategy to handle the rapid motion and sudden pose changes of UAV targets. As illustrated in Figure 2, the inner spatial Mamba performs horizontal–vertical bidirectional spatial scanning, while the outer temporal Mamba executes forward–backward bidirectional temporal scanning. The forward scan predicts the current target’s motion trend based on historical frames, adapting to the target’s motion inertia. The backward scan corrects the current frame’s modeling error using subsequent frames, addressing drift in the forward scan caused by abrupt pose changes.
h temporal f = A ¯ f h temporal f 1 + B ¯ f F spatial
h temporal b = A ¯ b h temporal b + 1 + B ¯ b F spatial
        h temporal = LayerNorm ( h temporal f + h temporal b )
where F spatial is the output feature from the spatial scan, A ¯ f , B ¯ f , A ¯ b , B ¯ b are the learnable parameters of the temporal SSM, and f and b denote the forward and backward scanning directions, respectively.
Compared to existing Mamba trackers (e.g., TrackingMiM’s unidirectional temporal scan), this bidirectional temporal scanning achieves a 2.7% AUC improvement on the UAV123@10fps dataset (characterized by low frame rate and large displacements), validating its advantage in dynamic motion modeling.

3.1.2. Adaptive Fusion of Spatio-Temporal Features

To prevent small-target features from being diluted during spatio-temporal feature fusion, Bi-STM employs a fusion strategy combining “residual connections and dynamic weight allocation”, which adaptively adjusts the contribution weights of spatial and temporal features based on target size and motion state.
F stm = LayerNorm ( w s h spatial + w t h temporal + F mix )
where w s = TargetSize ( x t ) H s W s represents the relative size of the small target, and w t = 1 w s is the weight for the temporal features.
When the target is small ( w s is large), the model emphasizes spatial local features to prevent the loss of core information. When the target undergoes large motion displacement ( w t is large), the model prioritizes temporal motion features to enhance adaptability to dynamic changes. This enables stable feature representation under the dynamically varying target sizes and motion states typical in UAV scenarios.

3.2. Dynamic Template Fusion Module Based on Adaptive Attention (DTF-AA)

The core objective of the DTF-AA module is to achieve safe template updates in complex UAV scenarios. This update strategy must adapt to target appearance changes while preventing erroneous updates caused by UAV viewpoint changes and background clutter. The module operates through an iterative process of feature quality assessment, update safety verification, and adaptive fusion, leveraging a multi-constraint mechanism to suppress the propagation of cumulative errors. The feature representations processed in this module build upon the refined outputs from the Mamba Layer (as illustrated in Figure 3), which integrates In Attention, Out Attention, and FFN to enhance temporal–spatial feature coherence for reliable template fusion.

3.2.1. Triple Safety Verification Mechanism for Template Update

In UAV remote sensing scenarios, target deformation caused by UAV pose changes and distracting features introduced by background clutter are often mistaken for genuine appearance variations of the target, thereby inducing tracking drift and resulting in false feature updates during the template update process. To address this challenge, the DTF-AA method incorporates a triple safety verification mechanism comprising response peak inspection, response consistency validation, and motion stability checking, collaboratively ensuring the reliability and robustness of the template update.
F t d = F t i k = 1 , α F s * + ( 1 α ) F t d R peak > θ Δ R < ϵ MotionStab ( x s ) < δ , F t d otherwise .
Here, F t i is the initial template feature, F t d is the dynamic template feature, k is the frame index, α = 0.2 is the template update rate (lower than the typical 0.3 for general scenes, adopting a “slow update” strategy to reduce error accumulation for small targets), F s * is the feature from the small-target region in the current frame (obtained using 1.5× extended cropping to retain contextual information and prevent feature loss), and ϵ = 0.75 is the threshold for temporal consistency verification (controlling the allowable deviation between the current frame’s response peak and the mean value of the last 5 frames).
The design of each verification criterion is adapted to UAV characteristics: the response peak R peak reflects the credibility of the target features in the current frame, with θ = 0.92 (lower than the 0.95 used in general scenarios) because UAV small targets typically exhibit lower response peaks, thus avoiding false negatives. Response consistency Δ R = | R peak R ¯ t 5 : t 1 | (deviation from the mean response peak over the last 5 frames) ensures coherence between current and historical features, filtering transient disturbances. Motion stability is determined as M o t i o n S t a b ( x s ) = 1 3 i = 1 3 SSIM ( x s , x s i ) (mean Structural Similarity over the last 3 frames), with δ = 0.2 , filters frames blurred by sudden UAV pose changes, preventing updates based on spurious features.

3.2.2. Size-Aware Adaptive Attention Fusion

Traditional attention fusion methods do not account for the dynamic size changes of UAV targets, leading to the dilution of core small-target features by background information. DTF-AA introduces a size-aware attention mechanism that dynamically adjusts the attention window size and temperature parameter to focus on the core region of small targets. This mechanism leverages the feature refinement capabilities of the Mamba Layer (Figure 3), where In Attention and Out Attention modules help prioritize and fuse target-centric features while suppressing background noise.
The implementation is as follows: first, compute the size ratios s t = TargetSize ( x t ) H t W t and s s = TargetSize ( x s ) H s W s for the initial template and the current frame target, respectively. Then, dynamically adjust the attention temperature parameter τ based on the size ratio; for small targets, τ is decreased ( τ = 0.1 ) to enhance focus. Simultaneously, the attention window size is set to w = max ( 3 , 0.5 · TargetSize ( x ) ) , adapting to the target size to minimize background interference.
w i = Softmax S A t t ( F t i , R peak ) τ · s t
w d = Softmax S A t t ( F t d , R peak ) τ · s s
F t * = w i F t i + w d F t d w i + w d    
Here, S A t t ( · , · ) is the spatial attention computation function, ⊗ denotes element-wise multiplication, w i and w d are the attention weights for the initial and dynamic templates, respectively. This mechanism enhances the core features of small targets and suppresses background interference, with its feature processing pipeline aligning with the multi-frame enhancement logic of the Mamba Layer (Figure 3).

3.3. Small-Target-Aware Context Prediction Head (CAPH)

The CAPH module adopts a small-target-centric design philosophy, integrating multi-source contextual information to enhance localization robustness in complex backgrounds. Its core innovations lie in two aspects: small-target-prioritized feature fusion and a loss function specifically designed for small targets. These advancements directly address the critical challenges of small UAV targets being vulnerable to background clutter and highly sensitive to localization errors.

3.3.1. Small-Target Prior-Guided Feature Fusion

To address the issue of weak feature signals and susceptibility to background noise for UAV small targets in complex backgrounds, CAPH introduces a “Gaussian-weighted small-target prior” that guides the attention mechanism to focus on the target region, enhancing the discriminability between target features and the background. This prior uses the target center ( c x , c y ) from the previous frame as the Gaussian mean, with a variance σ = max ( 2 , 0.3 · TargetSize ( x s ) ) . The variance σ is decreased for small targets to enhance focus and increased for large targets to cover the entire target area. It is mathematically defined as
G ( i , j ) = exp ( i c x ) 2 + ( j c y ) 2 2 σ 2
The feature fusion process based on this prior is
F fusion = CrossAttn ( F stm , F t * , G ( x s ) ) + F stm
Here, CrossAttn ( · , · , · ) is a cross-attention layer that incorporates the small-target prior. The query comes from the template feature F t * to ensure target-centric attention, while the key and value come from the spatio-temporal feature F stm to incorporate contextual information, achieving precise localization through “template guidance and context constraint”.

3.3.2. Loss Function Optimized for Small Targets

Bounding box regression for small targets is more sensitive to errors: a 1-pixel localization error can reduce the IoU of a 20 × 20 small target by up to 15%, while for a 100 × 100 target the impact is only about 2%. Therefore, CAPH optimizes both the parameters and formulation of the loss function to increase the training weight for small-target classification and regression.
The total loss function is a weighted sum of the classification and regression losses:
L total = L cls + λ 1 L 1 + λ 2 L giou
Here, L cls is a modified Focal Loss adapted to the class imbalance problem common with small targets (where background pixels vastly outnumber target pixels). Its formulation is
L cls = i , j y i , j c ( 1 R i , j ) γ log ( R i , j ) + ( 1 y i , j c ) R i , j γ log ( 1 R i , j )
where γ = 2.5 (higher than the standard 2.0), further suppressing the gradient contribution from background pixels and increasing the training weight for small-target pixels. y i , j c is a Gaussian classification label with variance σ y = 0.2 · TargetSize ( x s ) , adapting to the target’s dynamic size changes.
The regression loss combines L1 loss and GIoU loss, with weights λ 1 = 6 and λ 2 = 3 to emphasize bounding box localization accuracy. To address the IoU evaluation bias for small targets, an area penalty term is added to the GIoU calculation for targets smaller than 30 × 30 pixels, mitigating IoU distortion caused by the small absolute area of these targets.

3.4. Overall Framework and Core Algorithm

The Mamba-based Spatio-Temporal Fusion Tracker (MSTFT) is designed for robust UAV tracking, particularly under the challenges of fast motion and small targets. Its effectiveness stems from the synergistic integration of three core modules: the Bidirectional Spatio-Temporal Memory Module (Bi-STM), the Dynamic Template Fusion with Adaptive Attention (DTF-AA), and the Triple Safety Verification Mechanism (TSVM). The complete end-to-end workflow of MSTFT is formalized in Algorithm 1, which details the processes of target initialization, multi-scale feature extraction, spatio-temporal fusion, adaptive template management, and final localization.
Implementation Details for Reproducibility:
  • Vision Mamba Backbone: We adopt the open-source Vision Mamba (ViM) implementation from [33] with input resolution of 256 × 256 pixels.
  • Cross-Correlation: Computed using standard 2D cross-correlation in PyTorch 2.1.0.
  • SSIM Calculation: Computed via torchmetrics library with patches resized to 64 × 64 .
Algorithm 1 Mamba-based Spatio-Temporal Fusion Tracker (MSTFT)
  • Input: Initial frame I 0 , initial target bounding box B 0 , video sequence { I t } t = 1 T , hyperparameters θ = 0.92 , ε = 0.75 , δ = 0.7 , α = 0.2
  • Output: Target bounding box B t for each frame t
  1:
Initialization
  2:
Extract initial target template T 0 from I 0 using B 0
  3:
Initialize spatio-temporal fusion template T s t T 0
  4:
Initialize response peak buffer P b u f Ø
  5:
Initialize motion stability buffer S b u f Ø
  6:
for t = 1   to  T  do
  7:
    Step 1: Multi-scale Feature Extraction
  8:
    Extract 3-scale feature maps F t = { F t 1 , F t 2 , F t 3 } from I t via Vision Mamba backbone
  9:
    Step 2: Bidirectional Spatio-Temporal Fusion
10:
     F t s t = Bi-STM ( F t , F t 1 s t )
# Fuse current and historical features
11:
    Step 3: Dynamic Template Fusion with Adaptive Attention
12:
    Compute scale-aware attention weights w s based on B t 1 ’s size
13:
     F t f u s e d = s = 1 3 w s · F t s t , s
# Adaptive multi-scale fusion
14:
    Compute target response map R t = CrossCorrelation ( F t f u s e d , T s t )
15:
    Step 4: Target Localization
16:
    Find peak position p t = argmax ( R t ) and value v t = R t ( p t )
17:
    Regress B t from R t using optimized regression head
18:
    Step 5: Triple Safety Verification Mechanism
19:
    Compute SSIM s t between I t ( B t ) and I t 1 ( B t 1 )
20:
    if  t 5  then
21:
         C 1 = ( v t > θ · mean ( P b u f ) )
# Peak sufficiency
22:
         C 2 = ( | v t mean ( P b u f ) | < ε )
# Temporal consistency
23:
    else
24:
         C 1 = True ; C 2 = True
# Bypass for initial frames
25:
    end if
26:
     C 3 = ( s t > δ )
# Motion stability
27:
    Update buffers: P b u f [ P b u f [ 1 : ] , v t ] ; S b u f [ S b u f [ 1 : ] , s t ]
28:
    Step 6: Adaptive Template Update
29:
    if  C 1 C 2 C 3  then
30:
        Extract new target template T t from I t using B t
31:
         T s t ( 1 α ) · T s t + α · T t
# Conservative update
32:
    end if
33:
    Output  B t for frame t
34:
end for

4. Experiments

To systematically evaluate the proposed Mamba-based Spatio-Temporal Fusion Tracker (MSTFT) in addressing the core challenges of UAV small-target tracking—sparse features, large inter-frame displacements, error accumulation in complex backgrounds, and the accuracy–efficiency trade-off under resource constraints—we conducted comprehensive experiments involving dataset adaptation, multi-dimensional comparisons, module ablation studies, and qualitative visualization to thoroughly validate MSTFT’s performance and practicality.

4.1. Experimental Setup

To ensure an objective and fair comparison, all algorithms were evaluated using the public tracker_benchmark_v1.1 toolbox, a standard automated tool for UAV tracking. It processes algorithm outputs, calculates standard metrics (precision, success rate, AUC), and generates corresponding plots for visual analysis. Performance data were compiled from multiple sources: results from our runs of publicly available code and results reported in the original publications. This multi-source approach guarantees the comprehensiveness and validity of our comparative study.

4.1.1. Datasets

We selected three mainstream UAV tracking benchmark datasets, covering diverse scene complexities, target scales, and motion patterns, to ensure targeted evaluation of MSTFT’s core design objectives.
The UAV123 dataset contains 123 UAV aerial videos (total >110k frames) spanning urban streets (32 seq.), rural farmland (28 seq.), coastal waters (25 seq.), and mountainous terrain (38 seq.). Targets are primarily small ( 5 × 5 30 × 30 pixels, 72%), with challenges including illumination variation (21 seq.), background clutter (18 seq.), and similar object distraction (15 seq.), used to evaluate MSTFT’s general tracking performance.
UAV123@10fps is a low-frame-rate downsampled version of UAV123, simulating large inter-frame displacements caused by high-speed UAV motion (e.g., >15 m/s), where target displacement often exceeds 5% (up to 12%) of the image size. It primarily tests the temporal dynamic modeling capability of the Bi-STM module.
The UAV20L dataset contains 20 long sequences (avg. 5200 frames/seq., max 12,000 frames) with challenging scenarios: long-term full occlusion (8 seq., 100–300 frames), target leaving the view (6 seq., 200–500 frames), and significant scale variation (6 seq., >200% size change). It validates the long-term error suppression capability of the DTF-AA module.

4.1.2. Evaluation Metrics

We employed Success Rate, Precision, and the Area Under the Curve (AUC) of the success plot as the primary evaluation metrics.
Success Rate measures tracking continuity by calculating the proportion of frames where the Intersection over Union (IoU) between the predicted bounding box R t p and the ground truth R t g meets or exceeds 0.5:
Success = 1 N t = 1 N I IoU ( R t p , R t g ) 0.5
Precision accounts for the high sensitivity of small targets to localization error. It is defined as the proportion of frames where the Euclidean distance between the predicted and ground truth bounding box centers is within 20 pixels:
Precision = 1 N t = 1 N I ( x c , t p , y c , t p ) ( x c , t g , y c , t g )   20
where N denotes the total number of frames in the evaluated video sequence, and I ( · ) is the indicator function (returning 1 if the condition inside is satisfied, otherwise 0).
Success AUC provides a comprehensive robustness assessment across different IoU thresholds. It is computed as the area under the success plot, where the x-axis is the IoU threshold (ranging from 0 to 1) and the y-axis is the corresponding success rate.

4.1.3. Implementation Details

All experiments are conducted on a server with 4× NVIDIA Tesla A100 GPUs. Input images are Z-Score normalized with mean [0.485, 0.456, 0.406] and std [0.229, 0.224, 0.225]. Data augmentation includes random flipping ( p = 0.5 ), brightness ( ± 15 % ), and contrast ( ± 10 % ) adjustments. The template size is 127 × 127 pixels, search region is 320 × 320 pixels. Training uses SGD (momentum 0.9, weight decay 1 × 10 4 ), batch size 32, for 50 epochs. Learning rates follow cosine annealing: backbone 0.005, core modules 0.01, decaying to 0.0005. We employ Vision Mamba-B backbone (70 M parameters) with channels [64, 128, 256] and 16 × downsampling. In Bi-STM, the spatial window is 16 × 16 , temporal window 8 frames, hidden dimension 256. DTF-AA uses update rate 0.2 with triple verification: peak (0.92), consistency (0.75), and stability (0.7) thresholds. CAPH employs Gaussian prior variance 0.1 × min ( w , h ) and modified Focal Loss weight 3.0.

4.2. Comparative Experiments and Results Analysis

To comprehensively evaluate the performance of the proposed MSTFT method, we selected representative tracking algorithms from three mainstream paradigms, including ese-based SiamAPN [61], SiamFM [62], and SiamTADT [63]; Transformer-based SGLATrack [64], Ostrack [65], TransT [47], MCTrack [66], and STARK [48]; and Mamba-based TemTrack [29], MambaLCT [27], TrackingMiM [28], and TADMT [67]. By incorporating representative methods featuring diverse architectural designs and modeling mechanisms, we ensure a systematic and targeted comparison with MSTFT in terms of architectural design, modeling strategies, and task adaptability.

4.2.1. Overall Performance Comparison

The performance of various UAV tracking algorithms is benchmarked on three mainstream datasets: UAV123, UAV123@10fps, and UAV20L. Quantitative results are summarized in Table 1, with precision and success curves visualized under the One-Pass Evaluation (OPE) protocol in Figure 4 for intuitive comparison. The evaluation was conducted using the tracker_benchmark_v1.1 toolbox, which calculates standard metrics (e.g., AUC, precision) from result files. Data for comparison were obtained from our experiments, open-source code, and published results.
  • General Scenario Performance. On the UAV123 dataset, MSTFT achieved a success AUC of 79.4% and a precision of 84.5% (see Table 1), outperforming the Mamba-based tracker TrackingMiM (70.8% success AUC, 83.5% precision) by 8.6 and 1.0 percentage points, respectively. This leading performance is also reflected in the visualization results of Figure 4: MSTFT’s precision curve (left subfigure) and success curve (right subfigure) both maintain the highest position across all threshold ranges. This improvement is driven by two core modules: the Bi-STM module uses a horizontal–vertical bidirectional spatial scanning strategy, reorganizing features in small target regions via 16 × 16 windows to suppress background noise (e.g., farmland textures) and enhance discriminative feature extraction; the CAPH module introduces a Gaussian-weighted prior to focus on target core areas (e.g., pedestrian heads in urban scenes), avoiding feature dilution in cluttered backgrounds.
  • Low-Frame-Rate and Large-Displacement Robustness. On the UAV123@10fps benchmark, MSTFT achieved a success AUC of 76.5% and a precision of 84.1% (Table 1), surpassing the Transformer-based representative STARK (63.8% AUC, 82.5% precision) by 12.7 and 1.6 percentage points. The advantage in low-frame-rate scenarios is further verified by the curve trend in Figure 4: even at small location error thresholds (≤10 pixels), MSTFT’s precision remains higher than other competitors, indicating strong resistance to inter-frame displacement. This benefit comes from the Bi-STM module’s forward-backward temporal scanning: forward scanning leverages historical frames ( t 8 to t 1 ) to model motion inertia, while backward scanning integrates future frame information ( t + 1 to t + 8 ) via caching, correcting deviations from sudden UAV attitude changes. This allows robust handling of inter-frame displacements exceeding 5% of the image size under 10 FPS conditions.
  • Long-Term Tracking Stability. On the UAV20L dataset (designed for long-term tracking tasks), MSTFT achieved a success AUC of 75.8% and a precision of 83.6% (Table 1), a 7.2 percentage point improvement in AUC over the Mamba-based TADMT (68.6%). As shown in Figure 4, MSTFT’s success curve maintains a significant lead when the overlap threshold is between 0.5 and 0.8, which is a key range for evaluating long-term tracking stability. Its robustness stems from the DTF-AA module’s triple adaptive verification: response peak verification ( θ peak = 0.92 ) filters low-confidence updates (e.g., foggy frames); response consistency verification ( θ consist = 0.75 ) ensures temporal feature coherence to avoid background clutter inclusion; motion stability verification ( θ motion = 0.7 ) rejects occlusion-induced erroneous features. This multi-verification synergy suppresses error accumulation in long-term tracking.

4.2.2. Performance Analysis Under Different Challenge Attributes

To further validate the robustness of MSTFT against core UAV small-target challenges, we evaluated its performance under six key attributes on the UAV123 dataset: small object, low resolution, fast motion, occlusion, and background clutter. The quantitative AUC results for each challenge are summarized in Table 2. For an intuitive comparison, Figure 5 and Figure 6 present the precision and success plots across all 12 extended attributes. The parenthetical numbers in the subfigure captions (e.g., “Scale Variation (109)”) indicate the number of sequences annotated with that attribute, which corresponds to the “Sample Sequences” column in Table 2.
  • Occlusion Robustness. In scenarios involving occlusion, MSTFT achieved a success AUC of 51.2 % under full occlusion (Table 2), exceeding TrackingMiM by 3.3 percentage points. For partial occlusion, it reached 69.2 % , outperforming all compared methods. As visualized in Figure 6, MSTFT’s success curve retains a clear lead within the 0.4–0.7 overlap threshold range, which is critical for evaluating occlusion recovery. This capability is enabled by the motion stability verification mechanism in the DTF-AA module, which computes the mean structural similarity index across the most recent 3 frames and suspends template updates when this value falls below 0.7 . Concurrently, the system retains only the initial template and reliable dynamic templates acquired before occlusion, effectively preventing the assimilation of corrupted features. Experimental results confirm that MSTFT’s template update error rate during full occlusion is only 11 % , substantially lower than the 32 % observed in MambaLCT.
  • Resistance to Background Clutter. When confronted with background clutter, MSTFT attained a success AUC of 60.9 % (Table 2), surpassing TrackingMiM by 2.6 percentage points. Figure 5 shows that its precision remains above 0.8 even at small location error thresholds (≤10 pixels), highlighting strong clutter rejection. This performance is attributed to the synergistic operation of the Bi-STM and DTF-AA modules: the Bi-STM captures the target’s motion trend (e.g., a pedestrian’s walking direction in a crowd) through temporal dynamic modeling, distinguishing it from static distractors; the DTF-AA employs a size-aware attention mechanism that dynamically adjusts the attention window relative to the target size (e.g., a 30 × 30 pixel window for a 20 × 20 pixel target), thereby minimizingM interference from semantically similar background objects during template fusion.

4.3. Ablation Study

4.3.1. Study of the Components

To quantitatively evaluate the contributions of the three core modules in the MSTFT framework—Bidirectional Spatio-Temporal Mamba (Bi-STM), Dynamic Template Fusion module based on Adaptive Attention (DTF-AA), and Small-Target-Aware Context Prediction Head (CAPH)—we conducted systematic ablation experiments based on the Vision Mamba-B backbone network. The baseline model (A1) employs only the backbone network for multi-scale feature extraction without any proposed modules. Model A2 incorporates the Bi-STM module to enhance spatio-temporal representation. Model A3 further integrates the DTF-AA module for dynamic template fusion. The complete MSTFT model (A4) includes all three modules to validate their synergistic effects.
The ablation results demonstrate the progressive contributions of each module. As shown in Table 3, adding the Bi-STM module (A2) increased the overall success AUC by 4.5 percentage points compared to the baseline, with particularly notable gains in fast-motion scenarios (+8.3%). This improvement stems from its bidirectional scanning mechanism. Spatial scanning ensures comprehensive coverage of small target regions, while temporal scanning corrects inter-frame motion errors, reducing motion prediction error from 18 to 11 pixels. The module adds minimal overhead—only 2M parameters and 2.2G FLOPs—while maintaining real-time performance at 48 FPS.
Integrating the DTF-AA module (A3) further improved success AUC by 2.6 percentage points, with significant gains in full occlusion (+4.7%) and background clutter scenarios. Its triple safety verification mechanism reduces the template update error rate from 25% to 15%, while the size-aware attention strategy enhances target-background distinguishability by 18%. This module adds only 1M parameters and 1.2G FLOPs, demonstrating high efficiency.
The complete MSTFT model (A4) with the CAPH module achieves the best performance, increasing overall AUC to 75.2% (+2.3%) with remarkable improvements in small-target and low-resolution scenarios. The CAPH module enhances feature response intensity in small target regions by 2.1 times through Gaussian prior feature fusion and employs a modified Focal Loss ( γ = 3.0 ) to improve localization accuracy. Notably, it introduces no additional parameters while adding only 1.1G FLOPs, showing excellent cost-effectiveness.
The three modules exhibit strong synergy through their deeply coupled architecture. The backbone’s enhanced features provide high-quality input for Bi-STM, whose spatio-temporal features support both CAPH’s localization and DTF-AA’s verification. DTF-AA’s target-aware templates serve as precise queries for CAPH, while CAPH’s outputs inform DTF-AA’s update decisions. This closed-loop design enables mutual support among modules, collectively building a robust and efficient tracking system for complex UAV scenarios.

4.3.2. Memory Consumption Analysis of Bidirectional Scanning

We conducted a series of experiments to measure the memory footprint of different scanning configurations and component combinations. Specifically, we measured the peak GPU memory consumption during the inference phase on the UAV123 dataset; the experiments were performed using a single NVIDIA Tesla A100 GPU (equipped with 40 GB VRAM) and adopted a fixed input size (template: 127 × 127, search region: 320 × 320). Table 4 presents a comprehensive breakdown of the memory consumption under these various configurations.
  • Spatial Scanning Overhead: As summarized in Table 4, full bidirectional spatial scanning (H + V) incurs an overhead of 15 MB compared to the average unidirectional scan (710 MB vs. ∼696.5 MB). This cost arises from maintaining parallel computation graphs for both scanning directions.
  • Temporal Scanning Overhead: Bidirectional temporal scanning (F + B) introduces a 20 MB overhead versus its unidirectional counterpart (725 MB vs. ∼706 MB). The larger increment is attributed to the caching mechanism required for future-frame context in backward scanning.
  • Synergistic Integration: The complete Bi-STM module (750 MB) consumes less memory than the sum of its isolated spatial (710 MB) and temporal (725 MB) bidirectional components. This indicates efficient memory sharing, particularly in shared normalization layers and residual connections.
  • Architectural Efficiency: Despite the advanced bidirectional design, MSTFT maintains high memory efficiency (780 MB). This contrasts sharply with Transformer-based trackers, whose quadratic attention mechanism demands significantly more memory. The linear complexity of the Mamba architecture is key to this efficiency.

4.3.3. Ablation Analysis of Triple Safety Verification Mechanism (TSVM)

To evaluate the intrinsic contribution of the Triple Safety Verification Mechanism (TSVM) integrated within the DTF-AA module to both computational overhead and tracking robustness, we construct two model variants. Both variants share an identical backbone architecture (Bi-STM, CAPH, and Vision Mamba) and differ only in the inclusion of TSVM:
  • Variant A (w/o TSVM): Employs the standard size-aware attention with a naïve template update strategy (update coefficient α = 0.2 ) and without any safety verification.
  • Variant B (w/TSVM): Incorporates the complete TSVM, comprising response peak assessment, temporal consistency checking, and motion stability evaluation (parameters: θ = 0.92 , ε = 0.75 , δ = 0.7 ).
The results demonstrate that TSVM introduces a controllable overhead (Table 5), adding only 34.2K parameters (approximately 0.05% of the total 70 M model parameters) and 1.98 G FLOPs per frame (a 7.4% increase relative to the baseline 26.8 G FLOPs). The consequent speed reduction is minimal, preserving real-time performance compatible with UAV resource constraints.
More importantly, TSVM yields a significant robustness gain. The 8.7 percentage-point improvement in AUC under full occlusion scenarios (Table 5) directly addresses a critical challenge in UAV tracking: the accumulation of errors caused by unreliable template updates. This substantial enhancement in occlusion handling justifies the modest computational cost incurred by the safety verification mechanism.

4.3.4. Ablation Analysis of Individual Loss Function Modifications

To validate the independent contributions of each proposed modification to the loss function, we conducted a controlled ablation study. This experiment isolated the two core innovations—the improved Focal Loss and the optimized regression loss—to quantify their individual and combined effects on small-target tracking performance. All other components (Bi-STM, DTF-AA, Vision Mamba backbone) and hyperparameters remained fixed across variants. Evaluation was performed on small-target sequences from the UAV123 dataset (42 sequences with target size < 30 × 30 pixels), using three key metrics: tracking AUC for small targets, average localization error (ALE in pixels), and average intersection over union (AIoU).
The results, summarized in Table 6, clearly validate both the independent value and the synergistic effect of each loss modification:
  • Improved Focal Loss. Compared to the baseline, using only the improved Focal Loss increases AUC by 3.6 percentage points, reduces localization error by 1.3 pixels, and improves AIoU by 3.5%. This confirms that the adjusted focusing parameter ( γ = 2.5 ) and the Gaussian-shaped classification label more effectively suppress background interference, thereby enhancing the discrimination of sparse small-target features.
  • Optimized Regression Loss. Employing only the optimized regression loss yields even greater gains: a 3.9-point AUC improvement, a 2.5-pixel reduction in ALE, and a 4.1% increase in AIoU versus the baseline. The integration of GIoU and an area-based penalty term successfully mitigates the evaluation bias inherent to small target sizes, leading to more precise bounding box regression.
  • Synergistic Effect. The full proposed loss achieves the best performance across all metrics (e.g., 71.6% AUC), with total gains exceeding the simple sum of individual improvements. This synergy occurs because the improved Focal Loss provides better foreground-background distinction and region priors, which in turn enables the optimized regression loss to perform more accurate localization. The components are mutually reinforcing, maximizing overall tracking performance for small UAV targets.

4.4. Qualitative Analysis and Visual Verification

Three typical challenging scenarios for UAV small targets were selected to visualize the tracking results of MSTFT and representative algorithms (SiamRPN++, TrackingMiM, TADMT), providing intuitive validation of the modules’ effectiveness.

4.4.1. Low-Resolution Small Target Scenario

As illustrated in Figure 7, in the low-resolution small target scenario, most comparative algorithms struggled with insufficient feature extraction. SiamRPN++ relied on traditional handcrafted features, failing to capture effective discriminative information from the low-resolution target, resulting in a tracking box center error exceeding 25 pixels after frame 1268. SiamBAN’s shallow feature fusion mechanism could not suppress background noise, leading to IoU dropping below 0.35 at frame 1852. MambaLCT, despite its efficient sequence modeling, lacked targeted enhancement for small target features, causing tracking drift at frame 2416.
In contrast, MSTFT achieved stable tracking through the CAPH module’s multi-scale feature enhancement. The module adaptively amplified high-frequency detail features of the small target (e.g., edge contours of clothing) and suppressed low-frequency background noise, maintaining a center error below 8 pixels throughout the sequence. Additionally, the Bi-STM module compensated for motion blur by modeling the target’s slow walking trend (3 pixels/frame), ensuring IoU remained above 0.6 even at frame 3020, fully verifying the effectiveness of CAPH in low-resolution small target feature extraction.

4.4.2. Low-Frame-Rate and Fast-Motion Scenario

As shown in Figure 8, in the comparison of tracking results for the rapidly maneuvering boat scenario, the algorithms demonstrated significant differences. When the target underwent a sudden attitude change at frame 582, STARK, due to the high computational complexity of its Transformer self-attention mechanism, which induced a response delay, failed to capture the inter-frame motion change timely. The tracking box deviated to the boat’s stern, with a center error exceeding 30 pixels. MambaLCT, limited by its unidirectional temporal scanning mechanism, could not effectively correct the error introduced by the attitude change after frame 664. The tracking box consistently lagged behind the boat’s actual position, with a center error reaching 22 pixels. TADMT, under the interference of water splashes at frame 678, incorrectly incorporated splash features into the template update, leading to a tracking box that included non-target areas and an IoU drop to 0.42.
In contrast, MSTFT demonstrated exceptional tracking stability: its Bi-STM module accurately predicted the boat’s linear motion inertia through forward scanning, while simultaneously utilizing backward scanning to correct the attitude change error promptly based on cached frame features, resulting in a center error consistently below 10 pixels throughout the sequence. Furthermore, the response peak verification mechanism of the DTF-AA module detected that the response value dropped to 0.88 at frame 664 and decisively rejected the template update for the splash-interfered frame, thereby ensuring the IoU remained stable above 0.6.

4.4.3. Long-Sequence Full Occlusion Scenario

As shown in Figure 9, when confronting the challenge of target occlusion and re-emergence in long-term tracking sequences, the performance of the various tracking algorithms demonstrated notable differences. During the target occlusion at frame 134, MambaLCT, although the response peak had decreased to 0.85 (below the 0.92 threshold), still performed a template update due to the lack of an effective verification mechanism, resulting in the erroneous introduction of background road surface marking features into the system. When the target reappeared at frame 378, the tracking box incorrectly locked onto an interfering road surface marking, and the IoU dropped below 0.1. OSTrack, limited by the computational complexity of the Transformer architecture in long-sequence processing, exhibited significant delays after frame 652 and could not generate the search region in a timely manner. As a result, it required 15 frames to recapture the target after its reappearance.
In contrast, MSTFT demonstrated exceptional robustness, leveraging its carefully designed module collaboration mechanism. The triple verification mechanism of the DTF-AA module detected that both the response peak (0.82) and motion stability (0.65) were below the threshold at frame 500 and decisively refused to perform the template update. Meanwhile, the Bi-STM module continuously modeled the target’s motion pattern before occlusion (uniform walking, step size of 5 pixels/frame), providing an accurate trajectory prediction for the target’s re-emergence. When the target reappeared at frame 378, the CAPH module quickly matched reliable features from the historical template, achieving tracking convergence within only 3 frames (IoU ≥ 0.55 at frame 623). Throughout the entire sequence, MSTFT incurred only 1 tracking drift, significantly outperforming MambaLCT’s 6 drifts, thereby fully validating its exceptional stability in long-term occlusion scenarios involving road surface marking interference.

4.5. Computational Efficiency Analysis

As shown in Table 7, MSTFT achieves an excellent balance between performance and computational efficiency through several targeted optimizations. In terms of parameter efficiency, the adoption of Vision Mamba as the backbone network reduces parameter count to 70 M—a 62.8% reduction compared to SiamRPN++ (188 M). The DTF-AA module further enhances memory efficiency through an innovative dual-template storage strategy (single initial template + single dynamic template), replacing traditional multi-template approaches and reducing GPU memory usage from 2943 MB to 780 MB (a 73.5% reduction compared to OSTrack).
Computational efficiency is improved through the Bi-STM module’s local scanning strategy, which processes only the target and its surrounding 1.5× region instead of performing global scanning. This design reduces computational load by 40% and decreases FLOPs from 35.8G (SiamRPN++) to 26.8G. Additionally, TensorRT FP16 quantization during inference increases processing speed from 38 FPS (MambaLCT) to 45 FPS while maintaining tracking accuracy.
A key advantage of MSTFT is its linear computational complexity O ( H W ) , which provides significant benefits for long-sequence processing compared to the quadratic complexity O ( ( H W ) 2 ) of Transformer-based methods. On the 8000-frame sequence from the UAV20L dataset, OSTrack required 444 s while MSTFT completed processing in only 178 s—a 60% efficiency improvement. This advantage scales with sequence length: for a 12,000-frame sequence, MSTFT requires 267 s compared to OSTrack’s 666 s, demonstrating strong scalability for long-term tracking applications.

4.6. Analysis of Target Size Generalization

To comprehensively evaluate the generalization capability across scales, we categorized the sequences in the UAV123 dataset into three groups based on the average target size. The tracking success rates (Area Under Curve, AUC) of the proposed MSTFT and three representative baselines on these subsets are compared in Table 8.
Results in Table 8 show that MSTFT achieves the best performance on all subsets. Its dominant lead on small targets (+8.2% AUC) confirms the efficacy of dedicated designs (e.g., CAPH’s Gaussian prior). Significantly, it also outperforms baselines on medium and large targets, proving that these designs are adaptive and do not harm larger-target tracking. The core architecture (Bi-STM, DTF-AA) provides generalized benefits for spatio-temporal coherence and template reliability. Thus, MSTFT represents a universally robust tracker, with specialized optimizations that enhance without compromising general performance.

4.7. Power and Energy Efficiency for Onboard UAV Deployment

TTo address the stringent power and energy constraints of UAV onboard computing, we conducted dedicated experiments to quantify the deployment efficiency of MSTFT on mainstream UAV-edge hardware. The objective was to evaluate its practical feasibility for real-world aerial missions.
Benchmarks were performed on two representative NVIDIA Jetson modules: the high-performance AGX Orin (64 GB) and the power-efficient Xavier NX. The evaluation used five representative 1000-frame sequences from the UAV123 dataset, covering challenging scenarios including small targets, fast motion, and background clutter. We measured four key metrics: average power (P_avg), peak power (P_peak), energy per frame (E_frame), and the achieved frames per second (FPS). Additionally, the theoretical mission endurance was estimated based on a standard 200 Wh lithium-polymer battery commonly used in professional UAVs for tracking tasks. The results are summarized in Table 9.
The measured average power consumption of MSTFT ranges from 14.3 W on the Xavier NX to 28.7 W on the AGX Orin. This range aligns well with the typical power budgets of mainstream commercial and professional UAVs (generally ≤30 W). The Xavier NX variant is particularly suitable for micro-UAV platforms with stricter power limits (typically ≤20 W). Notably, the observed peak power (22.8–41.3 W) remains within the safe operating limits of standard UAV power systems, thereby mitigating risks of performance throttling or thermal overload during prolonged operation.
The low energy consumption per frame (0.33–0.64 J/frame) stems directly from two core architectural innovations in MSTFT. First, the linear computational complexity ( O ( N ) ) of the Mamba backbone eliminates the quadratic overhead associated with global attention mechanisms, substantially reducing redundant computations. Second, the localized scanning strategy of the Bi-STM module minimizes costly data transfers and memory access overhead. These design choices are particularly effective in lowering energy consumption on resource-constrained edge devices.

5. Discussion

Architectural Choices and Deployment Cost

The Mamba architecture adopted in this work uniquely combines linear scaling complexity ( O ( N ) ) with a global receptive field and superior long-sequence modeling capabilities. Its core innovation, the selective scanning mechanism, enables a dynamic focus on informative spatial regions while efficiently propagating context across time. As demonstrated in our experiments and memory analysis, this design achieves an optimal balance: it attains accuracy competitive with or surpassing leading Transformer-based trackers, while maintaining efficiency close to lightweight CNNs. The comparative advantages are summarized in Table 10.
Practical deployment of visual tracking algorithms on UAV platforms is constrained by multifaceted considerations including computational cost, power consumption, and hardware limitations. The proposed MSTFT framework, by achieving high accuracy with linear complexity, significantly reduces the hardware requirements for high-performance UAV tracking. This enables deployment on mid-tier edge computing platforms (e.g., NVIDIA Jetson series) while maintaining performance levels previously attainable only with server-grade GPUs. Consequently, MSTFT facilitates substantial reductions in overall system cost, weight, and power consumption. Our memory analysis confirms efficient resource utilization, with peak memory consumption of approximately 780 MB comfortably accommodated within the 8–16 GB RAM configurations typical of modern edge computing modules.
While this study has achieved notable progress, several limitations remain. First, although the proposed DTF-AA template update mechanism with triple safety verification effectively suppresses error accumulation under moderate occlusion and background clutter, its performance in more challenging scenarios requires further optimization. For instance, in cases of extreme long-term occlusion where the target is completely blocked by large obstacles for over 50 frames, the sparse feature representation of small UAV targets degrades further, which may lead to temporary tracking drift during the recovery phase. Second, the current MSTFT framework is specifically designed for single small-target tracking in UAV scenarios, focusing on optimizing performance for sparse individuals; it has not yet been extended to multi-object tracking tasks. Finally, this work relies solely on RGB visual features for target representation, whose discriminative power may be insufficient under conditions of severe visual degradation such as nighttime tracking. Although the modular design of the framework theoretically supports the integration of multi-modal data (e.g., infrared or thermal imaging), the compatibility of such fusion and its corresponding performance gains remains to be validated in future research.

6. Conclusions

This study addresses the core challenges in small object tracking in unmanned aerial vehicle (UAV) videos, including sparse target features, large inter-frame displacements, error accumulation in template updates, and the need to balance accuracy and efficiency. A Mamba-based spatio-temporal fusion tracker (MSTFT) is proposed. Three key modules are designed: the Bidirectional Spatio-Temporal Mamba (Bi-STM), the Dynamic Template Fusion with Adaptive Attention (DTF-AA), and the Small-Target-Aware Context Prediction Head (CAPH). Experimental results on three mainstream UAV datasets (UAV123, UAV123@10fps, and UAV20L) demonstrate that MSTFT achieves state-of-the-art tracking performance (success AUC of 79.4%, 76.5%, and 75.8%, respectively) while maintaining real-time inference speed (45 FPS) and lightweight characteristics (70 M parameters, 780 MB peak GPU memory).
The designed Bi-STM enables synergistic modeling of spatial details and temporal dynamics with linear computational complexity, reducing motion prediction error by 38.9%. DTF-AA suppresses template update error to 11% under full occlusion through multi-dimensional verification and scale-aware attention mechanisms. CAPH enhances small-target localization accuracy via Gaussian-weighted priors and an optimized loss function, reducing the average localization error to 8.7 pixels.
This study contributes a closed-loop framework for UAV small object tracking, enriching the theoretical system of Mamba-based tracking. Practically, it provides a deployable solution for UAV applications such as search and rescue and environmental monitoring. In terms of performance, it exhibits advantages in either accuracy or efficiency compared to existing Siamese-, Transformer-, or Mamba-based methods. During implementation, key challenges such as balancing feature enhancement and computational efficiency, designing scenario-adaptive template update mechanisms, and ensuring experimental consistency were addressed through repeated ablation studies, cross-dataset parameter tuning, and standardized evaluation tools.
Future work will focus on extending the approach to multi-object tracking scenarios, integrating multi-modal data to enhance adaptability in extreme environments, further optimizing model lightweightness for deployment on micro-UAV platforms, and improving dynamic adaptation to extreme target scale variations, thereby expanding the method’s applicability and practical value.

Author Contributions

Conceptualization, H.C. and H.Z.; Methodology, K.S. and H.Z.; Software, K.S.; Validation, H.C. and H.Z.; Formal analysis, K.S.; Investigation, H.Z.; Writing—original draft preparation, K.S.; Writing—review and editing, H.C. and H.Z.; Supervision, H.C.; Project administration, H.C.; Funding acquisition, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (62163023, 61873116, 62366031, 62363023), the Gansu Provincial Basic Research Innovation Group of China (25JRRA058), the Central Government’s Funds for Guiding Local Science and Technology Development of China (25ZYJA040), the Gansu Provincial Key Talent Project of China (2024RCXM86) and the Gansu Provincial Special Fund for Military-Civilian Integration Development of China.

Data Availability Statement

The datasets analyzed in this study include publicly available UAV tracking benchmarks, specifically the UAV123 dataset, which can be accessed via the official link: https://ivul.kaust.edu.sa/benchmark-and-simulator-uav-tracking-dataset (accessed on 3 January 2026). The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The research content in this paper is entirely original, with clear sharing among the authors and no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AUCArea Under the Curve
Bi-STMBidirectional Spatio-Temporal Mamba module
CAPHSmall-Target-Aware Context Prediction Head
CNNConvolutional Neural Network
DCFDiscriminative Correlation Filter
DTF-AADynamic Template Fusion module based on Adaptive Attention
FPSFrames Per Second
GIoUGeneralized Intersection over Union
IoUIntersection over Union
MSTFTMamba-Based Spatio-Temporal Fusion Tracker
UAVUnmanned Aerial Vehicle
ViMVision Mamba

References

  1. Haque, A.; Chowdhury, M.N.U.R.; Hassanalian, M. A Review of Classification and Application of Machine Learning in Drone Technology. AI Comput. Sci. Robot. Technol. 2025, 4, 1–32. [Google Scholar] [CrossRef]
  2. Lyu, M.; Zhao, Y.; Huang, C.; Huang, H.; Lyu, M.; Zhao, Y.; Huang, C.; Huang, H. Unmanned Aerial Vehicles for Search and Rescue: A Survey. Remote Sens. 2023, 15, 3266. [Google Scholar] [CrossRef]
  3. Mohsan, S.A.H.; Othman, N.Q.H.; Li, Y.; Alsharif, M.H.; Khan, M.A. Unmanned Aerial Vehicles (UAVs): Practical Aspects, Applications, Open Challenges, Security Issues, and Future Trends. Intell. Serv. Rob. 2023, 16, 109–137. [Google Scholar] [CrossRef] [PubMed]
  4. Laghari, A.A.; Jumani, A.K.; Laghari, R.A.; Li, H.; Karim, S.; Khan, A.A. Unmanned Aerial Vehicles Advances in Object Detection and Communication Security Review. Cognit. Rob. 2024, 4, 128–141. [Google Scholar] [CrossRef]
  5. Ariante, G.; Core, G.D.; Ariante, G.; Core, G.D. Unmanned Aircraft Systems (UASs): Current State, Emerging Technologies, and Future Trends. Drones 2025, 9, 59. [Google Scholar] [CrossRef]
  6. Quamar, M.M.; Al-Ramadan, B.; Khan, K.; Shafiullah, M.; Ferik, S.E.; Quamar, M.M.; Al-Ramadan, B.; Khan, K.; Shafiullah, M.; Ferik, S.E. Advancements and Applications of Drone-Integrated Geographic Information System Technology—A Review. Remote Sens. 2023, 15, 5039. [Google Scholar] [CrossRef]
  7. Wang, Q.; Zhou, L.; Xu, C.; Shang, Y.; Jin, P.; Cao, C.; Shen, T. Progress and Perspectives on UAV Visual Object Tracking. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 20214–20239. [Google Scholar] [CrossRef]
  8. Cao, Y.; Dong, S.; Zhang, J.; Xu, H.; Zhang, Y.; Zheng, Y. Adaptive Spatial Regularization Correlation Filters for UAV Tracking. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 7867–7877. [Google Scholar] [CrossRef]
  9. Su, Y.; Xu, F.; Wang, Z.; Sun, M.; Zhao, H. A Context Constraint and Sparse Learning Based on Correlation Filter for High-Confidence Coarse-to-Fine Visual Tracking. Expert Syst. Appl. 2025, 268, 126225. [Google Scholar] [CrossRef]
  10. Wang, K.; Wang, Z.; Zhang, X.; Liu, M. BSTrack: Robust UAV Tracking Using Feature Extraction of Bilateral Filters and Sparse Attention Mechanism. Expert Syst. Appl. 2025, 267, 126202. [Google Scholar] [CrossRef]
  11. Kumar, A.; Vohra, R.; Jain, R.; Li, M.; Gan, C.; Jain, D.K. Correlation Filter Based Single Object Tracking: A Review. Inf. Fusion 2024, 112, 102562. [Google Scholar] [CrossRef]
  12. Du, S.; Wang, S. An Overview of Correlation-Filter-Based Object Tracking. IEEE Trans. Comput. Social Syst. 2022, 9, 18–31. [Google Scholar] [CrossRef]
  13. Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep Learning for Unmanned Aerial Vehicle-Based Object Detection and Tracking: A Survey. IEEE Geosci. Remote Sens. Mag. 2022, 10, 91–124. [Google Scholar] [CrossRef]
  14. Marvasti-Zadeh, S.M.; Cheng, L.; Ghanei-Yakhdan, H.; Kasaei, S. Deep Learning for Visual Tracking: A Comprehensive Survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 3943–3968. [Google Scholar] [CrossRef]
  15. Jiao, L.; Wang, D.; Bai, Y.; Chen, P.; Liu, F. Deep Learning in Visual Tracking: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 5497–5516. [Google Scholar] [CrossRef]
  16. Lu, X.; Li, F.; Yang, W. Siamada: Visual Tracking Based on Siamese Adaptive Learning Network. Neural Comput. Appl. 2024, 36, 7639–7656. [Google Scholar] [CrossRef]
  17. Sabeeh Hasan Allak, A.; Yi, J.; Al-Sabbagh, H.M.; Chen, L. Siamese Neural Networks in Unmanned Aerial Vehicle Target Tracking Process. IEEE Access 2025, 13, 24309–24322. [Google Scholar] [CrossRef]
  18. Lim, S.C.; Huh, J.H.; Kim, J.C.; Lim, S.C.; Huh, J.H.; Kim, J.C. Siamese Trackers Based on Deep Features for Visual Tracking. Electronics 2023, 12, 4140. [Google Scholar] [CrossRef]
  19. Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
  20. Ni, X.; Yuan, L.; Lv, K. Efficient Single-Object Tracker Based on Local-Global Feature Fusion. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 1114–1122. [Google Scholar] [CrossRef]
  21. Zheng, J.; Liang, M.; Huang, S.; Ning, J. Exploring the Feature Extraction and Relation Modeling for Light-Weight Transformer Tracking. In Proceedings of the Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer: Cham, Switzerland, 2025; pp. 110–126. [Google Scholar] [CrossRef]
  22. Chen, X.; Yan, B.; Zhu, J.; Lu, H.; Ruan, X.; Wang, D. High-Performance Transformer Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 8507–8523. [Google Scholar] [CrossRef]
  23. Lin, L.; Fan, H.; Zhang, Z.; Xu, Y.; Ling, H. SwinTrack: A Simple and Strong Baseline for Transformer Tracking. Adv. Neural Inf. Process. Syst. 2022, 35, 16743–16754. [Google Scholar]
  24. Kugarajeevan, J.; Kokul, T.; Ramanan, A.; Fernando, S. Transformers in Single Object Tracking: An Experimental Survey. IEEE Access 2023, 11, 80297–80326. [Google Scholar] [CrossRef]
  25. Zhang, H.; Zhu, Y.; Wang, D.; Zhang, L.; Chen, T.; Wang, Z.; Ye, Z.; Zhang, H.; Zhu, Y.; Wang, D.; et al. A Survey on Visual Mamba. Appl. Sci. 2024, 14, 5683. [Google Scholar] [CrossRef]
  26. Xu, R.; Yang, S.; Wang, Y.; Cai, Y.; Du, B.; Chen, H. Visual Mamba: A Survey and New Outlooks. arXiv 2024, arXiv:2404.18861. [Google Scholar] [CrossRef]
  27. Li, X.; Zhong, B.; Liang, Q.; Li, G.; Mo, Z.; Song, S. Mambalct: Boosting Tracking via Long-Term Context State Space Model. Proc. AAAI Conf. Artif. Intell. 2025, 39, 4986–4994. [Google Scholar] [CrossRef]
  28. Liu, B.; Chen, C.; Li, J.; Yu, G.; Song, H.; Liu, X.; Cui, J.; Zhang, H. TrackingMiM: Efficient Mamba-in-Mamba Serialization for Real-Time UAV Object Tracking. arXiv 2025, arXiv:2507.01535. [Google Scholar] [CrossRef]
  29. Xie, J.; Zhong, B.; Liang, Q.; Li, N.; Mo, Z.; Song, S. Robust Tracking via Mamba-Based Context-Aware Token Learning. Proc. AAAI Conf. Artif. Intell. 2025, 39, 8727–8735. [Google Scholar] [CrossRef]
  30. Yao, M.; Peng, J.; He, Q.; Peng, B.; Chen, H.; Chi, M.; Liu, C.; Benediktsson, J.A. MM-Tracker: Motion Mamba for UAV-Platform Multiple Object Tracking. Proc. AAAI Conf. Artif. Intell. 2025, 39, 9409–9417. [Google Scholar] [CrossRef]
  31. Wang, Q.; Zhou, L.; Jin, P.; Qu, X.; Zhong, H.; Song, H.; Shen, T. TrackingMamba: Visual State Space Model for Object Tracking. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 16744–16754. [Google Scholar] [CrossRef]
  32. Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar] [CrossRef]
  33. Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
  34. Zhu, X.F.; Wu, X.J.; Xu, T.; Feng, Z.H.; Kittler, J. Complementary Discriminative Correlation Filters Based on Collaborative Representation for Visual Object Tracking. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 557–568. [Google Scholar] [CrossRef]
  35. Zhang, Z.; Peng, H. Deeper and Wider Siamese Networks for Real-Time Visual Tracking. In Proceedings of the 2019 IEEE/CVP Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 15–20 June 2019; pp. 4591–4600. [Google Scholar]
  36. Liu, Z.; Zhong, Y.; Ma, G.; Wang, X.; Zhang, L. A Deep Temporal-Spectral-Spatial Anchor-Free Siamese Tracking Network for Hyperspectral Video Object Tracking. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5539216. [Google Scholar] [CrossRef]
  37. Cen, M.; Jung, C. Fully Convolutional Siamese Fusion Networks for Object Tracking. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 3718–3722. [Google Scholar] [CrossRef]
  38. Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H.S. Fully-Convolutional Siamese Networks for Object Tracking. In Proceedings of the Computer Vision—ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10 October 2016; Hua, G., Jégou, H., Eds.; Springer: Cham, Switzerland, 2016; pp. 850–865. [Google Scholar] [CrossRef]
  39. Yuan, D.; Liao, D.; Huang, F.; Qiu, Z.; Shu, X.; Tian, C.; Liu, Q. Hierarchical Attention Siamese Network for Thermal Infrared Target Tracking. IEEE Trans. Instrum. Meas. 2024, 73, 5032411. [Google Scholar] [CrossRef]
  40. Yang, P.; Wang, Q.; Dou, J.; Dou, L. Learning Saliency-Awareness Siamese Network for Visual Object Tracking. J. Visual Commun. Image Represent. 2024, 103, 104237. [Google Scholar] [CrossRef]
  41. Yang, X.; Huang, J.; Liao, Y.; Song, Y.; Zhou, Y.; Yang, J. Light Siamese Network for Long-Term Onboard Aerial Tracking. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5623415. [Google Scholar] [CrossRef]
  42. Sun, L.; Zhang, J.; Yang, Z.; Gao, D.; Fan, B. Long-Term Object Tracking Based on Joint Tracking and Detection Strategy with Siamese Network. Multimedia Syst. 2024, 30, 162. [Google Scholar] [CrossRef]
  43. Li, Y.; Zhang, X.; Chen, D. SiamVGG: Visual Tracking Using Deeper Siamese Networks. arXiv 2022, arXiv:1902.02804. [Google Scholar] [CrossRef]
  44. Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R.; Tang, Z.; Li, X. SiamBAN: Target-Aware Tracking with Siamese Box Adaptive Network. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 5158–5173. [Google Scholar] [CrossRef]
  45. Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of Siamese Visual Tracking with Very Deep Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4282–4291. [Google Scholar]
  46. Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines. roc. AAAI Conf. Artif. Intell. 2020, 34, 12549–12556. [Google Scholar] [CrossRef]
  47. Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
  48. Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning Spatio-Temporal Transformer for Visual Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10448–10457. [Google Scholar]
  49. Zhang, W.; Xu, T.; Xie, F.; Wu, J.; Yang, W. TATrack: Target-Oriented Adaptive Vision Transformer for UAV Tracking. Neural Netw. 2026, 193, 108067. [Google Scholar] [CrossRef] [PubMed]
  50. Lee, D.; Choi, W.; Lee, S.; Yoo, B.; Yang, E.; Hwang, S. BackTrack: Robust Template Update via Backward Tracking of Candidate Template. arXiv 2023, arXiv:2308.10604. [Google Scholar] [CrossRef]
  51. Liu, S.; Liu, D.; Muhammad, K.; Ding, W. Effective Template Update Mechanism in Visual Tracking with Background Clutter. Neurocomputing 2021, 458, 615–625. [Google Scholar] [CrossRef]
  52. Xie, Q.; Liu, K.; Zhiyong, A.; Wang, L.; Li, Y.; Xiang, Z. A Novel Incremental Multi-Template Update Strategy for Robust Object Tracking. IEEE Access 2020, 8, 162668–162682. [Google Scholar] [CrossRef]
  53. Zhang, L.; Gonzalez-Garcia, A.; Weijer, J.V.D.; Danelljan, M.; Khan, F.S. Learning the Model Update for Siamese Trackers. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4010–4019. [Google Scholar]
  54. Cui, Y.; Jiang, C.; Wang, L.; Wu, G. Mixformer: End-to-End Tracking with Iterative Mixed Attention. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13608–13618. [Google Scholar]
  55. Chen, X.; Peng, H.; Wang, D.; Lu, H.; Hu, H. Seqtrack: Sequence to Sequence Learning for Visual Object Tracking. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14572–14581. [Google Scholar]
  56. Hosseini, S.M.H.; Hassanpour, M.; Masoudnia, S.; Iraji, S.; Raminfard, S.; Nazem-Zadeh, M. Cttrack: A Cnn+ Transformer-Based Framework for Fiber Orientation Estimation & Tractography. Neurosci. Inf. 2022, 2, 100099. [Google Scholar] [CrossRef]
  57. Mayer, C.; Danelljan, M.; Paudel, D.P.; Van Gool, L. Learning Target Candidate Association to Keep Track of What Not to Track. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 13444–13454. [Google Scholar]
  58. Shi, L.; Zhong, B.; Liang, Q.; Li, N.; Zhang, S.; Li, X. Explicit Visual Prompts for Visual Object Tracking. Proc. AAAI Conf. Artif. Intell. 2024, 38, 4838–4846. [Google Scholar] [CrossRef]
  59. Zheng, Y.; Zhong, B.; Liang, Q.; Mo, Z.; Zhang, S.; Li, X. Odtrack: Online Dense Temporal Token Learning for Visual Tracking. Proc. AAAI Conf. Artif. Intell. 2024, 38, 7588–7596. [Google Scholar] [CrossRef]
  60. Xie, J.; Zhong, B.; Mo, Z.; Zhang, S.; Shi, L.; Song, S.; Ji, R. Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, US, 16–22 June 2024; pp. 19300–19309. [Google Scholar]
  61. Fu, C.; Cao, Z.; Li, Y.; Ye, J.; Feng, C. Onboard Real-Time Aerial Tracking with Efficient Siamese Anchor Proposal Network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5606913. [Google Scholar] [CrossRef]
  62. Lin, Y.; Yin, M.; Zhang, Y.; Guo, X. Robust UAV Tracking via Information Synergy Fusion and Multi-Dimensional Spatial Perception. IEEE Access 2025, 13, 39886–39900. [Google Scholar] [CrossRef]
  63. Li, L.; Chen, C.; Yu, X.; Pang, S.; Qin, H. SiamTADT: A Task-Aware Drone Tracker for Aerial Autonomous Vehicles. IEEE Trans. Veh. Technol. 2025, 74, 3708–3722. [Google Scholar] [CrossRef]
  64. Xue, C.; Zhong, B.; Liang, Q.; Zheng, Y.; Li, N.; Xue, Y.; Song, S. Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking. arXiv 2025, arXiv:2503.06625. [Google Scholar] [CrossRef]
  65. Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; pp. 341–357. [Google Scholar] [CrossRef]
  66. Wang, S.; Wang, Z.; Sun, Q.; Cheng, G.; Ning, J. Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking. IEEE Trans. Image Process. 2024, 33, 5073–5085. [Google Scholar] [CrossRef] [PubMed]
  67. Du, G.; Zhou, P.; Yadikar, N.; Aysa, A.; Ubul, K. Mamba Meets Tracker: Exploiting Token Aggregation and Diffusion for Robust Unmanned Aerial Vehicles Tracking. Complex Intell. Syst. 2025, 11, 204. [Google Scholar] [CrossRef]
Figure 1. Schematic architecture of the Mamba-based Spatio-Temporal Fusion Tracker (MSTFT) with core modules.
Figure 1. Schematic architecture of the Mamba-based Spatio-Temporal Fusion Tracker (MSTFT) with core modules.
Electronics 15 00256 g001
Figure 2. Schematic of the bidirectional scanning mechanism in the Bi-STM module.
Figure 2. Schematic of the bidirectional scanning mechanism in the Bi-STM module.
Electronics 15 00256 g002
Figure 3. Mamba Layer schematic with In/Out Attention, FFN modules for multi-frame feature enhancement and hidden state computation.
Figure 3. Mamba Layer schematic with In/Out Attention, FFN modules for multi-frame feature enhancement and hidden state computation.
Electronics 15 00256 g003
Figure 4. Comparison of Precision and Success Curves of Different UAV Tracking Algorithms under OPE Protocol.
Figure 4. Comparison of Precision and Success Curves of Different UAV Tracking Algorithms under OPE Protocol.
Electronics 15 00256 g004
Figure 5. Precision plots of OPE under diverse UAV tracking challenge scenarios.
Figure 5. Precision plots of OPE under diverse UAV tracking challenge scenarios.
Electronics 15 00256 g005
Figure 6. Success plots of OPE under diverse UAV tracking challenge scenarios.
Figure 6. Success plots of OPE under diverse UAV tracking challenge scenarios.
Electronics 15 00256 g006
Figure 7. Visual comparison of object tracking algorithms for pedestrian targets in aerial low-resolution scenes.
Figure 7. Visual comparison of object tracking algorithms for pedestrian targets in aerial low-resolution scenes.
Electronics 15 00256 g007
Figure 8. Visual comparison of object tracking algorithms for vessels in aerial fast-motion scenes.
Figure 8. Visual comparison of object tracking algorithms for vessels in aerial fast-motion scenes.
Electronics 15 00256 g008
Figure 9. Visual Comparison of object tracking algorithms for pedestrians in aerial long-sequence occlusion scenes.
Figure 9. Visual Comparison of object tracking algorithms for pedestrians in aerial long-sequence occlusion scenes.
Electronics 15 00256 g009
Table 1. Performance comparison of different types of unmanned aerial vehicle (UAV) tracking algorithms on UAV123, UAV123@10fps, and UAV20L datasets.
Table 1. Performance comparison of different types of unmanned aerial vehicle (UAV) tracking algorithms on UAV123, UAV123@10fps, and UAV20L datasets.
Algorithm
Type
Algorithm
Name
YearUAV123
(AUC/Precision)
UAV123@10fps
(AUC/Precision)
UAV20L
(AUC/Precision)
SiameseSiamFC [38]201648.5/64.847.2/67.840.2/59.9
SiamRPN++ [45]201973.2/79.555.5/69.052.8/69.6
SiamBAN [44]202068.0/77.659.8/78.458.3/75.9
SiamAPN [61]202257.5/76.656.6/75.253.9/72.1
SiamFM [62]202562.8/82.0-/-58.5/76.8
SiamTADT [63]202565.1/84.264.0/83.464.3/83.1
Transformer-
based
TransT [47]202162.9/81.861.2/79.253.7/69.6
STARK [48]202274.4/80.463.8/82.560.1/77.3
OSTrack [65]202376.3/80.463.7/79.763.0/78.2
MCTrack [66]202465.2/79.164.1/80.165.2/79.5
SGLATrack [64]202566.9/84.264.3/81.764.0/79.2
Mamba-
based
MambaLCT [27]202576.7/81.568.5/80.365.2/81.5
TrackingMiM [28]202570.8/83.569.1/82.666.5/83.2
TADMT [67]202568.2/83.869.1/83.468.6/81.9
TemTrack [29]202577.5/83.169.3/83.769.7/82.3
MSTFT-79.4/84.576.5/84.175.8/83.6
Note: AUC refers to the Area Under the Success Curve, and Precision denotes the tracking precision under the 20-pixel threshold. Best performance is highlighted in bold.
Table 2. Performance of representative algorithms on different challenge attributes.
Table 2. Performance of representative algorithms on different challenge attributes.
Challenge
Attribute
Sample
Sequences
SiamRPN++OSTrackTrackingMiMMSTFTCore Contribution
Module
Small Target4252.158.462.370.5CAPH
Fast Motion2864.869.765.773.3Bi-STM
Low Resolution3546.857.859.261.1CAPH+Bi-STM
Partial Occlusion7361.764.366.169.2DTF-AA
Full Occlusion3343.545.747.951.2DTF-AA
Background Clutter2148.855.958.360.9Bi-STM+DTF-AA
All numerical values represent tracking success rate (AUC, %). CAPH, Bi-STM, and DTF-AA are core modules designed for specific challenges.
Table 3. Performance comparison under different experimental configurations.
Table 3. Performance comparison under different experimental configurations.
Exp.Config.AUC (%)Prec (%)Small Target
AUC (%)
Fast Motion
AUC (%)
Full Occlusion
AUC (%)
Params
(M)
FLOPs
(G)
Speed
(FPS)
GPU Mem
(MB)
A1 (Baseline)65.876.358.258.245.07022.352680
A2 (+Bi-STM)70.380.562.766.549.77224.548710
A3 (+DTF-AA)72.983.265.368.954.47325.746750
A4 (MSTFT)75.284.571.673.358.77426.845780
Note: AUC = Area Under the Curve; Prec = Precision; Params = Parameter Count (millions); FLOPs = Floating-Point Operations (gigaflops); Speed = Inference Speed (tested on a single GPU); GPU Mem = Maximum GPU memory usage during inference.
Table 4. Detailed memory consumption analysis of bidirectional scanning and core components (measured on NVIDIA Tesla A100).
Table 4. Detailed memory consumption analysis of bidirectional scanning and core components (measured on NVIDIA Tesla A100).
ConfigurationScanning
Strategy
Components
Included
Peak
Memory (MB)
Baseline (A1)No scanning (backbone only)Basic Prediction Head680
Spatial OnlyUnidirectional HorizontalBi-STM (Spatial: H-only)695
Spatial OnlyUnidirectional VerticalBi-STM (Spatial: V-only)698
Spatial OnlyBidirectional (H + V)Bi-STM (Spatial: Full)710
Temporal OnlyUnidirectional ForwardBi-STM (Temporal: Forward)705
Temporal OnlyUnidirectional BackwardBi-STM (Temporal: Backward)707
Temporal OnlyBidirectional (F + B)Bi-STM (Temporal: Full)725
Full Bi-STMFull BidirectionalComplete Bi-STM
A3 (+DTF-AA)Full BidirectionalBi-STM + DTF-AA770
Full MSTFT (A4)Full BidirectionalAll Components780
Note: Best performance (highest peak memory) and key configurations are highlighted in bold.
Table 5. Ablation results of the Triple Safety Verification Mechanism (TSVM).
Table 5. Ablation results of the Triple Safety Verification Mechanism (TSVM).
Model VariantAdditional
Params (K)
Additional
FLOPs (G/frame)
Speed Loss
(FPS)
Full Occlusion
AUC (%)
Robustness Gain
(AUC ↑ %)
Variant A (w/o TSVM)00.000.049.90.0
Variant B (w/TSVM)34.21.983.658.68.7
Note: 1. Model Variant: Indicates the model configuration. “w/o TSVM” denotes the model without the Triple Safety Verification Mechanism, while “w/TSVM” denotes the model with the Triple Safety Verification Mechanism. 2. Inference speed of Variant A is 48.6 FPS; Variant B maintains 45.0 FPS (well above 30 FPS real-time threshold). 3. ↑ denotes performance improvement.
Table 6. Ablation results of individual loss function modifications.
Table 6. Ablation results of individual loss function modifications.
Model VariantSmall-Target
AUC (%)
Average Localization
Error (Pixel)
Average IoU
(%)
Variant 1 (Baseline Loss)65.312.858.2
Variant 2 (Only Improved Focal Loss)68.911.561.7
Variant 3 (Only Optimized Regression Loss)69.210.362.3
Variant 4 (Full Proposed Loss)71.68.765.1
Table 7. Complexity comparison of different tracking algorithms.
Table 7. Complexity comparison of different tracking algorithms.
Algorithm
Name
Params
(M)
FLOPs (G)Speed
(FPS)
GPU Mem
(MB)
Long-seq Inference
Time (s/8000 frames)
Complexity
Type
SiamRPN++ [45]18835.8322236250Linear (O (HW))
OSTrack [65]11345.2182943444Quadratic (O ((HW)2))
MambaLCT [27]8522.4381624210Linear (O (HW))
TrackingMiM [28]9731.2351952228Linear (O (HW))
TADMT [67]10233.5302015266Linear (O (HW))
MSTFT7026.845780178Linear (O (HW))
Note: Params = Parameter Count (millions); FLOPs = Floating-Point Operations (gigaflops); Speed = Inference Speed (frames per second); GPU Mem = Maximum GPU memory usage; Long-seq Inference Time = Inference latency for 8000 consecutive frames; HW = Height × Width of feature map.
Table 8. Tracking success rate (AUC, %) comparison on different target size subsets of UAV123.
Table 8. Tracking success rate (AUC, %) comparison on different target size subsets of UAV123.
AlgorithmSmall TargetsMedium TargetsLarge Targets
SiamRPN++ [45]52.171.878.5
OSTrack [65]58.473.279.1
MambaLCT [27]62.374.980.4
MSTFT (Ours)70.576.881.2
Note: Best performance is highlighted in bold.
Table 9. Power and energy performance of MSTFT on UAV-edge hardware platforms.
Table 9. Power and energy performance of MSTFT on UAV-edge hardware platforms.
Hardware PlatformP_avg (W)P_peak (W)E_frame (J/Frame)Endurance (200 Wh, h)
NVIDIA Jetson AGX Orin28.741.30.64≈6.97
NVIDIA Jetson Xavier NX14.322.80.33≈13.99
Note: P_avg = Average power consumption; P_peak = Peak power consumption; E_frame = Energy per frame; Endurance estimated for a 200 Wh battery.
Table 10. Comparison of Core Characteristics Among CNN-based, Transformer-based, and Mamba-based Trackers.
Table 10. Comparison of Core Characteristics Among CNN-based, Transformer-based, and Mamba-based Trackers.
CharacteristicCNN-BasedTransformer-BasedMamba-Based
(Ours)
Global Contextlocal windowSelf-attentionSelective SSM
Theoretical Complexity O ( N ) O ( N 2 ) O ( N )
Memory EfficiencyHighLarge attention mapsCompact states
Suitability for Small TargetsNeeds multi-scale tricksGlobal viewDynamic focus
Real-time Edge FeasibilityHighChallengingHigh
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, K.; Zhang, H.; Chen, H. MSTFT: Mamba-Based Spatio-Temporal Fusion for Small Object Tracking in UAV Videos. Electronics 2026, 15, 256. https://doi.org/10.3390/electronics15020256

AMA Style

Sun K, Zhang H, Chen H. MSTFT: Mamba-Based Spatio-Temporal Fusion for Small Object Tracking in UAV Videos. Electronics. 2026; 15(2):256. https://doi.org/10.3390/electronics15020256

Chicago/Turabian Style

Sun, Kang, Haoyang Zhang, and Hui Chen. 2026. "MSTFT: Mamba-Based Spatio-Temporal Fusion for Small Object Tracking in UAV Videos" Electronics 15, no. 2: 256. https://doi.org/10.3390/electronics15020256

APA Style

Sun, K., Zhang, H., & Chen, H. (2026). MSTFT: Mamba-Based Spatio-Temporal Fusion for Small Object Tracking in UAV Videos. Electronics, 15(2), 256. https://doi.org/10.3390/electronics15020256

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop