Next Article in Journal
Design, Modeling, and MPC-Based Control of a Fully Vectored Propulsion Underwater Robot
Previous Article in Journal
A Fine-Grained Difficulty and Similarity Framework for Dynamic Evaluation of Path-Planning Generalization in UGVs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

LF-SSM: Lightweight HiPPO-Free State Space Model for Real-Time UAV Tracking

1
National Key Laboratory of Electromagnetic Energy, Wuhan 430030, China
2
East Lake Laboratory, Wuhan 430202, China
*
Author to whom correspondence should be addressed.
Drones 2026, 10(2), 102; https://doi.org/10.3390/drones10020102
Submission received: 28 December 2025 / Revised: 22 January 2026 / Accepted: 24 January 2026 / Published: 31 January 2026
(This article belongs to the Section Artificial Intelligence in Drones (AID))

Highlights

What are the main findings?
  • We propose LF-SSM, a lightweight HiPPO-free state space model that reformulates state evolution on Riemannian manifolds through geodesic dynamics, eliminating complex discretization procedures and specialized hardware kernels required by existing SSM methods.
  • The proposed Geodesic State Module (GSM) performs state updates through tangent space projection and exponential mapping on the unit sphere, providing adaptive local coordinate systems that preserve geometric structure of tracking features.
What are the implications of the main findings?
  • LF-SSM achieves state-of-the-art performance on multiple UAV tracking benchmarks while running at 69 FPS with only 18.5M parameters on edge platforms (Jetson Orin Nano), enabling practical real-time deployment on resource-constrained UAV systems.
  • The manifold-based formulation demonstrates that geometric approaches can effectively replace HiPPO-derived state transitions in visual tracking, opening new research directions for efficient sequence modeling without relying on fixed polynomial bases.

Abstract

Visual object tracking from unmanned aerial vehicles (UAVs) demands both high accuracy and computational efficiency for real-time deployment on resource-constrained platforms. While state space models (SSMs) offer linear computational complexity, existing methods face critical deployment challenges. They rely on the HiPPO framework with complex discretization procedures and employ hardware-aware algorithms optimized for high-performance GPUs, which introduce deployment overhead and are difficult to transfer to edge platforms. Additionally, their fixed polynomial bases may cause information loss for tracking features with complex geometric structures. We propose LF-SSM, a lightweight HiPPO (High-order Polynomial Projection Operators)-free state space model that reformulates state evolution on Riemannian manifolds. The core contribution is the Geodesic State Module (GSM), which performs state updates through tangent space projection and exponential mapping on the unit sphere. This design eliminates complex discretization and specialized hardware kernels while providing adaptive local coordinate systems. Extensive experiments on UAV benchmarks demonstrate that LF-SSM achieves state-of-the-art performance while running at 69 frames per second (FPS) with only 18.5 M parameters, demonstrating superior efficiency for real-time edge deployment.

1. Introduction

Visual object tracking from unmanned aerial vehicles (UAVs) has become increasingly important for autonomous aerial perception. Applications such as aerial surveillance [1,2], search and rescue operations [3,4], precision agriculture [5], and autonomous navigation [6] all rely on accurate tracking from aerial platforms. However, UAV platforms impose strict limitations on computational resources, power budgets, and payload capacity. These constraints require tracking algorithms to achieve both high accuracy and real-time performance under resource-constrained conditions [7,8,9,10].
Transformer-based trackers [11,12,13,14,15] achieve remarkable accuracy through powerful global modeling capabilities, but their quadratic computational complexity with respect to sequence length limits their deployment on edge platforms. State space models (SSMs) [16] offer an alternative with linear computational complexity, and recent SSM-based tracking methods [17,18] have shown promising results. However, existing SSM architectures face deployment challenges on resource-constrained platforms. Mamba [19] employs hardware-aware algorithms specifically optimized for high-performance GPUs, which are difficult to efficiently deploy on edge platforms with limited computational resources.
Beyond deployment efficiency, modern SSM architectures present a fundamental limitation in feature representation. Recent SSM-based methods including Mamba [19] rely on the HiPPO (High-order Polynomial Projection Operators) framework [20] to construct their state transition matrices. HiPPO projects historical signals onto a fixed Legendre polynomial basis to achieve long-range memory. While this approach enables efficient sequence modeling, the fixed polynomial basis represents a universal design that does not adapt to task-specific data distributions. According to the manifold hypothesis [21,22], high-dimensional visual features typically reside on low-dimensional manifolds embedded in the ambient space. This geometric perspective has been validated in visual understanding tasks, where hyperspherical representations improve feature discrimination by constraining embeddings to unit spheres [23,24]. The success of normalized feature spaces in Siamese tracking networks further supports this view, as cosine similarity-based matching inherently operates on angular relationships between feature vectors on hyperspheres [25]. In UAV tracking scenarios, target appearance features exhibit complex intrinsic geometric structures due to deformation, scale variation, and viewpoint change [26]. These transformations define smooth trajectories in feature space rather than arbitrary jumps, which is precisely the characteristic of manifold-structured data. These structures fundamentally differ from smooth polynomial curves. Projecting such manifold-structured features onto a fixed Euclidean basis causes information loss during the state compression process, leading to suboptimal tracking performance in challenging scenarios.
Based on this analysis, we identify two critical limitations of existing SSM-based methods for UAV tracking. First, the reliance on hardware-aware algorithms makes them difficult to deploy on resource-constrained edge platforms. Second, the fixed polynomial basis causes information loss when representing tracking features with complex geometric structures. As illustrated in Figure 1a, HiPPO-based SSMs project manifold-structured hidden states onto a fixed polynomial coefficient space, which may lose geometric information during compression.
To address these limitations, we propose LF-SSM, a lightweight HiPPO-free state space model for UAV tracking. Our key insight is that if tracking features naturally reside on manifolds, the state space itself should be defined on a manifold rather than in Euclidean space. As shown in Figure 1b, we reformulate state evolution as geodesic motion on Riemannian manifolds, where the tangent space at each state provides an adaptive local coordinate system. This formulation eliminates the need for HiPPO-derived state transition matrices and specialized hardware kernels. The adaptive tangent space naturally changes as the state evolves, avoiding the information loss caused by fixed basis compression. This design enables efficient deployment on edge platforms while preserving the geometric structure of tracking features.
Our contributions are summarized as follows:
  • We propose LF-SSM, a lightweight HiPPO-free state space model that fundamentally reformulates state evolution on Riemannian manifolds for visual tracking. Unlike existing SSM methods that project features onto fixed Legendre polynomial bases, our geodesic dynamics naturally preserve the geometric structure of tracking features through adaptive local coordinate systems. This eliminates both the computational overhead of HiPPO discretization and the information loss from fixed basis compression, while maintaining linear complexity O ( T E N ) with respect to sequence length.
  • We design a Geodesic State Module (GSM) that performs state updates through tangent space projection and exponential mapping on the unit sphere S N 1 . The core innovation lies in replacing the N × N HiPPO-derived state transition matrix with geometric operations involving only O ( N ) vector computations. This design eliminates the need for specialized hardware kernels (such as Mamba’s selective scan) while providing adaptive tangent spaces T h t S N 1 that change with the evolving state. The input-dependent step size Δ t = σ ( Linear Δ ( x ˜ t ) ) enables content-aware updates, and the prior velocity mechanism v t + 1 prior = Proj h t + 1 ( v t ) implements geometric forgetting through parallel transport approximation.
  • Extensive experiments on four UAV benchmarks (UAV123, VisDrone, ARDMAV, LaSOT) demonstrate that LF-SSM achieves state-of-the-art performance while enabling practical real-time deployment on resource-constrained platforms. Specifically, LF-SSM-L achieves 73.2% AUC on UAV123, outperforming all SSM-based baselines by significant margins.
The remainder of this article is organized as follows. Section 2 reviews the related work in UAV tracking and state space models. Section 3 introduces the preliminary background. In Section 4, we elaborate on the proposed LF-SSM. Section 5 analyzes the computational efficiency. Section 6 presents our experimental results. Section 7 provides discussion on the theoretical analysis and limitations. Finally, conclusions are drawn in Section 8.

2. Related Work

2.1. UAV Visual Tracking

Siamese network-based methods have dominated UAV tracking due to their balance between accuracy and efficiency. SiamFC [25] pioneers the fully convolutional matching framework for visual tracking. Subsequent works extend this paradigm through region proposal networks [27,28,29] and anchor-free prediction [30,31,32,33]. For UAV-specific challenges, HiFT [34] develops hierarchical feature fusion to handle scale variations, while TCTrack [35] and TCTrack++ [36] address temporal contexts through online template updating. UDAT [37] proposes unsupervised domain adaptation for cross-domain UAV tracking. SGDViT [38] introduces saliency-guided dynamic vision transformers. AVTrack [39] explores adaptive approaches for aerial scenarios. While these methods achieve real-time performance, they are limited by the local receptive fields of convolutional operations.
Transformer-based methods have advanced tracking accuracy through global attention mechanisms. TransT [15] first introduces cross-attention for template-search feature fusion. STARK [14] and MixFormer [40,41] further develop spatio-temporal transformers with learned queries and mixed attention modules. OSTrack [11] proposes a one-stream framework that unifies feature extraction and relation modeling, while SeqTrack [13] reformulates tracking as a sequence generation task. AiATrack [42] introduces attention in attention modules for enhanced feature interaction. SimTrack [43] and MT-Track [44] explore simplified architectures and regularization strategies. ROMTrack [45] develops robust online tracking with memory. ARTrack [46] and ODTrack [47] propose autoregressive and online dense frameworks. VideoTrack [48] explores video-level temporal modeling for robust tracking. EVPTrack [49] introduces explicit visual prompting for adaptive tracking. SPMTrack [50] proposes sparse-to-dense pyramid matching for efficient tracking. HIPTrack [51] develops hierarchical interaction propagation for long-term tracking. These approaches achieve remarkable accuracy on various benchmarks [52,53,54], yet their quadratic computational complexity poses challenges for UAV deployment.
Lightweight tracking methods have been developed to address deployment constraints on edge platforms [55]. TSTrack [56] employs neural architecture search to discover efficient network structures. FEAR [57] and E.T.Track [58] leverage knowledge distillation and efficient training strategies for model compression. HCAT [59] proposes hierarchical cross-attention for efficient tracking. Aba-ViTrack [60] introduces re-parameterization and adaptive techniques. Recent works explore efficient attention mechanisms [61] for further acceleration. Despite these efforts, the trade-off between accuracy and efficiency on UAV platforms remains an open challenge [62,63].

2.2. State Space Models

State space models [64,65] have emerged as an efficient alternative for sequence modeling with linear computational complexity. The theoretical foundation of modern SSMs originates from the HiPPO framework [20], which projects continuous signals onto Legendre polynomial bases for long-range memory. S4 [66] introduces structured parameterization that enables practical deep SSM training, while S4D [67] and S5 [68] further simplify the formulation through diagonal approximation and parallel scan algorithms. DSS [69] proposes alternative parameterizations for improved efficiency. LSSL [70] develops linear state space layers for sequence modeling. H3 [71] and Hyena [72] extend the framework with gating mechanisms and implicit convolutions. These methods demonstrate strong performance on long-range sequence benchmarks [73] but rely on fixed polynomial bases that may not adapt to task-specific data distributions.
Selective state space models introduce input-dependent mechanisms for content-aware modeling. Mamba [19] proposes the selective scan algorithm that dynamically adjusts state transitions based on input features, achieving competitive performance with Transformers. Mamba2 [74] refines the architecture with improved stability through structured state space duality. Jamba [75] combines Mamba with attention mechanisms for hybrid architectures. RWKV [76] explores alternative recurrent designs. However, these methods employ hardware-aware algorithms optimized for high-performance GPUs [77], limiting their effectiveness on edge platforms.
Recent works have extended SSMs to visual domains. Vision Mamba [78] and VMamba [79] adapt the architecture for image understanding through bidirectional and cross-scan strategies. LocalMamba [80] and EfficientVMamba [81] improve local feature extraction and computational efficiency. PlainMamba [82] simplifies the architecture without hierarchical designs. For dense prediction tasks, Mamba-UNet [83], VM-UNet [84], and Swin-UMamba [85] apply SSMs to medical image segmentation. U-Mamba [86] and SegMamba [87] further explore encoder-decoder structures. For visual tracking, MambaVision [88] and MambaVOT [89] explore SSM-based temporal modeling for appearance evolution. Mamba-FETrack [90] proposes frame-event fusion with Mamba. MambaVT [91] introduces vision transformers with Mamba for robust tracking. VideoMamba [92] extends Mamba to video understanding tasks. While these approaches achieve linear complexity, they inherit the HiPPO framework with its fixed basis representation, which may cause information loss for tracking features that exhibit complex geometric structures.
In addition to polynomial-based SSMs, geometric approaches have been explored for sequence modeling. Orthogonal and unitary RNNs [93,94] enforce norm-preserving state transitions to mitigate vanishing gradients in long sequences. Hyperspherical representations [23,24] constrain neural network embeddings to unit spheres, demonstrating improved generalization and robustness. Riemannian optimization techniques [95] enable gradient-based learning on manifold-valued parameters. Our approach differs fundamentally from these works by formulating the entire state evolution process as geodesic motion on Riemannian manifolds. Rather than merely constraining parameters or outputs to manifolds, we replace the HiPPO-derived state transition matrix with geometric operations that provide adaptive local coordinates through tangent space projection.
The connection between geometric constraints and sequence modeling in tracking deserves particular attention. Traditional SSMs process features in Euclidean space without considering the intrinsic structure of visual embeddings. However, tracking features naturally exhibit directional characteristics: the angular relationships between template and search features determine matching quality, while feature magnitudes primarily encode confidence or saliency. This observation motivates our reformulation of state evolution on Riemannian manifolds, where the unit sphere provides a natural geometric structure for directional feature dynamics. Unlike polynomial bases that compress all features onto a fixed coordinate system, the tangent space at each spherical state provides an adaptive local coordinate frame that evolves with the feature trajectory, preserving geometric relationships crucial for robust tracking.

3. Preliminary

3.1. State Space Model

State space models [19,66] map an input sequence x ( t ) R L to an output sequence y ( t ) R L through latent states h ( t ) R N using three key matrices: state matrix A R N × N , input projection B R N × 1 , and output projection C R 1 × N .
The HiPPO framework [20] provides the theoretical foundation for constructing A by projecting the input history onto Legendre polynomial bases. Using zero-order hold discretization with step size Δ , the continuous dynamics become:
h t = A ¯ h t 1 + B ¯ x t y t = C h t
where A ¯ = exp ( Δ A ) and B ¯ = ( Δ A ) 1 ( A ¯ I ) · Δ B .
While this formulation achieves linear complexity, it introduces two limitations for UAV tracking deployment. First, the discretization procedure and hardware-aware algorithms require specialized computational kernels optimized for high-performance GPUs. Second, the fixed Legendre polynomial basis compresses all inputs onto the same coordinate system, which may cause information loss when features exhibit task-specific geometric structures [19].

3.2. Riemannian Geometry

A Riemannian manifold ( M , g ) is a smooth manifold equipped with a metric g that defines inner products on tangent spaces. For each point h M , the tangent space T h M forms a local Euclidean approximation of the manifold.
The exponential map Exp h : T h M M moves a point along the geodesic with initial velocity v T h M . For the unit sphere S N 1 = { h R N : h = 1 } , the exponential map has a closed-form expression:
Exp h ( v ) = cos ( v ) · h + sin ( v ) · v v
The projection of an arbitrary vector u R N onto the tangent space T h S N 1 is given by h ( u ) = u u , h h . These operations have been widely adopted in deep learning for optimization on manifolds and hyperspherical representations.
Unlike the fixed Legendre basis in HiPPO, the tangent space T h M provides an adaptive local coordinate system that changes with the current state h. This property motivates our approach to reformulate state space models on Riemannian manifolds.

4. Methodology

4.1. Overall Architecture

In this section, we present LF-SSM, a lightweight state space model for UAV tracking that addresses the computational overhead and information loss issues in existing SSM-based approaches.
We propose the Geodesic State Module (GSM) that reformulates state evolution on Riemannian manifolds for efficient token sequence modeling. GSM performs state updates through geodesic motion on the unit sphere, eliminating the need for HiPPO-derived N × N state transition matrices and complex discretization procedures. The tangent space at each state provides an adaptive local coordinate system, which avoids the information loss caused by fixed basis compression.
The overall architecture is illustrated in Figure 2. The template and search images first undergo patch embedding to generate token sequences. These tokens are concatenated to form a unified spatial sequence and processed through stacked GSM Blocks, which model long-range dependencies among all tokens. Finally, the enhanced features are fed into a prediction head to generate the target bounding box.

4.2. Geodesic State Evolution

Traditional state space models update the hidden state through linear recurrence h t = A ¯ h t 1 + B ¯ x t , where A ¯ R N × N is derived from the HiPPO framework with a fixed Legendre polynomial basis. We propose to reformulate state evolution on Riemannian manifolds, where the state space is defined on the unit sphere:
S N 1 = { h R N : h = 1 }
where N denotes the state dimension and the superscript N 1 indicates that the unit sphere is an ( N 1 ) -dimensional manifold embedded in R N .
Given the current state h t 1 S N 1 and input token x t , the input token is first projected into the state space to obtain u t R N . The projected input is then mapped onto the tangent space T h t 1 S N 1 through orthogonal projection:
v t = h t 1 ( u t ) = u t u t , h t 1 h t 1
where h ( · ) denotes the projection operator onto the tangent space at point h, the angle brackets · , · represent the Euclidean inner product, and the resulting tangent vector v t satisfies the orthogonality condition v t , h t 1 = 0 . The input-dependent step size Δ t R + controls the magnitude of state update. The new state is obtained by moving along the geodesic on S N 1 :
h t = cos ( Δ t v t ) · h t 1 + sin ( Δ t v t ) · v t v t + ϵ
where · denotes the Euclidean norm, the product Δ t v t represents the arc length traveled along the geodesic, and  ϵ = 10 6 is a small constant for numerical stability when v t 0 . The output y t = C h t is computed through the output projection matrix C R D × N that maps the N-dimensional state to the D-dimensional output space.
The above geodesic formulation constrains the state vector to evolve on the unit sphere, which maintains h t = 1 throughout the sequence. This design choice reflects a deliberate factorization of the feature representation into directional and magnitude components based on polar decomposition. In Euclidean space, any non-zero vector x R N can be uniquely written as x = r · x ^ , where the scalar r = x captures the magnitude and the unit vector x ^ = x / x on the sphere S N 1 captures the direction. In visual tracking, these two components encode fundamentally different types of information. The directional component x ^ represents the geometric structure and relative spatial relationships of target appearance features, while the magnitude component r reflects confidence, salience, or attention weights. The geodesic state evolution operates on the directional component through the spherical state h t S N 1 , which encodes the geometric structure of appearance features as they evolve over the token sequence.
The spherical constraint provides important stability properties that help address common issues in unconstrained Euclidean formulations. In traditional state space models, the discrete dynamics h t = A h t 1 + B x t involve a transition matrix A R N × N whose spectral properties determine long-term behavior. When the largest eigenvalue satisfies | λ max ( A ) | > 1 , the state magnitude grows exponentially over time, which eventually causes numerical overflow and gradient explosion during backpropagation. Conversely, when | λ max ( A ) | < 1 , states decay exponentially toward zero, which leads to vanishing gradients and loss of long-range information. The ideal regime | λ max ( A ) | = 1 provides stable dynamics but remains structurally fragile. Small perturbations to the matrix entries during training can push eigenvalues away from the unit circle and trigger instability.
Our geodesic formulation significantly mitigates this spectral sensitivity. The norm preservation property h t = 1 holds exactly by geometric construction, without any constraints on learned parameters. To verify this, observe that the tangent vector v t produced by projection in Equation (4) satisfies the orthogonality condition v t , h t 1 = 0 by definition of the tangent space. Consequently, the geodesic update in Equation (5) yields
h t 2 = cos ( Δ t v t ) · h t 1 + sin ( Δ t v t ) · v t v t 2 = cos 2 ( Δ t v t ) h t 1 2 + sin 2 ( Δ t v t ) v t v t 2 + 2 cos ( Δ t v t ) sin ( Δ t v t ) h t 1 , v t v t = cos 2 ( Δ t v t ) + sin 2 ( Δ t v t ) + 0 = 1
where the first line substitutes the geodesic update formula, the second line expands the squared norm using the identity a + b 2 = a 2 + b 2 + 2 a , b , and the third line uses h t 1   = 1 from the inductive hypothesis, v t / v t = 1 by normalization, and the cross term vanishes because h t 1 , v t = 0 from Equation (4). This equality holds to machine precision for any step size Δ t > 0 and any tangent vector v t , which provides robust stability without the need for eigenvalue tuning or spectral normalization.
This formulation eliminates the HiPPO-derived N × N state transition matrix A and the associated discretization procedure. The tangent space T h t 1 S N 1 provides an adaptive local coordinate system that changes with the current state, avoiding information loss from fixed basis compression. The geodesic state evolution is summarized in Algorithm 1.
The prior velocity mechanism provides temporal continuity across state updates and implements geometry-aware forgetting of outdated information. At each time step, the prior velocity v t prior T h t 1 S N 1 carries directional momentum from previous updates. When the state moves from h t 1 to h t , this prior velocity is transported to the tangent space at the new location through the projection
v t + 1 prior = Proj h t ( v t ) = v t v t , h t h t
where Proj h t ( · ) denotes orthogonal projection onto the tangent space T h t S N 1 , and the inner product v t , h t measures the alignment between the velocity and the new state direction. This operation removes from the old velocity its component along the new state direction while it preserves the orthogonal remainder. When the tracking target undergoes rapid appearance changes, the state trajectory curves sharply and the projection removes a substantial component of the prior velocity. This effectively forgets outdated directional information that is no longer relevant. Conversely, when the target appearance evolves smoothly, the trajectory follows a nearly geodesic path with small curvature, and most of the prior velocity survives projection to maintain temporal consistency.
Algorithm 1 Geodesic state evolution
Require: Input sequence { x 1 , , x T } , initial state h 0 S N 1 , stability constant ϵ , confidence weight α [ 0 , 1 ]
Ensure: Output sequence { y 1 , , y T }
    v 0 prior 0 {Initialize prior velocity as zero vector}
   for t = 1 , , T do
       u t B x t {Project input to state space}
       w t u t + α v t prior {Combine input with prior momentum}
       v t w t w t , h t 1 h t 1 {Project to tangent space}
       Δ t σ ( W Δ x t ) {Compute input-dependent step size}
       h t cos ( Δ t v t ) · h t 1 + sin ( Δ t v t ) · v t / ( v t + ϵ ) {Geodesic update}
       v t + 1 prior v t v t , h t h t {Transport prior to new tangent space}
       y t C h t {Output projection}
   end for
   return { y 1 , , y T }
The forgetting rate can be quantified in terms of the geodesic step size. Let θ t = Δ t v t denote the arc length traveled in one update step, which equals the geodesic distance between successive states h t 1 and h t on the unit sphere. The magnitude of the transported prior satisfies the bound v t + 1 prior v t sin ( θ t ) , which follows from
v t + 1 prior 2 = v t v t , h t h t 2 = v t 2 v t , h t 2
where we use h t = 1 and the expansion a a , b b 2 = a 2 a , b 2 for unit vector b . For small steps where θ t 1 , the Taylor expansion sin ( θ t ) θ t gives approximately v t + 1 prior v t θ t , which indicates that the prior magnitude decays in proportion to the step size. Over multiple steps, this produces gradual attenuation of historical momentum and implements smooth forgetting that avoids the abrupt information loss associated with aggressive eigenvalue decay in traditional SSM formulations.

4.3. Geodesic State Module

Based on the geodesic state evolution theory, we design the Geodesic State Module (GSM) that serves as the core component of LF-SSM. The GSM replaces the selective scan mechanism in Mamba with geodesic dynamics on the unit sphere, where state updates are performed through tangent space projection and exponential mapping using only standard vector operations.
The GSM processes an input feature sequence X R T × D where T is the number of tokens and D denotes the feature dimension. As shown in Figure 3, the module first applies a linear layer to expand the input dimension to E, followed by a 1D convolution for local feature extraction. The expanded features X ˜ R T × E are then processed through two parallel branches.
The state evolution branch computes geodesic updates on the unit sphere following the formulation in Section 4.2. For each token x ˜ t R E , we generate input-dependent projection parameters to enable content-aware state updates. The projected input u t , step size Δ t , and initial state h 0 are computed as
u t = Linear B ( x ˜ t ) , Δ t = σ ( Linear Δ ( x ˜ t ) ) , h 0 = h ^ 0 h ^ 0 + ϵ
where Linear B : R E R N projects the E-dimensional expanded feature into the N-dimensional state space, Linear Δ : R E R 1 computes a scalar step size, σ ( · ) denotes the softplus activation that ensures positivity, h ^ 0 R N is a learnable parameter, and the division by h ^ 0 + ϵ normalizes it onto the unit sphere. The geodesic state evolution then proceeds sequentially according to Algorithm 1, which involves only vector operations including inner product, scalar multiplication, and trigonometric functions.
The gating branch implements the magnitude component of the factorized representation introduced in Section 4.2. While the state evolution branch processes directional information through the spherical state h t S N 1 , the gating branch handles magnitude-related attributes through a separate pathway. This branch operates on the expanded features X ˜ which retain complete magnitude information from the input tokens. The gating signal is computed as
G = SiLU ( Linear G ( X ˜ ) )
where Linear G : R E R E denotes a learnable linear transformation applied to each token independently, and  SiLU ( · ) is the Sigmoid Linear Unit activation defined as SiLU ( x ) = x · σ ( x ) with σ ( · ) being the sigmoid function. This activation allows both amplification when x > 0 and suppression when x < 0 of feature magnitudes depending on the learned weights.
The final output recombines the directional and magnitude components through element-wise multiplication
Y = Linear O ( G ( C H ) )
where H = [ h 1 , h 2 , , h T ] R T × N collects the spherical states across all T time steps with each row h t S N 1 being an N-dimensional unit vector, the matrix C R E × N projects each N-dimensional directional state to the E-dimensional space, the symbol ⊙ denotes element-wise multiplication that modulates the directional features C H by the learned magnitude weights G , and  Linear O : R E R D produces the final D-dimensional output. This design follows the polar decomposition principle where directional and magnitude information are processed separately and recombined in the output. Since the gating branch can learn arbitrary scaling functions through the parameters in Linear G , this factorized architecture achieves equivalent representational capacity to unconstrained state-output mappings while it preserves the stability benefits of bounded state dynamics on the unit sphere.

4.4. GSM Block

The GSM Block integrates the Geodesic State Module, normalization, and residual connections to form a complete building block for the tracking network. The structure is illustrated in Figure 2.
Given input tokens Z R T × D , the GSM Block first applies layer normalization, then processes features through bidirectional GSM and combines the result with the input via residual connection:
Z = LayerNorm ( Z ) , Z o u t = Z + Linear ( [ GSM f w d ( Z ) ; GSM b w d ( Z ) ] )
where [ · ; · ] denotes concatenation along the feature dimension. The bidirectional design applies GSM in both forward and backward directions along the token sequence, enabling each token to aggregate information from both preceding and subsequent tokens. This facilitates comprehensive feature interaction between template and search regions within the unified spatial sequence.
Multiple GSM Blocks are stacked to form the backbone of LF-SSM. The number of blocks L is a hyperparameter that balances model capacity and computational efficiency.

4.5. Training and Inference

4.5.1. Training

During training, the template and search images are processed through patch embedding to generate token sequences. The template tokens Z t R T t × D and search tokens Z s R T s × D are concatenated to form a unified spatial sequence Z = [ Z t ; Z s ] and fed into the stacked GSM Blocks.
The enhanced search tokens are extracted from the output and fed into the prediction head, which consists of a classification branch for foreground-background prediction and a regression branch for bounding box estimation. The training loss combines classification loss and regression loss:
L = L c l s + λ i o u L i o u + λ L 1 L L 1
where L c l s is the focal loss for classification, L i o u is the GIoU loss for bounding box regression, and  L L 1 is the L1 loss for coordinate regression. The hyperparameters λ i o u and λ L 1 balance the loss terms. The complete training procedure is summarized in Algorithm 2.

4.5.2. Inference

During inference, the template is encoded once and cached for subsequent frames. For each search frame, the search tokens are concatenated with the cached template tokens and processed through GSM Blocks. The prediction head outputs the target bounding box directly without post-processing such as window penalty or cosine window, which further improves inference efficiency. The complete inference process is summarized in Algorithm 3.
Algorithm 2 LF-SSM Training
Require: Training set D , learning rate η , epochs E, loss weights λ i o u , λ L 1
Ensure: Trained model parameters Θ
   Initialize model parameters Θ
   for e = 1 , , E  do
         for each batch ( I t , I s , B g t ) in D  do
                Z t PatchEmbed ( I t ) , Z s PatchEmbed ( I s )
                Z o u t GSM - Blocks ( [ Z t ; Z s ] )
                B p r e d , C p r e d PredHead ( Z o u t )
                L L c l s + λ i o u L i o u + λ L 1 L L 1
                Θ Θ η Θ L
      end for
   end for
   return Θ
Algorithm 3 LF-SSM Inference
Require: Template image I t , search images { I s 1 , , I s K }
Ensure: Bounding boxes { B 1 , , B K }
    Z t PatchEmbed ( I t ) {Encode once and cache}
   for k = 1 , , K do
          Z s PatchEmbed ( I s k )
          Z o u t GSM - Blocks ( [ Z t ; Z s ] )
          B k PredHead ( Z o u t )
   end for
   return { B 1 , , B K }

5. Efficiency Analysis

In this section, we analyze the computational efficiency of LF-SSM compared to existing SSM-based methods.

5.1. Time Complexity

For a sequence of T tokens with expanded dimension E and state dimension N, we analyze the time complexity of geodesic state evolution. At each token position t, the computational cost consists of input projection u t = Linear B ( x ˜ t ) with O ( E N ) , tangent space projection with O ( N ) , exponential map computation with O ( N ) , and output projection with O ( E N ) . The total time complexity per GSM is:
T G S M = O ( T E N + T N ) = O ( T E N )
This achieves linear complexity with respect to sequence length T, consistent with modern SSM methods. In contrast, Transformer-based methods require O ( T 2 D ) for self-attention computation.

5.2. Space Complexity

The space complexity of LF-SSM consists of linear projection parameters with O ( E N ) and state vector h t S N 1 with O ( N ) . The total space complexity is:
S t o t a l = O ( E N )
This is comparable to modern SSM methods such as S4D and Mamba that use diagonal state matrices with O ( N ) complexity. Compared to Transformer-based methods, we avoid the O ( T 2 ) memory requirement for attention matrices.

5.3. Computational Operations

A key efficiency advantage of LF-SSM is the elimination of HiPPO-related computational overhead. Although modern SSMs such as S4D and Mamba have simplified the state transition matrix to diagonal form, they still require computing the discretized matrix A ¯ = exp ( Δ A ) and the associated input transformation B ¯ . Mamba further requires selective scan kernels optimized for GPU memory hierarchies, which are difficult to deploy efficiently on edge platforms due to their hardware-specific memory access patterns.
In contrast, our geodesic state evolution bypasses the discretization procedure entirely:
h t = cos ( Δ t v t ) · h t 1 + sin ( Δ t v t ) · v t v t + ϵ
This formulation involves only inner product u t , h t 1 with O ( N ) , vector subtraction and norm computation with O ( N ) , scalar trigonometric functions with O ( 1 ) , and scalar-vector multiplication with O ( N ) . These are standard operations efficiently supported across diverse hardware platforms including edge devices, without requiring specialized computational kernels or hardware-specific optimizations. The elimination of discretization procedures and selective scan kernels enables efficient deployment on resource-constrained UAV platforms.
While the geodesic formulation and diagonal SSMs both achieve O ( T E N ) asymptotic complexity, the constant factors differ due to specific operations involved. The geodesic update requires approximately twice the arithmetic operations compared to diagonal SSMs when the expanded dimension E is comparable to the input dimension D, primarily due to the additional tangent space projection and exponential map computations. However, this arithmetic overhead does not directly translate to proportional wall-clock time increase due to two factors. First, the trigonometric functions cos ( · ) and sin ( · ) are computed efficiently through hardware-accelerated Special Function Units (SFUs) on modern GPUs, executing in approximately 4–8 clock cycles comparable to floating-point multiplications. Second, the geodesic formulation exhibits favorable memory access patterns where all operations are element-wise or involve matrix-vector products that can be efficiently parallelized, whereas the selective scan in Mamba requires sequential dependencies that limit parallelization and necessitate sophisticated kernel implementations.
The practical efficiency boundary depends on the availability of optimized computational kernels for specific hardware platforms. On edge platforms where highly tuned selective scan implementations are unavailable, the geodesic formulation achieves higher throughput despite the additional arithmetic operations, because it relies entirely on standard operations that are well-supported across all hardware architectures. On high-end GPUs with optimized selective scan kernels, the efficiency advantage of geodesic formulation diminishes for larger state dimensions where the constant factor overhead becomes more pronounced. For moderate state dimensions N 128 that are typical in visual tracking applications, the geodesic formulation maintains competitive or superior practical efficiency while providing the geometric benefits of adaptive tangent space representation.

6. Experiments and Analysis

In this section, we conduct extensive experiments to assess the performance, efficiency, and robustness of LF-SSM compared to SOTA methods. We first describe the experimental setup in Section 6.1, then present quantitative comparisons with state-of-the-art trackers in Section 6.2. Qualitative evaluation and ablation studies are provided in Section 6.3 and Section 6.4, respectively.

6.1. Experimental Setup

6.1.1. Datasets

We conduct experiments on four widely-used tracking benchmarks, including UAV123 [7], VisDrone [96], ARDMAV [97], and LaSOT [52]. These datasets serve dual purposes: their training splits are used for model training, while their test splits are used for evaluation. UAV123 is a large-scale aerial tracking dataset consisting of 123 video sequences with over 112K frames, covering diverse scenarios including urban areas, roads, and natural environments. VisDrone comprises 96 sequences captured by various drone platforms under different weather and lighting conditions, presenting challenges in scale variation and occlusion. ARDMAV contains 60 sequences focusing on aggressive drone racing and maneuvering scenarios with rapid motion and viewpoint changes. LaSOT is a large-scale long-term tracking benchmark with 1120 training sequences and 280 test sequences averaging over 2500 frames per video, providing comprehensive evaluation of long-term tracking capability. This training configuration ensures that the model learns from UAV-specific visual characteristics including aerial viewpoints, small target appearances, and platform-induced motion patterns.

6.1.2. Evaluation Metrics

We adopt standard metrics following the tracking evaluation protocols. For UAV123, VisDrone, and ARDMAV, we report precision (P) and success rate (AUC). Precision measures the percentage of frames where the center location error is within 20 pixels. Success rate evaluates the overlap ratio using intersection over union (IoU), and the area under curve (AUC) of the success plot is reported. For LaSOT, we report AUC, normalized precision ( P Norm ), and precision (P). Additionally, model parameter volume (M), computational complexity (GFLOPs), and frames per second (FPS) are used to evaluate efficiency.

6.1.3. Implementation Details

Experiments are conducted on an Intel i9-13900K CPU, NVIDIA GeForce RTX 4090 GPU (24 GB), and 64 GB memory. LF-SSM is implemented in PyTorch 2.3 with CUDA 12.1.
Images are preprocessed by cropping search regions of 256 × 256 pixels centered on the target, with template regions of 128 × 128 pixels. For each training iteration, we sample template-search pairs from a temporal window of ± 100 frames centered on the template frame with uniform random sampling. We use AdamW optimizer with initial learning rate 1 × 10 4 and weight decay 1 × 10 4 . The model is trained for 300 epochs with batch size 32. The state dimension N is set to 64, and the numerical stability constant ϵ is set to 10 6 for tangent space projection. During inference, we employ a static template strategy where the template is extracted from the first frame and cached throughout the sequence without online updating.
For edge platform deployment, we evaluate on NVIDIA Jetson Orin Nano (NVIDIA Corporation, Santa Clara, CA, USA) (8 GB RAM, 1024-core Ampere GPU, 15 W TDP), which represents typical computational constraints in real-world UAV systems where power consumption and weight limitations necessitate lightweight yet accurate tracking solutions. Models are converted using TensorRT 8.5 with FP16 precision. For fair comparison, all baseline methods are evaluated using the same deployment pipeline with TensorRT FP16 optimization on both NVIDIA GeForce RTX 4090 (NVIDIA Corporation, Santa Clara, CA, USA) and Jetson Orin Nano platforms. We implement three model variants with different numbers of GSM Blocks: LF-SSM-S with L = 6 blocks, LF-SSM-M with L = 12 blocks, and LF-SSM-L with L = 18 blocks. This configuration enables systematic comparison while maintaining architectural consistency across different model scales.

6.1.4. Baselines

We conduct extensive comparative experiments with 13 state-of-the-art tracking methods across two categories. Transformer-based methods include SeqTrack [13], VideoTrack [48], ARTrackV2 [98], ODTrack [47], AQATrack [99], EVPTrack [49], MixFormer [40], SPMTrack [50], and HIPTrack [51]. SSM-based methods include Mamba-FETrack [90], MambaVT [91], MambaVision [88], and VideoMamba [92], which serve as the primary baselines for evaluating our HiPPO-free design.

6.2. Comparison with SOTA Methods

We conduct comprehensive comparisons with state-of-the-art methods across four tracking benchmarks. The results are presented in Table 1 and Table 2.
On UAV123, while SPMTrack [50] achieves AUC of 72.8%, our LF-SSM-L obtains the highest AUC of 73.2% and precision of 92.5%, surpassing SPMTrack by 0.4% in AUC. Compared to SSM-based methods, LF-SSM-L outperforms VideoMamba [92] by 4.4% and Mamba-FETrack [90] by 5.0% in AUC. For VisDrone, LF-SSM-L achieves AUC of 69.2% and precision of 88.5%, outperforming SPMTrack with AUC of 68.5% and all SSM-based methods by significant margins. On ARDMAV which features aggressive drone motion and rapid viewpoint changes, our method leads with AUC of 67.8% and precision of 86.2%, exceeding SPMTrack by 1.3% in AUC.
For long-term tracking on LaSOT, LF-SSM-L achieves competitive results with AUC of 75.8%, P Norm of 85.5%, and P of 84.2%. While SPMTrack achieves the highest AUC of 77.4%, our method outperforms all SSM-based baselines including VideoMamba by 4.0% and Mamba-FETrack by 4.3% in AUC, demonstrating the effectiveness of our HiPPO-free geodesic state evolution.
The consistent performance gains across all four benchmarks demonstrate LF-SSM’s capability in modeling long-range temporal dependencies while preserving fine-grained spatial details. Notably, our method achieves superior precision scores on UAV-specific datasets, indicating excellent boundary localization ability crucial for aerial tracking scenarios. The improvement is more pronounced on ARDMAV which features rapid motion variations, validating that geodesic state evolution effectively handles appearance changes through adaptive tangent space representation.
Beyond accuracy improvements, LF-SSM exhibits remarkable computational efficiency. As shown in Table 2, our LF-SSM-S achieves 69 FPS on Jetson Orin Nano with only 18.5 M parameters and 12.8 GFLOPs, significantly outperforming all compared methods. Compared to Mamba-FETrack which operates at 32 FPS with 78 M parameters, LF-SSM-S achieves 2.2× speedup with 76.3% parameter reduction. To ensure fair comparison, we also evaluate parameter-matched SSM variants scaled to approximately 18–19 M parameters. Even at matched parameter counts, LF-SSM-S maintains substantial advantages: it achieves 69 FPS compared to 55 FPS for S4D-S and 42–48 FPS for Mamba-based variants, representing 1.25–1.64× speedup. This efficiency gain stems from the elimination of discretization overhead and selective scan dependencies in our geodesic formulation, rather than merely reduced model capacity. While Transformer-based methods like MixFormer and SeqTrack operate at merely 8–12 FPS on Jetson Orin Nano due to quadratic attention complexity, our HiPPO-free design enables efficient edge deployment. LF-SSM effectively bridges this efficiency-accuracy gap, achieving state-of-the-art tracking performance while maintaining real-time capability on edge platforms.

6.3. Qualitative Evaluation

Figure 4 presents visual comparisons between LF-SSM and five representative SOTA methods on challenging cases from four datasets. The examples demonstrate our method’s superior performance in diverse UAV tracking scenarios, including small target tracking where targets occupy less than 20 pixels against cluttered backgrounds in rows 1, 5, and 7. In these cases, LF-SSM excels at maintaining precise bounding box alignment, while other methods often produce drifted or incorrectly scaled predictions.
As shown in rows 3 and 4, LF-SSM effectively handles dense urban traffic scenes with multiple similar distractors. Additionally, rows 5 and 6 demonstrate our method’s capability in MAV tracking scenarios from varying environmental contexts, from hillside vegetation in row 5 to urban infrastructure with roads and buildings in row 6. These examples highlight our model’s capability to maintain robust localization under extreme scale variations, a strength attributed to our geodesic state evolution with manifold-based updates. This advantage becomes particularly evident when compared to other SSM-based methods like VideoMamba and Mamba-FETrack, which struggle with target drift in similarly challenging environments.
Row 8 presents nighttime tracking with LED-illuminated drone against dark background. The results demonstrate that conventional SSM-based methods exhibit significant limitations in low-light conditions due to HiPPO’s fixed polynomial basis compressing appearance information. In contrast, our proposed method successfully maintains target identity through geodesic state transitions that preserve discriminative features across illumination changes.

6.4. Ablation Study

To evaluate the effectiveness of our proposed components, we conduct comprehensive experiments on different model configurations. The results are presented in Table 3 and Table 4.

6.4.1. Component Analysis

We investigate the contribution of each component in the Geodesic State Module. TSP, ISS, and Gate denote Tangent Space Projection, Input-dependent Step Size, and Gating Mechanism, respectively.
As shown in Table 3, #1 serves as the baseline with only geodesic state evolution on the unit sphere, achieving AUC of 68.5% on UAV123. Among individual components, Tangent Space Projection (#2) contributes the most significant improvement of 1.7% AUC, validating that adaptive local coordinate systems effectively preserve the geometric structure of tracking features. Input-dependent Step Size (#3) brings 1.0% gain by enabling content-aware state transitions, while Gating Mechanism (#4) contributes 1.3% through selective information flow. Combining TSP and ISS (#5) achieves 71.5% AUC, outperforming other two-component combinations (#6 and #7). The full model (#Ours) with all components obtains the best performance of 73.2% AUC on UAV123, demonstrating that these components are complementary and jointly contribute to the final performance. Consistent improvements are observed across all four benchmarks.

6.4.2. Comparison with SSM Methods

We compare our geodesic state evolution with different SSM formulations to validate the advantage of eliminating the HiPPO framework. All variants are implemented with the same backbone architecture for fair comparison.
As shown in Table 4, S4 with HiPPO-LegS initialization achieves AUC of 67.5% with 25 FPS on Jetson Orin Nano, while diagonal variants S4D and S5 obtain slightly better results through simplified parameterization. Mamba with selective scan mechanism achieves 69.5% AUC at 38 FPS, but its hardware-specific kernels designed for high-performance GPUs cannot be fully utilized on edge platforms. To isolate the contribution of geodesic formulation, we also implement a Euclidean baseline that replaces geodesic update with h t = ( h t 1 + Δ t u t ) / h t 1 + Δ t u t using the same gating and convolution layers. This baseline achieves 70.2% AUC, which is 3.0% lower than our full model, demonstrating that the geodesic formulation with trigonometric functions provides superior state evolution compared to simple normalization. This performance gap arises because the Euclidean baseline operates in flat space before projecting back to the sphere, which introduces systematic distance distortions that accumulate over long sequences. In contrast, our geodesic update through the exponential map Exp h t ( Δ v t ) = cos ( Δ v t ) h t + sin ( Δ v t ) v t / v t respects the intrinsic spherical geometry and preserves geodesic distances exactly. Moreover, the tangent space T h t S N 1 provides an adaptive local coordinate system that rotates with the evolving state, allowing new inputs to be represented in coordinates aligned with the current geometric context rather than a fixed polynomial basis. The prior velocity mechanism further enables geometry-aware temporal modeling, where the projection v t + 1 prior = v t v t , h t + 1 h t + 1 naturally discards outdated information when the trajectory curves sharply during rapid appearance changes, while preserving consistency during smooth evolution.
To further validate the benefits of the Riemannian formulation against traditional sequence modeling approaches, we include two additional baselines with matched parameter counts. Linear-RNN employs the recurrence h t = A h t 1 + B x t with orthogonal state transition matrix A initialized via Cayley parameterization A = ( I S ) ( I + S ) 1 where S is skew-symmetric, ensuring | λ i ( A ) | = 1 for gradient stability. GRU implements standard gating with update gate z t , reset gate r t , and candidate state h ˜ t . LF-SSM outperforms Linear-RNN by 4.7% AUC and GRU by 3.0% AUC. The improvement over GRU is noteworthy since it demonstrates that the geometric inductive bias of spherical state evolution is more effective than learned gating mechanisms for visual tracking. Additionally, LF-SSM achieves 69 FPS compared to 32 FPS for Linear-RNN and 28 FPS for GRU, as the geodesic formulation avoids the N × N matrix multiplications required by these recurrent architectures.
Our LF-SSM achieves the highest AUC of 73.2% with 69 FPS on Jetson Orin Nano, outperforming Mamba by 3.7% in accuracy and 1.8× in speed. This performance gap arises because the Euclidean baseline operates in flat space before projecting back to the sphere, which introduces systematic distance distortions that accumulate over long sequences. In contrast, our geodesic update through the exponential map Exp h t ( Δ v t ) = cos ( Δ v t ) h t + sin ( Δ v t ) v t / v t respects the intrinsic spherical geometry and preserves geodesic distances exactly. Moreover, the tangent space T h t S N 1 provides an adaptive local coordinate system that rotates with the evolving state, allowing new inputs to be represented in coordinates aligned with the current geometric context rather than a fixed polynomial basis. The prior velocity mechanism further enables geometry-aware temporal modeling, where the projection v t + 1 prior = v t v t , h t + 1 h t + 1 naturally discards outdated information when the trajectory curves sharply during rapid appearance changes, while preserving consistency during smooth evolution. Our LF-SSM achieves the highest AUC of 73.2% with 69 FPS on Jetson Orin Nano, outperforming Mamba by 3.7% in accuracy and 1.8× in speed.
To rigorously isolate the contribution of geodesic state evolution from other architectural components such as gating, convolution, and bidirectional processing, we conducted controlled experiments where all components remain identical except for the state update mechanism. Table 5 presents this controlled comparison on UAV123.
The controlled experiments reveal that geodesic state evolution provides 4.7% AUC improvement over Euclidean Linear baseline, 4.0% over Diagonal SSM, and 3.0% over Euclidean-plus-Normalization. Since all architectural components are held constant, these improvements are directly attributable to the Riemannian state update mechanism. We further verified that the other components provide comparable benefits to both Euclidean and Riemannian formulations: the gating mechanism improves Diagonal SSM by 1.4% and Geodesic by 1.2%; the convolution layer improves both by approximately 1.7%; the bidirectional processing improves Diagonal SSM by 2.1% and Geodesic by 2.3%. These similar magnitudes confirm that the architectural components provide additive benefits independent of the state update mechanism, and the 3.0% gap between Euclidean-plus-Normalization and Geodesic formulation is attributable to the intrinsic geometric advantages of manifold-based state evolution.
To further analyze the source of efficiency differences, we provide a detailed per-operation latency breakdown in Table 6. The measurements reveal that trigonometric operations contribute only 1.7% of total GSM computation time, while the selective scan in Mamba consumes 49% of its total time due to sequential dependencies. This confirms that the efficiency advantage stems from avoiding sequential bottlenecks rather than despite the transcendental operations.

6.4.3. Quantitative Feature Analysis

To validate that the feature sharpness observed in Figure 5 arises from the Riemannian geometric structure rather than model capacity effects, we conducted controlled experiments with matched parameter counts (52.8 M) across all methods. We introduce three metrics: Target Response Ratio (TRR) measuring target-background discrimination, Feature Entropy quantifying spatial concentration, and Boundary Gradient Magnitude measuring edge sharpness.
As shown in Table 7, LF-SSM achieves 53% higher TRR, 30% lower entropy, and 71% higher boundary gradient compared to Mamba at matched parameters. Frequency analysis confirms that LF-SSM maintains balanced spectral distribution without overfitting artifacts. These improvements originate from geodesic state evolution properties: exact norm preservation prevents magnitude explosion, adaptive tangent spaces provide location-dependent coordinates, and angular relationship preservation maintains feature discriminability.

6.5. Parameter Sensitivity Analysis

We analyze the sensitivity of LF-SSM to the state dimension N, which controls the capacity of manifold-based state representation. Experiments are conducted on UAV123 with LF-SSM-L configuration, and FPS is measured on RTX 4090.
As shown in Table 8, increasing the state dimension N from 16 to 64 brings consistent performance improvements, with AUC rising from 70.2% to 73.2%. Further increasing N to 128 and 256 yields marginal gains of 0.3% and 0.4%, respectively, while significantly increasing parameters and reducing inference speed. This indicates that N = 64 achieves an optimal balance between representation capacity and computational efficiency. The unit sphere S 63 provides sufficient degrees of freedom for capturing the geometric structure of tracking features without redundant parameterization.
We further analyze the sensitivity to step size configuration. Table 9 compares fixed step sizes at different scales against our adaptive step size mechanism on UAV123.
The results reveal several important observations about step size sensitivity. Among fixed step sizes, moderate values around Δ = 0.10 achieve the best performance with AUC of 72.8%, while very small step sizes ( Δ = 0.01 ) limit the model’s ability to adapt quickly to appearance changes, and large step sizes ( Δ = 0.50 ) cause overshooting that degrades localization precision. The adaptive step size mechanism outperforms all fixed configurations by 0.4% in AUC, demonstrating that content-aware step size selection enables the network to balance stability and responsiveness according to input characteristics. Empirical analysis of the learned step size distribution shows that the adaptive mechanism produces values concentrated between 0.05 and 0.15, with larger values appearing near object boundaries where rapid feature changes occur.

7. Discussion

The preceding sections have presented the technical formulation and experimental validation of LF-SSM. In this section, we provide deeper theoretical analysis of the key design choices. We examine the stability properties of geodesic dynamics, the mechanism by which temporal information is selectively forgotten, and the relationship between our approach and traditional HiPPO-based state space models.

7.1. Perturbation Stability of Geodesic Dynamics

The exact norm preservation established in Section 4.2 guarantees that states remain bounded on the unit sphere, but does not by itself characterize how the dynamics respond to perturbations in the input or initial conditions. Understanding this sensitivity is essential for the assessment of robustness in practical tracking scenarios where sensor noise and feature extraction errors are inevitable.
Consider two state trajectories generated by the geodesic dynamics. The unperturbed trajectory { h t } t = 0 T and the perturbed trajectory { h ˜ t } t = 0 T start from the same initial condition h 0 = h ˜ 0 but the latter is subject to additive noise η t in the tangent velocity at each step. The perturbed update takes the form h ˜ t + 1 = Exp h ˜ t ( Δ ( v t + η t ) ) , where η t T h ˜ t S N 1 represents the perturbation projected onto the tangent space and we assume uniform boundedness η t ε for some small ε > 0 . The quantity of interest is the geodesic distance d t = d S ( h t , h ˜ t ) between the two trajectories, where d S ( x , y ) = arccos ( x , y ) measures arc length on the unit sphere.
The evolution of this distance is governed by the Lipschitz properties of the exponential map on curved manifolds. On the unit sphere with constant positive sectional curvature κ = 1 , the exponential map exhibits local Lipschitz continuity in its tangent argument. Specifically, for two tangent vectors v , v T h S N 1 at the same basepoint with magnitudes that satisfy Δ v , Δ v < δ for some δ < π / 2 , the geodesic distance between their images satisfies d S ( Exp h ( Δ v ) , Exp h ( Δ v ) ) C Δ v v , where the Lipschitz constant C = 1 + δ 2 / 2 depends only on the radius bound δ . This estimate follows from Taylor expansion of the spherical law of cosines, which relates the geodesic distance to the angle between tangent directions and the magnitudes of the tangent vectors.
The combination of this tangent-direction estimate with analogous bounds for basepoint variation and iteration over the sequence length yields a Grönwall-type inequality for the trajectory deviation. Under the assumptions that the step size satisfies Δ < Δ max for some bound that ensures Δ v t < δ throughout the trajectory, that perturbations are uniformly bounded by η t ε , and that the trajectories remain within a geodesic ball of radius δ 0 < π / 2 of each other, the deviation at time t satisfies
d t C ε L ( e L t Δ 1 ) ,
where L > 0 is a constant that depends on the curvature and the neighborhood radius. For short time horizons where the product L t Δ remains small, this bound simplifies to approximately linear growth d t C ε t Δ , which indicates that perturbations accumulate at most linearly with sequence length. The transition from linear to exponential growth occurs at a critical time t crit 1 / ( L Δ ) , which for typical tracking parameters with step size Δ = 0.1 and empirically estimated L 0.015 occurs around t 600 steps. Since most tracking sequences in standard benchmarks contain fewer than a few hundred frames, the geodesic dynamics operate predominantly in the linear growth regime and ensure robust and predictable behavior throughout practical applications.
We validated these theoretical predictions on the LaSOT benchmark with sequences averaging 2500 frames. The state norm deviation | h t 1 | remained below 10 14 throughout tracking with FP32 arithmetic, confirming exact norm preservation to machine precision over extended time horizons. The tracking performance (AUC 75.8%) showed no degradation compared to shorter sequences, demonstrating that the geodesic formulation maintains numerical stability even on long sequences where traditional SSMs may suffer from accumulated drift or spectral instability due to eigenvalue sensitivity in the state transition matrix.

7.2. Temporal Forgetting Through Geometric Transport

In traditional state space models, the forgetting of outdated information is controlled by the eigenvalue spectrum of the transition matrix. Eigenvalues with magnitude less than unity cause exponential decay of the corresponding state components, with smaller eigenvalues that produce faster forgetting. This mechanism is uniform and input-independent. All historical information decays at fixed rates determined solely by the matrix spectrum, regardless of the relevance or importance of that information to the current tracking context.
Our geodesic framework implements a fundamentally different forgetting mechanism that arises naturally from the geometry of the sphere rather than from explicit spectral design. The key element is the prior velocity v t prior , which carries directional momentum from previous updates and provides temporal continuity across the state sequence. When the state moves from h t to h t + 1 , this prior velocity must be transported to the tangent space at the new location. We employ the projection approximation v t + 1 prior = Proj h t + 1 ( v t ) = v t v t , h t + 1 h t + 1 , which removes from the old velocity its component along the new state direction while it preserves the orthogonal remainder.
This projection implements selective, geometry-aware forgetting. The amount of information discarded depends on the alignment between the prior velocity and the new state direction, which in turn reflects the curvature of the trajectory through state space. When the tracking target undergoes rapid appearance changes, the state trajectory curves sharply and produces large angles between successive states. In this regime, the projection removes a substantial component of the prior velocity and effectively forgets outdated directional information that is no longer relevant to the current appearance. Conversely, when the target appearance evolves smoothly, the trajectory follows a nearly geodesic path with small curvature, and most of the prior velocity survives projection to maintain temporal consistency. This adaptive behavior emerges automatically from the geometry without explicit parameterization or tuning.
The forgetting rate can be quantified in terms of the geodesic step size. Let θ t = Δ t v t denote the arc length traveled in one update step, which equals the geodesic distance d S ( h t , h t + 1 ) between successive states. The magnitude of the transported prior satisfies v t + 1 prior 2 = v t 2 v t , h t + 1 2 v t 2 ( 1 cos 2 ( θ t ) ) = v t 2 sin 2 ( θ t ) , which yields the bound v t + 1 prior v t sin ( θ t ) . For small steps where θ t 1 , this gives approximately v t + 1 prior v t θ t , which indicates that the prior magnitude decays in proportion to the step size. Over multiple steps, this produces gradual attenuation of historical momentum and implements smooth forgetting that avoids the abrupt information loss associated with aggressive eigenvalue decay in traditional formulations.

7.3. Relationship to HiPPO-Based State Transitions

The HiPPO framework provides the mathematical foundation for the construction of state transition matrices in modern structured state space models. The core idea is to maintain a compressed representation of input history by projection of the continuous input signal onto a basis of orthogonal polynomials, typically the Legendre polynomials. The resulting transition matrix A is derived analytically to ensure that the state vector approximates the coefficients of the optimal polynomial projection at each time step. While this construction achieves efficient compression of temporal history, it imposes a fixed representational basis that does not adapt to the specific structure of the input data.
Our geodesic formulation can be understood as a replacement of this fixed global basis with adaptive local coordinates that evolve with the state. At each point h t on the unit sphere, the tangent space T h t S N 1 provides a local coordinate system centered at the current state. Input features are projected onto this tangent space before they are used to update the state, which effectively represents new information in coordinates adapted to the current representation. As the state evolves along its geodesic trajectory, the tangent space rotates accordingly and continuously adapts the local coordinate frame to track the evolving feature geometry. This can be viewed as a moving frame that follows the state along the manifold, in contrast to the fixed polynomial frame used in HiPPO-based methods.
The relationship between geodesic updates and traditional normalized linear updates provides additional insight into this comparison. Consider the normalized linear update h t + 1 lin = ( h t + Δ v t ) / h t + Δ v t , which applies a linear displacement followed by projection back onto the sphere. Under the scaling assumption that the tangent velocity satisfies v t = O ( 1 ) with step size Δ 1 , both the geodesic update and the normalized linear update admit Taylor expansions. For the normalized linear update, using the binomial expansion ( 1 + x ) 1 / 2 = 1 x / 2 + O ( x 2 ) with x = Δ 2 v t 2 , we obtain
h t + 1 lin = h t + Δ v t Δ 2 v t 2 2 h t + O ( Δ 3 )
For the geodesic update, expanding the trigonometric functions cos ( Δ v t ) = 1 Δ 2 v t 2 / 2 + O ( Δ 4 ) and sin ( Δ v t ) = Δ v t Δ 3 v t 3 / 6 + O ( Δ 5 ) yields
h t + 1 geo = h t + Δ v t Δ 2 v t 2 2 h t + O ( Δ 3 )
The leading terms through second order are identical, which confirms that geodesic evolution generalizes normalized linear dynamics to second order in the step size. The difference between the two methods first appears at third order, establishing that h t + 1 geo h t + 1 lin = O ( Δ 3 ) .
Although the single-step difference is small, it accumulates over long sequences to produce measurable performance gaps. For a sequence of length T with step size Δ , the accumulated error is of order T · Δ 3 . With typical tracking parameters Δ 0.1 and T 500 , this corresponds to approximately 500 × 0.001 = 0.5 radians of accumulated angular deviation. The geodesic formulation produces smaller third-order errors because its Taylor coefficients arise from the exact spherical geometry, whereas the normalized linear update introduces additional error through the normalization operation. This accumulated difference explains the 3% AUC gap observed in Table 4 between our geodesic method and the Euclidean-plus-normalization baseline.
The geodesic formulation provides two additional advantages beyond this asymptotic equivalence. First, it remains well-defined and numerically stable even when the linear update h t + Δ v t approaches zero magnitude, a regime where normalization-based approaches suffer from division by near-zero quantities. Second, it provides access to the full toolkit of Riemannian geometry, which includes parallel transport for velocity propagation and geodesic distance for the measurement of state similarity, enabling the geometry-aware temporal modeling described in the previous subsection.

7.4. Limitations and Future Directions

While LF-SSM demonstrates strong performance across multiple benchmarks, our analysis reveals certain failure modes that warrant discussion. As illustrated in Figure 6, we observe two representative failure cases on LaSOT and UAV123 datasets. The first involves rapid and extreme scale variations where the target undergoes significant size changes exceeding 4 × within a short temporal window. In such scenarios, the geodesic state evolution may fail to adapt quickly enough because the tangent space projection operates on normalized directional features through v t = u t u t , h t 1 h t 1 , which can lose scale-sensitive information during dramatic size transitions. As shown in Figure 6 (left), when a drone target in a LaSOT sequence rapidly approaches the camera, the predicted bounding box fails to expand proportionally with the ground truth, resulting in significant scale mismatch. The second failure mode involves significant bounding box drift in cluttered scenes with similar background regions. As shown in Figure 6 (right), in a UAV123 aerial traffic scene, the predicted bounding box exhibits noticeable spatial drift from the ground truth location. This drift occurs because the geodesic state evolution accumulates small errors over time when the target appearance is similar to surrounding background regions. The normalized state update h t = cos ( v t ) h t 1 + sin ( v t ) v t v t can gradually shift when background features consistently influence the input projection u t = Linear B ( x ˜ t ) , causing the state to evolve toward a slightly offset position on the manifold. These failure modes suggest promising research directions including adaptive scale modeling that incorporates magnitude information into the state representation, and robust state update mechanisms that mitigate drift accumulation in challenging scenarios.
Beyond addressing these failure modes, several theoretical aspects of our approach warrant further investigation. The choice of the unit sphere S N 1 as the state manifold was motivated primarily by computational convenience, as this manifold admits closed-form expressions for the exponential map, logarithmic map, and parallel transport. Other Riemannian manifolds may offer advantages for specific tracking scenarios. Stiefel manifolds that consist of orthonormal frames could naturally represent pose or orientation dynamics, while hyperbolic spaces with their exponentially expanding geometry might better capture hierarchical feature relationships. The extension of the geodesic state evolution framework to these alternative manifolds represents a promising direction for future work.
The stability analysis presented above assumes bounded step sizes, whereas our implementation employs input-dependent step sizes Δ t = σ ( Linear Δ ( x ˜ t ) ) computed from the input features. The softplus activation σ ensures positivity and empirically produces well-behaved step sizes, but formal stability guarantees for fully adaptive stepping require additional theoretical development. The establishment of conditions on the step size network that ensure long-term stability would strengthen the theoretical foundations of our approach.
Several architectural properties provide implicit guarantees against degenerate dynamics even with adaptive step sizes. The geodesic formulation maintains exact norm preservation h t = 1 by construction, which eliminates the possibility of state disappearance regardless of the step size values. The input-dependent tangent velocity v t = Proj h t ( B x ˜ t + α v t prior ) changes direction at each time step as different spatial tokens are processed, which prevents the state from following a single great circle and thereby avoids periodic motion. The combination of bounded input features from normalized images, weight decay regularization on the step size network parameters, and the smooth softplus activation ensures that the product Δ t v t remains within the domain where the local Lipschitz bounds apply. Empirical measurements confirm that this product stays below 0.3 radians throughout training and inference, well within the stability regime established by the perturbation analysis.
Finally, the bidirectional processing employed in GSM Blocks operates on spatial token sequences within individual frames rather than temporal sequences across frames. This design choice prioritizes comprehensive spatial feature interaction between template and search regions, with temporal consistency handled through the concatenation mechanism rather than explicit temporal state propagation. The extension of the geodesic dynamics to model temporal evolution across video frames, potentially through hierarchical state spaces that operate at different temporal scales, could further enhance long-term tracking performance.

8. Conclusions

In this paper, we propose LF-SSM, a lightweight HiPPO-free state space model for real-time UAV tracking. Unlike existing SSM-based methods that rely on fixed Legendre polynomial basis and complex discretization procedures, our approach reformulates state evolution on Riemannian manifolds where geodesic dynamics naturally preserve the geometric structure of tracking features. The core contribution is the Geodesic State Module that performs state updates through tangent space projection and exponential mapping on the unit sphere, eliminating the need for HiPPO discretization and specialized scan kernels while providing adaptive local coordinate systems for feature representation. Extensive experiments on four tracking benchmarks demonstrate that LF-SSM achieves competitive performance with state-of-the-art methods while providing significant efficiency improvements, enabling real-time deployment on resource-constrained UAV platforms with 69 FPS on Jetson Orin Nano using only 18.5 M parameters. In future work, we plan to explore Riemannian manifolds with adaptive geometry and integrate online template updating mechanisms to further enhance long-term tracking capability.

Author Contributions

Conceptualization, T.W.; methodology, T.W.; software, T.W.; validation, T.W.; formal analysis, T.W.; investigation, T.W.; resources, X.X.; data curation, T.W.; writing—original draft preparation, T.W.; writing—review and editing, X.X. and S.Q.; visualization, T.W.; supervision, X.X., S.Q., C.S., D.W., H.T. and J.Y.; project administration, X.X.; funding acquisition, X.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62301587, 62502533, 62506379, and 62406337, in part by the Hubei Provincial Natural Science Foundation of China under Grant 2022CFB989, and also in part by the Postdoctoral Fellowship Program of CPSF under Grant GZC20233552.

Data Availability Statement

Data derived from public domain resources.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Huang, X.; Bai, Y.; Ma, J.; Li, Y.; Shang, C.; Shen, Q. NightTrack: Joint Night-Time Image Enhancement and Object Tracking for UAVs. Drones 2025, 9, 824. [Google Scholar] [CrossRef]
  2. Hossein Motlagh, N.; Kortoçi, P.; Su, X.; Lovén, L.; Hoel, H.K.; Bjerkestrand Haugsvær, S.; Srivastava, V.; Gulbrandsen, C.F.; Nurmi, P.; Tarkoma, S. Unmanned Aerial Vehicles for Air Pollution Monitoring: A Survey. IEEE Internet Things J. 2023, 10, 21687–21704. [Google Scholar] [CrossRef]
  3. Liu, Z.; An, P.; Yang, Y.; Qiu, S.; Liu, Q.; Xu, X. Vision-Based Drone Detection in Complex Environments: A Survey. Drones 2024, 8, 643. [Google Scholar] [CrossRef]
  4. Wang, Y.; Su, Z.; Xu, Q.; Li, R.; Luan, T.H.; Wang, P. A Secure and Intelligent Data Sharing Scheme for UAV-Assisted Disaster Rescue. IEEE/ACM Trans. Netw. 2023, 31, 2422–2438. [Google Scholar] [CrossRef]
  5. Khaki, S.; Safaei, N.; Pham, H.; Wang, L. WheatNet: A Lightweight Convolutional Neural Network for High-Throughput Image-Based Wheat Head Detection and Counting. Neurocomputing 2022, 489, 78–89. [Google Scholar] [CrossRef]
  6. Zheng, L.; Zeng, J.; Qin, L.; Ju, R. Multi-Attention Meets Pareto Optimization: A Reinforcement Learning Method for Adaptive UAV Formation Control. Drones 2025, 9, 845. [Google Scholar] [CrossRef]
  7. Mueller, M.; Smith, N.; Ghanem, B. A Benchmark and Simulator for UAV Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 445–461. [Google Scholar]
  8. Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The Unmanned Aerial Vehicle Benchmark: Object Detection, Tracking and Baseline. Int. J. Comput. Vis. 2020, 128, 1141–1159. [Google Scholar]
  9. Li, S.; Yeung, D.-Y. Visual Object Tracking for Unmanned Aerial Vehicles: A Benchmark and New Motion Models. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
  10. Zhu, P.; Wen, L.; Du, D.; Bian, X.; Ling, H.; Hu, Q.; Wu, H.; Nie, Q.; Cheng, H.; Liu, C.; et al. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef]
  11. Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
  12. Lin, L.; Fan, H.; Xu, Y.; Ling, H. SwinTrack: A Simple and Strong Baseline for Transformer Tracking. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
  13. Chen, X.; Peng, H.; Wang, D.; Lu, H.; Hu, H. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14572–14581. [Google Scholar]
  14. Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning Spatio-Temporal Transformer for Visual Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10448–10457. [Google Scholar]
  15. Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
  16. Hamilton, J.D. State-Space Models. In Handbook of Econometrics; Elsevier: Amsterdam, The Netherlands, 1994; Volume 4, pp. 3039–3080. [Google Scholar]
  17. Yao, M.; Peng, J.; He, Q.; Peng, B.; Chen, H.; Chi, M.; Liu, C.; Benediktsson, J.A. MM-Tracker: Motion Mamba for UAV-platform Multiple Object Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 27 February–2 March 2025; Volume 39, pp. 9409–9417. [Google Scholar]
  18. Xie, J.; Zhong, B.; Liang, Q.; Li, N.; Mo, Z.; Song, S. Robust Tracking via Mamba-Based Context-Aware Token Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 27 February–2 March 2025; Volume 39, pp. 8727–8735. [Google Scholar]
  19. Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
  20. Gu, A.; Dao, T.; Ermon, S.; Rudra, A.; Ré, C. Hippo: Recurrent Memory with Optimal Polynomial Projections. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020; Volume 33, pp. 1474–1487. [Google Scholar]
  21. Netto, C.F.D.; Wang, Z.; Ruiz, L. Improved image classification with manifold neural networks. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
  22. Narayanan, H.; Mitter, S. Sample complexity of testing the manifold hypothesis. Adv. Neural Inf. Process. Syst. 2010, 23, 1786–1794. [Google Scholar]
  23. Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; Song, L. SphereFace: Deep Hypersphere Embedding for Face Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 212–220. [Google Scholar]
  24. Mettes, P.; Van der Pol, E.; Snoek, C. Hyperspherical Prototype Networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
  25. Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H.S. Fully-Convolutional Siamese Networks for Object Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 850–865. [Google Scholar]
  26. Hong, M.; Li, S.; Yang, Y.; Zhu, F.; Zhao, Q.; Lu, L. SSPNet: Scale Selection Pyramid Network for Tiny Person Detection From UAV Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  27. Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High Performance Visual Tracking with Siamese Region Proposal Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
  28. Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4282–4291. [Google Scholar]
  29. Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware Siamese Networks for Visual Object Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar]
  30. Fu, C.; Cao, Z.; Li, Y.; Ye, J.; Feng, C. Siamese Anchor Proposal Network for High-Speed Aerial Tracking. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 510–516. [Google Scholar]
  31. Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12549–12556. [Google Scholar]
  32. Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6269–6277. [Google Scholar]
  33. Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese Box Adaptive Network for Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6668–6677. [Google Scholar]
  34. Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. HiFT: Hierarchical Feature Transformer for Aerial Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 15457–15466. [Google Scholar]
  35. Cao, Z.; Huang, Z.; Pan, L.; Zhang, S.; Liu, Z.; Fu, C. TCTrack: Temporal Contexts for Aerial Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 14798–14808. [Google Scholar]
  36. Cao, Z.; Huang, Z.; Pan, L.; Zhang, S.; Liu, Z.; Fu, C. Towards Real-World Visual Tracking with Temporal Contexts. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15834–15849. [Google Scholar] [CrossRef]
  37. Ye, J.; Fu, C.; Zheng, G.; Paudel, D.P.; Chen, G. Unsupervised Domain Adaptation for Nighttime Aerial Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8896–8905. [Google Scholar]
  38. Yao, L.; Fu, C.; Li, S.; Zheng, G.; Ye, J. SGDViT: Saliency-Guided Dynamic Vision Transformer for UAV Tracking. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 3353–3359. [Google Scholar]
  39. Wu, Y.; Li, Y.; Liu, M.; Wang, X.; Yang, X.; Ye, H.; Zeng, D.; Zhao, Q.; Li, S. Learning an Adaptive and View-Invariant Vision Transformer for Real-Time UAV Tracking. IEEE Trans. Circuits Syst. Video Technol. Early Access. 2025. [Google Scholar] [CrossRef]
  40. Cui, Y.; Jiang, C.; Wang, L.; Wu, G. MixFormer: End-to-End Tracking with Iterative Mixed Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13608–13618. [Google Scholar]
  41. Cui, Y.; Song, T.; Wu, G.; Wang, L. MixFormerV2: Efficient Fully Transformer Tracking. Adv. Neural Inf. Process. Syst. 2023, 36, 58736–58751. [Google Scholar]
  42. Gao, S.; Zhou, C.; Ma, C.; Wang, X.; Yuan, J. AiATrack: Attention in Attention for Transformer Visual Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 146–164. [Google Scholar]
  43. Pauwels, K.; Kragic, D. SimTrack: A Simulation-Based Framework for Scalable Real-Time Object Pose Detection and Tracking. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 1300–1307. [Google Scholar]
  44. Yuan, X.; Xu, T.; Liu, X.; Wang, Y.; Qin, H.; Fang, Y.; Li, J. Multi-Step Temporal Modeling for UAV Tracking. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7216–7230. [Google Scholar] [CrossRef]
  45. Cai, Y.; Liu, J.; Tang, J.; Wu, G. Robust Object Modeling for Visual Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 9589–9600. [Google Scholar]
  46. Wei, X.; Bai, Y.; Zheng, Y.; Shi, D.; Gong, Y. Autoregressive Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 9697–9706. [Google Scholar]
  47. Zheng, Y.; Zhong, B.; Liang, Q.; Mo, Z.; Zhang, S.; Li, X. ODTrack: Online Dense Temporal Token Learning for Visual Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 7588–7596. [Google Scholar]
  48. Xie, F.; Chu, L.; Li, J.; Lu, Y.; Ma, C. VideoTrack: Learning to Track Objects via Video Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 22826–22835. [Google Scholar]
  49. Shi, L.; Zhong, B.; Liang, Q.; Li, N.; Zhang, S.; Li, X. Explicit Visual Prompts for Visual Object Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 4838–4846. [Google Scholar]
  50. Cai, W.; Liu, Q.; Wang, Y. SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 16871–16881. [Google Scholar]
  51. Cai, W.; Liu, Q.; Wang, Y. HIPTrack: Visual Tracking with Historical Prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 19258–19267. [Google Scholar]
  52. Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5374–5383. [Google Scholar]
  53. Huang, L.; Zhao, X.; Huang, K. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
  54. Müller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 300–317. [Google Scholar]
  55. Li, H.; Xu, K. Innovative Adaptive Edge Detection for Noisy Images Using Wavelet and Gaussian Method. Sci. Rep. 2025, 15, 5838. [Google Scholar] [CrossRef]
  56. Li, J.; Sun, S.; Wang, Y.; Zhang, J.; Zhuo, L. TSTrack: A Light-weight Transformer-based Spatiotemporal Feature Refinement Tracking Algorithm. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4707416. [Google Scholar] [CrossRef]
  57. Borsuk, V.; Vei, R.; Kupyn, O.; Martyniuk, T.; Krashenyi, I.; Matas, J. FEAR: Fast, Efficient, Accurate and Robust Visual Tracker. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 644–663. [Google Scholar]
  58. Blatter, P.; Kanakis, M.; Danelljan, M.; Van Gool, L. Efficient Visual Tracking with Exemplar Transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 1571–1581. [Google Scholar]
  59. Chen, X.; Kang, B.; Wang, D.; Li, D.; Lu, H. Efficient Visual Tracking via Hierarchical Cross-Attention Transformer. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 461–477. [Google Scholar]
  60. Li, S.; Yang, Y.; Zeng, D.; Wang, X. Adaptive and Background-Aware Vision Transformer for Real-Time UAV Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 13989–14000. [Google Scholar]
  61. Zhang, W.; Zhang, Y.; Liu, Y.; Wu, L.; Hu, X. REATA: An Efficient Vision Transformer Accelerator Featuring a Resource-Optimized Attention Design on Versal ACAP. ACM Trans. Reconfig. Technol. Syst. 2025. [Google Scholar] [CrossRef]
  62. Wang, J.; Li, X.; Chen, J.; Zhou, L.; Guo, L.; He, Z.; Zhou, H.; Zhang, Z. DPH-YOLOv8: Improved YOLOv8 Based on Double Prediction Heads for the UAV Image Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar]
  63. Huang, S.; Lin, C.; Jiang, X.; Qu, Z. BRSTD: Bio-Inspired Remote Sensing Tiny Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
  64. Liu, X.; Zhang, C.; Huang, F.; Xia, S.; Wang, G.; Zhang, L. Vision Mamba: A Comprehensive Survey and Taxonomy. IEEE Trans. Neural Netw. Learn. Syst. Early Access. 2025. [Google Scholar] [CrossRef]
  65. Patro, B.N.; Agneeswaran, V.S. Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges. Eng. Appl. Artif. Intell. 2025, 159, 111279. [Google Scholar] [CrossRef]
  66. Gu, A.; Goel, K.; Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2022. [Google Scholar]
  67. Gu, A.; Goel, K.; Gupta, A.; Ré, C. On the Parameterization and Initialization of Diagonal State Space Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 35971–35983. [Google Scholar]
  68. Smith, J.T.H.; Warrington, A.; Linderman, S. Simplified State Space Layers for Sequence Modeling. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  69. Gupta, A.; Gu, A.; Berant, J. Diagonal State Spaces Are as Effective as Structured State Spaces. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 22982–22994. [Google Scholar]
  70. Gu, A.; Johnson, I.; Goel, K.; Saab, K.; Dao, T.; Rudra, A.; Ré, C. Combining Recurrent, Convolutional, and Continuous-Time Models with Linear State Space Layers. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); 6-14 December 2021; Online 34, pp. 572–585. [Google Scholar]
  71. Fu, D.Y.; Dao, T.; Saab, K.K.; Thomas, A.W.; Rudra, A.; Ré, C. Hungry Hungry Hippos: Towards Language Modeling with State Space Models. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  72. Poli, M.; Massaroli, S.; Nguyen, E.; Fu, D.Y.; Dao, T.; Baccus, S.; Bengio, Y.; Ermon, S.; Ré, C. Hyena Hierarchy: Towards Larger Convolutional Language Models. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; pp. 28043–28078. [Google Scholar]
  73. Tay, Y.; Dehghani, M.; Abnar, S.; Shen, Y.; Bahri, D.; Pham, P.; Rao, J.; Yang, L.; Ruder, S.; Metzler, D. Long Range Arena: A Benchmark for Efficient Transformers. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
  74. Dao, T.; Gu, A. Transformers Are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024. [Google Scholar]
  75. Lenz, B.; Lieber, O.; Arazi, A.; Bergman, A.; Manevich, A.; Peleg, B.; Aviram, B.; Almagor, C.; Fridman, C.; Padnos, D.; et al. Jamba: Hybrid Transformer-Mamba Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025. [Google Scholar]
  76. Peng, B.; Alcaide, E.; Anthony, Q.; Albalak, A.; Arcadinho, S.; Biderman, S.; Cao, H.; Cheng, X.; Chung, M.; Grella, M.; et al. RWKV: Reinventing RNNs for the Transformer Era. arXiv 2023, arXiv:2305.13048. [Google Scholar] [CrossRef]
  77. Han, D.; Wang, Z.; Xia, Z.; Han, Y.; Pu, Y.; Ge, C.; Song, J.; Song, S.; Zheng, B.; Huang, G. Demystify Mamba in Vision: A Linear Attention Perspective. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 9–15 December 2024; Volume 37, pp. 127181–127203. [Google Scholar]
  78. Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024. [Google Scholar]
  79. Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual State Space Model. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 9–15 December 2024; Volume 37, pp. 103031–103063. [Google Scholar]
  80. Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. LocalMamba: Visual State Space Model with Windowed Selective Scan. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 12–22. [Google Scholar]
  81. Pei, X.; Huang, T.; Xu, C. EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 27 February–2 March 2025; Volume 39, pp. 6443–6451. [Google Scholar]
  82. Yang, C.; Chen, Z.; Espinosa, M.; Ericsson, L.; Wang, Z.; Liu, J.; Crowley, E.J. PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition. In Proceedings of the British Machine Vision Conference (BMVC), Glasgow, UK, 25–28 November 2024. [Google Scholar]
  83. Ma, C.; Wang, Z. Semi-Mamba-UNet: Pixel-Level Contrastive and Cross-Supervised Visual Mamba-Based UNet for Semi-Supervised Medical Image Segmentation. Knowl.-Based Syst. 2024, 300, 112203. [Google Scholar] [CrossRef]
  84. Ruan, J.; Li, J.; Xiang, S. VM-UNet: Vision Mamba UNet for Medical Image Segmentation. ACM Trans. Multimed. Comput. Commun. Appl. 2024. [Google Scholar] [CrossRef]
  85. Liu, J.; Yang, H.; Zhou, H.-Y.; Xi, Y.; Yu, L.; Li, C.; Liang, Y.; Shi, G.; Yu, Y.; Zhang, S.; et al. Swin-UMamba: Mamba-Based UNet with ImageNet-Based Pretraining. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Marrakesh, Morocco, 6–10 October 2024; pp. 615–625. [Google Scholar]
  86. Ma, J.; Li, F.; Wang, B. U-Mamba: Enhancing Long-Range Dependency for Biomedical Image Segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
  87. Xing, Z.; Ye, T.; Yang, Y.; Liu, G.; Zhu, L. SegMamba: Long-Range Sequential Modeling Mamba for 3D Medical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Marrakesh, Morocco, 6–10 October 2024. [Google Scholar]
  88. Hatamizadeh, A.; Kautz, J. MambaVision: A Hybrid Mamba-Transformer Vision Backbone. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 25261–25270. [Google Scholar]
  89. Long, J.; Zhang, Y.; Hu, S. A Visual Object Tracking Method Based on Historical Prompts of Mamba. Knowl.-Based Syst. 2025, 330, 114741. [Google Scholar] [CrossRef]
  90. Huang, J.; Wang, S.; Wang, S.; Wu, Z.; Wang, X.; Jiang, B. Mamba-FETrack: Frame-Event Tracking via State Space Model. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Urumqi, China, 18–20 October 2024; pp. 3–18. [Google Scholar]
  91. Lai, S.; Liu, C.; Zhu, J.; Kang, B.; Liu, Y.; Wang, D.; Lu, H. MambaVT: Spatio-Temporal Contextual Modeling for Robust RGB-T Tracking. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 9312–9323. [Google Scholar] [CrossRef]
  92. Li, K.; Li, X.; Wang, Y.; He, Y.; Wang, Y.; Wang, L.; Qiao, Y. VideoMamba: State Space Model for Efficient Video Understanding. arXiv 2024, arXiv:2403.06977. [Google Scholar] [CrossRef]
  93. Arjovsky, M.; Shah, A.; Bengio, Y. Unitary Evolution Recurrent Neural Networks. In Proceedings of the International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; pp. 1120–1128. [Google Scholar]
  94. Wisdom, S.; Powers, T.; Hershey, J.; Le Roux, J.; Atlas, L. Full-Capacity Unitary Recurrent Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
  95. Bonnabel, S. Stochastic Gradient Descent on Riemannian Manifolds. IEEE Trans. Autom. Control 2013, 58, 2217–2229. [Google Scholar] [CrossRef]
  96. Cao, Y.; He, Z.; Wang, L.; Wang, W.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Van Gool, L.; Han, J.; et al. VisDrone-DET2021: The Vision Meets Drone Object Detection Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 2847–2854. [Google Scholar]
  97. Guo, H.; Zheng, Y.; Zhang, Y.; Gao, Z.; Zhao, S. Global-Local MAV Detection under Challenging Conditions Based on Appearance and Motion. IEEE Trans. Intell. Transp. Syst. 2024, 25, 12005–12017. [Google Scholar] [CrossRef]
  98. Wei, X.; Bai, Y.; Zheng, Y.; Shi, D.; Gong, Y. ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 18911–18920. [Google Scholar]
  99. Xie, J.; Zhong, B.; Mo, Z.; Zhang, S.; Shi, L.; Song, S.; Ji, R. Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 19300–19309. [Google Scholar]
Figure 1. Geometric interpretation of state space models. (a) HiPPO-based SSM: hidden states h t on the manifold are projected to polynomial coefficient space R N via HiPPO basis functions, then mapped to h t + 1 . (b) LF-SSM (Ours): the tangent space T h t M at h t contains the velocity vector v t , which is mapped to h t + 1 via the exponential map Exp h t ( v t ) .
Figure 1. Geometric interpretation of state space models. (a) HiPPO-based SSM: hidden states h t on the manifold are projected to polynomial coefficient space R N via HiPPO basis functions, then mapped to h t + 1 . (b) LF-SSM (Ours): the tangent space T h t M at h t contains the velocity vector v t , which is mapped to h t + 1 via the exponential map Exp h t ( v t ) .
Drones 10 00102 g001
Figure 2. Overall architecture of LF-SSM. Our model comprises patch embedding layers, stacked GSM Blocks for geodesic state evolution, and a prediction head for bounding box regression.
Figure 2. Overall architecture of LF-SSM. Our model comprises patch embedding layers, stacked GSM Blocks for geodesic state evolution, and a prediction head for bounding box regression.
Drones 10 00102 g002
Figure 3. Architecture of the Geodesic State Module (GSM). The input features are expanded and processed through two branches: (1) state evolution branch that computes geodesic updates on the unit sphere, and (2) gating branch that modulates the output.
Figure 3. Architecture of the Geodesic State Module (GSM). The input features are expanded and processed through two branches: (1) state evolution branch that computes geodesic updates on the unit sphere, and (2) gating branch that modulates the output.
Drones 10 00102 g003
Figure 4. Qualitative comparison on four benchmarks. UAV123: uav3 (row 1) and truck4 (row 2) sequences. VisDrone: urban traffic scenes (rows 3–4). ARDMAV: phantom47 (row 5) and phantom79 (row 6) sequences. LaSOT: drone-1 (row 7) and drone-13 (row 8) sequences.
Figure 4. Qualitative comparison on four benchmarks. UAV123: uav3 (row 1) and truck4 (row 2) sequences. VisDrone: urban traffic scenes (rows 3–4). ARDMAV: phantom47 (row 5) and phantom79 (row 6) sequences. LaSOT: drone-1 (row 7) and drone-13 (row 8) sequences.
Drones 10 00102 g004
Figure 5. Feature visualization comparison. Top: class activation mapping. Bottom: feature response maps. LF-SSM produces more focused responses on the target.
Figure 5. Feature visualization comparison. Top: class activation mapping. Bottom: feature response maps. LF-SSM produces more focused responses on the target.
Drones 10 00102 g005
Figure 6. Representative failure cases of LF-SSM. Left: During rapid scale variation in a LaSOT sequence, the predicted bounding box (red) fails to match the scale of the ground truth (green) due to loss of scale-sensitive information in normalized directional features. Right: In a UAV123 traffic scene with similar vehicles, the predicted bounding box exhibits noticeable spatial drift from the ground truth location due to accumulated errors when the target appearance is similar to surrounding background regions.
Figure 6. Representative failure cases of LF-SSM. Left: During rapid scale variation in a LaSOT sequence, the predicted bounding box (red) fails to match the scale of the ground truth (green) due to loss of scale-sensitive information in normalized directional features. Right: In a UAV123 traffic scene with similar vehicles, the predicted bounding box exhibits noticeable spatial drift from the ground truth location due to accumulated errors when the target appearance is similar to surrounding background regions.
Drones 10 00102 g006
Table 1. Performance comparison on four tracking benchmarks. AUC, P, and P Norm denote success rate, precision, and normalized precision, respectively. The best and second-best results are in bold and underline.
Table 1. Performance comparison on four tracking benchmarks. AUC, P, and P Norm denote success rate, precision, and normalized precision, respectively. The best and second-best results are in bold and underline.
MethodUAV123VisDroneARDMAVLaSOT
AUC P AUC P AUC P AUC P Norm P
SeqTrack [13]68.588.264.283.561.579.872.581.579.3
VideoTrack [48]67.286.862.881.559.877.570.281.976.4
ARTrackV2 [98]69.889.565.885.263.581.873.682.881.1
ODTrack [47]70.289.866.285.564.282.574.084.282.3
AQATrack [99]68.287.563.582.860.878.572.782.980.2
EVPTrack [49]68.888.264.583.861.879.872.782.980.3
MixFormer [40]69.588.965.584.563.281.272.482.280.1
SPMTrack [50]72.891.868.587.866.584.877.486.685.0
HIPTrack [51]68.587.864.283.261.279.272.782.979.5
Mamba-FETrack [90]68.288.564.583.261.880.271.580.877.5
MambaVT [91]67.887.563.882.560.578.870.880.276.8
MambaVision [88]68.588.264.283.561.579.571.280.577.2
VideoMamba [92]68.888.564.883.862.280.571.881.278.2
LF-SSM-S69.589.265.284.562.881.272.882.579.8
LF-SSM-M71.290.867.586.865.584.274.584.282.5
LF-SSM-L73.292.569.288.567.886.275.885.584.2
Table 2. Efficiency comparison. FPS is measured on RTX 4090 and Jetson Orin Nano platforms. Parameter-matched SSM variants (-S) are configured as follows: Mamba-FETrack-S uses 8 blocks with hidden dimension 256; MambaVT-S uses 6 blocks with hidden dimension 288; VideoMamba-S uses 8 blocks with hidden dimension 256; S4D-S uses 6 blocks with hidden dimension 256. The best and second-best results are in bold and underline.
Table 2. Efficiency comparison. FPS is measured on RTX 4090 and Jetson Orin Nano platforms. Parameter-matched SSM variants (-S) are configured as follows: Mamba-FETrack-S uses 8 blocks with hidden dimension 256; MambaVT-S uses 6 blocks with hidden dimension 288; VideoMamba-S uses 8 blocks with hidden dimension 256; S4D-S uses 6 blocks with hidden dimension 256. The best and second-best results are in bold and underline.
MethodParams (M)GFLOPsFPS
4090 Orin Nano
SeqTrack [13]891483512
VideoTrack [48]961653210
ARTrackV2 [98]935218045
ODTrack [47]92739528
AQATrack [99]97868524
EVPTrack [49]94729026
MixFormer [40]196245258
SPMTrack [50]1057815538
HIPTrack [51]989210528
Mamba-FETrack [90]786212032
MambaVT [91]825813535
MambaVision [88]755514538
VideoMamba [92]856512533
Mamba-FETrack-S 18.8 14.2 185 48
MambaVT-S 19.2 15.5 172 45
VideoMamba-S 18.5 13.8 168 42
S4D-S 18.6 13.5 205 55
LF-SSM-S18.512.832069
LF-SSM-M35.224.526552
LF-SSM-L52.838.221041
Table 3. Ablation study of GSM components on four benchmarks. Bold values indicate the best results.
Table 3. Ablation study of GSM components on four benchmarks. Bold values indicate the best results.
No.ComponentUAV123VisDroneARDMAVLaSOT
TSP ISS Gate AUC P AUC P AUC P AUC P Norm P
#1 68.587.864.583.262.280.571.280.577.2
#2 70.289.265.884.563.882.272.582.279.5
#3 69.588.565.284.263.281.572.281.578.8
#4 69.888.865.584.563.581.872.581.879.2
#5 71.590.567.286.265.584.274.283.881.8
#6 71.290.266.885.865.283.873.883.581.2
#7 70.889.866.585.564.883.573.583.280.8
#Ours73.292.569.288.567.886.275.885.584.2
Table 4. Comparison with different SSM variants on UAV123. FPS is measured on Jetson Orin Nano platform. Bold values indicate the best results.
Table 4. Comparison with different SSM variants on UAV123. FPS is measured on Jetson Orin Nano platform. Bold values indicate the best results.
SSM VariantAUCPParams (M)FPS
S4 [66]67.587.245.225
S4D [67]68.287.842.832
S5 [68]68.888.248.522
Mamba [19]69.589.252.838
Euclidean + Norm70.289.852.868
Linear-RNN (Orthogonal) 68.5 87.8 52.5 32
GRU 70.2 89.5 53.2 28
LF-SSM (Ours)73.292.552.869
Table 5. Controlled ablation isolating geodesic state evolution contribution. All variants share identical gating, convolution, bidirectional processing, and residual connections. Bold values indicate the best results.
Table 5. Controlled ablation isolating geodesic state evolution contribution. All variants share identical gating, convolution, bidirectional processing, and residual connections. Bold values indicate the best results.
State Update MechanismAUCP Δ AUC
Euclidean Linear ( h t = A h t 1 + B u t )68.587.8
Diagonal SSM (S4D-style)69.288.5+0.7
Euclidean + Normalization70.289.8+1.7
Geodesic (Ours)73.292.5+4.7
Table 6. Per-layer latency breakdown on Jetson Orin Nano. State dimension N = 64 , sequence length T = 320 tokens. Bold values indicate the best results.
Table 6. Per-layer latency breakdown on Jetson Orin Nano. State dimension N = 64 , sequence length T = 320 tokens. Bold values indicate the best results.
OperationGSM (ms)Mamba (ms)S4D (ms)
Input Projection B x ˜ t 0.350.380.36
Tangent Proj./Discretization0.080.420.35
State Update/Selective Scan0.221.380.55
Output Projection C h t 0.530.580.62
Trigonometric/Exponential Ops0.020.040.02
Total1.202.801.90
Table 7. Quantitative feature analysis on UAV123 with matched parameters (52.8 M). ↑ indicates higher is better, ↓ indicates lower is better. Bold values indicate the best results.
Table 7. Quantitative feature analysis on UAV123 with matched parameters (52.8 M). ↑ indicates higher is better, ↓ indicates lower is better. Bold values indicate the best results.
MethodTRR ↑Entropy ↓Gradient ↑
Linear-RNN (Orthogonal)2.925.050.45
S4D2.894.820.48
GRU3.054.420.55
Mamba3.154.580.52
LF-SSM (Ours)4.823.210.89
Table 8. Sensitivity analysis of state dimension N on UAV123. FPS is measured on RTX 4090. Bold values indicate the best results.
Table 8. Sensitivity analysis of state dimension N on UAV123. FPS is measured on RTX 4090. Bold values indicate the best results.
State Dimension NAUCPParams (M)FPS
N = 16 70.289.548.5265
N = 32 71.891.250.2248
N = 64 73.292.552.8210
N = 128 73.592.858.5175
N = 256 73.692.972.2138
Table 9. Step size sensitivity analysis on UAV123. FPS is measured on RTX 4090. Bold values indicate the best results.
Table 9. Step size sensitivity analysis on UAV123. FPS is measured on RTX 4090. Bold values indicate the best results.
Step Size ConfigurationAUCPParams (M)FPS
Fixed Δ = 0.01 71.590.852.8215
Fixed Δ = 0.05 72.391.552.8212
Fixed Δ = 0.10 72.891.852.8210
Fixed Δ = 0.20 71.890.552.8208
Fixed Δ = 0.50 69.288.252.8205
Adaptive (Ours)73.292.552.8210
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, T.; Xu, X.; Qiu, S.; Sheng, C.; Wang, D.; Tian, H.; Yu, J. LF-SSM: Lightweight HiPPO-Free State Space Model for Real-Time UAV Tracking. Drones 2026, 10, 102. https://doi.org/10.3390/drones10020102

AMA Style

Wang T, Xu X, Qiu S, Sheng C, Wang D, Tian H, Yu J. LF-SSM: Lightweight HiPPO-Free State Space Model for Real-Time UAV Tracking. Drones. 2026; 10(2):102. https://doi.org/10.3390/drones10020102

Chicago/Turabian Style

Wang, Tianyu, Xinghua Xu, Shaohua Qiu, Changchong Sheng, Di Wang, Hui Tian, and Jiawei Yu. 2026. "LF-SSM: Lightweight HiPPO-Free State Space Model for Real-Time UAV Tracking" Drones 10, no. 2: 102. https://doi.org/10.3390/drones10020102

APA Style

Wang, T., Xu, X., Qiu, S., Sheng, C., Wang, D., Tian, H., & Yu, J. (2026). LF-SSM: Lightweight HiPPO-Free State Space Model for Real-Time UAV Tracking. Drones, 10(2), 102. https://doi.org/10.3390/drones10020102

Article Metrics

Back to TopTop