SAMViTrack: A Search-Region Adaptive Mamba-ViT Tracker for Real-Time UAV Tracking

Xiaoyu Guo; Yian Li; Hao Zhang; Xucheng Wang; Dan Zeng; Feixiang He; Shuiwang Li

doi:10.3390/s25247454

,

and

¹

College of Computer Science and Engineering, Guilin University of Technology, Guilin 541006, China

²

Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, Guilin 541004, China

³

School of Computer Science, Fudan University, Shanghai 200082, China

⁴

School of Artificial Intelligence, Sun Yat-sen University, Zhuhai 519000, China

Sensors2025, 25(24), 7454;https://doi.org/10.3390/s25247454

This article belongs to the Section Navigation and Positioning

Version Notes

Order Reprints

Abstract

Achieving fast and robust object tracking is critical for real-time Unmanned Aerial Vehicle (UAV) applications, where targets often move unpredictably and environmental conditions can rapidly change. In this paper, we propose the Search-Region Adaptive Mamba-ViT Tracker (SAMViTrack), a novel framework that combines the efficiency of Mamba attention with the powerful feature extraction capabilities of Vision Transformer (ViT). Our tracker dynamically adjusts the search region based on the target’s motion and environmental context, ensuring precise tracking even under challenging conditions such as occlusions, fast motion, and scale variations. By integrating an adaptive search mechanism, our SAMViTrack significantly reduces computational overhead without compromising accuracy, making it suitable for real-time deployment on UAVs with limited onboard resources. Extensive experiments on benchmark datasets demonstrate that our method outperforms both traditional and modern trackers, achieving superior accuracy and robustness with improved efficiency. The proposed tracker sets a new baseline, especially by combining Mamba and ViT, for UAV tracking by offering a balance between speed, accuracy, and adaptability in dynamic environments.

Keywords:

Mamba attention; Vision Transformer; object tracking; UAV applications; adaptive search region

1. Introduction

The code for SAMViTrack is publicly available at https://github.com/open-at-25/SAMViTrack (accessed on 4 December 2025).

Unmanned Aerial Vehicles (UAVs) have become essential tools in a variety of applications, including surveillance [1], search and rescue [2], environmental monitoring [3], and package delivery [4]. A core challenge in these applications is achieving fast and reliable object tracking in real-time, as UAVs often operate in dynamic and unpredictable environments. Robust UAV tracking systems must handle rapid target movement, scale variations, occlusions, and environmental changes while maintaining high computational efficiency to meet real-time requirements [5,6,7,8].

Traditional tracking algorithms commonly use fixed search regions to locate a target in successive video frames. In these methods, a predefined area around the previously detected target is used as the search region to predict the target’s position in the next frame. While this approach is computationally efficient and straightforward to implement, it suffers from significant limitations in dynamic environments. One of the key issues with fixed search regions is their lack of adaptability to changes in the target’s movement or appearance. When the target moves unpredictably—such as sudden accelerations, sharp turns, or changes in direction—a static search region may fail to encompass the target’s new location [9,10]. This often results in tracking drift, where the tracker gradually loses sight of the target and begins to follow background objects instead [11,12,13]. Additionally, fixed search regions are vulnerable to environmental changes, such as occlusions, lighting variations, or cluttered backgrounds [14,15,16]. In real-world UAV tracking scenarios, targets can be partially or fully obscured by obstacles, making it difficult for a fixed search region to capture the target’s exact position. Furthermore, targets may change in size or shape due to variations in camera angles or distance, which a static search region cannot accommodate effectively. Recent advancements in deep learning-based trackers have improved tracking accuracy by leveraging powerful feature extraction models such as convolutional neural networks (CNNs) and transformers [5,17,18]. Among these, the Vision Transformer (ViT) [19] has shown significant promise due to its ability to capture both local and global context. However, the computational overhead of ViT due to the quadratic complexity of its attention mechanism makes it challenging for real-time UAV applications with limited onboard resources.

To address these challenges, we propose the Search-Region Adaptive Mamba-ViT Tracker (SAMViTrack), a novel framework designed to achieve fast and robust UAV tracking in real-world scenarios. The proposed tracker combines the efficiency of Mamba [20], a lightweight attention mechanism, with the powerful feature extraction capabilities of ViT to deliver both accuracy and speed. Combining Mamba and ViT leverages their complementary strengths to create an efficient and powerful tracking system. Mamba provides a lightweight attention mechanism that reduces computational costs, ensuring fast processing for real-time applications without sacrificing accuracy [21,22,23,24]. On the other hand, ViT excels at extracting rich, high-level features by capturing global dependencies across input data, which is essential for accurate tracking in complex scenarios [25,26]. The synergy between Mamba’s efficiency and ViT’s robust feature extraction strikes a balance between speed and accuracy, enhancing the tracker’s performance in dynamic environments. Together, they enable a system that is both fast and accurate, making it ideal for real-time tracking tasks. Another key innovation in our method is the adaptive search region mechanism, which dynamically adjusts the search area based on the target’s motion and environmental context. This adaptive mechanism allows the tracker to focus on the most relevant areas, reducing unnecessary computations and improving tracking performance in challenging conditions.

The proposed SAMViTrack framework addresses two critical needs in UAV tracking: (1) maintaining high tracking accuracy under diverse environmental conditions and (2) reducing computational complexity for real-time deployment. By integrating Mamba’s fast attention mechanism, SAMViTrack optimizes the traditional multi-head self-attention used in ViT, making it more efficient for resource-constrained UAV systems. Additionally, the adaptive search region mechanism enhances the tracker’s robustness by preventing target drift and improving recovery from occlusions and fast target movement. We evaluate our method on several benchmark datasets across various UAV tracking scenarios. The results demonstrate that SAMViTrack achieves superior performance compared to both traditional trackers and state-of-the-art deep learning models, with significant improvements in tracking accuracy and computational efficiency. As shown in Figure 1, our tracker achieves a superior balance between accuracy and efficiency, outperforming the leading tracker Aba-ViTrack by over 150 frames per second (FPS) while maintaining comparable precision. This makes it well-suited for deployment on UAV platforms with constrained computational resources. The main contributions of this paper are summarized as follows:

Figure 1. Compared with state-of-the-art UAV tracking algorithms on VisDrone2018, our SAMViTrack performs with 0.84 precision and still runs efficiently at approximately 340 FPS.

A lightweight hybrid backbone integrating Mamba and ViT. We design a compact Mamba–ViT hybrid backbone that leverages Mamba’s efficient sequence modeling and ViT’s strong spatial representation. Unlike pure-Mamba or pure-VT variants, our hybrid design achieves a balanced trade-off between accuracy and computational overhead, making it suitable for UAV platforms with limited resources.
A plug-and-play adaptive search-region mechanism. We introduce a search-region adaptive (SA) module that adjusts the size of the search area based solely on the relative velocity of the target and is activated only during inference. Unlike existing adaptive search mechanisms, such as Aba-ViTrack, which relies on auxiliary appearance cues and online regression, our method does not require additional network branches, online optimization, or extra model parameters. This makes the SA module architecture-agnostic, easily attachable to various trackers, and free of training-time cost.
Extensive validation demonstrating strong generalization. Although designed for UAV tracking, the proposed SA module provides consistent improvements when integrated into different state-of-the-art trackers (e.g., OSTrack, AQATrack, HIPTrack) and evaluated on both UAV and general tracking benchmarks. These results confirm that the mechanism is not restricted to UAV viewpoints but serves as a broadly applicable dynamic search strategy.

3. Method

In this section, we introduce the proposed end-to-end tracking framework, SAMViTrack. This framework is designed to achieve efficient and accurate object tracking by combining the strengths of the Mamba model and ViT for feature representation, along with a mechanism that dynamically adjusts the search region size. The framework consists of three main components: the Search-Region Adaptive Network, the Feature Extraction Network, and the Head Network. An overview of the model is illustrated in Figure 2.

Figure 2. Overview of the proposed SAMViTrack framework. It consists of a Search-Region Adaptive Network, a Feature Extraction Network, and a Head Network. Note that the Search-Region Adaptive module is exclusively active during inference.

3.1. Overview

As illustrated in Figure 2, the proposed SAMViTrack framework adopts a single-stream architecture, which consists of a Search-Region Adaptive network, a backbone network, dubbed Mamba-ViT, based on Vision Mamba and Vision Transformer (ViT), and a tracking head. The framework takes a pair of images as input: the template image

Z \in R^{3 \times H_{z} \times W_{z}}

and the search image

X \in R^{3 \times H_{x} \times W_{x}}

. These images are divided and flattened into sequences of patches with resolution

P \times P

, resulting in

P_{z} = H_{z} \times W_{z} / P^{2}

patches for Z and

P_{x} = H_{x} \times W_{x} / P^{2}

patches for X. The features extracted from the Mamba-ViT backbone are then fed into the tracking head to generate the final tracking results. The Search-Region Adaptive Module is designed to dynamically adjust the size of the search region based on the target’s current relative speed. After the input template and search images undergo Patch Embedding, the resulting tokens are fed into the Feature Extraction Network, which consists of multiple Mamba and ViT layers. The integration of Mamba and ViT layers will be thoroughly examined in the Section 4.

3.2. Hybrid Mamba-ViT for Feature Representation

The proposed hybrid architecture leverages Mamba’s adaptive attention for efficiency and ViT’s global feature modeling to enhance representation power. This balance between efficiency and richness is essential for real-time UAV tracking. Our experiments show that placing ViT blocks after Mamba blocks yields better performance. More intricate combinations, such as alternating the stacking order, are beyond the scope of this work and will be explored in future studies. The details of the Mamba and ViT layers are provided below.

Vision Mamba for Tracking: In our proposed framework, the Vision Mamba module plays a crucial role in feature extraction for tracking tasks. Given the template image Z and the search image X, we first transform them into a sequence of one-dimensional tokens through a patch embedding process facilitated by a trainable linear projection layer. This results in

K

tokens, mathematically represented as:

t_{1 : K}^{0} = E (Z, X) \in R^{K \times E},

(1)

where E denotes the embedding dimension of each token. These input tokens

t_{1 : K}^{0}

are then fed into the encoding layer, where they undergo processing through

L_{m}

stacked layers of bidirectional Vision Mamba (Vim) encoders [51]. Let

m^{l}

represent the Mamba block at layer

l ⩽ L_{m}

, which processes all tokens from layer

(l - 1)

via

t_{1 : K}^{l} = m^{l} (t_{1 : K}^{l - 1}) + t_{1 : K}^{l - 1}

. Specifically, the input

t_{1 : K}^{l - 1}

is initially normalized and subsequently processed through two distinct linear projection layers to yield intermediate features V and Q:

\begin{matrix} V = {Linear}^{v} (Norm (t_{1 : K}^{l - 1})), \\ Q = {Linear}^{q} (Norm (t_{1 : K}^{l - 1})) . \end{matrix}

(2)

Although both are derived from the same input, they are generated via distinct linear projections (

{Linear}^{v}

and

{Linear}^{q}

), ensuring different parameterizations. In the Mamba block,

Q

acts as the input to the selection mechanism (determining content-dependent state transitions), while

V

provides the content that is processed by the SSM.

Next, V is processed in both forward and backward directions. The State Space Model (SSM) [20] captures the dynamic variations within a sequence by mapping the input sequence onto a latent state space. In each direction, a 1D convolution followed by a Sigmoid Linear Unit (SiLU) activation function [52] is applied to produce

V^{'}

:

\begin{matrix} V_{o} = SSM (SiLU (Conv 1 d (V))), \\ V_{o}^{'} = V_{o} ⊙ SiLU (V), \\ Y = Linear (V_{forward}^{'}) + Linear (V_{backward}^{'}), \end{matrix}

(3)

where the subscript o indicates the two scan orientations. Bidirectional scanning facilitates interactions among all elements within the sequence, establishing a global and unconstrained receptive field. The output of the last layer

m^{L_{m}}

is fed into the ViT encoder for further processing. Here, ⊙ denotes element-wise multiplication between matrices of the same dimension.

Vision Transformer for Tracking: The Vision Transformer (ViT) layer also serves as a key component for feature extraction and relationship modeling in our framework. The output tokens

t_{1 : K}^{L_{m}}

from the Mamba encoder are subsequently passed into the ViT encoder, which is composed of

L_{v}

ViT layers. Specifically, each ViT layer consists of two main modules: Multi-Head Self-Attention and FFN. The Multi-Head Self-Attention mechanism is implemented as follows:

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V .

(4)

Here, Q, K, and V represent the Query, Key, and Value matrices obtained through linear transformations, respectively, and

d_{k}

denotes the feature dimension of each attention head. The Multi-Head mechanism enhances the model’s representational capacity by partitioning the feature space into multiple subspaces:

MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{M}) W^{O},

(5)

where M represents the number of attention heads, and

W^{O}

is a learned projection matrix that combines the outputs of all attention heads into a unified representation. This mechanism allows the model to capture diverse relationships and dependencies within the input data, significantly improving its ability to handle complex patterns. The Feed-Forward Network (FFN) further processes the features, defined as:

FFN (x) = max (0, W_{1} x + b_{1}) W_{2} + b_{2},

(6)

where x represents the input feature vector,

W_{1}

and

W_{2}

are weight matrices,

b_{1}

and

b_{2}

are bias terms, and

\max (\cdot, \cdot)

is the ReLU activation function that introduces non-linearity. The FFN first transforms the input features into a higher-dimensional space and then projects them back, enabling the model to learn and capture complex patterns effectively. The output of each sublayer (self-attention and FFN) is normalized using Layer Normalization and combined with the input via residual connections:

LayerNorm (x + Sublayer (x))

. By stacking multiple ViT layers, the model progressively optimizes the extraction of target features and suppression of background noise, thereby achieving robust object tracking in complex scenarios.

3.3. Search-Region Adaptive Module

The proposed Search-Region Adaptive Module dynamically adjusts the size of the search region based on the target’s motion to improve tracking accuracy and optimize computational efficiency. The workflow of the module is illustrated in Figure 3. Let the estimated location of the target at t frame be denoted by

{\hat{p}}_{t}

. Then the relative speed of the target at

t - 1

can be estimated by

v_{rel} (t - 1) = \frac{| {\hat{p}}_{t - 1} - {\hat{p}}_{t - 2} |}{\sqrt{H_{t - 1} \times W_{t - 1}}},

(7)

where

(H_{t - 1}, W_{t - 1})

denotes the predicted target size at

t - 1

. Larger objects might move slower relative to their size, while smaller objects might have a higher relative speed despite moving at the same pixel speed. This calculation provides a measure of the target’s speed relative to its size, offering insights into how fast the object is moving in relation to the area it occupies in the frame. Intuitively, the search region should be smaller when the target moves slower and larger when the target moves faster. However, traditional trackers typically use a fixed scale factor, denoted by

s_{0}

, to determine the search region size without tacking this into account. In this paper, we extend this approach by introducing an adaptive scale factor,

s (t)

, which adjusts based on the relative speed of the target, and is formulated by

s (t) = s_{0} + Δ s (t)

, where

Δ s (t) = b \times \tanh (a (v_{rel} (t - 1) - c)),

(8)

where

tanh (\cdot)

denotes the hyperbolic tangent function, and a, b, and c are predefined constants that regulate the sensitivity and magnitude of the scale adjustment. Specifically, the parameter a controls how strongly the scale variation responds to changes in relative velocity, with larger values producing more pronounced adjustments. The parameter b defines the maximum allowable change by constraining

Δ s

to the bounded interval

[- b, b]

, thereby preventing excessively large or small search regions. The parameter c serves as an offset that shifts the activation threshold of the adjustment, allowing the mechanism to better adapt to the motion characteristics of different targets. While this formulation models motion dynamics using a simplified relative-speed descriptor, such a design allows the adaptive mechanism to remain lightweight, inference-only, and fully plug-and-play, avoiding the additional computational overhead that more complex motion predictors typically introduce.

Figure 3. Illustration of the Search-Region Adaptive Module. The module takes two consecutive frames as input. Using relative velocity, the module computes the adaptive scaling factor (s) which is used to generate the adaptive search region.

We emphasize that the efficiency gains of our method primarily arise from reducing the computational overhead of image resizing, rather than accelerating the model’s inference process. Since resizing is performed on every frame, its cumulative cost becomes non-negligible. The computational complexity of image resizing depends on both the interpolation method and the dimensions of the output image, and is typically linear in the number of output pixels. For instance, the bilinear interpolation used in this work has a complexity of

O (H^{'} \times W^{'})

, where

H^{'}

and

W^{'}

are the height and width of the resized image. However, because

H^{'}

and

W^{'}

are fixed input dimensions, the runtime efficiency is largely influenced by the overhead of data access, which in turn depends on the size of the search image prior to resizing. In the following, we provide a rationale for how our proposed method improves efficiency by adaptively adjusting this size. Denote the size of the search image prior to resizing as a two-dimensional discrete random variable

H \times W

. Assuming the data access overhead is proportional to this size, the associated expected computational complexity of resizing the search image in our tracker can be expressed as:

\begin{matrix} E [O (H \times W)] = O (E (H \times W)) \\ = O (E (H) \times E (W) + C o v (H, W)), \end{matrix}

(9)

where

E

is the expected value operator,

C o v

means covariance. It is evident that reducing

E (H)

,

E (W)

, or

C o v (H, W)

, leads to a decrease in the expected computational complexity of the resizing operation. Existing methods typically use a fixed scale factor, commonly set to 4, which represents the ratio between the search area and the target area, to determine the size of the search image. In contrast, we adapt this factor by generalizing it into a function of the target’s relative speed,

v_{r e l}

. When

v_{r e l}

falls below a predefined threshold c (as defined in Equation (8)), the scale factor is reduced, thereby decreasing

E (H)

,

E (W)

, which leads to improved computational efficiency. Conversely, when

v_{r e l}

exceeds c, the scale factor increases, expanding the search region and enhancing robustness against occlusions and rapid target motion—thus improving tracking accuracy. Experimental results demonstrate that, by appropriately setting the parameters of the scale factor function, our method is able to achieve simultaneous improvements in both efficiency and accuracy.

3.4. Tracking Head and Loss Function

In alignment with OSTrack [53], we implement a center-based head composed of multiple Conv-BN-ReLU layers to directly estimate the target’s bounding box. The head outputs local offsets to correct for discretization errors caused by resolution reduction, normalized bounding box sizes, and an object classification score map. The position with the highest classification score is selected as the object’s location, resulting in the final bounding box for the object. During training, we adopt the weighted focal loss for classification, a combination of

L_{1}

loss and Generalized Intersection over Union (GIoU) loss for bounding box regression. The total loss function is defined as follows:

L_{total} = L_{cls} + λ_{iou} L_{iou} + λ_{L_{1}} L_{L_{1}} .

(10)

where the trade-off parameters are set as

λ_{iou} = 2

and

λ_{L_{1}}

= 5 in our experiments.

Although the parameters a, b, and c are set empirically, their roles are simple and intuitive, and they follow common practice in lightweight adaptive mechanisms. The parameters mainly serve to keep the scale updates stable and predictable, preventing the search region from changing too quickly or excessively. Since the search-region adjustment is applied only during inference, learning these parameters through training would introduce additional complexity without providing clear benefits. Using fixed values therefore matches the design goal of the SA module, which is intended to be lightweight, plug-and-play, and independent of model retraining.

3.5. Parameter Sensitivity Observation

To further address concerns about the manually set parameters in the adaptive scaling function, we conducted a preliminary sensitivity study on the three parameters

(a, b, c)

across six evaluation settings (AQATrack–GOT10k, AQATrack–OTB, HIPTrack–GOT10k, HIPTrack–OTB, OSTrack–GOT10k, and OSTrack–OTB). All performance scores were normalized via min–max normalization to allow consistent comparison.

Overall, the tracker demonstrates stable behavior across a broad parameter range, and the performance curves vary smoothly without abrupt degradation. Importantly, all six model–dataset combinations exhibit consistent trends.

Parameter a. It controls the responsiveness of the hyperbolic tangent in Equation (8). Extremely large values (e.g.,

a > 40

) cause over-reactive scale changes and reduce stability. Across all six evaluation settings, the optimal region consistently lies within

a \in [10, 25]

, where the adaptive behavior remains both smooth and effective.

Parameter b. This parameter determines the maximum deviation from the base scale. The sensitivity curves show that nearly all models reach their best performance within a narrow and stable range of

b \in [0.5, 0.8]

. This robustness suggests that b can be treated as a fixed constant in practice, without requiring per-model tuning, which greatly reduces hyperparameter dependency.

Parameter c. It functions as the relative-motion threshold. All six evaluation settings reveal an optimal region of

c \in [0.9, 1.2]

. When c becomes too large (e.g.,

c \geq 1.8

), the scale adjustment mechanism becomes overly conservative, leading to noticeable performance degradation.

These observations indicate that the proposed formulation is sufficiently robust for inference-only deployment, as performance varies smoothly and remains stable throughout practical parameter ranges. Nevertheless, obtaining a more comprehensive and theoretically unified understanding would require a substantially larger experimental grid across diverse target categories, motion patterns, and UAV environments—an exploration that falls beyond the scope of this work. Future research may incorporate a lightweight learnable module to automatically optimize

(a, b, c)

during training, providing a more principled and data-driven formulation.

3.6. Summary and Motivation for Adaptive Search

The adaptive search mechanism is motivated by the observation that the computational cost of transformer-based trackers is strongly influenced by the spatial size of the search region. A larger search area increases the number of tokens entering the backbone, thereby raising FLOPs, memory usage, and inference latency, while an overly small search region increases the risk of target drift. Our design aims to balance these two factors by dynamically adjusting the search region based on the estimated relative motion. This mechanism requires no retraining and incurs negligible overhead, making it suitable for real-time UAV tracking. The following section evaluates how this adaptive strategy affects tracking accuracy, robustness, and efficiency across several benchmarks.

4. Experiments

Before presenting the experimental results, we briefly outline how the evaluations relate to the design considerations introduced in Section 3. Since the proposed approach incorporates a lightweight hybrid backbone together with an adaptive search-region mechanism applied only during inference, the experiments are designed to examine the effectiveness of the backbone design, the influence of the adaptive mechanism on both accuracy and efficiency, and its ability to generalize across different trackers and datasets.

In this section, we conduct extensive experiments on four UAV tracking benchmarks: DTB70 [54], UAV123 [55], VisDrone2018 [56], UAVDT [57], and WebUAV3M [58]. All evaluation experiments are conducted on a PC equipped with i9-10850K processor (3.6 GHz), 16GB RAM and an NVIDIA TitanX GPU. To validate our approach, we compare it with state-of-the-art trackers, which are categorized into lightweight and deep learning-based models based on their architecture and computational needs.

4.1. Implementation Details

Model. We adopt the proposed Mamba-ViT as the backbone, which consists of a three-layer Mamba encoder and a three-layer ViT encoder for consideration of efficiency. The search-region adaptive module is applied only during inference, where it dynamically adjusts the search area based on the target’s relative velocity. During training, however, no dynamic resizing is used; a fixed search region is employed to maintain stable optimization and ensure compatibility with the OSTrack loss design. The head of our tracker consists of a stack of four Conv-BN-ReLU layers. The sizes of the template and search region are set to 128 × 128 and 256 × 256.

Training. We train the model using splits from the GOT-10k [59], LaSOT [60], COCO [61], and TrackingNet [62] datasets. The batch size is set to 32. For optimization, we use the AdamW optimizer [63] with a weight decay of

10^{- 4}

and an initial learning rate of

4 \times 10^{- 5}

for the backbone. The model is trained for 300 epochs, processing 60,000 image pairs per epoch, and the learning rate is reduced by a factor of 10 after 240 epochs. The hidden dimension of the model is set to 192, and the number of attention heads is set to 3.

Inference. During inference, we apply a Hanning window penalty to incorporate positional priors, as is common in tracking practices [64]. Specifically, the classification map is multiplied by a Hanning window of the same size, and the bounding box with the highest score after multiplication is selected as the final tracking result.

4.2. Comparison with Lightweight Trackers

The evaluation results of our trackers and the competing lightweight trackers are presented in Table 1. As shown, our SAMViTrack demonstrate superior performance among all these trackers in terms of average (Avg.) precision (Prec.), success rate (Succ.) and speeds. On average, RACF achieves the highest precision (72.3%) and success rate (49.2%) among DCF-based trackers. DRCI attains the highest precision at 77.5%, while UDAT leads with the highest success rate of 58.1% among CNN-based trackers. Our SAMViTrack achieved the highest speed, surpassing the second-fastest DRCI by 2.8% in Prec. and 3.2% in Succ. AVTrack-SA and Aba-ViTrack-SA, integrated with the Search-Region Adaptive module (SA), enhance Prec. by 1.2% and 1.0%, respectively, while improving efficiency. Remarkably, Aba-ViTrack-SA achieves the highest Prec. (82.7%) and Succ. (62.8%), while AVTrack-SA ranks second in Prec. and third in Succ. This improvement in both tracking accuracy and efficiency can be attributed to the proposed Search-Region Adaptive module, which effectively narrows the search area when the target moves slowly, minimizing background interference and reducing computational overhead. It is worth noting that SAMViTrack achieves a GPU speed 1.7 times faster and a CPU speed 1.3 times faster than Aba-ViTrack-SA, showcasing its outstanding performance. This demonstrates an excellent balance between tracking accuracy and efficiency, highlighting the advantages of our method and confirming its effectiveness in delivering SOTA performance in UAV tracking.

Table 1. Comparison of precision (Prec.), success rate (Succ.), speed (FPS), FLOPs and Param between SAMViTrack and lightweight trackers on DTB70, UAVDT, VisDrone2018, UAV123 and WebUAV-3M. Red, blue, and green signify the first, second, and third places. Please note that the percent symbol (%) is excluded for Prec. and Succ. values.

4.3. Comparison with Deep Trackers

The proposed SAMViTrack is also compared with 14 deep trackers in Table 2, showing the Precision (Prec.), Success (Succ.), and GPU speed on the UAVDT dataset. SAMViTrack stands out by achieving the highest Prec. and the fastest GPU speed, demonstrating its competitiveness in both accuracy and speed. Notably, our method outperforms LiteTrack (second in Prec.) by 0.4% and SMAT (third in Prec.) by 1.2%, while being more than twice as fast in GPU speed. Although HIPTrack achieves the highest Succ., SAMViTrack is nearly ten times faster. Furthermore, while MixFormerV2 ranks second in speed, our method surpasses it by 24.2% in Prec. and 15.9% in Succ., further highlighting its superior balance of accuracy and efficiency.

Table 2. Precision (Prec.), success (Succ.), and GPU speed comparison between SAMViTrack and DL-based tracker on UAVDT. Red, blue, and green signify the first, second, and third places.

4.4. Attribute-Based Evaluation

To evaluate the effectiveness of the adaptive search region mechanism in addressing target fast motion, we conducted a comparative analysis of Aba-ViTrack-SA, AVTrack-SA, and SAMViTrack against 20 state-of-the-art (SOTA) trackers using the fast motion subset of the UAV123 dataset. Note that we also assessed SAMViTrack without applying the proposed method for the adaptive search region mechanism, denoted as MViTrack, for reference. The precision plot is illustrated in Figure 4. The results reveal that Aba-ViTrack-SA achieves the highest precision of 85.8%, ranking first among all evaluated trackers. Notably, the integration of the Search-Region Adaptive (SA) module improves precision by 2.4% compared to the original Aba-ViTrack. Similarly, AVTrack-SA and SAMViTrack demonstrate precision improvements of 1.6% and 0.6%, respectively, over their baseline versions. These findings highlight the crucial role of the adaptive search region mechanism in improving tracking performance by preventing target drift and enhancing recovery from occlusions and rapid target movement, especially in fast-motion scenarios.

Figure 4. Attribute -based comparison on the fast motion subset of VisDrone2018. Note that MViTrack refers to SAMViTrack without utilizing the proposed adaptive search region mechanism.

4.5. Ablation Study

Impact of Mamba-ViT configurations and the Search-Region Adaptive module (SA). We conduct a thorough evaluation of both the backbone architecture configurations (i.e., the number of Mamba and ViT layers) and the effectiveness of our proposed Search-Region Adaptive (SA) mechanism on the UAVDT. Since there is no consensus on what constitutes the optimal balance between accuracy and efficiency, we define a measure, dubb

I_{b}

, to quantify the trade-off between these factors, which is based on the Efficacy Coefficient Method [88] and defined as follows:

I_{b} (P, S) = ρ Ecoef (P, P_{m i n}) + (1 - ρ) Ecoef (S, S_{m i n}),

(11)

where P and S denote the Prec. and the GPU speed, respectively,

P_{m a x}

and

P_{m i n}

are the maximum and minimum values of the data for normalization,

Ecoef (P, P_{m i n}) = 0.1 + \frac{0.9 (P - P_{m i n})}{P_{m a x} - P_{m i n}}

defines the Efficacy,

ρ

is the weight to trade-off precision and efficiency, which is set to 0.7 empirically. The experimental results are presented in Table 3. As can be seen, the Mamba-3/ViT-3 setting achieves an optimal balance between accuracy and efficiency, reaching the highest

I_{b}

of 0.89, with a precision of 82.1%, a success rate of 58.1%, and a real-time inference speed of 346.2 FPS. Remarkably, the effectiveness of our proposed SA module is evidenced by consistent performance improvements across all configurations (denoted by ✓ in the SA column), with an average increase of 3.7% in precision and 1.9% in success rate while maintaining real-time processing capabilities. And the integration of SA with the Mamba-3/ViT-3 configuration also demonstrates superior performance, with gains of 2.8%, 1.3%, and 2.7% in Prec., Succ., and speed, respectively. These findings support our design for the SAMViTrack framework and highlight the effectiveness of our Search-Region Adaptive mechanism.

Table 3. Impact of Mamba-ViT configurations and Search-Region Adaptive (SA) on the tracking performance of SAMViTrack on the UAVDT benchmark. Here, ✓ indicates the addition of the SA module, and ↑ shows the improvement; for FPS, it represents the percentage increase relative to the original value.

In addition, to empirically validate that our SA effectively reduces the average size of the search image before resizing, we record the search dimensions at each frame of the DTB70 dataset, both with and without the application of SA. We then compare the changes using two-dimensional histograms of the original and adjusted search regions. As illustrated in Figure 5, the left panel shows the distribution of width and height for the original search regions, while the right panel displays the distribution after applying our SA. As observed, the adjusted search area exhibits a more concentrated distribution of width and height, indicating reduced covariance between the two dimensions. In fact, our method decreases both the average width and height. These findings qualitatively demonstrate that SA effectively reduces the average search image size before resizing, thereby lowering resizing overhead and validating its efficiency.

Figure 5. Comparison of two-dimensional histograms between the original and the adjusted search image.

Evaluation of Mamba-ViT Configurations. In this experiment, we assessed the impact of various combinations of Mamba and ViT versions on the performance of deep learning models. The results in Table 4 indicate the configuration with the first three layers being Mamba and the last three layers being ViT demonstrated superior performance, achieving an accuracy of 82.1% and a success rate of 58.1%, along with the highest

I_{b}

of 1.00. This configuration’s superior performance may stem from the Mamba layer’s effectiveness in capturing local features and the ViT layer’s proficiency in learning global features. The sequential processing of features from local to global may facilitate the ViT layer’s more effective integration of local features for global feature learning, thereby potentially enhancing the model’s overall representational capacity and generalization performance.

Table 4. Comparison of different Mamba and ViT combinations on the UAVDT benchmark. It presents Prec., Succ., FPS, and

I_{b}

of the model under various combinations of Mamba and ViT, with and without SA enabled. Here, ✓ indicates the addition of the SA module.

Furthermore, we added two pure baselines—6-layer Mamba and 6-layer ViT—for direct comparison. As shown in Table 4, both pure configurations underperform the hybrid models: pure Mamba lacks global modeling capacity, while pure ViT is less robust and less efficient. These results further confirm the complementary roles of Mamba and ViT and demonstrate that the hybrid design provides a more balanced and effective representation than using either architecture alone.

Application to SOTA Trackers. We evaluate the generalization ability of Search-Region Adaptative module (SA) by applying it to three state-of-the-art (SOTA) trackers: AQATrack [89], OSTrack-384 [53], and HIPTrack [79], on the GOT-10k [59], and OTB [90] datasets. As shown in Table 5, the SA module consistently improves tracking performance across all trackers. Notably, OSTrack experiences a significant boost with a 1.8% increase in SR0.75 on GOT-10k, while HIPTrack achieves a substantial gain of 1.3% in SR0.75. On the OTB dataset, all trackers show improvements of over 1.0% in Succ. and 0.8% in Prec. Additionally, the integration of SA consistently enhances GPU speeds. These results highlight the generalizability of the proposed SA module. While the module was initially developed for UAV tracking, its consistent improvements on GOT-10k and OTB show that it is not restricted to UAV-specific viewpoints and can serve as a broadly applicable dynamic search strategy across different tracking paradigms. Although the SA module is initially designed for UAV tracking tasks, our experimental results demonstrate its effectiveness in other tracking scenarios as well. However, due to limitations in time and resources, this study focuses solely on the application of UAV tracking for the time being.

Table 5. Evaluation fany of the generalizability of our SA by applying them to three SOTA trackers. The evaluation is performed on GOT-10k and OTB. Here, ✓ indicates the addition of the SA module.

Discussion on the 3 + 3 Configuration. As shown in Table 3 and Table 4, the configuration with the first three layers as Mamba and the last three layers as ViT achieves the most balanced performance. A plausible explanation is that the Mamba layers, which are efficient in modeling local dependencies, provide compact early representations, while the subsequent ViT layers benefit from these refined features to capture broader global context. This sequential progression from local to global information appears to form a stable and efficient feature hierarchy.

We note that this observation is based on the currently evaluated combinations. A full exploration of all possible layer permutations would require a much larger search space and extensive computational cost, which is beyond the scope of this work. Nevertheless, the existing results already indicate that moderately interleaving the two modules leads to stable performance without showing strong contradictory patterns. More systematic exploration of alternative hybridization strategies will be investigated in future work.

4.6. Computational Complexity

We provide a brief analysis of the computational cost of the proposed components. The hybrid Mamba–ViT backbone consists of three Mamba layers and three ViT layers. Mamba layers operate with linear complexity

O (L d)

, while ViT layers follow the standard quadratic self-attention complexity

O (L^{2} d)

, where L denotes the number of tokens and d is the embedding dimension. Replacing part of the transformer blocks with linear Mamba layers reduces the overall computation compared with a pure ViT backbone of comparable depth.

Impact on Feature Extraction Load. Although the adaptive search mechanism modifies the spatial crop size before feature extraction, it does not introduce additional branches or parameters. After cropping, the search region is resized to a fixed resolution and therefore fed into the backbone with an identical spatial size during inference. As a result, the FLOPs of the feature extraction network remain unchanged regardless of the expansion or contraction of the search region. The computational savings arise primarily from reducing unnecessary scaling operations when the target exhibits slow or smooth motion, rather than from altering the backbone complexity. The Search-Region Adaptive (SA) module introduces negligible overhead, as it updates the search size using only a scalar function defined in Equation (8), which has constant-time complexity

O (1)

and does not require additional feature extraction or attention computation. Overall, the tracker maintains a favorable computational profile, fully aligning with the empirical runtime improvements observed in Section 4.

4.7. Qualitative Results

Figure 6 shows two example sequences from DTB70 [54] (Gull2 and RcCar4) using adaptive search regions based on the target’s relative speed with the proposed SAMViTrack method. The red boxes highlight the target location, while the yellow and cyan boxes indicate the traditional fixed search region and our adaptively adjusted search region, respectively. The darkblue and tangerine curves represent the relative speed of the target and the corresponding adaptive search factor s produced by our method, where the search factor directly controls the degree to which the search region expands or contracts according to the target’s motion. The frames are marked with ellipses in the curves to indicate their time stamps. As shown, our method effectively expands the search area during fast target motion and contracts it during slower movements, successfully tracking both objects. This further supports the effectiveness of our Search-Region Adaptive (SA) mechanism in UAV tracking.

Figure 6. Illustration of the Search-Region Adaptive (SA) mechanism. The x-axis represents the frame index of a video sequence. The green-circled points correspond to the specific frames from which the visual examples (shown above) are taken. The red boxes indicate the true target location, the cyan boxes denote the fixed search region used in conventional methods, and the yellow boxes represent the adaptively adjusted search region.

Figure 7 compares the tracking results of SAMViTrack with seven state-of-the-art trackers in three challenging scenarios. Our approach demonstrates remarkable robustness by maintaining accurate tracking across all test cases, while other methods encounter difficulties. Specifically, SAMViTrack excels in handling complex situations such as background clutter (e.g., Horse2 and So302), full occlusion (e.g., uav0000353_01127_s), and significant viewpoint changes. The consistent performance of SAMViTrack in these demanding conditions further highlights the effectiveness of our proposed UAV tracking framework.

Figure 7. Qualitative comparison on three video sequences, respectively, from the datasets DTB70, VisDrone2018, and UAVDT (i.e., Horse2, uav0000353_01127_s, and So302). Each number indicates which frame of the video is displayed.

5. Discussion

5.1. Limitations and Design Trade-Offs

Although SAMViTrack demonstrates strong performance across multiple UAV benchmarks, several limitations remain that merit attention.

Handling Abrupt Motion and Extreme Conditions. The proposed adaptive search mechanism is designed as a plug-and-play module that can be integrated into existing trackers without retraining. It adjusts the search region size based on the estimated relative motion from the previous frame,

v_{rel} (t - 1)

, enabling the region to contract during slow motion and expand when the target moves rapidly. However, because the update relies on motion cues from the preceding frame, this design introduces an inherent one-frame response delay. While such delay is negligible under smooth motion, it may affect responsiveness when the target undergoes abrupt acceleration or abrupt direction changes. In these cases, the contracted search region may not expand quickly enough, potentially causing temporary tracking failure if the target jumps outside the reduced region. Although the bounded tanh-based scaling stabilizes the update by constraining

Δ s

within

[- b, b]

, this also limits the maximum expansion rate and therefore the ability to handle extreme motion variations or momentary disappearances. These failure risks become more pronounced under severe motion blur or full occlusion, where short-term motion predictions become unreliable.

Multi-Target Scenarios. SAMViTrack follows the Single-Object Tracking (SOT) paradigm, where the tracker maintains a single adaptive search region for the target being tracked. As a result, the current formulation does not incorporate inter-target reasoning mechanisms, such as occlusion handling or cross-target interaction modeling. While a straightforward multi-target application (processing instances independently) is computationally feasible, the lack of these reasoning components limits the tracker’s ability to handle the complex, interacting behaviors and occlusions commonly observed in dense multi-UAV environments.

Complex Motion Modeling. One limitation of our current design is the coarse binary categorization of target motion (fast vs. slow). This simplification, intended to preserve the lightweight and plug-and-play characteristics of the adaptive search mechanism, sacrifices long-horizon, intent-aware prediction for computational efficiency.

Our approach is inherently reactive, using only the single-frame relative velocity

v_{rel} (t - 1)

as the dynamic cue, resulting in a constant-time

O (1)

overhead for search adjustments. Unlike Trajectron++ [91] or Social-LSTM [92], which explicitly model multi-dimensional latent states and social interaction cues, our mechanism avoids recurrent inference or multimodal trajectory prediction. Similarly, unlike interactive decision frameworks [48,49,50], which estimate continuous aggressiveness parameters, asymmetric preferences, and uncertainty in social intent, our formulation reduces dynamics to a single reactive scalar. This deliberate trade-off limits predictive expressiveness but ensures real-time efficiency and plug-and-play usability on resource-constrained UAV platforms, without requiring additional training, deep network inference, or complex state-estimation solvers.

5.2. Future Directions

Integrating Anticipatory Dynamics. The limitations of the current reactive mechanism suggest a promising future direction: incorporating anticipatory dynamics inspired by specific socially interactive and game-theoretic models. For instance, integrating the asymmetric agent preference modeling approach from Socially Game-Theoretic Lane-Change [49] could enable intent-aware acceleration scaling, allowing the search region to expand preemptively in response to inferred adversarial maneuvers. Likewise, leveraging uncertainty quantification techniques from probabilistic forecasting frameworks such as Trajectron++ [91] or models for reducing social preference uncertainty [50] could allow the search region to adapt based on predicted outcome variance, making the mechanism risk-aware. Integrating these concrete strategies into the adaptive scaling function may enhance robustness under complex, interactive UAV scenarios while aiming to retain computational efficiency.

Multi-Target Extension. Extending SAMViTrack to handle multiple targets represents a promising direction for future work. Such an extension could explore strategies for maintaining independent search regions and adaptive scaling for each target, while leveraging the Mamba-ViT backbone for robust feature association and identity maintenance. Incorporating inter-target reasoning mechanisms, such as occlusion handling and cross-target interactions, may further enhance tracking robustness in complex multi-UAV environments.

Training with UAV-Specific Data. While SAMViTrack is trained on general large-scale tracking datasets (e.g., GOT-10k, LaSOT), these datasets exhibit viewpoint distributions that differ from typical UAV scenarios. UAV-specific datasets such as WebUAV-3M provide richer aerial-view variations and could potentially further improve viewpoint robustness for both the backbone network and the adaptive search module. Conducting such full-scale training or fine-tuning requires substantial computational effort, and is therefore left as an important direction for future work.

6. Conclusions

In this paper, we presented SAMViTrack, a novel framework that successfully combines Vision Mamba with Vision Transformers for real-time UAV tracking for the first time. By introducing an adaptive search region mechanism and integrating Mamba’s efficient attention computation and ViT’s strong feature extraction capabilities, we effectively addressed the dual challenges of maintaining high tracking accuracy while reducing computational overhead for real-time UAV applications. The dynamic search region adjustment enables robust tracking under challenging conditions such as occlusions, fast motion, and varying scales. Moreover, the modular nature of our approach allows for easy integration with existing tracking systems. Extensive experiments on multiple benchmark datasets show that our method delivers state-of-the-art performance while maintaining real-time processing capabilities on resource-limited UAV platforms. We believe SAMViTrack represents a significant step forward in real-time UAV tracking where computational efficiency is crucial and provides a solid foundation for future research in this domain.

Author Contributions

Conceptualization, X.G., Y.L., H.Z., D.Z., F.H. and S.L.; Data curation, X.G., Y.L. and X.W.; Formal analysis, D.Z., F.H. and S.L.; Funding acquisition, S.L.; Methodology, X.G., Y.L., H.Z., X.W. and S.L.; Visualization, X.G. and H.Z.; Validation, X.G., Y.L., H.Z. and X.W.; Writing—original draft, X.G., H.Z., X.W. and S.L.; Writing—review & editing, X.G., D.Z., F.H. and S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 62466013), Key Technologies R&D Program (Grant No. Guike AB23026004), and the Guangxi Natural Science Foundation (Grant No. 2024GXNSFAA010484).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fang, Z.; Savkin, A.V. Strategies for Optimized UAV Surveillance in Various Tasks and Scenarios: A Review. Drones 2024, 8, 193. [Google Scholar] [CrossRef]
Zhang, C.; Zhou, W.; Qin, W.; Tang, W. A novel UAV path planning approach: Heuristic crossing search and rescue optimization algorithm. Expert Syst. Appl. 2023, 215, 119243. [Google Scholar] [CrossRef]
Yang, Z.; Yu, X.; Dedman, S.; Rosso, M.; Zhu, J.; Yang, J.; Xia, Y.; Tian, Y.; Zhang, G.; Wang, J. UAV remote sensing applications in marine monitoring: Knowledge visualization and review. Sci. Total Environ. 2022, 838, 155939. [Google Scholar] [CrossRef] [PubMed]
Saponi, M.; Borboni, A.; Adamini, R.; Faglia, R.; Amici, C. Embedded payload solutions in UAVs for medium and small package delivery. Machines 2022, 10, 737. [Google Scholar] [CrossRef]
Cao, Z.; Huang, Z.; Pan, L.; Zhang, S.; Liu, Z.; Fu, C. TCTrack: Temporal contexts for aerial tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14798–14808. [Google Scholar]
Wang, X.; Zeng, D.; Zhao, Q.; Li, S. Rank-based filter pruning for real-time uav tracking. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
Li, S.; Yang, Y.; Zeng, D.; Wang, X. Adaptive and Background-Aware Vision Transformer for Real-Time UAV Tracking. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 13943–13954. [Google Scholar]
Liu, M.; Wang, Y.; Sun, Q.; Li, S. Global Filter Pruning with Self-Attention for Real-Time UAV Tracking. In Proceedings of the British Machine Vision Conference, London, UK, 21–24 November 2022. [Google Scholar]
Howard, W.W.; Martone, A.F.; Buehrer, R.M. Timely target tracking: Distributed updating in cognitive radar networks. IEEE Trans. Radar Syst. 2024, 2, 318–332. [Google Scholar] [CrossRef]
Brenner, E.; de la Malla, C.; Smeets, J.B. Tapping on a target: Dealing with uncertainty about its position and motion. Exp. Brain Res. 2023, 241, 81–104. [Google Scholar] [CrossRef]
Zeng, K.; You, Y.; Shen, T.; Wang, Q.; Tao, Z.; Wang, Z.; Liu, Q. NCT: Noise-control multi-object tracking. Complex Intell. Syst. 2023, 9, 4331–4347. [Google Scholar] [CrossRef]
Sun, L.; Chang, J.; Zhang, J.; Fan, B.; He, Z. Adaptive image dehazing and object tracking in UAV videos based on the template updating Siamese network. IEEE Sens. J. 2023, 23, 12320–12333. [Google Scholar] [CrossRef]
Ma, Y.; He, J.; Yang, D.; Zhang, T.; Wu, F. Adaptive part mining for robust visual tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11443–11457. [Google Scholar] [CrossRef] [PubMed]
Gao, X.; Lin, X.; Lin, F.; Huang, H. Segmentation Point Simultaneous Localization and Mapping: A Stereo Vision Simultaneous Localization and Mapping Method for Unmanned Surface Vehicles in Nearshore Environments. Electronics 2024, 13, 3106. [Google Scholar] [CrossRef]
Chen, K.; Wang, L.; Wu, H.; Wu, C.; Liao, Y.; Chen, Y.; Wang, H.; Yan, J.; Lin, J.; He, J. Background-Aware Correlation Filter for Object Tracking with Deep CNN Features. Eng. Lett. 2024, 32, 1353–1363. [Google Scholar]
Jiang, S.; Cui, R.; Wei, R.; Fu, Z.; Hong, Z.; Feng, G. Tracking by segmentation with future motion estimation applied to person-following robots. Front. Neurorobot. 2023, 17, 1255085. [Google Scholar] [CrossRef]
Kumie, G.A.; Habtie, M.A.; Ayall, T.A.; Zhou, C.; Liu, H.; Seid, A.M.; Erbad, A. Dual-attention network for view-invariant action recognition. Complex Intell. Syst. 2024, 10, 305–321. [Google Scholar] [CrossRef]
Gao, L.; Ji, Y.; Gedamu, K.; Zhu, X.; Xu, X.; Shen, H.T. View-invariant human action recognition via view transformation network (VTN). IEEE Trans. Multimed. 2021, 24, 4493–4503. [Google Scholar] [CrossRef]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Rahman, M.M.; Tutul, A.A.; Nath, A.; Laishram, L.; Jung, S.K.; Hammond, T. Mamba in vision: A comprehensive survey of techniques and applications. arXiv 2024, arXiv:2410.03105. [Google Scholar] [CrossRef]
Zhang, H.; Zhu, Y.; Wang, D.; Zhang, L.; Chen, T.; Wang, Z.; Ye, Z. A survey on visual mamba. Appl. Sci. 2024, 14, 5683. [Google Scholar] [CrossRef]
Heidari, M.; Kolahi, S.G.; Karimijafarbigloo, S.; Azad, B.; Bozorgpour, A.; Hatami, S.; Azad, R.; Diba, A.; Bagci, U.; Merhof, D.; et al. Computation-Efficient Era: A Comprehensive Survey of State Space Models in Medical Image Analysis. arXiv 2024, arXiv:2406.03430. [Google Scholar] [CrossRef]
Yang, C.; Chen, Z.; Espinosa, M.; Ericsson, L.; Wang, Z.; Liu, J.; Crowley, E.J. Plainmamba: Improving non-hierarchical mamba in visual recognition. arXiv 2024, arXiv:2403.17695. [Google Scholar]
Amir, S.; Gandelsman, Y.; Bagon, S.; Dekel, T. Deep vit features as dense visual descriptors. arXiv 2021, arXiv:2112.05814. [Google Scholar]
Amir, S.; Gandelsman, Y.; Bagon, S.; Dekel, T. On the effectiveness of vit features as local semantic descriptors. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 39–55. [Google Scholar]
Li, Y.; Fu, C.; Ding, F.; Huang, Z.; Lu, G. AutoTrack: Towards High-Performance Visual Tracking for UAV With Automatic Spatio-Temporal Regularization. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11920–11929. [Google Scholar]
Li, S.; Liu, Y.; Zhao, Q.; Feng, Z. Learning residue-aware correlation filters and refining scale for real-time UAV tracking. Pattern Recognit. 2022, 127, 108614. [Google Scholar] [CrossRef]
Huang, Z.; Fu, C.; Li, Y.; Lin, F.; Lu, P. Learning Aberrance Repressed Correlation Filters for Real-Time UAV Tracking. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2891–2900. [Google Scholar]
Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. HiFT: Hierarchical Feature Transformer for Aerial Tracking. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 15457–15466. [Google Scholar]
Wu, W.; Zhong, P.; Li, S. Fisher Pruning for Real-Time UAV Tracking. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–7. [Google Scholar]
Xie, F.; Wang, C.; Wang, G.; Yang, W.; Zeng, W. Learning Tracking Representations via Dual-Branch Fully Transformer Networks. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Virtual, 11–17 October 2021; pp. 2688–2697. [Google Scholar]
Cui, Y.; Song, T.; Wu, G.; Wang, L. Mixformerv2: Efficient fully transformer tracking. Adv. Neural Inf. Process. Syst. 2024, 36, 58736–58751. [Google Scholar]
Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXII; Springer: Berlin/Heidelberg, Germany, 2022; pp. 341–357. [Google Scholar]
Xie, F.; Wang, C.; Wang, G.; Cao, Y.; Yang, W.; Zeng, W. Correlation-Aware Deep Tracking. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8741–8750. [Google Scholar]
Wu, Y.; Li, Y.; Liu, M.; Wang, X.; Yang, X.; Ye, H.; Zeng, D.; Zhao, Q.; Li, S. Learning Adaptive and View-Invariant Vision Transformer for Real-Time UAV Tracking. In Proceedings of the Forty-first International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Wu, Y.; Wang, X.; Zeng, D.; Ye, H.; Xie, X.; Zhao, Q.; Li, S. Learning motion blur robust vision transformers with dynamic early exit for real-time UAV tracking. arXiv 2024, arXiv:2407.05383. [Google Scholar] [CrossRef]
Yang, X.; Zeng, D.; Wang, X.; Wu, Y.; Ye, H.; Zhao, Q.; Li, S. Adaptively bypassing vision transformer blocks for efficient visual tracking. Pattern Recognit. 2025, 161, 111278. [Google Scholar] [CrossRef]
Aminifar, F.; Rahmatian, F. Unmanned aerial vehicles in modern power systems: Technologies, use cases, outlooks, and challenges. IEEE Electrif. Mag. 2020, 8, 107–116. [Google Scholar] [CrossRef]
Chai, R.; Guo, Y.; Zuo, Z.; Chen, K.; Shin, H.S.; Tsourdos, A. Cooperative motion planning and control for aerial-ground autonomous systems: Methods and applications. Prog. Aerosp. Sci. 2024, 146, 101005. [Google Scholar] [CrossRef]
Mohsan, S.A.H.; Khan, M.A.; Noor, F.; Ullah, I.; Alsharif, M.H. Towards the unmanned aerial vehicles (UAVs): A comprehensive review. Drones 2022, 6, 147. [Google Scholar] [CrossRef]
Zhu, J.; Chen, X.; Diao, H.; Li, S.; He, J.Y.; Li, C.; Luo, B.; Wang, D.; Lu, H. Exploring Dynamic Transformer for Efficient Object Tracking. arXiv 2024, arXiv:2403.17651. [Google Scholar] [CrossRef]
Dao, T.; Fu, D.; Ermon, S.; Rudra, A.; Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Adv. Neural Inf. Process. Syst. 2022, 35, 16344–16359. [Google Scholar]
Yeom, S.K.; Kim, T.H. UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices. arXiv 2024, arXiv:2412.02344. [Google Scholar] [CrossRef]
Wu, Y.; Wang, Z.; Lu, W.D. PIM GPT a hybrid process in memory accelerator for autoregressive transformers. npj Unconv. Comput. 2024, 1, 4. [Google Scholar] [CrossRef]
Ruan, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar] [CrossRef]
Liu, J.; Yang, H.; Zhou, H.Y.; Xi, Y.; Yu, L.; Li, C.; Liang, Y.; Shi, G.; Yu, Y.; Zhang, S.; et al. Swin-umamba: Mamba-based unet with imagenet-based pretraining. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Marrakesh, Morocco, 7–11 October 2024; Springer: Cham, Switzerland, 2024; pp. 615–625. [Google Scholar]
Crosato, L.; Tian, K.; Shum, H.P.; Ho, E.S.; Wang, Y.; Wei, C. Social Interaction-Aware Dynamical Models and Decision-Making for Autonomous Vehicles. Adv. Intell. Syst. 2024, 6, 2300575. [Google Scholar] [CrossRef]
Hu, W.; Deng, Z.; Yang, Y.; Zhang, P.; Cao, K.; Chu, D.; Zhang, B.; Cao, D. Socially Game-Theoretic Lane-Change for Autonomous Heavy Vehicle based on Asymmetric Driving Aggressiveness. IEEE Trans. Veh. Technol. 2025, 74, 17005–17018. [Google Scholar] [CrossRef]
Deng, Z.; Hu, W.; Sun, C.; Chu, D.; Huang, T.; Li, W.; Yu, C.; Pirani, M.; Cao, D.; Khajepour, A. Eliminating uncertainty of driver’s social preferences for lane change decision-making in realistic simulation environment. IEEE Trans. Intell. Transp. Syst. 2024, 26, 1583–1597. [Google Scholar] [CrossRef]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 2018, 107, 3–11. [Google Scholar] [CrossRef] [PubMed]
Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Li, S.; Yeung, D.Y. Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Ling, H.; Hu, Q.; Wu, H.; Nie, Q.; Cheng, H.; Liu, C.; et al. VisDrone-VDT2018: The Vision Meets Drone Video Detection and Tracking Challenge Results. In Proceedings of the ECCV Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.F.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 375–391. [Google Scholar]
Zhang, C.; Huang, G.; Liu, L.; Huang, S.; Yang, Y.; Wan, X.; Ge, S.; Tao, D. WebUAV-3M: A benchmark for unveiling the power of million-scale deep UAV tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 9186–9205. [Google Scholar] [CrossRef] [PubMed]
Huang, L.; Zhao, X.; Huang, K. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Loshchilov, I. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-Speed Tracking with Kernelized Correlation Filters. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 583–596. [Google Scholar] [CrossRef] [PubMed]
Danelljan, M.; Hager, G.; Khan, F.S.; Felsberg, M. Discriminative Scale Space Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1561–1575. [Google Scholar] [CrossRef]
Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Wang, N.; Zhou, W.; Tian, Q.; Hong, R.; Wang, M.; Li, H. Multi-cue correlation filters for robust visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4844–4853. [Google Scholar]
Ye, J.; Fu, C.; Zheng, G.; Paudel, D.P.; Chen, G. Unsupervised domain adaptation for nighttime aerial tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8896–8905. [Google Scholar]
Zeng, D.; Zou, M.; Wang, X.; Li, S. Towards Discriminative Representations with Contrastive Instances for Real-Time UAV Tracking. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 1349–1354. [Google Scholar]
Yao, L.; Fu, C.; Li, S.; Zheng, G.; Ye, J. SGDViT: Saliency-Guided Dynamic Vision Transformer for UAV Tracking. arXiv 2023, arXiv:2303.04378. [Google Scholar]
Chen, T.; Ding, S.; Xie, J.; Yuan, Y.; Chen, W.; Yang, Y.; Ren, Z.; Wang, Z. Abd-net: Attentive but diverse person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8351–8361. [Google Scholar]
Wei, Q.; Zeng, B.; Liu, J.; He, L.; Zeng, G. LiteTrack: Layer Pruning with Asynchronous Feature Extraction for Lightweight and Efficient Visual Tracking. arXiv 2023, arXiv:2309.09249. [Google Scholar] [CrossRef]
Gopal, G.Y.; Amer, M.A. Separable self and mixed attention transformers for efficient object tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 6708–6717. [Google Scholar]
Zhao, H.; Wang, D.; Lu, H. Representation Learning for Visual Object Tracking by Masked Appearance Transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18696–18705. [Google Scholar]
Zhu, J.; Tang, H.; Cheng, Z.Q.; He, J.Y.; Luo, B.; Qiu, S.; Li, S.; Lu, H. Dcpt: Darkness clue-prompted tracking in nighttime uavs. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 7381–7388. [Google Scholar]
Wei, X.; Bai, Y.; Zheng, Y.; Shi, D.; Gong, Y. Autoregressive visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9697–9706. [Google Scholar]
Song, Z.; Yu, J.; Chen, Y.P.P.; Yang, W. Transformer tracking with cyclic shifting window attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8791–8800. [Google Scholar]
Cai, W.; Liu, Q.; Wang, Y. HIPTrack: Visual Tracking with Historical Prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Wu, Q.; Yang, T.; Liu, Z.; Wu, B.; Shan, Y.; Chan, A.B. DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14561–14571. [Google Scholar]
Chen, B.; Li, P.; Bai, L.; Qiao, L.; Shen, Q.; Li, B.; Gan, W.; Wu, W.; Ouyang, W. Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Shi, L.; Zhong, B.; Liang, Q.; Li, N.; Zhang, S.; Li, X. Explicit Visual Prompts for Visual Object Tracking. In Proceedings of the Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
Chen, X.; Peng, H.; Wang, D.; Lu, H.; Hu, H. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning Spatio-Temporal Transformer for Visual Tracking. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10428–10437. [Google Scholar]
Kou, Y.; Gao, J.; Li, B.; Wang, G.; Hu, W.; Wang, Y.; Li, L. ZoomTrack: Target-aware Non-uniform Resizing for Efficient Visual Tracking. Adv. Neural Inf. Process. Syst. 2023, 36, 50959–50977. [Google Scholar]
Cai, Y.; Liu, J.; Tang, J.; Wu, G. Robust object modeling for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023. [Google Scholar]
Guo, D.; Shao, Y.; Cui, Y.; Wang, Z.; Zhang, L.; Shen, C. Graph attention tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9543–9552. [Google Scholar]
Liu, Y.; Qin, P.; Fu, L. Research on the Application of Entropy Method and Efficiency Coefficient Method in Financial Risk Early Warning of Enterprises: Taking Shandong Longda Meishi Co., Ltd. as an Example. Front. Bus. Econ. Manag. 2024, 14, 254–259. [Google Scholar] [CrossRef]
Xie, J.; Zhong, B.; Mo, Z.; Zhang, S.; Shi, L.; Song, S.; Ji, R. Autoregressive Queries for Adaptive Tracking with Spatio-TemporalTransformers. arXiv 2024, arXiv:2403.10574. [Google Scholar] [CrossRef]
Wu, Y.; Lim, J.; Yang, M.H. Online object tracking: A benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition, Portland, OR, USA, 23–28 June 2013; pp. 2411–2418. [Google Scholar]
Salzmann, T.; Ivanovic, B.; Chakravarty, P.; Pavone, M. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 683–700. [Google Scholar]
Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Li, F.-F.; Savarese, S. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 961–971. [Google Scholar]

Figure 1. Compared with state-of-the-art UAV tracking algorithms on VisDrone2018, our SAMViTrack performs with 0.84 precision and still runs efficiently at approximately 340 FPS.

Figure 2. Overview of the proposed SAMViTrack framework. It consists of a Search-Region Adaptive Network, a Feature Extraction Network, and a Head Network. Note that the Search-Region Adaptive module is exclusively active during inference.

Figure 3. Illustration of the Search-Region Adaptive Module. The module takes two consecutive frames as input. Using relative velocity, the module computes the adaptive scaling factor (s) which is used to generate the adaptive search region.

Figure 4. Attribute -based comparison on the fast motion subset of VisDrone2018. Note that MViTrack refers to SAMViTrack without utilizing the proposed adaptive search region mechanism.

Figure 5. Comparison of two-dimensional histograms between the original and the adjusted search image.

Figure 6. Illustration of the Search-Region Adaptive (SA) mechanism. The x-axis represents the frame index of a video sequence. The green-circled points correspond to the specific frames from which the visual examples (shown above) are taken. The red boxes indicate the true target location, the cyan boxes denote the fixed search region used in conventional methods, and the yellow boxes represent the adaptively adjusted search region.

Figure 7. Qualitative comparison on three video sequences, respectively, from the datasets DTB70, VisDrone2018, and UAVDT (i.e., Horse2, uav0000353_01127_s, and So302). Each number indicates which frame of the video is displayed.

Table 1. Comparison of precision (Prec.), success rate (Succ.), speed (FPS), FLOPs and Param between SAMViTrack and lightweight trackers on DTB70, UAVDT, VisDrone2018, UAV123 and WebUAV-3M. Red, blue, and green signify the first, second, and third places. Please note that the percent symbol (%) is excluded for Prec. and Succ. values.

Tracker		Source	DTB70		UAVDT		VisDrone2018		UAV123		WebUAV-3M		Avg.		Avg. FPS		FLOPs	Param
Tracker		Source	Prec.	Succ.	Prec.	Succ.	Prec.	Succ.	Prec.	Succ.	Prec.	Succ.	Prec.	Succ.	GPU	CPU	(GMac)	(M)
DCF-based	KCF [65]	TPAMI 15	46.8	28.0	57.1	29.0	68.5	41.3	52.3	33.1	39.8	21.6	52.9	30.6	-	468.5	-	-
	fDSST [66]	TPAMI 17	53.4	35.7	66.6	38.3	69.8	51.0	58.3	40.5	43.5	28.5	58.3	38.8	-	132.4	-	-
	ECO_HC [67]	CVPR 17	63.5	44.8	69.4	41.6	80.8	58.1	71.0	49.6	61.3	40.2	69.2	46.9	-	66.9	-	-
	MCCT_H [68]	CVPR 18	60.4	40.5	66.8	40.2	80.3	56.7	65.9	45.7	52.1	34.3	65.1	43.5	-	55.3	-	-
	AutoTrack [27]	CVPR 20	71.6	47.8	71.8	45.0	78.8	57.3	68.9	47.2	56.5	35.6	69.5	46.6	-	50.7	-	-
	RACF [28]	PR 20	72.6	50.5	77.3	49.4	83.4	60.0	70.2	47.7	58.0	38.4	72.3	49.2	-	33.2	-	-
CNN-based	HiFT [30]	ICCV 21	80.2	59.4	65.2	47.5	71.9	52.6	78.7	59.0	60.9	45.8	71.4	52.9	157.2	-	7.2	9.9
	UDAT [69]	CVPR 22	80.6	61.8	80.1	59.2	81.6	61.9	76.1	59.0	64.8	48.7	76.6	58.1	31.2	-	-	-
	TCTrack [5]	CVPR 22	81.2	62.2	72.5	53.0	79.9	59.4	80.0	60.5	61.9	45.7	75.1	56.2	132.7	-	8.8	9.7
	DRCI [70]	ICME 23	81.4	61.8	84.0	59.0	83.4	60.0	76.7	59.7	62.0	47.6	77.5	57.6	268.5	61.0	3.6	8.8
	SGDViT [71]	ICRA 23	78.5	60.4	65.7	48.0	72.1	52.1	75.4	57.5	61.3	45.7	70.6	52.7	107.3	-	11.3	23.3
	ABDNet [72]	RAL 23	76.8	59.6	75.5	55.3	75.0	57.2	79.3	60.7	63.9	48.7	74.1	56.3	119.2	-	-	-
Vit-based	Aba-ViTrack [7]	ICCV 23	85.9	66.4	82.9	59.1	86.1	65.3	86.4	66.4	67.4	53.7	81.7	62.2	172.1	46.3	2.4	8.0
	AVTrack [36]	ICML 24	84.3	65.0	82.1	58.7	86.0	65.3	84.8	66.8	69.0	54.0	81.2	61.9	252.8	53.8	0.97–1.9	3.5–7.9
	LiteTrack [73]	ICRA 24	82.5	63.9	81.6	59.3	79.7	61.4	84.2	65.9	69.4	54.1	79.5	60.9	132.6	-	14.1	54.92
	SMAT [74]	WACV 24	81.9	63.8	80.8	58.7	82.5	63.4	81.8	64.6	68.9	53.9	79.2	60.9	118.5	-	3.2	8.6
	Aba-ViTrack-SA		86.1	65.2	83.1	59.4	87.7	65.7	86.0	67.9	70.7	55.5	82.7	62.8	188.4	47.9	2.4	8.0
	AVTrack-SA		85.5	64.8	82.8	58.3	86.6	65.4	85.8	67.2	71.2	55.3	82.4	62.2	268.0	55.3	0.97–1.9	3.5–7.9
	SAMViTrack	Ours	83.3	63.8	82.1	58.1	84.1	63.6	81.9	64.1	70.2	54.6	80.3	60.8	314.1	61.6	1.1	4.0

Table 2. Precision (Prec.), success (Succ.), and GPU speed comparison between SAMViTrack and DL-based tracker on UAVDT. Red, blue, and green signify the first, second, and third places.

Tracker	Prec.	Succ.	FPS	Tracker	Prec.	Succ.	FPS	Tracker	Prec.	Succ.	FPS
SAMViTrack (Ours)	82.1	58.1	346.2	MixFormerV2 [33]	57.8	42.1	248.0	MAT [75]	72.9	54.6	71.2
DCPT [76]	76.8	56.8	41.2	ARTrack [77]	74.7	52.8	84.9	CSWinTT [78]	67.3	51.2	32.6
HIPTrack [79]	79.6	60.9	33.0	DropTrack [80]	78.8	57.0	142.8	SimTrack [81]	76.5	57.2	76.0
EVPTrack [82]	80.1	60.3	68.3	SeqTrack [83]	79.0	60.0	12.1	STARK [84]	72.0	53.6	53.3
ZoomTrack [85]	77.1	58.0	62.3	ROMTrack [86]	81.9	61.6	55.8	SiamGAT [87]	76.4	58.9	86.5

Table 3. Impact of Mamba-ViT configurations and Search-Region Adaptive (SA) on the tracking performance of SAMViTrack on the UAVDT benchmark. Here, ✓ indicates the addition of the SA module, and ↑ shows the improvement; for FPS, it represents the percentage increase relative to the original value.

Mamba Layers	ViT Layers	SA	Prec.	Succ.	FPS	I_b
6	0		75.28	54.66	366.72	0.49
6	0	✓	${80.24}_{↑ 4.96}$	${56.95}_{↑ 2.29}$	${377.7}_{↑ 3.0 %}$	0.87
5	1		72.9	52.7	357.9	0.30
5	1	✓	${79.7}_{↑ 6.8}$	${57.1}_{↑ 4.4}$	${367.9}_{↑ 2.8 %}$	0.80
4	2		78.8	56.1	341.0	0.64
4	2	✓	${80.5}_{↑ 1.7}$	${58.9}_{↑ 2.8}$	${351.4}_{↑ 1.3 %}$	0.79
3	3		79.3	56.8	337.3	0.66
3	3	✓	${82.1}_{↑ 2.8}$	${58.1}_{↑ 1.3}$	${346.2}_{↑ 2.7 %}$	0.89
2	4		78.2	56.3	319.9	0.53
2	4	✓	${79.4}_{↑ 1.2}$	${56.6}_{↑ 0.3}$	${326.2}_{↑ 2.0 %}$	0.63
1	5		76.8	56.1	315.1	0.41
1	5	✓	${80.3}_{↑ 3.5}$	${57.0}_{↑ 0.9}$	${324.8}_{↑ 3.1 %}$	0.69
0	6		75.68	54.57	302.7	0.30
0	6	✓	${80.63}_{↑ 4.95}$	${55.83}_{↑ 1.26}$	${313.9}_{↑ 3.70 %}$	0.67

Table 4. Comparison of different Mamba and ViT combinations on the UAVDT benchmark. It presents Prec., Succ., FPS, and

I_{b}

of the model under various combinations of Mamba and ViT, with and without SA enabled. Here, ✓ indicates the addition of the SA module.

Table 4. Comparison of different Mamba and ViT combinations on the UAVDT benchmark. It presents Prec., Succ., FPS, and

I_{b}

of the model under various combinations of Mamba and ViT, with and without SA enabled. Here, ✓ indicates the addition of the SA module.

Mamba	ViT	SA	Prec.	Succ.	FPS	I_b
1, 3, 5	2, 4, 6		65.4	46.5	285.1	0.14
1, 3, 5	2, 4, 6	✓	71.9	48.7	292.9	0.41
2, 4, 6	1, 3, 5		67.0	48.4	273.9	0.16
2, 4, 6	1, 3, 5	✓	74.2	51.4	293.0	0.50
4, 5, 6	1, 2, 3		79.6	55.0	335.9	0.87
4, 5, 6	1, 2, 3	✓	80.7	57.4	344.7	0.94
1, 2, 3	4, 5, 6		79.3	56.8	337.3	0.86
1, 2, 3	4, 5, 6	✓	82.1	58.1	346.2	1.00

Table 5. Evaluation fany of the generalizability of our SA by applying them to three SOTA trackers. The evaluation is performed on GOT-10k and OTB. Here, ✓ indicates the addition of the SA module.

Tracker	SA	GOT-10k		OTB		Avg. FPS
Tracker	SA	AO	SR0.75	Succ.	Prec.	Avg. FPS
AQATrack [89]		76.0	74.9	68.8	89.8	22.6
AQATrack [89]	✓	76.7	75.7	69.9	91.7	24.3
OSTrack [34]		73.7	70.8	68.2	88.7	33.5
OSTrack [34]	✓	74.7	72.6	69.4	90.5	35.2
HIPTrack [79]		77.4	74.5	70.9	93.1	28.7
HIPTrack [79]	✓	78.4	75.8	71.7	93.9	30.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

SAMViTrack: A Search-Region Adaptive Mamba-ViT Tracker for Real-Time UAV Tracking

Abstract

1. Introduction

3. Method

3.1. Overview

3.2. Hybrid Mamba-ViT for Feature Representation

3.3. Search-Region Adaptive Module

3.4. Tracking Head and Loss Function

3.5. Parameter Sensitivity Observation

3.6. Summary and Motivation for Adaptive Search

4. Experiments

4.1. Implementation Details

4.2. Comparison with Lightweight Trackers

4.3. Comparison with Deep Trackers

4.4. Attribute-Based Evaluation

4.5. Ablation Study

4.6. Computational Complexity

4.7. Qualitative Results

5. Discussion

5.1. Limitations and Design Trade-Offs

5.2. Future Directions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

SAMViTrack: A Search-Region Adaptive Mamba-ViT Tracker for Real-Time UAV Tracking

Abstract

1. Introduction

2. Related Works

3. Method

3.1. Overview

3.2. Hybrid Mamba-ViT for Feature Representation

3.3. Search-Region Adaptive Module

3.4. Tracking Head and Loss Function

3.5. Parameter Sensitivity Observation

3.6. Summary and Motivation for Adaptive Search

4. Experiments

4.1. Implementation Details

4.2. Comparison with Lightweight Trackers

4.3. Comparison with Deep Trackers

4.4. Attribute-Based Evaluation

4.5. Ablation Study

4.6. Computational Complexity

4.7. Qualitative Results

5. Discussion

5.1. Limitations and Design Trade-Offs

5.2. Future Directions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics