Next Article in Journal
RDAH-Net: Bridging Relative Depth and Absolute Height for Monocular Height Estimation in Remote Sensing
Previous Article in Journal
MRTS-Boosting: A Quality-Aware Multivariate Time Series Classification Framework for Robust Rice Detection Under Cloud Contamination
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MIFMNet: A Multimodal Interactions and Fusion Mamba for RGBT Tracking with UAV Platforms

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(7), 1026; https://doi.org/10.3390/rs18071026
Submission received: 10 February 2026 / Revised: 4 March 2026 / Accepted: 27 March 2026 / Published: 29 March 2026
(This article belongs to the Section Remote Sensing Image Processing)

Highlights

What are the main findings?
  • A novel multimodal interaction and fusion Mamba network (MIFMNet) is proposed for UAV RGBT tracking, featuring two core modules scale differential enhanced Mamba (SDEM) and flow-guided multilayer interaction Mamba (FMIM) that address the trade-off between interaction capability and computational efficiency in existing CNN/Transformer frameworks.
  • MIFMNet achieves state-of-the-art performance on four mainstream RGBT benchmarks (LasHeR, RGBT210, RGBT234, VTUAV), with an inference speed of 35.3 FPS and superior robustness in UAV-specific challenges (scale variation, rapid motion, occlusion).
What are the implications of the main findings?
  • The scale differential enhancement and flow-guided motion-aware interaction mechanisms of MIFMNet provide an efficient solution for multimodal fusion in dynamic remote sensing observation scenarios with resource constraints.
  • Extending Mamba to RGBT tracking verifies its potential for linear-complexity long-range modeling in multimodal vision tasks, offering a new architectural alternative to CNNs and Transformers for UAV computer vision applications.

Abstract

RGBT tracking holds irreplaceable value in unmanned aerial vehicle (UAV) ground observation missions, effectively supporting scenarios such as nighttime monitoring and low-altitude reconnaissance. However, existing frameworks based on CNNs or Transformers face inherent trade-offs between interaction capabilities and computational efficiency. Furthermore, current methods perform poorly in challenging scenarios involving target scale variations and rapid motion from UAV perspectives. To address these issues, this paper proposes a novel multimodal interaction and fusion Mamba network (MIFMNet), which achieves fundamental innovations relative to existing RGB-T fusion trackers and recent Mamba-based tracking methods. Different from existing RGB-T trackers that rely on CNN’s local convolution or Transformer’s quadratic-complexity self-attention for cross-modal fusion, MIFMNet departs from these architectures and designs modality-adaptive interaction mechanisms based on Mamba, fully leveraging the complementary information while resolving the efficiency-accuracy trade-off. Specifically, this paper designs the scale differential enhanced Mamba (SDEM), which expands the receptive field through multiscale parallel convolutions while amplifying complementary information via differential strategies to enhance feature responses to scale-varying objects. Furthermore, we propose flow-guided multilayer interaction Mamba (FMIM), which integrates inter-frame motion information into scanning prediction. This enables the network to adaptively adjust interaction priorities between shallow texture and high-level semantic features based on motion intensity, mitigating early information forgetting and enhancing robustness in dynamic scenes. Extensive experiments on four major benchmarks demonstrate that MIFMNet achieves state-of-the-art performance on precision and success rate, particularly excelling in UAV scenarios involving occlusion, scale variations, and rapid motion. Simultaneously, it achieves an inference speed of 35.3 FPS, enabling efficient deployment on resource-constrained platforms, thereby providing robust support for UAV applications of RGBT tracking.

1. Introduction

Traditional object tracking methods primarily rely on video sequences captured in RGB spectrum and have achieved remarkable success after years of development [1,2,3]. However, in remote sensing scenarios such as unmanned aerial vehicle (UAV) ground observation, RGB images exhibit significant limitations: poor illumination, high inter-object similarity, and ambiguous background boundaries can all lead to a severe performance degradation in tracking performance. To overcome these constraints, researchers have begun exploring RGBT tracking [4,5], which leverages the complementary information from RGB and thermal infrared (TIR) modalities to enable continuous prediction and positioning. Specifically, TIR images are formed by capturing the thermal radiation emitted or reflected by objects. This allows tracking algorithms to locate based on thermal differences between objects and the background, thereby providing effective supplementation to RGB images under low-light or no-light conditions. Nevertheless, TIR images often suffer from inherently low spatial resolution and weak textural details due to limitations in airborne infrared imaging technology, longer wavelengths, and thermal noise. It is precisely due to the complementarity and differences in imaging mechanisms and information representation between RGB and TIR that RGBT tracking can overcome the limitations of single-modality approaches, playing a pivotal role in remote sensing tasks such as Earth observation and UAV surveying [6].
As a representative multimodal vision task, existing RGBT tracking approaches have focused on designing effective feature fusion modules. These methods can be broadly categorized into two groups: one involves Siamese-based networks, which match template and search region features across modalities. For instance, Zhang et al. [7] introduced a complementary-aware multimodal fusion module that enhances original features via a learnable weighting network. Cheng et al. [8] further integrated information from deeper levels after extracting multimodal features through a multiscale feature enhancement module. However, due to the inability of Siamese-based networks to facilitate information interaction between the template and the search region, contextual modeling is not sufficiently achieved. The second group aims to achieve feature fusion through Transformer with strong representational capacity [9]. Fan et al. [10], for example, stacked multiple Swin-Transformer blocks [11] to enable cross-modal interaction and fusion. Li et al. [12] developed a multimodal interactive tracker based on Vision Transformer (ViT) that fuses feature from different branches, promoting multidimensional information exchange. However, due to excessive computational overhead, when confronted with challenging tasks such as drones flying at high speeds or sudden changes in target scale, Transformer-based methods can only achieve interaction at limited levels. This prevents them from fully leveraging the complementary features. Current approaches face an intrinsic trade-off between interaction capability and computational efficiency: complex interaction modules are computationally expensive and difficult to deploy, while lightweight designs lack sufficient representational power.
Moreover, the complementarity between RGB and TIR modalities manifests differently across feature hierarchies [13]. At shallow layers, fine-grained RGB textures complement the coarse thermal contours in TIR. At deeper layers, high-level semantic cues from RGB integrate with thermal signatures [14]. Yet prevailing methods based on either convolutional neural networks (CNNs) or Transformers are fundamentally constrained: CNNs possess limited receptive fields that hinder global modeling, while Transformers, despite their long-range modeling strength, are bottlenecked by memory consumption, making multimodal interaction impractical. These architectural limitations prevent existing RGBT tracking from achieving comprehensive interaction and fusion, thereby limiting their applicability in UAV remote sensing scenarios [15]. Figure 1 illustrates comparisons of the accuracy and efficiency between MIFMNet and several state-of-the-art (SOTA) RGBT trackers on the dataset.
To address the limitations of existing RGBT tracking methods, which struggle to balance accuracy and efficiency while performing poorly in UAV scenarios, we propose a novel multimodal interaction and fusion Mamba network named MIFMNet. The network transcends CNN or Transformer architectures, fully leveraging the similarities and complementary information between two modalities to meet the core requirements of UAV dynamic observation and resource constraints. Specifically, we design the scale differential enhancement Mamba (SDEM), which models inter-modal differences with linear complexity and employs parallel convolutions to expand the effective receptive field. This significantly boosts feature responses for objects across varying scales while simultaneously enhancing modality-specific representations. Although Mamba-based models exhibit exceptional efficiency, their causal nature makes them prone to forgetting early feature information during sequence processing. Furthermore, standard scanning strategies lack the ability to perceive motions of objects. To mitigate these issues, we further propose flow-guided multilayer interaction Mamba (FMIM). FMIM incorporates inter-frame motion cues into Mamba’s scanning order, enabling adaptive reordering across layers. During high-intensity motion, the module prioritizes shallow texture features to ensure positioning accuracy. During smooth motion, it emphasizes high-level semantic features to enhance interference resistance. FMIM mitigates early information forgetting while achieving adaptive feature interaction in dynamic scenes. Extensive experiments on four RGBT tracking benchmarks demonstrate that MIFMNet has outstanding performance in both accuracy and efficiency, with particularly significant advantages in UAV scenarios involving scale variations and dynamic monitoring. Our main contributions are summarized as follows:
  • We propose a novel multimodal interaction and fusion network for RGBT tracking with UAV platforms. Through multi-level Mamba, this network not only achieves efficient multimodal interaction but also specifically addresses the issues of weak response in multiscale targets and insufficient adaptability to UAV scenarios.
  • We design a scale differential enhancement Mamba to model modal differences with linear computational complexity and expand the receptive field through parallel convolutions, enabling cross-modal enhancement and fusion while efficiently adapting to multiscale targets.
  • We introduce flow-guided multilayer interaction Mamba, which integrates optical flow-derived motion information into the scanning order prediction, allowing the network to dynamically prioritize shallow-texture or deep-semantic features based on motion intensity. This mitigates information forgetting and enhances robustness in UAV dynamic scenarios.
  • Extensive experiments on four RGBT benchmarks show that MIFMNet achieves SOTA results while maintaining a manageable computational load, with exceptional performance in UAV scenarios such as scale changes and rapid motion.

2. Related Works

2.1. RGBT Tracking

RGBT tracking has emerged as a prominent research direction in computer vision, leveraging the rich textural details of RGB images and the illumination-invariant properties of TIR images to achieve robust tracking in complex environments [16]. Research in this field primarily draws upon established frameworks for RGB tracking, focusing on two core directions: feature fusion and attribute adaptation. In terms of feature fusion, existing methods achieve cross-modal interaction through Siamese-based and Transformer-based architectures. Siamese networks typically employ parallel backbone networks to extract dual-modal features, followed by dedicated modules such as complementary perception or multiscale enhancement to facilitate cross-modal interaction [4]. Representative works include mfDiMP [17] and SiamCDA [7]. Despite significant progress in Siamese network, these methods restrict cross-modal interaction to local regions, resulting in insufficient contextual modeling. In contrast, Transformer-based trackers overcome this locality limitation by establishing global correspondences between multi-modal features from the template and search regions, thereby enabling comprehensive context modeling and enhancement. Methods such as TBSI [4] and MHTNet [18] introduce bridging interaction mechanisms and multi-head self-attention modules to infer contextual relationships between the target and its surroundings, achieving high-precision tracking. However, the quadratic computational complexity inherent to self-attention imposes a significant burden when processing long input sequences, limiting scalability.
To address specific challenges in UAV scenarios such as occlusion and lighting variations, task-aware strategies have been proposed. These approaches adopt a divide-and-conquer philosophy by attribute-specific modules to enhance adaptability. For instance, Li et al. [19] designed separate models for illumination change and thermal crossover, achieving adaptive feature representation through guided fusion. Xiao et al. [20] constructed five parallel branches with identical architecture to tackle five distinct challenges: illumination variation, thermal crossover, fast motion, scale variation, and occlusion. By employing encoders and decoders, features are deeply interactive and enhanced. Furthermore, Zhang et al. [21] jointly exploited appearance and motion cues, employing a shared-weight global network with a selection mechanism to flexibly switch between appearance and motion models, thereby improving robustness. Nevertheless, existing fusion strategies remain constrained by computational costs, often permitting interaction only at specific layers, and thus, failing to fully exploit multilayer multimodal features. In this work, we depart from conventional CNN and Transformer-based architectures and instead adopt the Mamba framework. As illustrated in Figure 2, Mamba enables efficient, full-hierarchy multimodal interaction with linear computational complexity, offering a compelling solution to the longstanding trade-off between modeling capacity and efficiency. The core advantage of Mamba in multimodal feature fusion lies in its adaptive modeling of the intrinsic feature distribution characteristics of RGB and TIR modalities. In sharp contrast, the CNN-based modal interaction relies on local convolution kernels and stacking of receptive fields, which can only capture the local spatial correlation of RGB-T features and cannot model the long-range global semantic association of TIR modalities, leading to the loss of thermal radiation feature information in large-scale UAV scenes. The Transformer-based cross-modal interaction is based on full self-attention with quadratic complexity, which equally calculates the correlation between all RGB-T feature tokens, leading to redundant modeling of the highly similar background regions of RGB and TIR, and the over-consumption of computational resources in the modal interaction process, which is difficult to adapt to the real-time requirements of UAV platforms.
The SSM of Mamba decouples the spatial–temporal sequence modeling of features into a continuous state transition process and a selective information scanning mechanism: on the one hand, the continuous state transition of SSM can model the slow-varying global thermal semantic features of TIR images with stable state propagation, and adapt to the discontinuous local texture feature distribution of RGB images through input-dependent dynamic step size adjustment (Formula (9)); on the other hand, the selective scanning of Mamba can dynamically allocate attention weights to the complementary feature regions of RGB and TIR—for texture-deficient TIR features, SSM strengthens the propagation of global semantic state information, and for detail-rich RGB features, it enhances the capture of local texture state variation.

2.2. Visual State Space Models

State space models (SSMs) have gained widespread adoption in vision tasks due to their strong capability in long-range sequence modeling, computational efficiency, and linear complexity with respect to sequence length. SSMs map an input sequence to a latent state that encodes historical context, thereby enabling sequence prediction based on hidden states. However, conventional SSMs are time-invariant, which limits their performance in context reasoning tasks such as tracking. To address this limitation, Mamba [22] introduces a dynamic time-aware mechanism by making the step size parameter input-dependent, thus enabling context-aware information propagation. Subsequently, [23] extends Mamba to two-dimensional visual tasks through bidirectional selective scanning. Kang et al. [24] designed an information fusion module combining Mamba with cross-attention to capture and propagate richer implicit contextual cues, achieving accurate localization in video sequences. In [25], a fully Mamba-based framework was proposed that integrates feature learning and template search, delivering competitive tracking performance at low computational cost. Moreover, Yao et al. [26] proposed a Motion Mamba module that leverages local correlations and bidirectional scanning to model motion-blurred targets under UAV viewpoints. Although existing Mamba-based vision models have shown excellent performance in single-modality sequence modeling, their superiority in multimodal feature interaction (e.g., RGB-T) has not been theoretically and quantitatively demonstrated. Different from Transformer’s self-attention mechanism that relies on global token pair correlation calculation, Mamba’s SSM structure realizes modality-adaptive information propagation through dynamic state transition and selective scanning, which is more in line with the heterogeneous feature distribution characteristics of RGB and TIR modalities (RGB with local texture sparsity, TIR with global semantic compactness). More importantly, the linear computational complexity of Mamba fundamentally solves the efficiency bottleneck of Transformer in dense multimodal interaction, making it more suitable for resource-constrained UAV RGBT tracking tasks. The following section will further derive the computational complexity of Mamba and Transformer in RGB-T interaction and theoretically analyze the inherent advantages of Mamba in RGBT tracking.

3. Methodology

3.1. Preliminaries

SSMs are inspired by continuous-time linear systems. In such systems, a one-dimension function or sequence x t R is mapped from an intermediate hidden state h t R at the previous time step to the current hidden state, and the output is computed based on the input and the current hidden state. This process can be expressed as:
h t = A h t + B x t
y t = C h t + D x t
where A R N × N is the learnable state matrix, and B R N × 1 , C R N × 1 , and D R N × 1 are learnable projection parameters. D x t is sometimes interpreted as a residual connection and may be omitted. To convert the continuous-time formulation into a discrete-time model suitable for sequences, SSMs introduce a time-step discretization Δ . A widely used discretization method is the zero order hold (ZOH) rule, defined as:
A ¯ = exp Δ A
B ¯ = Δ A 1 exp Δ A I Δ B
C ¯ = C
where A ¯ R N × N , B ¯ R D × N , and C ¯ R D × N . After discretization, (1) and (2) can be expressed as:
h t = A ¯ h t + B ¯ x t
y t = C ¯ h t
Finally, the output sequence is obtained via a global convolution over the input, implemented through a structured convolution kernel K ¯ :
y = x * K ¯
K ¯ = C B ¯ , C A B ¯ , , C A L 1 B ¯
Here, * denotes the convolution operator, and L represents the sequence length x . In traditional SSMs, all learnable parameters remain fixed across different inputs, resulting in a linear time-invariant (LTI) system. This static nature constitutes a fundamental limitation when applying SSMs to vision tasks that demand context-sensitive processing. To overcome this, Mamba introduces a selective scanning mechanism, which enables the model to dynamically emphasize important information within long sequences. Specifically, the input x R B × L × N is passed through a learnable multilayer perceptron (MLP) to generate B R B × L × N , C R B × L × N and Δ R B × L × N . Therefore, parameter values can be dynamically adjusted based on input. This process is represented as follows:
B , C , Δ =   Linear   x
Unlike conventional approaches that stack separate linear attentions and MLPs, Mamba integrates these components into a single block. It replaces multiplicative gating with activation function and embeds the SSM transformation directly into the core computational path. This design preserves linear computational complexity with respect to sequence length. Assume the input RGB/TIR feature map after patch embedding is a token sequence with length N and dimension D. For RGB-T bimodal interaction, the total input token sequence length is 2N. The computational complexity of Transformer is dominated by the correlation matrix calculation between all tokens. For RGB-T fusion, the self-attention needs to calculate the correlation between RGB tokens, TIR tokens, and cross-modal tokens, with the total computational complexity as:
O ( T r a n s ) = O ( 2 N ) 2 D = 4 O N 2 D
For RGB-T interaction, Mamba processes the heterogeneous features of RGB and TIR through modality-specific state transition, with the total computational complexity as:
O ( M a m b a ) = 2 O N D 2 + N D K = O N D 2
The complexity of Mamba is linear with respect to the token sequence length N, and the quadratic term is only related to the feature dimension D. Even for high-resolution input in UAV scenarios, the computational overhead of Mamba increases slowly, which is the fundamental reason for its high efficiency in dense RGB-T multimodal interaction. Furthermore, Mamba decouples the spatial–temporal sequence modeling of features into a continuous state transition process and a selective information scanning mechanism. On the one hand, the continuous state transition of SSM can model the slow-varying global thermal semantic features of TIR images with stable state propagation and adapt to the discontinuous local texture feature distribution of RGB images through input-dependent dynamic step size adjustment. On the other hand, the selective scanning of Mamba can dynamically allocate attention weights to the complementary feature regions of RGB and TIR. For texture-deficient TIR features, SSM strengthens the propagation of global semantic state information, and for detail-rich RGB features, it enhances the capture of local texture state variation. In this work, we harness the strong representational capacity of Mamba and extend it to support multimodal interaction and fusion.

3.2. Overall Framework

The Vision Transformer (ViT) departs from the conventional dual-stream architecture by directly concatenating the template and search regions, and then feeding them into Transformer, which enable joint optimization of feature extraction and relational modeling. Inspired by this paradigm, our work builds upon OSTrack [27] and extends its backbone with Mamba-based modules to strike an effective balance between inference speed and tracking accuracy.
As illustrated in Figure 3, our proposed MIFMNet adopts a dual-stream encoder structure, where the input sequences from RGB and TIR modalities share identical parameters. The network comprises three key components: SDEM, FMIM, and a prediction head. Specifically, the search and template frames from both RGB and TIR modalities are first processed through patch embedding and positional encoding layers to obtain initial tokens. Subsequently, for each modality, the corresponding search and template tokens are concatenated along a designated dimension to form unified RGB tokens x t r and TIR tokens x t t , which are then fed into Transformer backbone for joint feature extraction and relational modeling. Following this, the resulting modality-specific features are passed through SDEM for differential and enhancement. To enable comprehensive cross-modal interaction at every layer, we integrate SDEM into each layer of the backbone. For the i -th layer, the enhanced features are formulated as follows:
x ^ i r , x ^ i t , x i s d e m = M i S D E M x i r , x i t , i [ 1 , N ] .
N denotes the layer number. x ^ i r and x ^ i t represent RGB and TIR modal features after differential enhancement, respectively. M i S D F M and x i s d e m denote SDEM and the fused features output at i -th layer, respectively. After all layers have performed differential enhancement, the resulting features from each layer are concatenated along the token dimension and passed into FMIM. By leveraging optical flow, FMIM dynamically determines the scanning order of sequences, thus achieving adaptive, motion-aware fusion across hierarchical levels. Finally, the fused representation produced by FMIM is forwarded to the prediction head to generate the final tracking output.

3.3. SDEM

Visible and infrared modalities, grounded in distinct physical imaging principles, exhibit markedly different characteristics. Specifically, RGB images rely on the reflection of objects in visible spectrum to capture rich appearance cues such as texture, color, and geometric details. In contrast, TIR images depend on the intrinsic infrared radiation emitted by objects, directly reflecting their temperature distribution and thermodynamic properties. The discrepancy between two modalities not only represents physical attributes but also encodes critical complementary information. For instance, in UAV nighttime scenarios, RGB images often suffer from severe loss of textural details due to insufficient illumination, whereas TIR images can still clearly delineate the contours of heat-emitting targets such as pedestrians or vehicles. Conversely, under strong reflective conditions, such as those involving water surfaces or glass facades, RGB images are prone to highlight artifacts, while TIR images stably preserve the true thermal radiation patterns. Existing SOTA methods typically employ Transformers with long-range modeling capabilities to achieve interaction between modal features. However, this strategy suffers from two major limitations:
  • It fails to exploit specific modes.
  • Its quadratic computational overhead impedes dense multimodal feature interaction across multiple layers.
Moreover, from UAV perspectives, objects frequently undergo significant scale variations, their feature representations are easily overwhelmed by background noise. To address these challenges, we propose SDEM, illustrated in Figure 3. SDEM explicitly models and enhances complementary differences between modalities at multiple levels, thereby improving the robustness of modality-specific representations.
The core of SDEM lies in its multiscale adaptability and differential enhancement mechanism for complementary features. On one hand, it employs parallel multiscale convolutions to enlarge the receptive field, strengthening feature capture for objects of varying scales and alleviating the issue of insufficient target response. On the other hand, inspired by differential amplification circuits in electronics, SDEM suppresses common-mode signals while amplifying differential signals to fully exploit inter-modal complementarity. As shown in Figure 3, we first compute the difference between RGB and TIR features at the same layer to obtain modality differential features x i d , where x i d = x i r x i t . Although differential features encode valuable complementary information, it is also contaminated by noise. To adapt to objects of diverse scales, we feed x i d into a spatial pyramid pooling (SPP) module prior to the Mamba block. The SPP module comprises three parallel convolutional branches with kernel sizes of 3 × 3, 5 × 5, and 7 × 7, respectively. The 3 × 3 branch preserves fine-grained target details and emphasizes local texture. The 5 × 5 branch balances local and global context, making it suitable for medium-scale targets. The 7 × 7 branch expands the receptive field to enhance feature representation for small targets, thereby boosting their response strength against complex backgrounds. The outputs from these three branches are concatenated along the channel dimension.
x i d spp = Concat Conv 3 × 3 x i d , Conv 5 × 5 x i d , Conv 7 × 7 x i d .
Here, C o n v k × k denotes the k × k convolution operation. C o n c a t denotes channel dimension concatenation. Subsequently, the enhanced differential feature is processed by Mamba to suppress noise and further amplify useful signals. The resulting feature is then passed through an activation function τ (Tanh) and element-wise multiplied with the original RGB features x i r and TIR features x i t , respectively, to generate complementary features. These features are added back to the original inputs to produce the enhanced RGB features x ^ i r and TIR features x ^ i t .
x ^ i r = x i r + x i t × τ M x i d spp ,
x ^ i t = x i t + x i r × τ M x i d spp .
Finally, the enhanced modality features undergo layer normalization (LN) and a linear projection for channel reduction, yielding the fused feature x i s d e m for the current layer.
x i s d e m = τ L N x ^ i r , x ^ i t W i .
Here, W i denotes a linear layer with dimension reduction and L N denotes layer normalization. Through its multiscale design and differential enhancement mechanism, SDEM effectively enlarges the receptive field, strengthens multimodal complementary features, and significantly improves feature responsiveness for multiscale objects.

3.4. FMIM

In vision tasks, shallow features capture fine-grained texture details, while deep features emphasize semantic information. The strong complementarity across different levels has proven effective [28,29]. However, most existing approaches adopt Transformer-based architectures, which not only suffer from limited computational efficiency but also struggle to facilitate interactions across hierarchical feature layers. Although a few recent methods explore Mamba, their scanning mechanisms lack awareness of object motion dynamics, when applied to 2D images. This inevitably disrupts spatial relationships and leads to tracking drifts during rapid movements of UAVs. To address this limitation, we propose FMIM, which dynamically adjusts the scanning direction based on inter-frame optical flow statistics, thereby enabling efficient cross-layer feature interaction.
The core idea of FMIM is to integrate inter-frame motion cues into the selective scanning mechanism. Specifically, we compute the mean optical flow value between consecutive frames to quantify target motion intensity. When motion intensity is high (e.g., rapid movement or abrupt displacement), the model prioritizes scanning shallow features rich in texture details to leverage their precise localization capability and enhance tracking stability. Conversely, when motion is mild or the target is nearly static, the model prioritizes scanning deep semantic features, leveraging their superior resistance to interference. Concretely, we first concatenate the output features from all SDEM modules along the token dimension to form a long token sequence X a l l that spans from shallow to deep representations. We then perform both forward and backward scans over this sequence.
X all   = C o n c a t x 1 s d e m , x 2 s d e m , , x N s d e m ,
F all   forward   = S X a l l , V 1 ,
F all   backward   = S X a l l , V 1 ,
where V 1 denotes a 1D convolutional layer and S represents the scanning operation. However, relying solely on unidirectional scanning may overlook critical tokens at sequence boundaries. To overcome this, we introduce an optical flow-guided scanning strategy. To quantify the inter-frame motion intensity of the target, we adopt the dense inverse search (DIS) optical flow algorithm to compute the dense optical flow field between consecutive RGB frames in the tracking sequence, with the implementation based on the PyTorch 1.8 framework for UAV scenario adaptation. Specifically, we first resize the paired consecutive RGB frames to the same resolution as the model input (256 × 256) to ensure spatial consistency with the feature extraction process, and set the search window size of the DIS algorithm to 9 × 9 to balance motion capture precision and computational efficiency. We then compute the horizontal ( u ) and vertical ( v ) flow vectors of each pixel in the RGB frame pair, and perform spatial average pooling on the dense flow field with a pooling kernel of 16 × 16 to obtain a 3D mean optical flow vector F = [ u ¯ , v ¯ , F ] . Here, F = u ¯ 2 + v ¯ 2 represents the overall motion magnitude of the target in the image plane. The RGB modality is selected as the optical flow calculation basis because it contains richer texture details than the TIR modality, which can effectively reduce the ambiguity of optical flow estimation in low-texture thermal regions; meanwhile, the mean optical flow vector is normalized to the range [−1,1] to eliminate the impact of absolute pixel displacement on subsequent motion intensity judgment. Next, the long token sequence X a l l is downsampled to a fixed dimension to yield a compact feature representation X a l l - d . X a l l - d is then channel-wise concatenated with F and fed into a multi-layer perception (MLP) followed by a fully connected (FC) layer to predict a scanning priority rank for each layer, as formulated in (18). This rank determines the adaptive scanning order that reflects the relative importance of each feature layer under the current motion context. Subsequently, FMIM reorders the original long token sequence X a l l according to the predicted priorities r a n k , yielding an ordered feature sequence X all rank that enables motion-aware scanning.
rank   = F C M L P C o n c a t X all - d , F ,
X all   rank = X all ,   rank ,
F all   flow   = S X all rank , V 1 ,
Finally, the results of tri-directional scanning modeling guided by forward, backward, and optical flow information through a simple gating strategy in Mamba. The fused representation is aggregated via element-wise summation and passed to the head for localization. By incorporating optical flow statistics, FMIM adaptively modulates the scanning order based on motion intensity, achieving dynamic prioritization between shallow texture cues and deep semantic cues. Crucially, FMIM is accomplished while preserving linear computational complexity, ensuring both comprehensive cross-layer interaction and high computational efficiency. It is inevitable that optical flow estimation will generate errors in complex UAV RGBT tracking scenarios, and the main error types include motion ambiguity error (caused by low-texture RGB regions, e.g., uniform sky/ground) and outlier error (caused by sudden camera jitter or background clutter). The former will lead to the under-estimation of motion intensity, making the model incorrectly prioritize deep semantic features in high-speed motion scenarios and causing tracking drift; the latter will result in the over-estimation of motion intensity, making the model over-rely on shallow texture features in static/slow motion scenarios and reducing the anti-interference ability.

4. Experiment

4.1. Experimental Setup

We implement MIFMNet using the PyTorch framework and train it on a system equipped with two NVIDIA RTX 4090 GPUs. The backbone network is ViT-Base with a patch size of 16 × 16 and a stride of 16. Our modality interaction Mamba spans all 12 layers, and the classification head adopts a center-localization architecture with 256 channels. The network is initialized with the K-700 pre-trained weights provided by DropTrack [30], which significantly enhances the capacity to capture motion-related features. To optimize memory efficiency, we employ mixed-precision training (AMP) with a batch size of 16. The Adam optimizer is used with an initial learning rate and a weight decay coefficient of 0.0001. The learning rate scheduling employs a stepwise decay strategy, reducing the learning rate by a factor of 10 after the 10th epoch. Gradient clipping is set to a norm of 0.1 to prevent gradient explosion. The model is trained for 30 epochs in total, with each epoch comprising 6 × 104 training sample pairs.
To improve robustness, we apply multiple data augmentation techniques to each sequence in the training set, including spatial transformations (center jittering and scale jittering) and color normalization. Considering the multimodal nature, we activate the cross-entropy loss during the early training phase (epochs: 1–4) and enforce a 70% token retention rate in the interaction layers (layers: 3, 6, and 9) to strengthen inter-modal alignment. We conduct experiments on four public remote sensing datasets, LasHeR [31], RGBT210 [32], RGBT234 [33], and VTUAV [34], specifically validating our MIFMNet on challenging scenarios captured from UAV perspectives.

4.2. Quantitative Comparison

To comprehensively evaluate the overall performance of MIFMNet, we conduct extensive experiments on four mainstream RGBT tracking benchmarks: RGBT210, RGBT234, LasHeR, and VTUAV. Our method is compared against 20 SOTA trackers, including CNN-based and Transformer-based approaches. To ensure fair and rigorous comparison with SOTA RGBT trackers, we uniformly confirm and align the training settings, pre-training strategies and core hyperparameters of all comparative methods based on their original publications and official open-source implementations. For the consistency of experimental conditions, all comparative methods and our MIFMNet are trained and tested on the same hardware platform (two NVIDIA RTX 4090 GPUs) with the PyTorch framework, and the input resolution of all models is unified to 256 × 256. Following the standard one-pass evaluation (OPE) protocol, we adopt three widely accepted quantitative metrics: Precision Rate (PR), Success Rate (SR), and Normalized Precision Rate (NPR), all of which are established standards for assessing RGBT tracking algorithms. A detailed summary of the comparative results is presented in Table 1, clearly demonstrating the superior performance of MIFMNet.

4.2.1. Evaluation on RGBT210

The RGBT210 dataset is renowned for its high level of difficulty, comprising 210 paired RGBT video sequences with approximately 210 K frames in total. It provides fine-grained annotations across 12 distinct challenge attributes. As shown in Table 1, MIFMNet achieves the highest PR of 64.3%, significantly outperforming all CNN-based RGBT trackers and surpassing AFTER and DMD by 0.8% and 0.6%, respectively. Although its SR (86.7%) is slightly lower than that of AFTER (87.6%), MIFMNet operates at a real-time speed of 35.3 FPS, substantially faster than AFTER (20.9 FPS) and DMD (17 FPS), thereby achieving an optimal trade-off between accuracy and efficiency.

4.2.2. Evaluation on RGBT234

The RGBT234 dataset is one of the most influential and widely adopted benchmarks in RGBT tracking, containing 234 precisely aligned RGB-T sequences totaling approximately 233.4 K frames. As reported in Table 1, MIFMNet achieves the best SR of 66.9%, outperforming AFTER and DMD by 0.2%. While its PR (63.1%) is marginally lower than those of AFTER and DMD by 1.4% and 0.6%, respectively, it still significantly exceeds all other RGBT trackers. This result further validates the algorithm’s consistently high accuracy on this canonical benchmark.

4.2.3. Evaluation on VTUAV

The VTUAV dataset encompasses a wide range of UAV Earth observation scenarios, covering vegetation coverage, mountainous terrain, and nighttime observation. We focus on its short-term tracking subset. As shown in Table 1, MIFMNet exhibits overwhelming dominance: it achieves 74.8%/86.5% of SR/PR metrics, substantially outperforming the second-best method AFTER, the metrics of which is 72.5%/84.9%. Given that this dataset features abundant challenges such as fast motion, extreme scale changes, and motion blur, MIFMNet’s significant lead validates the effectiveness of its dynamic fusion and motion-aware modeling strategy. Notably, we validate the algorithm on typical deformation scenarios, with results demonstrating that the proposed method achieves optimal performance. This underscores its strong potential for deployment in high-speed, dynamic applications such as UAV-based tracking.

4.2.4. Evaluation on LasHeR

The LasHeR dataset is currently the largest and most challenging RGBT tracking dataset, featuring 1224 aligned sequences and 734.8 K frames. It includes annotations for 19 fine-grained challenge attributes, imposing stringent demands on tracker robustness and generalization. As summarized in Table 1, MIFMNet is evaluated across the three metrics (SR, NPR, PR) against multiple advanced trackers. Compared to the strongest CNN-based method (QAT), MIFMNet improves SR, NPR, and PR by 7.7%, 9.5%, and 9.0%, respectively. Against the leading Transformer-based methods, TBSI, AFTER, and DMD, MIFMNet consistently achieves the best results, with gains of 2.2%/3.4%/4.0%, 2.7%/3.3%/2.9%, and 0.2%/0.5%/0.6% in SR/NPR/PR, respectively. These results confirm that our proposed Mamba-based fusion framework exhibits exceptional robustness and generalization in complex, long-duration, and diverse real-world scenarios.
To provide intuitive visual evidence, we plot the PR and SR curves of MIFMNet against top-performing trackers. As shown in Figure 4a, MIFMNet consistently ranks highest across all position error thresholds. At the standard threshold of 20 pixels, MIFMNet achieves a precision of 0.732, notably higher than the second-best method (MCTrack), indicating its highly stable localization. Moreover, the curve rises sharply in the low-threshold region (0–10 pixels), demonstrating strong performance even under strict precision requirements and high resistance to interference. As illustrated in Figure 4b, the high area under the curve (AUC) and slow decay of the SR curve confirm MIFMNet’s superior robustness under challenging conditions such as occlusion and scale variation.
We further compare MIFMNet against representative CNN- and Transformer-based methods, including APFNet, mfDiMP, ViPT [49], and TBSI across all 19 challenge subsets of the LasHeR dataset (in Table 2). These challenges include: no occlusion (NO), partial occlusion (PO), total occlusion (TO), transparent occlusion (HO), motion blur (MB), low illumination (LI), high illumination (HI), abrupt illumination variation (AIV), low resolution (LR), deformation (DEF), background clutter (BC), similar appearance (SA), camera motion (CM), thermal crossover (TC), frame loss (FL), out of view (OV), fast motion (FM), scale variation (SV), and aspect ratio change (ARC). MIFMNet demonstrates either the best or second-best performance across complex scenarios, highlighting its exceptional robustness. Notably, under severe occlusion and deformation (e.g., HO and DEF), MIFMNet achieves the best SR/PR scores of 55.1%/63.5% and 61.0%/75.1%, respectively, which indicates its strong capability to handle heavy occlusion and non-rigid object deformations. In dynamic and interference-prone scenarios (FM, SV, BC), MIFMNet attains scores of 57.0%/71.4%, 58.2%/73.4%, and 56.3%/70.3%, confirming that its multimodal interaction and fusion Mamba effectively cope with UAV rapid motion and drastic scale changes. Additionally, MIFMNet demonstrates significant advantages in scenarios with poor imaging quality. Under LI and TC, the scores rank first with 51.5%/64.3% and 51.6%/65.3%, respectively. MIFMNet maintains stable tracking by leveraging complementary information when RGB or TIR images are degraded. Figure 5 presents the PR/SR scores across all challenge attributes, visually demonstrating MIFMNet’s superiority. Except for a few attributes like NO and OV, MIFMNet leads comprehensively, with no apparent performance bottlenecks. Additionally, to validate the effectiveness of the data augmentation strategy proposed in this paper, we compare the algorithm’s performance on the LasHeR dataset before and after applying the strategy. The results demonstrate that after data augmentation, PR and SR scores improve by 1.5% and 1.3%, respectively.

4.3. Qualitative Comparison

To facilitate qualitative analysis, we visually compare the tracking results of MIFMNet against four SOTA trackers under challenging scenarios involving occlusion, fast motion, intense illumination, and object deformation. As shown in Figure 6, MIFMNet demonstrates superior robustness to occlusion and drift in both occlusion and motion sequences. In the yellow skirt sequence, when the target is partially occluded by pedestrians crossing the street, MIFMNet maintains a bounding box that closely aligns with the ground truth. Even at the most severe occlusion frames (#036 and #088), there are no noticeable drifts. In contrast, methods such as CAT and CAT++ exhibit significant bounding box deviations immediately after occlusions occur. In the yellow girl sequence, the target becomes fully occluded by trees, causing severe information loss in the visible modality, while the thermal image only provides a faint contour of the target. Despite this, MIFMNet consistently localizes the target accurately. Conversely, CAT++ completely loses track of the target, and both AFTER and QAT suffer from substantial drift.
As further illustrated in Figure 7, MIFMNet also excels in scenarios involving rapid motion and scale variation. In the nighttime white riding bike sequence, the target undergoes fast motion under heavy shadow interference, and the thermal modality is corrupted by considerable noise. The tracking box of MIFMNet consistently maintains high alignment with the ground truth during both rapid target motion (#087, #119) and scale changes (#145). By comparison, CAT++ drifts during motion, while QAT and AFTER mismatch the bounding box during scale transitions. In the umbrella sequence, the scale of the umbrella continuously varies as the pedestrian moves. Among the competing methods, CAT and CAT++ show clear inaccuracies under scale changes, and AFTER drifts during rapid motion. In contrast, our tracker maintains precise localization and accurate box dimensions during target scale changes (#013, #100) and rapid motion (#159). In summary, MIFMNet leverages the complementary characteristics of RGB and TIR modalities to sustain stable tracking even when one modality suffers from severe degradation or high noise. Moreover, it rapidly adapts to scale variations and effectively prevents bounding box drift under conditions of heavy occlusion or abrupt motion, thereby achieving robust tracking performance in real-world UAV scenarios.

4.4. Ablation Studies

4.4.1. Component Analysis

To evaluate the contribution of each proposed module in our MIFMNet to RGBT tracking, we conduct extensive ablation studies on the LasHeR dataset, with results summarized in Table 3. We adopt OSTrack as the baseline model, and all ablation experiments share identical training hyperparameters to isolate the independent impact of each component: the training process is set to 30 epochs with 6 × 104 training sample pairs per epoch, the initial learning rate of the Adam optimizer is set to 1 × 10−4 with a weight decay coefficient of 0.0001, the learning rate is decayed by a factor of 10 after the 10th epoch, gradient clipping is set to a norm of 0.1, and mixed-precision training (AMP) with a batch size of 16 is employed for all ablation models. In Table 3, Symbol A corresponds to the scale differential enhanced Mamba (SDEM) module, and Symbol B corresponds to the flow-guided multilayer interaction Mamba (FMIM) module; the marks √ and × represent adding or not adding the corresponding module to the baseline network, respectively, and √* denotes the FMIM module with its optical flow guidance mechanism removed (only the basic multilayer interaction function is retained).
Specifically, we first ablate the SDEM module independently: in the baseline model, cross-modal feature fusion is simply implemented via element-wise addition without any differential enhancement and multiscale modeling. After incorporating the SDEM module, the model achieves SR/NPR/PR improvements of 7.5%/7.8%/6.5% over the baseline (from 47.8/55.8/60.4 to 55.3/63.6/66.9), which demonstrates that the multiscale parallel convolution and differential feature enhancement mechanism of SDEM can effectively amplify the complementary information between RGB and TIR modalities and boost feature responses for scale-varying targets, thus significantly benefiting RGBT tracking performance. Next, we independently assess the impact of the FMIM module: the baseline model originally fuses hierarchical features from different layers through direct token concatenation without motion-aware adaptive scanning. Integrating the complete FMIM module yields gains of 6.8%/7.3%/6.3% in SR/NPR/PR (from 47.8%/55.8%/60.4% to 54.6%/63.1%/66.7%) compared with the baseline, which confirms that the flow-guided scanning strategy in FMIM facilitates efficient cross-layer feature interaction under dynamic motion scenarios and mitigates early information forgetting in Mamba’s causal modeling process.
When both SDEM and FMIM are jointly integrated into the baseline model, the model achieves absolute improvements of 10.0%/13.3%/12.8% in SR/NPR/PR (reaching 57.8/69.1/73.2) over the baseline, which validates the synergetic effect of our multimodal interaction and fusion strategy. SDEM provides enhanced multi-scale cross-modal features for FMIM, and FMIM further realizes motion-aware adaptive interaction of these enhanced features across layers, forming a closed loop of feature enhancement and hierarchical interaction. Furthermore, we ablate the optical flow guidance mechanism in FMIM (denoted as √*) and observe a noticeable performance drop (SR/NPR/PR from 54.6/63.1/66.7 to 54.1/62.3/65.2), which underscores the necessity of optical flow-guided scanning for adapting to dynamic UAV scenarios and improving the model’s motion robustness. Additionally, we further verify the impact of input resolution on the full model (SDEM+FMIM) by increasing the input resolution from 256 × 256 to 384 × 384, and the model achieves additional SR/NPR/PR gains of 0.5%/0.4%/0.6% (58.3/69.5/73.8), which proves the excellent scalability of our MIFMNet architecture.

4.4.2. Impact of Input Resolution

High-resolution inputs provide richer texture and edge details, yielding more discriminative appearance cues for tracking. While increasing resolution is a common strategy to boost performance in vision tasks, it typically incurs quadratic growth in computational cost, especially prohibitive in RGBT tracking due to dual-modality processing. Consequently, most existing methods resort to low-resolution inputs (e.g., 256 × 256). Leveraging the linear computational complexity of Mamba, we explore a higher input resolution of 384 × 384 to validate the scalability of our approach. As shown in Table 3, this directly leads to consistent performance gains across all metrics, achieving SOTA results. Additionally, to verify the impact of resolution scaling on memory usage, comparative experiments are conducted on a single NVIDIA RTX 4090 GPU, as shown in Figure 8. When replacing our SDEM and FMIM modules with standard Transformer blocks, GPU memory consumption grows quadratically with resolution. In contrast, MIFMNet exhibits only linear growth, enabling high-accuracy tracking at elevated resolutions while remaining deployment-friendly.

4.4.3. Impact of Network Depth

In CNN-based trackers, deeper layers enhance representational capacity but drastically increase parameters and FLOPs, hindering deployment in resource-constrained scenarios. Early RGBT methods (e.g., those built upon VGG-M) suffered from limited depth and suboptimal performance. Transformer-based models, while capable of global modeling, face severe memory bottlenecks. Self-attention complexity scales quadratically with sequence length, preventing deep stacking and full hierarchical interaction. Indeed, when we replace Mamba with Transformer in our framework, the model runs out of memory when attempting to fuse features from just five layers. In stark contrast, MIFMNet successfully supports interaction across all 12 layers. We thus perform ablation studies with varying numbers of fused layers. Results in Table 4 show that tracking performance consistently improves as more layers are utilized. Compared to using only the first layer, fusing all 12 layers yields gains of 2.5%/2.4%/2.9% in SR/NPR/PR on the LasHeR dataset. In summary, the efficient multimodal interaction and fusion architecture constructed by MIFMNet significantly improves tracking accuracy while maintaining controllable resource consumption, thereby overcoming the efficiency bottlenecks faced by traditional architectures in RGBT tasks.

4.4.4. Efficiency Analysis

To evaluate the computational efficiency of the proposed method, we comprehensively compared it with SOTA approaches. As shown in Table 5, the comparison metrics include parameter count, computational load (measured in FLOPs), and inference speed (FPS). MIFMNet features a parameter count of only 17.2 M, significantly lower than GMMT (962.2 M), TBSI (99.3 M), and DMD (121.6 M). Furthermore, the FLOPs of MIFMNet are only 5.6 G, substantially lower than competing methods, indicating a reduced computational burden during inference. A key clarification for the apparent discrepancy between MIFMNet and Transformer-based trackers (e.g., GMMT) lies in the architectural differences in the interaction modules. Despite sharing the same ViT-B backbone, the core multimodal interaction mechanism (Mamba vs. self-attention) fundamentally determines the parameter and FLOP gap. GMMT adopts a standard Transformer with multi-head self-attention (MHA) for cross-modal interaction, which requires additional learnable parameters for attention projection matrices (query/key/value) and layer normalization in each Transformer block. In contrast, MIFMNet replaces the MHA layers with lightweight Mamba-based modules (SDEM and FMIM). The SSM structure of Mamba reuses the linear projection layers for state transition and selective scanning, avoiding redundant parameter introduction in cross-modal interaction. Meanwhile, the differential enhancement and motion-guided scanning mechanisms of SDEM/FMIM eliminate the need for complex attention weight calculation modules, resulting in a total parameter count of only 17.2 M, which is far lower than GMMT’s 962.2 M (dominated by MHA-related parameters) and TBSI’s 99.3 M. Furthermore, although MIFMNet’s FPS (35.3) is marginally lower than TBSI (36.2), it remains significantly higher than GMMT (22.4) and DMD (16.8). In summary, MIFMNet achieves substantial reductions in model parameters and computational complexity while maintaining high tracking accuracy, making it suitable for UAV embedded systems. Furthermore, its current inference speed has yet to fully leverage its computational advantages, primarily due to Mamba not being fully optimized within existing hardware acceleration libraries, which impacts the frame rate during end-to-end inference. With the development of underlying computing architectures, its inference efficiency is expected to increase further.

5. Conclusions

This paper addresses the performance bottlenecks of existing architectures in RGB-T tracking by proposing MIFMNet. Through the design of two core modules, SDEM and FMIM, MIFMNet effectively resolves the trade-off between interaction capability and computational efficiency, significantly enhancing tracking robustness in challenging scenes. Specifically, SDEM enhances complementary differences between modalities via parallel convolution and differential approaches. FMIM achieves adaptive interaction in UAV dynamic scenes by incorporating inter-frame motion information into scanning sequence prediction. Extensive experiments demonstrate that MIFMNet provides an efficient multimodal tracking solution for Earth observation. Its scale-differential enhancement combined with a motion-guided dynamic scanning mechanism exhibits superior practicality compared to CNNs’ local modeling limitations and Transformers’ high computational overhead. Future research will focus on two directions: First, while MIFMNet employs Transformers for feature extraction, we plan to explore tracking pre-training backbones based on Mamba to further enhance efficiency. Second, we aim to optimize Mamba’s adaptation to 2D visual data, thereby improving tracking accuracy in remote sensing scenarios involving small targets and extreme deformations.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/rs18071026/s1.

Author Contributions

Conceptualization, R.G. and X.S.; methodology, R.G.; validation, R.G., X.S. and S.S.; formal analysis, R.G.; investigation, Z.D.; resources, H.Q.; data curation, B.S.; writing—original draft preparation, B.S.; writing—review and editing, F.L.; visualization, P.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors thank Changguang Yuchen Information Technology and Equipment (Qingdao) Co., Ltd. for providing the multispectral camera.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Sun, X.; Sun, H.; Liu, B.; Jiang, S.; Wang, J.; Li, D. Target-aware bidirectional fusion transformer for aerial object tracking. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 29071–29082. [Google Scholar] [CrossRef]
  2. He, B.; Zhao, X.; Chen, Y.; Liu, C.; Pang, X. Application of feature tracking using k-nearest-neighbor vector field consensus in sea ice tracking. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 4326–4336. [Google Scholar] [CrossRef]
  3. Gao, W.; Niu, W.; Lu, W.; Wang, P.; Qi, Z.; Peng, X.; Yang, Z. Dim small target detection and tracking: A novel method based on temporal energy selective scaling and trajectory association. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 17239–17262. [Google Scholar] [CrossRef]
  4. Hui, T.; Xun, Z.; Peng, F.; Huang, J.; Wei, X.; Wei, X.; Dai, J.; Han, J.; Liu, S. Bridging search region interaction with template for RGB-T tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13630–13639. [Google Scholar]
  5. Wang, W.; Li, C.; Zhang, D.; Zhou, H.; Xie, M.; Zhou, H.; Fu, K. FcFNet: A challenge-based feature complementary fusion network for RGB-T tracking. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 2239–2251. [Google Scholar] [CrossRef]
  6. Zhang, H.; Yuan, D.; Shu, X.; Li, Z.; Liu, Q.; Chang, X.; He, Z.; Shi, G. A comprehensive review of RGB-T tracking. IEEE Trans. Instrum. Meas. 2024, 73, 5027223. [Google Scholar]
  7. Zhang, T.; Liu, X.; Zhang, Q.; Han, J. SiamCDA: Complementarity- and distractor-aware RGB-T tracking based on siamese network. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 1403–1417. [Google Scholar] [CrossRef]
  8. Cheng, Z.; Fan, H.; Tang, Y.; Wang, Q. RGB-T object tracking network based on multi-scale modality fusion. J. Shandong Univ. Sci. Technol. 2024, 43, 89–99. [Google Scholar]
  9. Sun, D.; Pan, Y.; Lu, A.; Li, C.; Luo, B. Transformer RGB-T tracking with spatio-temporal multimodal tokens. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 12059–12072. [Google Scholar] [CrossRef]
  10. Fan, H.; Yu, Z.; Wang, Q.; Fan, B.; Tang, Y. QueryTrack: Joint-modality query fusion network for RGB-T tracking. IEEE Trans. Image Process. 2024, 33, 3187–3199. [Google Scholar] [CrossRef] [PubMed]
  11. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 26551–26561. [Google Scholar]
  12. Li, M.; Zhang, P.; Yan, M.; Chen, H.; Wu, C. Dynamic feature-memory transformer network for RGB-T tracking. IEEE Sens. 2023, 23, 19692–19703. [Google Scholar] [CrossRef]
  13. Wu, Y.; Guan, X.; Zhao, B.; Huang, M. Vehicle detection based on adaptive multi-modal feature fusion and cross-modal vehicle index using RGB-T images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 8166–8177. [Google Scholar] [CrossRef]
  14. Guo, R.; Guo, X.; Sun, X.; Zhou, P.; Sun, B.; Su, S. Background-aware cross-attention multiscale fusion for multispectral object detection. Remote Sens. 2024, 16, 4034. [Google Scholar] [CrossRef]
  15. Xu, C.; Gao, L.; Liu, Y.; Zhang, Q.; Su, N.; Zhang, S.; Li, T.; Zheng, X. CMShipReID: A cross-modality ship dataset for the reidentification task. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 10503–10513. [Google Scholar] [CrossRef]
  16. Cawse-Nicholson, K.; Hook, S.J.; Miller, C.E.; Thompson, D.R. Intrinsic dimensionality in combined visible to thermal infrared imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 12, 4977–4984. [Google Scholar] [CrossRef]
  17. Zhang, L.; Danelljan, M.; Gonzalez-Garcia, A.; van de Weijer, J.; Khan, F.S. Multi-modal fusion for end-to-end RGB-T tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019; pp. 2252–2261. [Google Scholar]
  18. Wu, B.; Zhang, R.; Liu, Y. Research on RGB-T multimodal interaction tracking algorithm with improved ViT. Comput. Eng. Appl. 2025, 61, 267–277. [Google Scholar]
  19. Li, C.; Liu, L.; Lu, A.; Ji, Q.; Tang, J. Challenge-aware RGB-T tracking. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 222–237. [Google Scholar]
  20. Xiao, Y.; Yang, M.; Li, C.; Liu, L.; Tang, J. Attribute-based progressive fusion network for RGB-T tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; pp. 2831–2838. [Google Scholar]
  21. Zhang, P.; Zhao, J.; Bo, C.; Wang, D.; Lu, H.; Yang, X. Jointly modeling motion and appearance cues for robust RGB-T tracking. IEEE Trans. Image Process. 2021, 30, 3335–3347. [Google Scholar] [CrossRef] [PubMed]
  22. Zhang, Z.; Liu, A.; Reid, I.; Hartley, R.; Zhuang, B.; Tang, H. Motion Mamba: Efficient and long sequence motion generation. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024. [Google Scholar]
  23. Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient visual representation learning with bidirectional state space model. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 23–25 July 2024; pp. 62429–62442. [Google Scholar]
  24. Kang, B.; Chen, X.; Lai, S.; Liu, Y.; Liu, Y.; Wang, D. Exploring enhanced contextual information for video-level object tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025. [Google Scholar]
  25. Wu, Y.; Yang, X.; Wang, X.; Ye, H.; Zeng, D.; Li, S. MambaNUT: Nighttime UAV tracking via Mamba-based adaptive curriculum learning. In Proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hangzhou, China, 19–25 October 2025. [Google Scholar]
  26. Yao, M.; Peng, J.; He, Q.; Peng, B.; Chen, H.; Chi, M.; Liu, C.; Benediktsson, J.A. MM-Tracker: Motion Mamba for UAV-platform multiple object tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025. [Google Scholar]
  27. Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 341–357. [Google Scholar]
  28. Hou, J.; Chen, X.; Wu, C.; Zhou, M.; Li, J.; Hong, D. Bilateral adaptive evolution transformer for multispectral image fusion. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5400612. [Google Scholar] [CrossRef]
  29. Roy, S.K.; Deria, A.; Hong, D.; Rasti, B.; Plaza, A.; Chanussot, J. Multimodal fusion transformer for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5515620. [Google Scholar] [CrossRef]
  30. Wu, Q.; Yang, T.; Liu, Z.; Wu, B.; Shan, Y.; Chan, A.B. DropMAE: Masked autoencoders with spatial-attention dropout for tracking tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14561–14571. [Google Scholar]
  31. Li, C.; Xue, W.; Jia, Y.; Qu, Z.; Luo, B.; Tang, J.; Sun, D. LasHeR: A large-scale high-diversity benchmark for RGB-T tracking. IEEE Trans. Image Process. 2022, 31, 392–404. [Google Scholar] [CrossRef] [PubMed]
  32. Li, C.; Zhao, N.; Lu, Y.; Zhu, C.; Tang, J. Weighted sparse representation regularized graph learning for RGB-T object tracking. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1856–1864. [Google Scholar]
  33. Li, C.; Liang, X.; Lu, Y.; Zhao, N.; Tang, J. RGB-T object tracking: Benchmark and baseline. Pattern Recognit. 2019, 96, 106977. [Google Scholar] [CrossRef]
  34. Zhang, P.; Zhao, J.; Wang, D.; Lu, H.; Ruan, X. Visible-thermal UAV tracking: A large-scale benchmark and new baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 8886–8895. [Google Scholar]
  35. Zhang, P.; Wang, D.; Liu, H.; Yang, X. Learning adaptive attribute-driven representation for real-time RGB-T tracking. Int. J. Comput. Vis. 2021, 129, 2714–2729. [Google Scholar] [CrossRef]
  36. Lu, A.; Qian, C.; Li, C.; Tang, J.; Wang, L. Duality-gated mutual condition network for RGB-T tracking. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 4118–4131. [Google Scholar] [CrossRef] [PubMed]
  37. Zhang, T.; Guo, H.; Jiao, Q.; Zhang, Q.; Han, J. Efficient RGB-T tracking via cross-modality distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5404–5413. [Google Scholar]
  38. Zhu, Y.; Li, C.L.; Zhao, N.; Tang, J.; Lu, H. Quality-aware feature aggregation network for robust RGB-T tracking. IEEE Trans. Intell. Veh. 2021, 6, 121–130. [Google Scholar] [CrossRef]
  39. Liu, L.; Li, C.L.; Xiao, Y.; Tang, J. Quality-aware RGB-T tracking via supervised reliability learning and weighted residual guidance. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 3129–3137. [Google Scholar]
  40. Liu, L.; Li, C.L.; Xiao, Y.; Ruan, R.; Fan, M. RGBT tracking via challenge-based appearance disentanglement and interaction. IEEE Trans. Image Process. 2024, 33, 1753–1767. [Google Scholar] [CrossRef]
  41. Mei, J.; Zhou, D.; Cao, J.; Nie, R.; He, K. Differential reinforcement and global collaboration network for RGB-T tracking. IEEE Sens. J. 2023, 23, 7301–7311. [Google Scholar] [CrossRef]
  42. Qin, Y.; Zhang, J.; Fan, S.; Liu, Z.; Wang, J. MCIT: Multi-level cross-modal interactive transformer for RGB-T tracking. Neurocomputing 2025, 649, 130758. [Google Scholar] [CrossRef]
  43. Cao, B.; Guo, J.; Zhu, P.; Hu, Q. Bi-directional adapter for multimodal tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 927–935. [Google Scholar]
  44. Tang, Z.; Xu, T.; Wu, X.; Zhu, X.-F.; Kittler, J. Generative-based fusion mechanism for multimodal tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 5189–5197. [Google Scholar]
  45. Hu, X.; Zhong, B.; Liang, Q.; Zhang, S.; Li, N.; Li, X. Toward modalities correlation for RGB-T tracking. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9102–9111. [Google Scholar] [CrossRef]
  46. Zhang, J.; Qin, Y.; Fan, S.; Xiao, Z.; Zhang, J. SiamTFA: Siamese triple-stream feature aggregation network for efficient RGB-T tracking. IEEE Trans. Intell. Transp. Syst. 2024, 26, 1900–1913. [Google Scholar] [CrossRef]
  47. Lu, A.; Wang, W.; Li, C.L.; Tang, J.; Luo, B. AFTER: Attention-based fusion router for RGB-T tracking. IEEE Trans. Image Process. 2025, 34, 4386–4401. [Google Scholar] [CrossRef]
  48. Hu, Y.; Shao, Z.; Fan, B.; Liu, H. Dual-level modality de-biasing for RGB-T tracking. IEEE Trans. Image Process. 2025, 34, 2667–2679. [Google Scholar] [CrossRef] [PubMed]
  49. Zhu, J.; Lai, S.; Chen, X.; Wang, D.; Lu, H. Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9516–9526. [Google Scholar]
Figure 1. Comparison of precision and speed between MIFMNet and state-of-the-art trackers on the LasHeR Dataset.
Figure 1. Comparison of precision and speed between MIFMNet and state-of-the-art trackers on the LasHeR Dataset.
Remotesensing 18 01026 g001
Figure 2. Three typical architectures in tracking networks. (a) CNN. (b) Transformer. (c) Mamba.
Figure 2. Three typical architectures in tracking networks. (a) CNN. (b) Transformer. (c) Mamba.
Remotesensing 18 01026 g002
Figure 3. The overall architecture of MIFMNet. First, two-channel images are embedded through Transformers for joint feature extraction. Next, complementary modalities are enhanced via differential and parallel convolutional processing using SDEM. Subsequently, the enhanced features from each layer are input into FMIM for multilayer interaction guided by optical flow information. Finally, the output features are fed into the tracking head to achieve localization.
Figure 3. The overall architecture of MIFMNet. First, two-channel images are embedded through Transformers for joint feature extraction. Next, complementary modalities are enhanced via differential and parallel convolutional processing using SDEM. Subsequently, the enhanced features from each layer are input into FMIM for multilayer interaction guided by optical flow information. Finally, the output features are fed into the tracking head to achieve localization.
Remotesensing 18 01026 g003
Figure 4. The evaluation on the LasHeR dataset compared with other RGBT trackers.
Figure 4. The evaluation on the LasHeR dataset compared with other RGBT trackers.
Remotesensing 18 01026 g004
Figure 5. Precision for challenging attributes on the LasHeR dataset.
Figure 5. Precision for challenging attributes on the LasHeR dataset.
Remotesensing 18 01026 g005
Figure 6. Qualitative comparison of MIFMNet with four advanced trackers across two sequences. (a) The yellow skirt: partially occluded target. (b) The yellow girl: severely occluded target. In the figure, # represents the image’s number in the sequence.
Figure 6. Qualitative comparison of MIFMNet with four advanced trackers across two sequences. (a) The yellow skirt: partially occluded target. (b) The yellow girl: severely occluded target. In the figure, # represents the image’s number in the sequence.
Remotesensing 18 01026 g006
Figure 7. Qualitative Comparison of MIFMNet with four advanced trackers across two sequences. (a) The white riding bike: rapid motion with strong light interference. (b) The umbrella: target scale variation. In the figure, # represents the image’s number in the sequence.
Figure 7. Qualitative Comparison of MIFMNet with four advanced trackers across two sequences. (a) The white riding bike: rapid motion with strong light interference. (b) The umbrella: target scale variation. In the figure, # represents the image’s number in the sequence.
Remotesensing 18 01026 g007
Figure 8. Memory usage comparison at different resolutions.
Figure 8. Memory usage comparison at different resolutions.
Remotesensing 18 01026 g008
Table 1. PR, NPR and SR Scores (%) of MIFMNet against other RGBT trackers on four public benchmarks, with the best, second-best, and third-best results labeled in red, blue, and green respectively.
Table 1. PR, NPR and SR Scores (%) of MIFMNet against other RGBT trackers on four public benchmarks, with the best, second-best, and third-best results labeled in red, blue, and green respectively.
M210.BackboneLasHeRRGBT210RGBT234VTUAVFPS
SRNPRPRSRPRSRPRSRPR
mfDiMP [17]ResNet5034.339.544.755.578.642.864.655.467.310.3
ADRNet [35]VGG-M30.839.544.453.477.857.180.946.662.225.0
CAT [19]VGG-M31.439.545.053.379.256.180.4--20
HMFT [34]ResNet5032.641.346.053.578.656.878.862.775.830.2
DMCNet [36]VGG-M35.543.149.055.579.759.383.9--2.3
CMD [37]ResNet5046.454.659.0--58.482.4--30
FANet [38]VGG-M30.938.444.1--55.378.7--19
QAT [39]ResNet5050.159.664.261.986.764.488.466.780.122
CAT++ [40]VGG-M35.644.450.956.182.259.284.0---
DRGCNet [41]VGG-M33.842.348.3--58.182.5--4.9
APFNet [20]VGG-M36.243.950.0--57.982.7--1.3
MCIT [42]ViT-B50.5-64.556.580.859.583.170.685.3-
TBSI [4]ViT-B55.665.769.262.585.363.787.1--36.2
BAT [43]ViT-B56.3-70.2--64.186.8---
GMMT [44]ViT-B56.667.070.7--64.787.9---
MCTrack [45]ViT-B57.167.671.6--65.687.5---
SiamTFA [46]ViT-B48.1-62.556.379.759.282.267.982.137.0
AFTER [47]ViT-B55.165.870.363.587.666.790.172.584.920.9
DMD [48]ViT-B57.668.672.663.787.066.789.3--17
STMT [9]ViT-B53.763.467.459.583.063.467.4--39.1
OursViT-B57.869.173.264.386.766.988.774.886.535.3
Table 2. PR/SR scores (%) of MIFMNet against four RGBT trackers across 19 attributes on the LasHeR dataset. The best and second-best results are in red and blue, respectively.
Table 2. PR/SR scores (%) of MIFMNet against four RGBT trackers across 19 attributes on the LasHeR dataset. The best and second-best results are in red and blue, respectively.
MethodsAPFNeTmfDiMPViPTTBSIOurs
NO46.7/66.757.5/76.568.4/84.174.1/91.472.4/88.4
PO34.5/47.330.8/39.750.3/62.454.0/67.855.6/70.7
TO31.4/41.725.0/32.246.1/57.651.0/64.351.9/66.0
HO27.7/27.119.8/23.843.8/43.753.4/60.655.1/63.5
MB32.8/45.928.7/37.645.9/57.349.5/63.151.9/67.0
LI30.8/41.823.8/29.641.2/49.849.3/61.351.5/64.3
HI41.2/60.435.1/46.754.2/67.858.2/73.862.2/79.5
AIV26.2/32.116.6/16.434.2/36.349.8/58.250.2/59.1
LR29.4/46.125.6/40.241.6/56.447.3/63.946.0/63.1
DER36.8/45.834.2/40.355.7/67.458.7/71.661.0/75.1
BC33.7/44.927.0/34.951.8/64.955.7/69.956.3/70.3
SA31.7/42.829.5/37.246.5/57.350.2/62.250.7/64.2
CM47.7/35.130.6/40.850.0/62.155.0/69.555.7/70.5
TC31.6/43.128.8/38.046.0/57.350.1/62.651.6/65.3
FL27.9/37.625.7/32.346.5/59.147.5/60.952.1/66.3
OV34.2/36.434.9/40.665.0/76.255.9/64.664.0/74.2
FM33.9/45.132.4/41.351.4/63.155.7/69.457.0/71.4
SV36.0/49.834.9/45.252.5/65.056.2/70.258.2/73.4
ARC31.0/40.530.9/37.849.5/59.352.5/64.355.3/68.1
ALL36.2/50.034.3/44.752.5/65.156.3/70.557.8/73.2
Table 3. Comparison of SR/NPR/PR Scores (%) across different components on the LASHER dataset. √* indicates that optical flow guidance has been removed in FMIM. The best and second-best results are in red and blue, respectively. Notes: A = scale differential enhanced Mamba (SDEM) module, B = flow-guided multilayer interaction Mamba (FMIM) module; √ = module added, × = module not added; √* = FMIM module with optical flow guidance mechanism removed.
Table 3. Comparison of SR/NPR/PR Scores (%) across different components on the LASHER dataset. √* indicates that optical flow guidance has been removed in FMIM. The best and second-best results are in red and blue, respectively. Notes: A = scale differential enhanced Mamba (SDEM) module, B = flow-guided multilayer interaction Mamba (FMIM) module; √ = module added, × = module not added; √* = FMIM module with optical flow guidance mechanism removed.
ABResolutionSRNPRPR
××256 × 25647.855.860.4
×256 × 25655.363.666.9
×256 × 25654.663.166.7
×√*256 × 25654.162.365.2
256 × 25657.869.173.2
384 × 38458.369.573.8
Table 4. Comparison of PR/NPR/SR Scores (%) across different layers on the LasHeR dataset. The best results are highlighted in bold.
Table 4. Comparison of PR/NPR/SR Scores (%) across different layers on the LasHeR dataset. The best results are highlighted in bold.
LayersSRNPRPR
155.366.775.3
356.467.375.9
657.268.277.5
1257.869.178.2
Table 5. Comparison with SOTA trackers on the LASHER dataset, including SR/NPR Scores (%), parameters, FLOPS, and FPS. The best results are highlighted in bold.
Table 5. Comparison with SOTA trackers on the LASHER dataset, including SR/NPR Scores (%), parameters, FLOPS, and FPS. The best results are highlighted in bold.
MethodsSRNPRParametersFLOPsFPS
GMMT56.670.7962.2 M146.5 G22.4
TBSI55.669.299.3 M38.5 G36.2
DMD57.668.2121.6 M102.2 G16.8
MIFMNet57.869.117.2 M5.6 G35.3
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Guo, R.; Sun, X.; Sun, B.; Qian, H.; Dang, Z.; Zhou, P.; Liu, F.; Su, S. MIFMNet: A Multimodal Interactions and Fusion Mamba for RGBT Tracking with UAV Platforms. Remote Sens. 2026, 18, 1026. https://doi.org/10.3390/rs18071026

AMA Style

Guo R, Sun X, Sun B, Qian H, Dang Z, Zhou P, Liu F, Su S. MIFMNet: A Multimodal Interactions and Fusion Mamba for RGBT Tracking with UAV Platforms. Remote Sensing. 2026; 18(7):1026. https://doi.org/10.3390/rs18071026

Chicago/Turabian Style

Guo, Runze, Xiaoyong Sun, Bei Sun, Hanxiang Qian, Zhaoyang Dang, Peida Zhou, Feiyang Liu, and Shaojing Su. 2026. "MIFMNet: A Multimodal Interactions and Fusion Mamba for RGBT Tracking with UAV Platforms" Remote Sensing 18, no. 7: 1026. https://doi.org/10.3390/rs18071026

APA Style

Guo, R., Sun, X., Sun, B., Qian, H., Dang, Z., Zhou, P., Liu, F., & Su, S. (2026). MIFMNet: A Multimodal Interactions and Fusion Mamba for RGBT Tracking with UAV Platforms. Remote Sensing, 18(7), 1026. https://doi.org/10.3390/rs18071026

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop