Unified Multi-Modal Object Tracking Through Spatial–Temporal Propagation and Modality Synergy

Jiajia Wu; Haorui Zuo; Yuxing Wei; Meihui Li; Jianlin Zhang

doi:10.3390/jimaging11120421

,

and

¹

State Key Laboratory of Optical Field Manipulation Science and Technology, Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610209, China

²

National Laboratory on Adaptive Optics, Chengdu 610209, China

³

Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610209, China

^*

Author to whom correspondence should be addressed.

J. Imaging2025, 11(12), 421;https://doi.org/10.3390/jimaging11120421
(registering DOI)

This article belongs to the Section Computer Vision and Pattern Recognition

Version Notes

Order Reprints

Abstract

Multi-modal object tracking (MMOT) has received widespread attention for the ability to overcome single-sensor perception limitations. However, existing methods encounter several critical challenges. Representation learning and generalization capabilities of models are constrained by the inherent heterogeneity of cross-task multi-modal data and inter-modal synergy imbalance. Particularly, in dynamically changing complex scenarios, the reliability and stability of data significantly degrade, further exacerbating the difficulty in multi-modal consistent perception and aggregation. To tackle the above issues, we propose SMUTrack, a unified framework with global shared parameters integrating three downstream MMOT tasks. SMUTrack implements a batch merging-and-splitting alternating strategy, coupled with multi-task joint training, to establish latent correlations across inter- and intra-task modalities, effectively avoiding over-reliance on certain modalities. Concurrently, we design a hierarchical modality synergy and reinforcement (HMSR) module, and a gated fusion and context awareness (GFCA) module to enable progressive multi-modal information exchange and integration, yielding the more discriminative and robust multi-modal representation. More importantly, we introduce a spatial–temporal information propagation (SIP) mechanism, which synchronously learns object trajectory cues and appearance variations to effectively build contextual relationships in long-term video tracking. Experimental results definitively validate the outstanding performance of SMUTrack on mainstream MMOT datasets, exhibiting its powerful adaptability to various MMOT tasks.

Keywords:

unified multi-modal tracking; spatial-temporal propagation; modality synergy; adaptive perception fusion

1. Introduction

Single object tracking (SOT) [,,,,], a fundamental task in computer vision, aims to continuously and accurately locate a specific object in a video sequence while capturing its bounding box and motion trajectory. Its breakthroughs are crucial for various applications such as autonomous driving [,], intelligent monitoring [,], and human–computer interaction [,]. While RGB-based object tracking technology has made significant progress, it still struggles with various interferences in complex open-world environments, including low illumination [], occlusion [], similar appearance [], motion blur [], etc. In these extreme conditions, only relying on RGB features is insufficient to comprehensively capture the appearance and motion state of objects, severely limiting tracking performance.

Multi-modal object tracking (MMOT) [,,,,] offers an effective solution to break through the performance bottleneck of single-modal object tracking (SMOT). The key distinction lies in that MMOT introduces a supplementary modality (X) such as depth (D) [], thermal (T) [], and event (E) [], and explores the complementary relationship between RGB and X modalities to strengthen the feature representation by deploying additional multi-modal feature interaction and fusion process, thereby mitigating reliance on single-modal data (typically RGB). Nevertheless, previous MMOT trackers generally present several critical challenges. Most works [,,] are devoted to designing a network architecture for a specific RGB-X tracking task (Figure 1a) without considering potential modality relationships across tasks. Although some studies [,] have proposed general models (Figure 1b), parameters need to be retrained for different RGB-X tasks. These general models typically prioritize RGB and regard X modality as auxiliary, which leads to the unique semantics of X modality to be underutilized. Overall, inter-task independence and intra-task synergy imbalance make it difficult for models to overcome the inherent heterogeneity of various modality data and learn effective global representations, thereby limiting the generalization ability. Moreover, tracking is a dynamic process that frequently encounters data quality degradation as real-world scenarios continuously change. This necessitates effectively identifying and suppressing hazardous information, placing higher demands on cross-modal synergy perception. Simultaneously, this process may involve object occlusion and deformation, resulting in incomplete or invalid perception. However, existing methods focus more on synchronous fusion of multi-modal features, while neglecting explicit modeling for long-term context and spatial–temporal dependencies in video sequences, which makes trackers struggle to handle such scenarios.

Figure 1. Comparison of multi-modal object tracking frameworks. (a) Task-specific multi-modal trackers primarily adopt symmetric dual-encoding architectures, tailored for a single RGB-X task. (b) Task-general multi-modal trackers typically treat X modality as auxiliary and introduce prompt learning to accommodate diverse multi-modal tracking tasks within a unified framework. (c) Our proposed SMUTrack is a powerful unified model that introduces batch operations and spatial–temporal information propagation in addition to enhancing inter-modal collaboration, hence diminishing cross-task and cross-modality barriers by one training session.

To solve the above issues, this paper proposes a unified multi-modal object tracking framework, named SMUTrack, which comprehensively excavates spatial–temporal information and inter-modal synergy relationships in video sequences to promote tracking robustness and generalization ability. Different from treating X modality as prompt cues, SMUTrack regards RGB and X modalities equally by alternately performing batch merging and batch splitting, allowing RGB and diverse X modalities to entirely share an encoder, as shown in Figure 1c. SMUTrack can achieve seamless adaptation to three downstream MMOT tasks (i.e., RGB-D, RGB-T, and RGB-E) with only one training session. Specifically, in terms of spatial–temporal modeling, we design a spatial–temporal information propagation (SIP) mechanism that integrates temporal prompt learning into multi-modal representation learning. This mechanism leverages historical temporal prompts to guide the temporal state generation and object feature enhancement in the current frame, thus facilitating the flow of temporal information across frames. To dynamically adapt to object appearance variations during tracking, the SIP mechanism further introduces a template update strategy with long-term and short-term memory. Regarding multi-modal feature exchange and integration, we first construct a hierarchical modality synergy and reinforcement (HMSR) module, which embeds mamba synergy prompt blocks (MSPBs) into three stages of the transformer encoder. This design enables RGB and X modality features to interact fully at different levels, progressively enhancing the feature representation capabilities of each modality. To achieve deep integration of multi-modal features, we then develop a gated fusion and context awareness (GFCA) module. In this module, the gated fusion unit (GFU) comprehensively evaluates the importance of each modality based on image tokens, adaptively selecting and fusing effective information. Meanwhile, the context awareness unit (CAU) utilizes mamba to adequately integrate multi-modal fusion features from different levels, obtaining representations with global contextual information.

The primary contributions in this article can be encapsulated as follows.

(1) A unified multi-modal tracking framework, SMUTrack, is proposed to fully explore spatial–temporal information and inter-modal synergy relationships. SMUTrack supports three MMOT tasks (e.g., RGB-D, RGB-T, and RGB-E) in parallel, and the four modalities share a same encoding backbone entirely.

(2) We propose a SIP mechanism, which incorporates temporal evolution and templates updating to explicitly model long-term contextual relationships in video sequences. This improves tracking robustness in challenging scenes, typically involving occlusions and changing appearance.

(3) We establish an HMSR module and a GFCA module to accomplish progressive interaction and adaptive perception fusion for cross-modal features, remarkably boosting the model’s ability in representational learning.

(4) Comprehensive experimental results revealed that SMUTrack attains cutting-edge performance on five mainstream MMOT datasets, exhibiting powerful adaptability to diverse tasks and environments.

The subsequent content is arranged as follows. Section 2 reviews and analyzes relevant research on multi-modal object tracking and spatial–temporal modeling. Section 3 elaborates the proposed SMUTrack. Section 4 delivers experimental results and discussion. Section 5 summarizes the entire work and prospects future research directions.

2. Related Works

2.1. Multi-Modal Object Tracking

As an extension of object tracking, multi-modal object tracking (MMOT) provides richer semantic clues for object representation in complex scenes by integrating complementary information from RGB and X modalities (e.g., depth, thermal, and event). It effectively copes with challenges such as occlusion, extreme weather, and drastic appearance changes. Existing MMOT methods can be categorized broadly into two types (Table 1): (1) Task-specific methods, which design independent network structures and functional modules tailored to specific RGB-X data. For instance, AMATrack [] proposes an asymmetric mixed attention module for RGB-D task to build intra-modal and inter-modal relations. AMNet [] achieves spatial alignment and fusion of RGB-T features through a mutual-interacted spatial alignment (MSA) module and an information matching fusion (IMF) module. GMMT [] employs a generative model to augment RGB-T information fusion. RT-MDNet [] introduces a cross-modality transformer for RGB-E tasks, enabling efficient fusion of visible and event data. (2) Task-general methods, which aim to make a single model compatible with multiple RGB-X tasks. For example, ViPT [] incorporates prompt learning to streamline auxiliary modalities into a few visual prompts for adapting various MMOT tasks. SDSTrack [] employs a lightweight adapter to transfer the feature extraction capability of the pre-trained model to other modalities. However, these methods still require separate training parameters for different MMOT tasks, resulting in high training costs and poor generalization capabilities. To address this issue, Un-Track [] proposes a tracker that learns a common latent space using low-rank factorization and reconstruction techniques, requiring only a single parameter set for any MMOT task. Nevertheless, treating X modalities as prompts tends to make the fusion feature dominated by RGB modality, which hinders information interaction and complementary fusion between modalities. To overcome the above issues regarding modality synergy imbalance and limited generalization ability, this paper proposes a straightforward yet effective unified MMOT framework, called SMUTrack. SMUTrack allows any modality (i.e., RGB, depth, thermal, and event) to share the same encoder. It embeds multiple mamba synergy prompt blocks (MSPBs) into the transformer encoder, achieving the relationship modeling and information flow of multi-modal features during the encoding process. Additionally, SMUTrack integrates gated fusion units (GFUs) and context awareness units (CAUs) to progressively extract fusion features with strong representational capability.

Table 1. Overview of related work on multi-modal object tracking.

2.2. Spatial–Temporal Modeling

Spatial–temporal information is crucial for object tracking. It can accurately capture the trajectory and appearance change trends of moving objects, and effectively deal with challenges such as object occlusion, temporary disappearance, and deformation. Therefore, related studies have extensively explored the application of spatial–temporal information in object tracking. These methods primarily focus on two aspects (Table 2): (1) They update appearance representations [,,,], which typically adopt template update strategies to capture real-time changes in object appearance. (2) They emphasize temporal correlation modeling [,,]. For instance, TCTrack [] proposes an online temporally adaptive convolution and an adaptive temporal transformer to enhance spatial features and refine similarity maps by leveraging temporal information. AQATrack [] introduces queries to learn spatial–temporal information, and proposes a spatial–temporal information fusion module for aggregating static appearance and instantaneous changes. ARTrack [] is a time-autoregressive framework for modeling the sequential evolution of trajectories, using previous tracking results as subsequent spatial–temporal cues. These studies confirm the significant role of spatial–temporal cues. However, most existing MMOT methods focus on exploring inter-modal complementarity and fail to fully utilize spatial–temporal information. Moreover, traditional sparse template update approaches struggle to establish stable spatial–temporal associations between search frames and template frames. Therefore, we propose a spatial–temporal information propagation (SIP) mechanism, and it introduces a set of learnable and autoregressive modal-specific temporal tokens, combined with a temporal guided attention mechanism (TGAM) and a template update strategy with long-term and short-term memory. This mechanism holistically associates spatial–temporal information into each stage of the MMOT framework.

Table 2. Overview of related work on spatial–temporal modeling.

3. Methods

In this paper, we elaborate a unified multi-modal tracker by leveraging spatial–temporal propagation and modality synergy, termed SMUTrack. The tracker can simultaneously adapt to three downstream MMOT tasks (i.e., RGB-D, RGB-T, and RGB-E) with one training parameter set. As illustrated in Figure 2, the core architecture of SMUTrack comprises a transformer encoder, a hierarchical modality synergy and reinforcement (HMSR) module, a gated fusion and context awareness (GFCA) module, and a spatial–temporal information propagation (SIP) mechanism. These components are designed to deeply exploit inter-task and intra-task modality correlations, as well as spatial–temporal prompts in video sequences, thereby achieving robust tracking performance.

Figure 2. Overall architecture of the proposed SMUTrack. Diverse modality images are first converted into token embeddings and fed into a transformer encoder alongside temporal tokens to extract multi-modal features and temporal prompts. Then, an HMSR module and a GFCA module progressively interact and integrate multi-modal spatial–temporal information. Finally, a SIP mechanism optimizes the fusion feature for prediction, and updates templates.

3.1. Feature Encoding

Different from most existing tracking methods that use X-modality (

X \in {D, T, E}

) as auxiliary input, our approach alternately performs batch merging and batch splitting for multi-modal feature extraction and interaction. This design endows RGB and X modalities with equal importance, effectively preventing dominance by any single modality and establishing intra-task modality relationships. Simultaneously, batch merging processing can significantly improve feature encoding efficiency. Specifically, given a pair of RGB-X search frames

{S_{R G B}^{i}, S_{X}^{i}} \in ℝ^{3 \times H_{S} \times W_{S}}

and M pairs of template frames

{T_{R G B}^{m}, T_{X}^{m}}_{m = 1}^{M} \in ℝ^{3 \times H_{T} \times W_{T}}

for frame

i

, all frames are first projected into

P \times P

patch sequences and flattened through embedding layers, obtaining search frame tokens

{{\hat{S}}_{R G B}^{i}, {\hat{S}}_{X}^{i}} \in ℝ^{N_{s} \times D}

and template frame tokens

{{\hat{T}}_{R G B}^{m}, {\hat{T}}_{X}^{m}} \in ℝ^{N_{T} \times D}

with position embeddings.

D = 3 P^{2}

is embedding dimension, and

N_{S} = H_{S} W_{S} / P^{2}

,

N_{T} = H_{T} W_{T} / P^{2}

are token numbers. Then, we concatenate the token embeddings from the template and search frames with the introduced modal-specific temporal tokens

τ_{R G B}^{i}

,

τ_{X}^{i} \in ℝ^{1 \times D}

(details in Section 3.2), so that

H_{R G B}^{i, 0} = [τ_{R G B}^{i}, {\hat{T}}_{R G B}^{1}, \dots, {\hat{T}}_{R G B}^{M}, {\hat{S}}_{R G B}^{i}] \in ℝ^{(1 + M N_{Z} + N_{S}) \times D},

(1)

H_{X}^{i, 0} = [τ_{X}^{i}, {\hat{T}}_{X}^{1}, \dots, {\hat{T}}_{X}^{M}, {\hat{S}}_{X}^{i}] \in ℝ^{(1 + M N_{Z} + N_{S}) \times D}

(2)

Then,

H_{R G B}^{i, 0}

and

H_{X}^{i, 0}

are merged along the batch dimension and form the final token embedding

H_{b m}^{i, 0} = [H_{R G B}^{i, 0}; H_{X}^{i, 0}] \in ℝ^{2 \times N \times D}

,

N = 1 + M N_{Z} + N_{S}

, which is fed into the transformer encoder to synchronously extract the features of RGB and X modalities, and the corresponding modality-specific temporal prompts.

Furthermore, we co-train multiple MMOT tasks in a unified process by standardizing the quantity of different multi-modal data per batch. This enables the feature encoder to more effectively learn inter-task modality correlations. Through joint optimization of inter-task and intra-task modality interactions, the task adaptability and robustness of the model are comprehensively improved.

3.2. Spatial–Temporal Information Propagation

Existing multi-modal object trackers predominantly emphasize cross-modal feature interaction while neglecting the inherent spatial–temporal properties in video tracking. When confronted with challenging scenarios involving incomplete or failed perception, such as occlusion or dramatic appearance variations, tracking performance significantly deteriorates. Therefore, we propose a spatial–temporal information propagation (SIP) mechanism that effectively captures inter-frame dependencies in long-sequence tracking. Our tracking pipeline can handle arbitrary-length RGB-X videos, as illustrated in Figure 3. Specifically, learnable empty temporal tokens

τ_{E R}^{1}

and

τ_{E X}^{1}

are first embedded into the multi-modal tokens of the initial frame (refer to Equations (1) and (2)). Then,

H_{b m}^{1, 0}

is obtained and processed at the visual encoder and interaction layer to establish intrinsic spatial–temporal relationships, thereby learning modality-specific temporal prompts, ultimately producing a learned fusion temporal prompt

τ_{F}^{1} \in ℝ^{1 \times D}

. These prompts carry clues about the object trajectory. The learned prompt

τ_{F}^{i}

from the current frame serves as historical context for the next frame, enabling temporal information to flow seamlessly and gradually across frames, and the process can be formally expressed as,

τ_{R G B}^{i + 1} = τ_{E R}^{i + 1} + τ_{F}^{i}, τ_{E R}^{i + 1} \in ℝ^{1 \times D},

(3)

τ_{X}^{i + 1} = τ_{E X}^{i + 1} + τ_{F}^{i}, τ_{E X}^{i + 1} \in ℝ^{1 \times D}

(4)

where

τ_{E R}^{i + 1}

and

τ_{E X}^{i + 1}

are the initialized learnable temporal tokens for RGB and X modalities at frame

i + 1

, respectively. We designed a temporal guided attention mechanism (TGAM) to further improve the temporal and trajectory perception of the final fusion feature. Specifically, TGAM calculates an attention weighting

W_{τ}^{i}

for the search region feature using the temporal prompt

τ_{F}^{i}

of the current frame,

W_{τ}^{i} = {\tilde{S}}_{F}^{i} \cdot {(τ_{F}^{i})}^{⊤}, W_{τ}^{i} \in ℝ^{N_{S} \times 1}

(5)

where

(\cdot)

is matrix multiplication.

{\tilde{S}}_{F}^{i} \in ℝ^{N_{S} \times D}

represents the fusion feature of the search region. See Section 3.4 for calculation

{\tilde{H}}_{F}^{i} = [τ_{F}^{i}, {\tilde{T}}_{F}^{i}, {\tilde{S}}_{F}^{i}]

. The weighting

W_{τ}^{i}

comprehensively analyzes the association between each token in the search region and the temporal prompt in a pixel-by-pixel manner. At the end, we introduce a residual connection to generate feature

{\hat{S}}_{F}^{i}

for the tracking prediction head,

{\hat{S}}_{F}^{i} = {\tilde{S}}_{F}^{i} + {\tilde{S}}_{F}^{i} \otimes W_{τ}^{i}

(6)

where

\otimes

is element-wise multiplication. This operation can effectively preserve the discriminative details about the original feature. Overall, the introduction of modality-specific temporal tokens enables spatial–temporal interactions throughout the entire tracking pipeline, establishing a global topological relationship among historical trajectories, template regions, and search regions. This remarkably augments the tracker’s capacity for continuous object tracking.

Figure 3. Spatial–temporal information propagation in SMUTrack. Our tracking pipeline introduces modality-specific temporal tokens and progressively learns a temporal prompt during feature extraction and interaction, propagating it to the next frame.

To cope with drastic changes in object appearance, we design a dynamic template update strategy featuring long-term and short-term memory. Specifically, we build a historical template library and use the confidence score map from the tracking prediction head to ascertain whether to add the present tracking frame into the library, with a threshold of 0.65. To ensure the reliability of tracking templates while accommodating long-term evolution in object status, we select an original, a middle, and a latest template frame from this library as new templates for subsequent frame tracking.

3.3. Hierarchical Modality Synergy and Reinforcement

Effective cross-modal feature interaction is pivotal for improving tracking performance. It enables trackers to better understand scene information across different modalities, extracting the essence while discarding the irrelevant. To this end, we propose a hierarchical modality synergy and reinforcement (HMSR) module. This module deeply excavates the intrinsic correlation between RGB and X modality features through the mamba synergy prompt block (MSPB), achieving comprehensive interaction and complementary reconstruction of cross-modal information. Specifically, we divide the transformer encoder into three feature extraction stages, low, medium, and high levels, and embed MSPB at the end of each stage. By leveraging the efficient sequence modeling capability of mamba and the cross-modal synergy prompts, MSPB achieves deep interaction and enhancement of multi-modal features, greatly boosting the representation learning ability of the encoder.

As shown in Figure 4, MSPB adopts two parallel mamba branches to enhance the RGB and X modality features at each encoding stage. The output features

H_{b m}^{i, j} = [H_{R G B}^{i, j}; H_{X}^{i, j}]

,

j \in [1, 2, 3]

from each encoding stage are separated into RGB and X modalities

H_{R G B}^{i, j}

,

H_{X}^{i, j}

via batch splitting and fed into MSPB. In MSPB, the two modal features are first concatenated along the embedding dimension. This is followed by a reduction–expansion bottleneck (REB) structure that eliminates redundant and unreliable information and extracts modality-specific prompts

p_{R G B}^{i, j}

,

p_{X}^{i, j} \in ℝ^{N \times D}

,

p_{R G B}^{i, j}, p_{X}^{i, j} = ϕ_{R E B}^{j} (C o n c a t (H_{R G B}^{i, j}, H_{X}^{i, j}))

(7)

where

ϕ_{R E B}^{j} (\cdot)

represents REB structure operation. Next,

H_{R G B}^{i, j} = [τ_{R G B}^{i, j}, T_{R G B}^{i, j}, S_{R G B}^{i, j}]

and

H_{X}^{i, j} = [τ_{X}^{i, j}, T_{X}^{i, j}, S_{X}^{i, j}]

enter the main branches with the state space model (SSM) to capture long-range dependencies from all token sequences, and then multiply with the corresponding modality-specific prompts. Meanwhile, residual connections are introduced to retain the key information of the original features. The above process can be formalized as

{\tilde{H}}_{R G B}^{i, j} = H_{R G B}^{i, j} + p_{R G B}^{i, j} \otimes M_{R G B}^{j} (H_{R G B}^{i, j}),

(8)

{\tilde{H}}_{X}^{i, j} = H_{X}^{i, j} + p_{X}^{i, j} \otimes M_{X}^{j} (H_{X}^{i, j})

(9)

where

M_{R G B}^{j}

and

M_{X}^{j}

represent the operations for the RGB and X modality branches, respectively, before introducing the prompt. Perform batch merging on enhanced modality features

{\tilde{H}}_{R G B}^{i, j} = [{\tilde{τ}}_{R G B}^{i, j}, {\tilde{T}}_{R G B}^{i, j}, {\tilde{S}}_{R G B}^{i, j}]

and

{\tilde{H}}_{X}^{i, j} = [{\tilde{τ}}_{X}^{i, j}, {\tilde{T}}_{X}^{i, j}, {\tilde{S}}_{X}^{i, j}]

as input for the next encoding stage. Obviously, MSPB ensures the intra- and inter-modality information exchange across spatial and temporal dimensions.

Figure 4. Details of the mamba synergy prompt block (MSPB). SSM, state space model.

σ

denotes SiLU activation function.

3.4. Gated Fusion and Context Awareness

To achieve deep fusion between enhanced RGB and X modality features, we introduce a gated fusion unit (GFU) and a context awareness unit (CAU). GFU jointly processes features from both modalities, learning multi-modal weight factors to perform adaptive fusion on each token sequence. This design allows the model to dynamically modulate modality contributions and adaptively select critical information, significantly improving the discriminability and robustness of multi-modal representation. As depicted in Figure 5, GFU first concatenates the input features of the two modalities along the embedding dimension. Then, it progressively reduces dimensionality through multiple linear layers, obtains weight factors

α^{i, j}

,

β^{i, j} \in ℝ^{N \times 1}

via the sigmoid activation function, and finally performs weighted fusion to generate multi-modal fusion feature

H_{F}^{i, j}

.

H_{F}^{i, j} = α^{i, j} \otimes {\tilde{H}}_{R G B}^{i, j} + β^{i, j} \otimes {\tilde{H}}_{X}^{i, j}

(10)

Figure 5. Schematic diagram of the gated fusion unit (GFU).

Furthermore, CAU progressively incorporates low-level fusion features rich in detailed information to improve the expressive capacity of high-level fusion features. As shown in Figure 6, in CAU, the low-level fusion feature

H_{F}^{i, j - 1}

is fed into a branch containing an SSM to model long-range dependencies for all fusion token sequences. Concurrently, the high-level feature

H_{F}^{i, j}

enters another branch containing a linear layer and an SiLU activation layer. Next, element-wise multiplication is performed on the outputs of these two branches, utilizing the high-level feature as guidance to enhance key information and suppress noise in the low-level feature. Additionally, a residual connection is introduced in CAU. The final fusion feature

{\tilde{H}}_{F}^{i} = [τ_{F}^{i}, {\tilde{T}}_{F}^{i}, {\tilde{S}}_{F}^{i}]

of the i-th frame is obtained after the last CAU, which is then input into the TGAM and tracking head for prediction. In SMUTrack, this hierarchical fusion strategy enables the fusion feature with both local details and global context, improving tracking performance.

Figure 6. Structure of the context awareness unit (CAU).

σ

denotes SiLU activation function.

4. Experiments and Results

In this section, we first detail the training and inference pipeline of the proposed SMUTrack, then compare it with state-of-the-art trackers on several mainstream MMOT datasets. Finally, we conduct an ablation study of its core modules, substantiating the effectiveness and superiority of SMUTrack.

4.1. Implementation Settings

We adopt the same tracking prediction head and loss function as OSTrack []. The transformer visual encoder employs ViT-Base [] architecture, and its parameters are initialized by MAE [] pre-training parameters. SMUTrack is built on the PyTorch 1.13.1 platform and trained on three types of multi-modal image pairs, including RGB-D, RGB-T, and RGB-E, with data sourced from DepthTrack [], LasHeR [], and VisEvent [] datasets, respectively. To avoid biasing the training model towards a specific modality type, the number of RGB-T, RGB-D, and RGB-E samples in each training batch is always consistent. For the input data, we use three 128 × 128 template image pairs and two 256 × 256 search image pairs. The sampling interval of the video sequence is set to 400, which better approximates the entire video content and captures the long-term motion changes in the tracked object. The training process is carried out on six NVIDIA GeForce RTX 3090 GPUs over a total of 60 epochs. Each epoch contains 60,000 image pairs, and the batch size is set to nine. The optimizer uses AdamW [] with a weight decay of 10⁻⁴. For the initial learning rate, the backbone network is set to 4 × 10⁻⁵, and other parameters are set to 4 × 10⁻⁴. The learning rate decays by a factor of 10 after 48 epochs.

The inference stage is consistent with the training setting, and three template frames are selected using a dynamic template update strategy. The proposed SMUTrack has 102.83 M parameters and 67.87 G FLOPs. On a GeForce RTX 4090 GPU, the average speed in LasHeR is approximately 41 FPS.

4.2. Comparison with State-of-the-Arts

We evaluate the proposed SMUTrack against state-of-the-art trackers on five benchmark datasets across three downstream MMOT tasks.

4.2.1. Comparison on RGB-D Datasets

DepthTrack [] is a large-scale long-term RGB-D tracking benchmark, which contains 150 training and 50 testing video sequences, featuring 15 per-frame attributes and an average sequence length of 1473. It uses precision (Pr), recall (Re), and F-score as evaluation metrics. Table 3 compares SMUTrack with 25 previous state-of-the-art trackers. SMUTrack exhibits the best performance, achieving 63.9% (F-score), 63.8% (recall), and 64.0% (precision).

Table 3. State-of-the-art comparisons on RGB-D datasets. Red and green indicate the best and second best performances.

VOT-RGBD2022 [] comprises 127 short-term RGB-D sequences. It adopts an anchor-based evaluation protocol [] that requires trackers to undergo multiple initialization starts from different points. As reported in Table 3, the evaluation metrics encompass accuracy (Acc.), robustness (Rob.), and expected average overlap (EAO). SMUTrack obtains the highest scores on these metrics, with values of 93%, 82.0%, and 76.9%, respectively; it has a 2.5% improvement over SeqTrackv2 in EAO.

4.2.2. Comparison on RGB-T Datasets

LasHeR [] comprises 1224 aligned RGB-T video sequences (978 for training, 244 for testing), totaling 730 K image pairs. This dataset covers 32 categories and 19 challenging attributes. Evaluation metrics include precision rate (PR), success rate (SR), and normalized precision rate (NPR). As illustrated in Table 4 and Figure 7, SMUTrack achieves the top performance in quantitative evaluation metrics and curves. We further analyze the capabilities of trackers in various challenging attributes, including illumination variation, occlusion, motion blur, etc. As shown in Figure 8, our SMUTrack performs extremely well in these extreme scenarios and demonstrates remarkable robustness. Quantitatively, SMUTrack outperforms the suboptimal GMMT by 3.4% under partial occlusion and 2.1% under total occlusion.

Table 4. State-of-the-art comparisons on RGB-T datasets. Red and green indicate the best and second best performances.

Figure 7. Success rate curves (left) and precision rate curves (right) on the LasHeR testing set.

Figure 8. Comprehensive comparison under 19 different attributes on the LasHeR testing set. NO, No Occlusion; PO, Partial Occlusion; TO, Total Occlusion; HO, Hyaline Occlusion; MB, Motion Blur; LI, Low Illumination, HI, High Illumination; AIV, Abrupt Illumination Variation; LR, Low Resolution; DEF, Deformation; BC, Blackground Clutter; SA, Similar Appearance; CM, Camera Moving; TC, Thermal Crossover; FL, Frame Lost; OV, Out of View; FM, Fast Motion; SV, Scale Variation; ARC, Aspect Ratio Change.

RGBT234 [] encompasses 234 video sequences carrying 12 annotated attributes. Maximum precision rate (MPR) and maximum success rate (MSR) are adopted as evaluation metrics, which consider the alignment error problem. In Table 4, our SMUTrack obtains the optimal performance with 66.7% MSR and 91% MPR, outperforming GMMT by 1.7% and 2.7%, respectively. Moreover, it adapts well to most challenging scenarios, as illustrated in Figure 9.

Figure 9. Performance comparison under 12 different attributes on the RGBT234 dataset. NO, No Occlusion; PO, Partial Occlusion; HO, Heavy Occlusion; LI, Low Illumination; LR, Low Resolution; TC, Thermal Crossover; DEF, Deformation; FM, Fast Motion; SV, Scale Variation; MB, Motion Blur; CM, Camera Moving; BC, Background Clutter; ALL, All Attributes.

4.2.3. Comparison on RGB-E Datasets

VisEvent [] comprises 820 sequences (500 used for training, 320 for testing). It utilizes conventional SR, PR, and NPR metrics to evaluate tracker performance. We compare SMUTrack with 18 state-of-the-art trackers. As illustrated in Table 5 and Figure 10, our SMUTrack has the best overall performance in quantitative evaluation metrics and curves. Moreover, it demonstrates strong capability in handling tracking tasks across diverse challenging scenarios (Figure 11). Notably, in partial occlusion and full occlusion scenarios, SMUTrack outperforms the suboptimal SeqTrackv2 by 0.8% and 2.1%, respectively.

Table 5. State-of-the-art comparisons on RGB-E datasets. ‘_E’ represents the extension of RGB trackers with event fusion. Red and green indicate the best and second best performances.

Figure 10. Overall performance on the VisEvent testing set.

Figure 11. Performance comparison under 17 different attributes on the VisEvent dataset. CM, Camera Motion; ROT, Rotation; DEF, Deformation; FOC, Full Occlusion; LI, Low Illumination; OV, Out of View; POC, Partial Occlusion; VC, Viewpoint Change; SV, Scale Variation; BC, Background Clutter; MB, Motion Blur; ARC, Aspect Ration Change; FM, Fast Motion; NMO, No Motion; IV, Illumination Variation; OE, Over Exposure; BOM, Background Object Motion.

4.3. Ablation Study

To validate the effectiveness of each module in SMUTrack, we perform an ablation study on DepthTrack, LasHeR, and VisEvent datasets, with detailed results presented in Table 6. The baseline model utilizes ViT as the visual encoder to extract RGB and X modality features. These features are fused through element-wise addition before being fed into the prediction head. To tackle the inadequate utilization of object spatial–temporal information, we designed the SIP module. This module effectively captures object appearance changes and facilitates temporal information flow across frames through multi-template updating and temporal prompt learning. Compared to the baseline, integrating the SIP module improves the average performance of the tracker by 3.3%. Considering the importance of RGB and X modality information interactions for the perception ability of the model, we further introduce the HMSR module. By embedding MSPB into the three encoding stages of ViT, we enhanced the representation learning capacity of the model, leading to an additional 1.3% improvement in tracking performance. Furthermore, to achieve deep fusion of RGB and X modality features, we designed a GFCA module; it regulates inter-modal information weights via a gated fusion unit (GFU), captures the correlations of cross-level fusion features in conjunction with a context awareness unit (CAU), and learns a multi-modal representation with global context. The GFCA module contributes to a further 0.2% enhancement in tracking accuracy.

Table 6. Quantitative results of ablation studies on DepthTrack, LasHeR, and VisEvent datasets. Δ denotes the average value of performance change compared to the benchmark. Red is the best performance.

We further validate the effectiveness of the two types of subcomponents in the GFCA module, with the results presented in Table 7. Specifically, when the multi-level fusion features obtained from GFUs are simply added together, tracking performance on the DepthTrack dataset deteriorates significantly (Experiment 2 in Table 7). The main reason is that low-level fusion features contain richer details but also come with certain noise. Especially, depth modality exhibits more inconsistent quality and blurred details compared to other modalities, making the model more sensitive to such noise. When multi-modal features from different levels are fused through direct addition and the contextual information is aggregated solely via CAUs, the overall model performance shows an obvious decline (Experiment 3 in Table 7). This result indicates that the additive approach introduces more redundant information and noise, negatively impacting the contextual aggregation process. This also reveals the importance of adaptive fusion within GFUs. GFUs and CAUs work synergistically to form an effective complementarity, enabling the model to achieve relatively superior performance (Experiment 4 in Table 7). Additionally, each GFU has 0.15 M parameters and 0.067 G FLOPs, while each CAU has 1.872 M parameters and 0.823 G FLOPs, with negligible impact on model complexity and computational efficiency.

Table 7. Ablation studies in the GFCA module. Δ denotes the average value of performance change compared to benchmark 1.

4.4. Visualization Results

In Figure 12, we present a qualitative comparison between SMUTrack and four of the latest unified multi-modal trackers including ViPT [], SDSTrack [], Un-Track [], and SeqTrackv2 []. These instances cover three combined modalities and diverse challenge attributes. In RGB-D video sequences, SMUTrack achieves stable and precise long-term object tracking even when encountering challenges such as partial occlusion and similar appearances. For RGB-T sequences, the tracked object is a composite of a human and a bicycle, with the main body severely occluded by dense pedestrians and vehicles as it rapidly moves from near to far. In this case, SMUTrack still achieves robust tracking by leveraging historical trajectory information. In RGB-E sequences, SMUTrack integrates RGB appearance features with event flow motion detail features. When confronting challenges such as low-resolution imaging, illumination variations, and complex backgrounds, it demonstrates excellent performance, enabling precise continuous tracking of an unmanned aerial vehicle. Overall, SMUTrack effectively exploits the complementary relationship between the RGB modality and arbitrary X modality and fully leverages the spatial–temporal information of video sequences, thereby significantly enhancing the perceptual and discriminative capacities of the tracking model for the object and its environmental context.

Figure 12. The visual comparison of our SMUTrack with the other four unified multi-modal trackers in three multi-modal tasks.

5. Conclusions

In this work, we propose SMUTrack, a unified MMOT framework featuring a unique network architecture and a unique parameter set. By leveraging the latent relationships among intra- and inter-task modalities, as well as the spatial–temporal intrinsic correlations in video sequences, SMUTrack effectively mitigates cross-modal inherent heterogeneity and inter-modal synergy imbalance, and its spatial–temporal perception and representation learning abilities are significantly improved. SMUTrack achieves excellent performance in three downstream MMOT tasks, exhibiting strong generalization capability. Extensive experiments validate the superiority and robustness of the proposed framework. Future work will focus on integrating the powerful representation capabilities of large models into the multi-modal tracking model while pursuing the lightweight design to promote precise understanding and efficient deployment in more challenging open-world scenarios.

Author Contributions

Conceptualization, J.W. and J.Z.; methodology, J.W. and J.Z.; validation, J.W.; formal analysis, J.W., M.L. and J.Z.; investigation, H.Z., Y.W. and J.Z.; writing—original draft preparation, J.W.; writing—review and editing, J.W. and J.Z.; visualization, J.W.; funding acquisition, M.L. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Frontier Research Fund of Institute of Optics and Electronics, China Academy of Sciences, under Grant C21K005, the National Natural Science Foundation of China, under Grant 62101529, and the Frontier Research Fund of Institute of Optics and Electronics, China Academy of Sciences, under Grant C24K003.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6182–6191. [Google Scholar]
Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 341–357. [Google Scholar]
Wei, X.; Bai, Y.; Zheng, Y.; Shi, D.; Gong, Y. Autoregressive visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9697–9706. [Google Scholar]
Wang, J.; Wu, Z.; Chen, D.; Luo, C.; Dai, X.; Yuan, L.; Jiang, Y.G. OmniTracker: Unifying Visual Object Tracking by Tracking-with-Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 3159–3174. [Google Scholar] [CrossRef]
Zhou, J.; Dong, Y.; Du, B. SiamTITP: Incorporating Temporal Information and Trajectory Prediction Siamese Network for Satellite Video Object Tracking. IEEE Trans. Image Process. 2025, 34, 4120–4133. [Google Scholar] [CrossRef] [PubMed]
Lin, J.; Chen, J.; Peng, K.; He, X.; Li, Z.; Stiefelhagen, R.; Yang, K. EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2024, 25, 18964–18977. [Google Scholar] [CrossRef]
Cho, M.; Kim, E. 3D LiDAR Multi-Object Tracking with Short-Term and Long-Term Multi-Level Associations. Remote Sens. 2023, 15, 5486. [Google Scholar] [CrossRef]
Zhang, Q.; Wang, L.; Patel, V.M.; Xie, X.; Lai, J. View-decoupled transformer for person re-identification under aerial-ground camera network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 22000–22009. [Google Scholar]
Meng, W.; Duan, S.; Ma, S.; Hu, B. Motion-Perception Multi-Object Tracking (MPMOT): Enhancing Multi-Object Tracking Performance via Motion-Aware Data Association and Trajectory Connection. J. Imaging 2025, 11, 144. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, T.; Liu, K.; Zhang, B.; Chen, L. Recent advances of single-object tracking methods: A brief survey. Neurocomputing 2021, 455, 1–11. [Google Scholar] [CrossRef]
Black, D.; Salcudean, S. Robust object pose tracking for augmented reality guidance and teleoperation. IEEE Trans. Instrum. Meas. 2024, 73, 9509815. [Google Scholar] [CrossRef]
Tang, Z.; Xu, T.; Li, H.; Wu, X.-J.; Zhu, X.; Kittler, J. Exploring fusion strategies for accurate RGBT visual object tracking. Inf. Fusion 2023, 99, 101881. [Google Scholar] [CrossRef]
Gao, S.; Yang, J.; Li, Z.; Zheng, F.; Leonardis, A.; Song, J. Learning dual-fused modality-aware representations for RGBD tracking. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 478–494. [Google Scholar]
Zhu, X.-F.; Xu, T.; Tang, Z.; Wu, Z.; Liu, H.; Yang, X.; Wu, X.-J.; Kittler, J. RGBD1K: A large-scale dataset and benchmark for RGB-D object tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 3870–3878. [Google Scholar]
Sun, C.; Zhang, J.; Wang, Y.; Ge, H.; Xia, Q.; Yin, B.; Yang, X. Exploring Historical Information for RGBE Visual Tracking with Mamba. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–17 June 2025; pp. 6500–6509. [Google Scholar]
Gao, L.; Ke, Y.; Zhao, W.; Zhang, Y.; Jiang, Y.; He, G.; Li, Y. RGB-D visual object tracking with transformer-based multi-modal feature fusion. Knowl.-Based Syst. 2025, 322, 113531. [Google Scholar] [CrossRef]
Zhang, T.; He, X.; Jiao, Q.; Zhang, Q.; Han, J. AMNet: Learning to Align Multi-Modality for RGB-T Tracking. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7386–7400. [Google Scholar] [CrossRef]
Zhu, Z.; Hou, J.; Wu, D.O. Cross-modal orthogonal high-rank augmentation for rgb-event transformer-trackers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 22045–22055. [Google Scholar]
Yuan, D.; Zhang, H.; Shu, X.; Liu, Q.; Chang, X.; He, Z.; Shi, G. Thermal Infrared Target Tracking: A Comprehensive Review. IEEE Trans. Instrum. Meas. 2024, 73, 5000419. [Google Scholar] [CrossRef]
Xue, Y.; Zhang, J.; Lin, Z.; Li, C.; Huo, B.; Zhang, Y. SiamCAF: Complementary Attention Fusion-Based Siamese Network for RGBT Tracking. Remote Sens. 2023, 15, 3252. [Google Scholar] [CrossRef]
Yang, J.; Gao, S.; Li, Z.; Zheng, F.; Leonardis, A. Resource-efficient rgbd aerial tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13374–13383. [Google Scholar]
Hu, X.; Zhong, B.; Liang, Q.; Zhang, S.; Li, N.; Li, X. Toward Modalities Correlation for RGB-T Tracking. IEEE Trans. Circuit Syst. Video Technol. 2024, 34, 9102–9111. [Google Scholar] [CrossRef]
Wang, X.; Li, J.; Zhu, L.; Zhang, Z.; Chen, Z.; Li, X.; Wang, Y.; Tian, Y.; Wu, F. Visevent: Reliable object tracking via collaboration of frame and event flows. IEEE Trans. Cybern. 2023, 54, 1997–2010. [Google Scholar] [CrossRef] [PubMed]
Cao, B.; Guo, J.; Zhu, P.; Hu, Q. Bi-directional adapter for multimodal tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 927–935. [Google Scholar]
Zhu, J.; Lai, S.; Chen, X.; Wang, D.; Lu, H. Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9516–9526. [Google Scholar]
Hou, X.; Xing, J.; Qian, Y.; Guo, Y.; Xin, S.; Chen, J.; Tang, K.; Wang, M.; Jiang, Z.; Liu, L. Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26551–26561. [Google Scholar]
Ye, P.; Xiao, G.; Liu, J. AMATrack: A Unified Network with Asymmetric Multimodal Mixed Attention for RGBD Tracking. IEEE Trans. Instrum. Meas. 2024, 73, 2526011. [Google Scholar] [CrossRef]
Tang, Z.; Xu, T.; Wu, X.; Zhu, X.-F.; Kittler, J. Generative-based fusion mechanism for multi-modal tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 5189–5197. [Google Scholar]
Wu, Z.; Zheng, J.; Ren, X.; Vasluianu, F.-A.; Ma, C.; Paudel, D.P.; Van Gool, L.; Timofte, R. Single-model and any-modality for video object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19156–19166. [Google Scholar]
Chen, X.; Peng, H.; Wang, D.; Lu, H.; Hu, H. Seqtrack: Sequence to sequence learning for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14572–14581. [Google Scholar]
Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10448–10457. [Google Scholar]
Zhang, M.; Zhang, Q.; Song, W.; Huang, D.; He, Q. PromptVT: Prompting for Efficient and Accurate Visual Tracking. IEEE Trans. Circuit Syst. Video Technol. 2024, 34, 7373–7385. [Google Scholar] [CrossRef]
Ying, G.; Zhang, D.; Ou, Z.; Wang, X.; Zheng, Z. Temporal adaptive bidirectional bridging for RGB-D tracking. Pattern Recognit. 2025, 158, 111053. [Google Scholar] [CrossRef]
Xie, J.; Zhong, B.; Mo, Z.; Zhang, S.; Shi, L.; Song, S.; Ji, R. Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19300–19309. [Google Scholar]
Cao, Z.; Huang, Z.; Pan, L.; Zhang, S.; Liu, Z.; Fu, C. Towards real-world visual tracking with temporal contexts. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15834–15849. [Google Scholar] [CrossRef]
Cao, Z.; Huang, Z.; Pan, L.; Zhang, S.; Liu, Z.; Fu, C. TCTrack: Temporal contexts for aerial tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14798–14808. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
Yan, S.; Yang, J.; Käpylä, J.; Zheng, F.; Leonardis, A.; Kämäräinen, J.-K. Depthtrack: Unveiling the power of rgbd tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10725–10733. [Google Scholar]
Li, C.; Xue, W.; Jia, Y.; Qu, Z.; Luo, B.; Tang, J.; Sun, D. LasHeR: A large-scale high-diversity benchmark for RGBT tracking. IEEE Trans. Image Process. 2021, 31, 392–404. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Hong, L.; Yan, S.; Zhang, R.; Li, W.; Zhou, X.; Guo, P.; Jiang, K.; Chen, Y.; Li, J.; Chen, Z. Onetracker: Unifying visual object tracking with foundation models and efficient tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19079–19091. [Google Scholar]
Yang, J.; Li, Z.; Zheng, F.; Leonardis, A.; Song, J. Prompting for multi-modal tracking. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 3492–3500. [Google Scholar]
Xie, F.; Wang, C.; Wang, G.; Cao, Y.; Yang, W.; Zeng, W. Correlation-aware deep tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8751–8760. [Google Scholar]
Mayer, C.; Danelljan, M.; Paudel, D.P.; Van Gool, L. Learning target candidate association to keep track of what not to track. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 13444–13454. [Google Scholar]
Kristan, M.; Matas, J.; Leonardis, A.; Felsberg, M.; Pflugfelder, R.; Kämäräinen, J.-K.; Chang, H.J.; Danelljan, M.; Cehovin, L.; Lukežič, A. The ninth visual object tracking vot2021 challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2711–2738. [Google Scholar]
Qian, Y.; Yan, S.; Lukežič, A.; Kristan, M.; Kämäräinen, J.-K.; Matas, J. DAL: A deep depth-aware long-term tracker. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 7825–7832. [Google Scholar]
Zhao, P.; Liu, Q.; Wang, W.; Guo, Q. Tsdm: Tracking by siamrpn++ with a depth-refiner and a mask-generator. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 670–676. [Google Scholar]
Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, M.; Pflugfelder, R.; Kämäräinen, J.-K.; Danelljan, M.; Zajc, L.Č.; Lukežič, A.; Drbohlav, O. The eighth visual object tracking VOT2020 challenge results. In Proceedings of the Computer Vision–ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020; Proceedings, Part V. pp. 547–601. [Google Scholar]
Dai, K.; Zhang, Y.; Wang, D.; Li, J.; Lu, H.; Yang, X. High-performance long-term tracking with meta-updater. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6298–6307. [Google Scholar]
Kristan, M.; Matas, J.; Leonardis, A.; Felsberg, M.; Pflugfelder, R.; Kämäräinen, J.-K.; Čehovin Zajc, L.; Drbohlav, O.; Lukezic, A.; Berg, A. The seventh visual object tracking VOT2019 challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2206–2241. [Google Scholar]
Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4660–4669. [Google Scholar]
Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, M.; Pflugfelder, R.; Kämäräinen, J.-K.; Chang, H.J.; Danelljan, M.; Zajc, L.Č.; Lukežič, A. The tenth visual object tracking vot2022 challenge results. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 431–460. [Google Scholar]
Wang, H.; Liu, X.; Li, Y.; Sun, M.; Yuan, D.; Liu, J. Temporal adaptive rgbt tracking with modality prompt. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 5436–5444. [Google Scholar]
Hui, T.; Xun, Z.; Peng, F.; Huang, J.; Wei, X.; Wei, X.; Dai, J.; Han, J.; Liu, S. Bridging search region interaction with template for rgb-t tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13630–13639. [Google Scholar]
Xiao, Y.; Yang, M.; Li, C.; Liu, L.; Tang, J. Attribute-based progressive fusion network for rgbt tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 22 February–1 March 2022; pp. 2831–2838. [Google Scholar]
Zhang, P.; Zhao, J.; Wang, D.; Lu, H.; Ruan, X. Visible-thermal UAV tracking: A large-scale benchmark and new baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8886–8895. [Google Scholar]
Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Montreal, QC, Canada, 10–17 October 2021; pp. 8126–8135. [Google Scholar]
Wang, C.; Xu, C.; Cui, Z.; Zhou, L.; Zhang, T.; Zhang, X.; Yang, J. Cross-modal pattern-propagation for RGB-T tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7064–7073. [Google Scholar]
Li, C.; Liu, L.; Lu, A.; Ji, Q.; Tang, J. Challenge-aware RGBT tracking. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 222–237. [Google Scholar]
Zhu, Y.; Li, C.; Tang, J.; Luo, B. Quality-aware feature aggregation network for robust RGBT tracking. IEEE Trans. Intell. Veh. 2020, 6, 121–130. [Google Scholar] [CrossRef]
Zhang, L.; Danelljan, M.; Gonzalez-Garcia, A.; Van De Weijer, J.; Shahbaz Khan, F. Multi-modal fusion for end-to-end RGB-T tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2252–2261. [Google Scholar]
Zhu, Y.; Li, C.; Luo, B.; Tang, J.; Wang, X. Dense feature aggregation and pruning for RGBT tracking. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 465–472. [Google Scholar]
Li, C.; Zhao, N.; Lu, Y.; Zhu, C.; Tang, J. Weighted sparse representation regularized graph learning for RGB-T object tracking. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1856–1864. [Google Scholar]
Li, C.; Liang, X.; Lu, Y.; Zhao, N.; Tang, J. RGB-T Object Tracking: Benchmark and Baseline. Pattern Recognit. 2019, 96, 106977. [Google Scholar] [CrossRef]
Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6269–6277. [Google Scholar]
Voigtlaender, P.; Luiten, J.; Torr, P.H.; Leibe, B. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6578–6588. [Google Scholar]
Danelljan, M.; Gool, L.V.; Timofte, R. Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7183–7192. [Google Scholar]
Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6668–6677. [Google Scholar]
Wang, Q.; Zhang, L.; Bertinetto, L.; Hu, W.; Torr, P.H. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1328–1338. [Google Scholar]
Song, Y.; Ma, C.; Wu, X.; Gong, L.; Bao, L.; Zuo, W.; Shen, C.; Lau, R.W.; Yang, M.-H. Vital: Visual tracking via adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8990–8999. [Google Scholar]
Nam, H.; Han, B. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4293–4302. [Google Scholar]

Figure 1. Comparison of multi-modal object tracking frameworks. (a) Task-specific multi-modal trackers primarily adopt symmetric dual-encoding architectures, tailored for a single RGB-X task. (b) Task-general multi-modal trackers typically treat X modality as auxiliary and introduce prompt learning to accommodate diverse multi-modal tracking tasks within a unified framework. (c) Our proposed SMUTrack is a powerful unified model that introduces batch operations and spatial–temporal information propagation in addition to enhancing inter-modal collaboration, hence diminishing cross-task and cross-modality barriers by one training session.

Figure 2. Overall architecture of the proposed SMUTrack. Diverse modality images are first converted into token embeddings and fed into a transformer encoder alongside temporal tokens to extract multi-modal features and temporal prompts. Then, an HMSR module and a GFCA module progressively interact and integrate multi-modal spatial–temporal information. Finally, a SIP mechanism optimizes the fusion feature for prediction, and updates templates.

Figure 3. Spatial–temporal information propagation in SMUTrack. Our tracking pipeline introduces modality-specific temporal tokens and progressively learns a temporal prompt during feature extraction and interaction, propagating it to the next frame.

Figure 4. Details of the mamba synergy prompt block (MSPB). SSM, state space model.

σ

denotes SiLU activation function.

Figure 5. Schematic diagram of the gated fusion unit (GFU).

Figure 6. Structure of the context awareness unit (CAU).

σ

denotes SiLU activation function.

Figure 7. Success rate curves (left) and precision rate curves (right) on the LasHeR testing set.

Figure 8. Comprehensive comparison under 19 different attributes on the LasHeR testing set. NO, No Occlusion; PO, Partial Occlusion; TO, Total Occlusion; HO, Hyaline Occlusion; MB, Motion Blur; LI, Low Illumination, HI, High Illumination; AIV, Abrupt Illumination Variation; LR, Low Resolution; DEF, Deformation; BC, Blackground Clutter; SA, Similar Appearance; CM, Camera Moving; TC, Thermal Crossover; FL, Frame Lost; OV, Out of View; FM, Fast Motion; SV, Scale Variation; ARC, Aspect Ratio Change.

Figure 9. Performance comparison under 12 different attributes on the RGBT234 dataset. NO, No Occlusion; PO, Partial Occlusion; HO, Heavy Occlusion; LI, Low Illumination; LR, Low Resolution; TC, Thermal Crossover; DEF, Deformation; FM, Fast Motion; SV, Scale Variation; MB, Motion Blur; CM, Camera Moving; BC, Background Clutter; ALL, All Attributes.

Figure 10. Overall performance on the VisEvent testing set.

Figure 11. Performance comparison under 17 different attributes on the VisEvent dataset. CM, Camera Motion; ROT, Rotation; DEF, Deformation; FOC, Full Occlusion; LI, Low Illumination; OV, Out of View; POC, Partial Occlusion; VC, Viewpoint Change; SV, Scale Variation; BC, Background Clutter; MB, Motion Blur; ARC, Aspect Ration Change; FM, Fast Motion; NMO, No Motion; IV, Illumination Variation; OE, Over Exposure; BOM, Background Object Motion.

Figure 12. The visual comparison of our SMUTrack with the other four unified multi-modal trackers in three multi-modal tasks.

Table 1. Overview of related work on multi-modal object tracking.

Category	Related Works	Data Types	Descriptions
Task-specific	AMATrack []	RGB-D	Designed for a specific RGB-X task, they may excel at one type of task but fail to consider potential modality relationships across tasks.
	AMNet []	RGB-T
	GMMT []	RGB-T
	RT-MDNet []	RGB-E
Task-general	ViPT []	RGB-D, RGB-T, RGB-E	Only a few prompts are needed for learning, but modality synergy and generalization capabilities still need improvement.
	SDSTrack []
	Un-Track []

Table 2. Overview of related work on spatial–temporal modeling.

Category	Related Works	Descriptions
Updating appearance representations	SeqTrack []	Online template update strategies are typically employed to capture real-time appearance changes, yet sparse template updates struggle to establish stable spatial–temporal correlations.
	STARK []
	PromptVT []
	TABBTrack []
Modeling temporal dependencies	ARTrack []	Learning a global temporal representation across video sequences enables effective perception of trajectory evolution processes.
	AQATrack []
	TCTrack []

Table 3. State-of-the-art comparisons on RGB-D datasets. Red and green indicate the best and second best performances.

Method	Source	VOT-RGBD22			DepthTrack
Method	Source	EAO	Acc.	Rob.	F-Score	Re	Pr
SMUTrack	Ours	76.9	82.0	93.0	63.9	63.8	64.0
SeqTrackv2 []	CVPR’24	74.4	81.5	91.0	63.2	63.4	62.9
OneTracker []	CVPR’24	72.7	81.9	87.2	60.9	60.4	60.7
Un-Track []	CVPR’24	71.8	82.0	86.4	61.0	61.0	61.0
SDSTrack []	CVPR’24	72.8	81.2	88.3	61.4	60.9	61.9
SPT []	AAAI’23	65.1	79.8	85.1	53.8	54.9	52.7
ViPT []	CVPR’23	72.1	81.5	87.1	59.4	59.6	59.2
ProTrack []	ACMMM’22	65.1	80.1	80.2	57.8	57.3	58.3
OSTrack []	ECCV’22	67.6	80.3	83.3	52.9	52.2	53.6
SBT-RGBD []	CVPR’22	70.8	80.9	86.4	-	-	-
DMTracker []	ECCV’22	65.8	75.8	85.1	-	-	-
DeT []	ICCV’21	65.7	76.0	84.5	53.2	50.6	56.0
STARK-RGBD []	ICCV’21	64.7	80.3	79.8	-	-	-
KeepTrack []	ICCV’21	60.6	75.3	79.7	-	-	-
DRefine []	ICCV’21	59.2	77.5	76.0	-	-	-
DAL []	ICPR’21	-	-	-	42.9	36.9	51.2
TSDM []	ICPR’21	-	-	-	38.4	37.6	39.3
DDiMP []	ECCV’20	-	-	-	48.5	46.9	50.3
ATCAIS []	ECCV’20	55.9	76.1	73.9	47.6	45.5	50.0
LTMU-B []	CVPR’20	-	-	-	46.0	41.7	51.2
GLGS-D []	ECCV’20	-	-	-	45.3	36.9	58.4
Siam-LTD []	ECCV’20	-	-	-	37.6	34.2	41.8
LTDSEd []	ICCVW’19	-	-	-	40.5	38.2	43.0
SiamM-Ds []	ICCVW’19	-	-	-	33.6	26.4	46.3
DiMP []	ICCV’19	54.3	70.3	73.1	-	-	-
ATOM []	CVPR’19	50.5	69.8	68.8	-	-	-

Table 4. State-of-the-art comparisons on RGB-T datasets. Red and green indicate the best and second best performances.

Method	Source	LasHeR			RGBT234
Method	Source	SR	PR	NPR	MSR	MPR
SMUTrack	Ours	59.7	74.8	71.3	66.7	91.0
GMMT []	AAAI’24	56.6	70.7	67.0	65.0	88.3
BAT []	AAAI’24	56.3	70.2	66.4	64.1	86.8
TATrack []	AAAI’24	56.1	70.2	66.7	64.4	87.2
SeqTrackv2 []	CVPR’24	55.8	70.4	67.2	64.7	88.0
OneTracker []	CVPR’24	53.8	67.2	-	64.2	85.7
Un-Track []	CVPR’24	51.3	63.7	60.1	62.5	84.2
SDSTrack []	CVPR’24	53.1	66.5	62.7	62.5	84.8
TBSI []	CVPR’23	55.6	69.2	65.7	63.7	87.1
ViPT []	CVPR’23	52.5	65.1	61.7	61.7	83.5
ProTrack []	ACMMM’22	42.0	53.8	49.8	59.9	79.5
OSTrack []	ECCV’22	41.2	51.5	48.2	54.9	72.9
APFNet []	AAAI’22	36.2	50.0	43.9	57.9	82.7
HMFT []	CVPR’22	31.3	43.6	38.1	-	-
TransT []	CVPR’21	39.4	52.4	48.0	-	-
CMPP []	CVPR’20	-	-	-	57.5	82.3
CAT []	ECCV’20	31.4	45.0	39.5	56.1	80.4
FANet []	TIV’20	30.9	44.1	38.4	55.3	78.7
mfDiMP []	ICCVW’19	34.3	44.7	39.5	42.8	64.6
DAPNet []	ACMMM’19	31.4	43.1	38.3	-	-
SGT []	ACMMM’17	25.1	36.5	30.6	47.2	72.0

Table 5. State-of-the-art comparisons on RGB-E datasets. ‘_E’ represents the extension of RGB trackers with event fusion. Red and green indicate the best and second best performances.

Method	Source	VisEvent
Method	Source	SR	PR	NPR
SMUTrack	Ours	62.3	79.4	75.4
SeqTrackv2 []	CVPR’24	61.2	78.2	73.9
OneTracker []	CVPR’24	60.8	76.7	-
Un-Track []	CVPR’24	58.9	75.5	71.0
SDSTrack []	CVPR’24	59.7	76.7	72.3
ViPT []	CVPR’23	59.2	75.8	71.5
ProTrack []	ACMMM’22	47.1	63.2	56.0
OSTrack []	ECCV’22	53.4	69.5	64.6
TransT_E []	CVPR’21	47.4	65.0	58.3
STARK_E []	ICCV’21	44.6	61.2	53.7
SiamCAR_E []	CVPR’20	42.0	59.9	52.8
SiamRCNN_E []	CVPR’20	49.9	65.9	60.6
LTMU_E []	CVPR’20	45.9	65.5	57.0
PrDiMP_E []	CVPR’20	45.3	64.4	55.5
SiamBAN_E []	CVPR’20	40.5	59.1	50.7
ATOM_E []	CVPR’19	41.2	60.8	50.8
SiamMask_E []	CVPR’19	36.9	56.2	47.1
VITAL_E []	CVPR’18	41.5	64.9	52.9
MDNet_E []	CVPR’16	42.6	66.1	55.7

Table 6. Quantitative results of ablation studies on DepthTrack, LasHeR, and VisEvent datasets. Δ denotes the average value of performance change compared to the benchmark. Red is the best performance.

Components				DepthTrack			LasHeR			VisEvent			Δ
Baseline	SIP	HMSR	GFCA	F-Score	Re	Pr	SR	PR	NPR	SR	PR	NPR	Δ
√				58.3	58.7	58.0	53.8	67.0	63.5	61.1	77.9	74.1	-
√	√			63.2	63.4	63.0	57.5	72.4	68.8	61.2	78.1	74.3	3.3
√	√	√		64.3	64.3	64.4	59.5	74.6	71.0	61.9	78.3	74.9	1.3
√	√	√	√	63.9	63.8	64.0	59.7	74.8	71.3	62.3	79.4	75.4	0.2

√ denotes the presence of the corresponding component.

Table 7. Ablation studies in the GFCA module. Δ denotes the average value of performance change compared to benchmark 1.

Number	GFU	CAU	DepthTrack			LasHeR			VisEvent			Δ
Number	GFU	CAU	F-Score	Re	Pr	SR	PR	NPR	SR	PR	NPR	Δ
1	×	×	64.3	64.3	64.4	59.5	74.6	71.0	61.9	78.3	74.9	-
2	√	×	61.9	62.0	61.9	59.5	74.8	71.2	62.2	79.4	75.3	−0.6
3	×	√	62.6	62.2	62.9	59.2	74.5	70.6	61.2	78.0	74.2	−0.9
4	√	√	63.9	63.8	64.0	59.7	74.8	71.3	62.3	79.4	75.4	0.2

√ and × denote the presence or absence of the corresponding component.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Article metric data becomes available approximately 24 hours after publication online.