MPVT: An Efficient Multi-Modal Prompt Vision Tracker for Visual Target Tracking

Xie, Jianyu; Fu, Yan; Zhou, Junlin; He, Tianxiang; Wang, Xiaopeng; Fang, Yuke; Chen, Duanbing

doi:10.3390/app15147967

Open AccessArticle

MPVT: An Efficient Multi-Modal Prompt Vision Tracker for Visual Target Tracking

by

Jianyu Xie

¹,

Yan Fu

^1,2

,

Junlin Zhou

^1,2

,

Tianxiang He

²,

Xiaopeng Wang

³,

Yuke Fang

² and

Duanbing Chen

^1,2,4,*

¹

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

²

Chengdu Union Big Data Tech. Inc., Chengdu 610041, China

³

Zhuhai Yiyuan Technology Co., Ltd., Zhuhai 519040, China

⁴

Suining Institute of Digital Economy, Suining 629018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 7967; https://doi.org/10.3390/app15147967

Submission received: 18 June 2025 / Revised: 11 July 2025 / Accepted: 14 July 2025 / Published: 17 July 2025

(This article belongs to the Special Issue Advanced Technologies Applied for Object Detection and Tracking)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Visual target tracking is a fundamental task in computer vision. Combining multi-modal information with tracking leverages complementary information, which improves the precision and robustness of trackers. Traditional multi-modal tracking methods typically employ a full fine-tuning scheme, i.e., fine-tuning pre-trained single-modal models to multi-modal tasks. However, this approach suffers from low transfer learning efficiency, catastrophic forgetting, and high cross-task deployment costs. To address these issues, we propose an efficient model named multi-modal prompt vision tracker (MPVT) based on an efficient prompt-tuning paradigm. Three key components are involved in the model: a decoupled input enhancement module, a dynamic adaptive prompt fusion module, and a fully connected head network module. The decoupled input enhancement module enhances input representations via positional and type embedding. The dynamic adaptive prompt fusion module achieves efficient prompt tuning and multi-modal interaction using scaled convolution and low-rank cross-modal attention mechanisms. The fully connected head network module addresses the shortcomings of traditional convolutional head networks such as inductive biases. Experimental results from RGB-T, RGB-D, and RGB-E scenarios show that MPVT outperforms state-of-the-art methods. Moreover, MPVT can save 43.8% GPU memory usage and reduce training time by 62.9% compared with a full-parameter fine-tuning model.

Keywords:

visual target tracking; prompt tuning; multi-modal fusion; vision large model

1. Introduction

Visual target tracking is widely used in various scenarios, such as video surveillance and autonomous driving. Target tracking based on RGB images [1,2,3] has high performance due to its large-scale training data. However, such images often make tracking fail under poor lighting conditions, such as darkness, cluttered backgrounds, and high-speed motion. With the development of multi-modal technology, auxiliary modalities can enhance the success rate, precision, and robustness of tracking by utilizing the complementarity between modalities. Some classical scenarios include RGB-T tracking [4,5,6], RGB-D tracking [7,8], and RGB-E tracking [9,10,11].

For RGB-T tracking, some researchers have proposed fusion methods based on feature decoupling. Li et al. [12] constructed a multi-branch network to handle RGB and thermal infrared images separately. ADRNet [13] further utilizes adaptive universal branches for decoupling. MANet [14] jointly learns shared, modal-specific, and instance-aware features through universal, modal, and instance adapters. MANet++ [15] incorporates a hierarchical difference loss to maximize the distribution difference between modal-specific and shared features. DMCNet [16] adopts a dual-door control mutual condition module to suppress noise through cross-modal feature guidance. With the deepening development of attention mechanisms, some researchers have proposed dynamic fusion methods based on feature selection. For example, FANet [17] involves a hierarchical feature aggregation module to adaptively fuse multi-layer features. CMPP [18] includes a cross-modal propagation framework that achieves complementarity through spatio-temporal domain feature diffusion. ViPT [19] uses infrared modalities as visual prompts for visible modalities and utilizes the attention mechanism to achieve cross-modal interaction. In addition, some methods incorporate the unique characteristics of Transformer models to design multi-modal interaction modules for RGB-T target tracking. For example, MTNet [20] utilizes channel aggregation and distribution modules to eliminate redundancy. SiamCAF [21] uses complementary coupling modules to enhance feature similarity through cross-connections. Very recently, Wang et al. [22] proposed a novel lightweight adapter to boost fusion performance for RGB-T target tracking.

Traditional RGB-D target tracking methods mainly adopt manually designed features and shallow fusion strategies. For example, DS-KCF [23] achieves bimodal feature-level fusion through HOG feature extraction, while STC [24] improves tracking robustness through decision-level fusion. With the development of deep learning, researchers have begun to explore deep feature fusion. For example, Jiang et al. [25] proposed a dual-branch CNN architecture, which extracts RGB and deep features separately and enhances discriminability through weighted fusion. OTR [26] introduces deep dynamic spatial constraints to optimize DCF (discriminative correlation filter) learning, significantly improving tracking stability in occluded scenarios. In recent years, researchers have focused more on cross-modal interaction modeling to fully utilize the complementary information of multiple modalities. DeT [7] incorporates a symmetrical double-branch structure and achieves early fusion through feature concatenation. Recently, Gao et al. [27] proposed a new RGB-D approach to capture the interactive information between modalities.

The RGB-E tracking method [28] often uses a cross-modal attention mechanism for feature fusion. AFNet [10] utilizes the high temporal resolution of event data to design a cross-frame rate alignment module, achieving a tracking frequency of 1000 Hz. HRCEUTrack [11] introduces a Mask Autoencoder (MAE) to randomly mask RGB and event tokens, enhances cross-modal representation consistency through reconstruction tasks, and combines orthogonal high-rank regularization to suppress network fluctuations.

In recent years, the visual prompt learning paradigm has been introduced in target tracking, which adapts specific downstream tasks through prompt modules to fully utilize pre-trained knowledge and reduce dependence on downstream data. Yang et al. [29] proposed ProTrack, which applies prompt learning methods to multi-modal tracking. This model utilizes visual prompts to convert RGB-T and RGB-D data into a unified RGB space. With the diversification of tracking tasks, some researchers are focusing on a universal visual prompt framework. Hong et al. [30] proposed a OneTracker framework, which freezes the basic tracking model and only fine-tunes the cross-modal prompt module, achieving the unified processing of multiple tracking tasks, enabling the model to quickly adapt to diverse tracking scenarios. Due to target tracking being a temporal task, some methods rely on prompt learning to fuse spatio-temporal information. Shi et al. [31] proposed an EVPTrack method, which effectively conveys spatio-temporal information to the current frame by generating an explicit visual prompt, avoiding frequent template update operations. This method enhances the model’s adaptability to scale changes through a multi-scale prompt mechanism.

Although various methods have achieved great success in target tracking, they mainly adopt full-parameter fine-tuning strategies, which have some limitations: (1) Low transfer efficiency: The introduction of multi-modality further increases the complexity of a model, resulting in a significant increase in the resources required for fine-tuning. (2) Catastrophic forgetting of pre-trained knowledge: Unbalanced data sizes can easily lead to the over-fitting of downstream noise in a model. (3) High cost of cross-task deployment: Each downstream task model needs to be stored separately and cannot share common capabilities.

This paper focuses on the issues of multi-modal tracking, combining thermal infrared (T), depth (D), event (E), or other data from different sensors with RGB images, and makes full use of the complementary characteristics between modalities to enhance tracking performance. In this paper, a multi-modal efficient fine-tuning model based on prompt learning, MPVT, is proposed to address the problems of low transfer efficiency, catastrophic forgetting, and weak multi-modal correlation modeling in the traditional full-parameter fine-tuning transfer paradigm.

The remainder of this paper is organized as follows: In Section 2, the multi-modal prompt vision tracker (MPVT) is proposed based on an efficient prompt-tuning paradigm. Decoupled input enhancement, dynamic adaptive prompt fusion, and fully connected head network modules are included in the model. Furthermore, a joint loss function is introduced so as to train the model efficiently. The experimental settings, datasets, details of the experimental results, and analysis are presented in Section 3. Finally, conclusions are summarized in Section 4.

2. Materials and Methods

The framework of MPVT is shown in Figure 1. MPVT aims to fine-tune the pre-trained RGB target tracking model for multi-modal tracking tasks, including RGB-T, RGB-D, and RGB-E, through efficient transfer learning.

Based on the ViT encoder architecture, three components are designed to achieve efficient multi-modal transfer learning. (1) The decoupled input embedding enhancement module first enhances the multi-modal resolution-decoupled position embedding of inputs and then enhances type embedding for both modalities. (2) The dynamic adaptive prompt fusion module is the core of prompt learning, which uses N multi-modal prompt learning modules for transfer learning, while significantly reducing the number of learning parameters by freezing the backbone encoder layer. (3) The fully connected head network module implements two branches of classification and regression, avoiding the inductive bias of traditional convolutional head networks and demonstrating stronger global feature interaction capabilities.

2.1. Decoupled Input Embedding Enhancement Module

For the input template images

Z_{R G B}, Z_{A U X} \in R^{H_{z} \times W_{z} \times 3}

and search regions

X_{R G B},

X_{A U X} \in R^{H_{x} \times W_{x} \times 3}

, two image embedding layers are used to map them to token sequences. Since the backbone has been pre-trained, only the auxiliary modal embedding layer needs to be fine-tuned, while the RGB embedding layer is frozen. The embedding process can be written as

\begin{matrix} {\tilde{X}}_{R G B}^{0} = E_{R G B} (X_{R G B}) \\ {\tilde{Z}}_{R G B}^{0} = E_{R G B} (Z_{R G B}) \\ {\tilde{X}}_{A U X}^{0} = E_{A U X} (X_{A U X}) \\ {\tilde{Z}}_{A U X}^{0} = E_{A U X} (Z_{A U X}) \end{matrix}\},

(1)

where

{\tilde{X}}_{R G B}^{0}, {\tilde{X}}_{A U X}^{0} \in R^{L_{x} \times d}

,

{\tilde{Z}}_{R G B}^{0}, {\tilde{Z}}_{A U X}^{0} \in R^{L_{z} \times d}

,

L_{x} = \frac{H_{x}}{s} \times \frac{W_{x}}{s}

, and

L_{z} = \frac{H_{z}}{s} \times \frac{W_{z}}{s}

, with the patch size being

(s, s)

.

After obtaining the embedding token sequences for each image, this module enhances the input token using two procedures: resolution-decoupled positional embedding enhancement and type embedding enhancement.

Traditional multi-modal tracking models independently embed the positions of the template and searching regions by utilizing one-dimensional absolute embedding. If the resolution of the template region is significantly lower than that of the searching region, this will disrupt the input distribution of the pre-trained ViT model, resulting in the failure of fine-tuning. To solve this problem, unified position embedding through a resolution-decoupled position embedding method is presented. Specifically, let the one-dimensional absolute embedding be

R

; the corresponding two-dimensional form is

R_{2 D}

, as shown in Equation (2). For a searching region of the size

h_{x} \times w_{x}

, if

h_{x} = h a n d w_{x} = w

,

R_{2 D}

can be directly used as the position embedding of the search region, i.e.,

R_{x} = R_{2 D}

.

R_{2 D} = (\begin{matrix} r_{11} & \dots & r_{1 w} \\ ⋮ & ⋱ & ⋮ \\ r_{h 1} & \dots & r_{h w} \end{matrix}),

(2)

A sub-matrix of

R_{x}

is used as the position embedding of a template region of the size

h_{z} \times w_{z}

due to its resolution being smaller than that of the searching region. Specifically, starting from the top-left corner of

R_{x}

, we select a region whose size equals the size of the template region, which can be written as

R_{z} = (\begin{matrix} r_{11} & \dots & r_{1 w_{z}} \\ ⋮ & ⋱ & ⋮ \\ r_{h_{z} 1} & \dots & r_{h_{z} w_{z}} \end{matrix}) .

(3)

Since the position encoding of the template region

R_{z}

and that of the searching region

R_{x}

are derived from the same position encoding

R_{2 D}

, which is consistent with the pre-trained ViT model, the uniformity of position encoding can be ensured.

Additionally, two learnable embeddings,

E_{z}^{t y p e} \in R^{d}

and

E_{x}^{t y p e} \in R^{d}

, are designed to distinguish between the template and searching region embedding and between the targets in the template region and background embedding.

E_{z}^{t y p e}

is further divided into the target region embedding

E_{z_{o}}^{t y p e} \in R^{d}

and the background region embedding

E_{z_{b}}^{t y p e} \in R^{d}

. Through resolution-decoupled position and type embedding enhancement, the template region embedding

E_{z}

and the searching region embedding

E_{x}

can be represented as

E_{x}^{(i, j)} = R_{x}^{(i, j)} + E_{x}^{t y p e},

(4)

and

E_{z}^{(i, j)} = \{\begin{matrix} R_{z}^{(i, j)} + E_{z_{o}}^{t y p e}, & z (i, j) is the target region \\ R_{z}^{(i, j)} + E_{z_{b}}^{t y p e}, & o t h e r w i s e \end{matrix},

(5)

For this reason, the enhanced features can be obtained as

\begin{matrix} X_{R G B}^{0} = {\tilde{X}}_{R G B}^{0} + E_{x} \\ Z_{R G B}^{0} = {\tilde{z}}_{R G B}^{0} + E_{z} \\ X_{A U X}^{0} = {\tilde{X}}_{A U X}^{0} + E_{x} \\ Z_{A U X}^{0} = {\tilde{z}}_{A U X}^{0} + E_{z} \end{matrix}\} .

(6)

2.2. Dynamic Adaptive Prompt Fusion Module

As shown in Figure 1, we assume that there are N encoder layers,

M_{1}, \dots, M_{N}

, in the ViT backbone and the corresponding N multi-modal prompt learning modules

P_{1}, \dots, P_{N}

. The prompt learning process is

\begin{matrix} C^{0} = C o n c a t (X_{R G B}^{0}, Z_{R G B}^{0}) \\ C_{A U X}^{0} = C o n c a t (X_{A U X}^{0}, Z_{A U X}^{0}) \\ C_{A U X}^{i} = P_{i} (C^{i - 1}, C_{A U X}^{i - 1}) \\ C^{i} = M_{i} (C^{i - 1} + C_{A U X}^{i}) \end{matrix}\},

(7)

where Concat represents the concatenation operation of the tokens

C^{i}, C_{A U X}^{i} \in R^{L \times d}

, and

L = L_{x} + L_{z}

.

P_{i}

is the i-th dynamic adaptive prompt fusion module, which mainly includes three core components: (1) dynamic channel scaling convolution for modality-specific weighting, (2) low-rank cross-modal attention for multi-modal interaction, and (3) grouped linear mapping for efficient output mapping, as shown in Figure 2.

(1): Dynamic Channel Scaling Convolution

For the modal feature

C

, the statistical value

s

of the feature is calculated by

s = \frac{1}{N} \sum_{i = 1}^{N} C [i, :] \in R^{d},

(8)

and the dynamic scaling factor

α

is calculated by

α = S o f t m a x (W_{2} \cdot G E L U (W_{1} s)),

(9)

where

W_{1} \in R^{r \times d}

and

W_{2} \in R^{K \times r}

are learnable parameters, and K is the number of dynamic convolution kernels.

Two types of features are separately learned using depth-wise separable convolution through

\begin{matrix} X = R e s h a p e (C) \in R^{d \times H \times W} \\ M = D W C o n v (X) ⊙ \sum_{k = 1}^{K} α [k] \cdot C_{k} \end{matrix}\},

(10)

where

D W C o n v : R^{d \times H \times W} \to R^{(d / β) \times H \times W}

is a depth-wise separable convolution,

C_{k} \in R^{(d / β) \times 1 \times 1}

is a dynamic channel scaling vector, and

β

is a hyper-parameter of the dimensionality reduction ratio.

The intermediate results

M

and

M_{A U X}

can be obtained by implementing the above dynamic channel scaling convolution on two features,

C^{i - 1}

and

C_{A U X}^{i - 1}

.

(2): Low-Rank Cross-Modal Attention

A low-rank attention mechanism for cross-modal learning is proposed to overcome the large number of parameters and the high complexity of traditional attention mechanisms. For

M

and

M_{A U X}

, low rank projection can be obtained by

\begin{matrix} Q = M W_{Q} \in R^{N \times d^{'}} \\ K = [M, M_{A U X}] W_{K} \in R^{2 N \times d^{'}} \\ V = [M, M_{A U X}] W_{V} \in R^{2 N \times d^{'}} \end{matrix}\},

(11)

where

W_{Q}, W_{K}, W_{V} \in R^{(d / β) \times d^{'}}

are learnable parameters. Based on Equation (11), a multi-modal interaction feature,

O

, can be achieved by fusing the attention mechanism and residual features dynamically, as follows:

\begin{matrix} A t t n = S o f t m a x (\frac{Q K^{⊤}}{\sqrt{d}}) \in R^{N \times 2 N} \\ O = γ \cdot (A t t n V) + (1 - γ) \cdot M_{A U X} \end{matrix}\},

(12)

where the fusion coefficient

γ

is a learnable parameter.

(3): Grouped Linear Mapping

Efficient grouped linear mapping is used to calculate prompt output features that in Equation (7),

C_{A U X}^{i} \in R^{L \times d}

, as follows:

C_{A U X}^{i} = GroupLinear (O) = {Concat}_{j = 1}^{g} (O [:, j \cdot d / g : (j + 1) \cdot d / g] W_{j}),

(13)

where

W_{j} \in R^{(d^{'} / g) \times (d / g)}

is the group learnable parameter and g is the number of groups. For the sake of computational efficiency, g is taken as 8 in this study.

2.3. Fully Connected Head Network Module

Traditional models such as OSTrack [32] use convolutional head networks for classification and regression, which have strong inductive biases, such as locality assumptions and translation invariance. The local receptive field of convolutional kernels limits the global feature interaction ability. For this reason, a fully connected head network is utilized in this study, which entirely uses multi-layer perceptrons for classification and regression.

In this module, a multi-task joint loss function,

L_{t o t a l} = L_{c l s} + L_{b o x}

, is used to train the model. The classification loss

L_{c l s}

adopts binary cross-entropy loss (BCE) and can be defined by

L_{c l s} = \frac{1}{N_{p o s}} \sum_{i, j} [- {\bar{y}}_{i, j} log {\hat{y}}_{i, j} - (1 - {\bar{y}}_{i, j}) log (1 - {\hat{y}}_{i, j})],

(14)

where

{\bar{y}}_{i, j}

is the IoU value of the predicted box

B_{i, j}^{p r e d}

and the true label

B^{g t}

for each foreground point,

(i, j)

.

{\hat{y}}_{i, j} = C [i, j] \in [0, 1]

is the predicted score of the classification branch, and

N_{p o s}

is the total number of foreground points.

The regression loss

L_{b o x}

adopts generalized IoU (GIoU), which can provide effective gradients even if the predicted and label boxes do not overlap. It can be defined as

L_{b o x} = \frac{1}{N_{p o s}} \sum_{i, j} {II}_{y_{i, j} = 1} (1 - GIoU (B_{i, j}^{p r e d}, B^{g t})),

(15)

where II is the indicator function, which only calculates the loss of foreground points.

3. Experimental Results and Discussion

3.1. Datasets

The performance of MPVT was evaluated and compared using three multi-modal datasets, LasHeR [33], DepthTrack [7], and VisEvent [9], corresponding to three multi-modal scenarios: RGB-T, RGB-D, and RGB-E, respectively. LasHeR is the largest and most diverse publicly available dataset in RGB-T target tracking, consisting of 1224 visible light and thermal infrared video pairs with over 730K frames. DepthTrack includes 200 video sequences, with an average sequence length of 1473 frames, providing sufficiently long and continuously changing training data. In addition, it covers 40 different scenario types and 90 target categories. VisEvent contains 820 video pairs, which were collected in complex real-world scenarios such as those with low light, high-speed motion, and cluttered backgrounds. It is widely used for the performance evaluation of RGB-E tracking methods.

3.2. Evaluation Indices

Four evaluation indices are used to evaluate the performance of the methods, that is, the success rate (

S r

), the tracking recall rate (

T R e

), precision (

P r

), and the F1-score.

The success rate is the ratio of successfully tracked frames to the total number of frames at a specific IoU threshold,

β

. If the IoU value of a frame is greater than or equal to

β

, the tracking of that frame is successful.

S r

can be defined as

S r = \frac{1}{N} \sum_{i = 1}^{N} I (I o U_{i} \geq β),

(16)

I o U_{i} = \frac{| p_{i} \cap g_{i} |}{| p_{i} \cup g_{i} |},

(17)

where N is the total number of frames in a video,

I o U_{i}

is the IoU value of the i-th frame, and

I (\cdot)

is the indicator function whose value is 1 if the condition is satisfied and 0 otherwise.

p_{i}

is the predicted target position, box and

g_{i}

is the label position box.

Precision measures the degree of overlap between predicted and real bounding boxes. It can be defined as

P r = \frac{1}{N} \sum_{i = 1}^{N} I o U_{i},

(18)

The tracking recall rate (

T R e

) is also used as an evaluation index to evaluate the tracking method. Unlike traditional recall,

T R e

measures the tracker’s ability to correctly detect targets in visible frames. It can be defined as

T R e = \frac{\sum_{t = 1}^{N} I ({\hat{v}}_{t} > β, v_{t} = 1)}{\sum_{t = 1}^{N} I (v_{t} = 1)},

(19)

where

v_{t} \in {0, 1}

is an annotation variable that determines whether the target is visible in frame t, and

{\hat{v}}_{t} \in [0, 1]

is the confidence score for frame t.

Generally, a large

P r

will lead to a low

T R e

and vice versa. For a better comparison, a unified index, i.e., the F1-score, combining

P r

and

T R e

is introduced to evaluate the performance of a method. It can be defined as

F 1 = 2 \times \frac{P r \cdot T R e}{P r + T R e} .

(20)

3.3. Model Parameter Settings

The model was trained for 60 epochs with a batch size of 16; each batch contained

6 \times 10^{5}

pairs of data. AdamW was used as the optimization algorithm, with the weight decay being

10^{- 4}

. The learning rate was updated by the CosineLR scheduling algorithm [34], with the maximal and minimal learning rates being

10^{- 4}

and

10^{- 6}

, respectively. The parameters were set as frozen or learnable for different modules, as shown in Figure 1. The learnable prompt parameters were initialized using the Xavier algorithm [35], and other training parameters are shown in Table 1.

3.4. Main Results

Various methods were compared with MPVT using three datasets, including TBSI [4], APFNet [5], FANet [17], CAT [12], MDNet [36], PrDIMP50 [37], LTMU [38], OSTrack [32], TransT [39], DAL [8], TATrack [40], TABBTrack [41], ProTrack [29], ViPT [19], and Un-Track [42].

For the RGB-T scenario, various types of RGB-T tracking models were compared with MPVT. Early methods such as FANet [5] mainly improved the complementarity of multi-modal information through feature fusion. Feature decoupling-based methods such as CAT [12] and APFNet [5] introduced difficult attribute branches and progressive fusion mechanisms, respectively. TransT [39] applied Transformer to visual tracking, significantly enhancing the global information interaction between templates and searching regions. OSTrack [32] and TBSI [4] further optimized the Transformer architecture, achieving significant improvements in precision and generalization performance. ProTrack [29] utilized a multi-modal prompt tracking framework, and ViPT [19] further fine-tuned the pre-trained model through visual prompts. We conducted extensive comparisons based on the various methods mentioned above, and the experimental results are shown in the third and fourth columns of Table 2. MPVT achieved the best performance among all models, with a success rate of 0.578 and a precision of 0.724. Compared with the second best model, TATrack [40], it increased the success rate and precision by 1.9% and 2.2%, respectively, and compared with ViPT [19], it increased the success rate and precision by 5.3% and 7.3%, respectively.

For the RGB-D scenario, to conduct a comprehensive comparison, various models were introduced as baselines. Besides the methods used in the RGB-T scenario, some other methods were also compared with MPVT. DAL [8] constructed a long-term tracking framework for deep perception through an end-to-end deep feature fusion network. TABBTrack [41] adopted a three-stream architecture and combined temporal features to fuse depth images. The results are shown in columns 5–7 in Table 2. MPVT achieved the highest performance among all the compared models, with an F1-score of 0.628, a tracking recall rate of 0.626, and a precision of 0.631. Compared with ViPT [19], it increased the F1-score,

T R e

, and

P r

by 3.4%, 3%, and 3.0%, respectively, and compared with the second-best model, TABBTrack [41], it increased the F1-score,

T R e

, and

P r

by 1%, 1.1%, and 0.9%, respectively.

Among all the compared models in the RGB-E scenario, MPVT achieved the highest tracking success rate (0.624) and precision (0.761), with improvements of over 3.2% and 0.3%, respectively, compared with those of the second-best model, ViPT [19], in the RGB-E scenario, as shown in the eighth and ninth columns of Table 2.

Regarding degraded scenarios, Figure 3 and Figure 4 present precision plots for 15 degraded scenarios involving the LasHeR and VisEvent datasets, respectively (DepthTrack is excluded due to unavailable degradation labels). MPVT consistently outperformed baselines across all scenarios. The performance gain was notable under abrupt illumination variation (AIV), hyaline occlusion (HO), and background clutter (BC) in LasHeR and also under low illumination (LI), AIV, and HO in the VisEvent dataset. For instance, MPVT outperformed ViPT by 0.306 under AIV. These results demonstrate the model’s robust performance in degraded conditions.

Regarding efficiency, MPVT has 0.9% (2.8 M) trainable parameters while achieving competitive tracking performance. It shows a higher transfer efficiency compared with traditional fully fine-tuning models, whose trainable parameter ratios are 100%, and learnable parameters are generally greater than 100 M. ViPT is a prompt-based model and has 0.4% (0.84 M) trainable parameters, and although MPVT has more trainable parameters than this, it achieved better tracking performance across all three datasets. Experimental results across the RGB-T, RGB-D, and RGB-E scenarios demonstrate that MPVT preserves the feature extraction capabilities of pre-trained models. Through decoupled input enhancement and dynamic prompt fusion, it avoids feature degradation while effectively integrating complementary multi-modal information. Compared with full-parameter fine-tuning, MPVT maintains higher transfer efficiency.

3.5. Ablation Study

To analyze the effectiveness of the proposed components, ablation experiments were conducted on the three datasets. The experimental results are shown in Table 3. MPVT, due to the use of efficient training paradigms such as prompt learning, achieved the best performance in RGB-T, RGB-D, and RGB-E, although only 0.9% of the parameters were used for training. First, if the fully connected head network module was removed and the traditional convolutional head network was used instead (MPVT-FC), the performance decreased by 0.01 to 0.02. This indicates that fully connected head networks can effectively avoid the inductive bias of traditional convolutional head networks. Second, if the dynamic adaptive prompt fusion module was removed based on MPVT-FC (MPVT-PF), the model just utilized a full-parameter fine-tuning strategy. In this case, each index decreased by 0.04 to 0.05 on all three datasets. This indicates that the fusion module is crucial for efficient parameter learning and can efficiently transfer from a single-modal pre-trained model on large-scale RGB data to multi-modal scenarios such as RGB-T, RGB-D, and RGB-E. Finally, if the input enhancement module was further removed (MPVT-IE), the performance decreased by about 0.01, proving that the proposed position and type embedding enhancement can contribute to accurate target tracking.

3.6. Model Analysis

3.6.1. Sensitivity Analysis of Prompt Depth

The dynamic adaptive prompt fusion (DAPF) module can be inserted before any ViT encoder layer. The number of DAPF modules used determines the prompt depth: shallow integration applies DAPF only in the first layer, whereas deep integration applies it in every layer. To further assess the impact of prompt fusion depth on tracking performance, we conducted experiments using a scaled-down MPVT model (with 12 encoder layers and 768 hidden dimensions). Performance variations were observed through primary metrics across the three datasets.

This experiment evaluated tracking performance at prompt depths

N_{p} \in {1, 3, 5, 9, 12}

. As shown in Figure 5, increasing the prompt depth consistently enhanced performance across all three datasets, demonstrating the effectiveness and scalability of our dynamic adaptive prompt fusion (DAPF) module.

3.6.2. Feature Visualization

A visualization of the tracking results of MPVT and a baseline model in three multi-modal scenarios is shown in Figure 6. From Figure 6d,e, it can be seen that the feature map of MPVT is more concentrated, clearer, and less noisy, while that of the baseline model is more scattered. This indicates that MPVT has a more stable discrimination of targets. From the tracking results in Figure 6b,c, it can also be observed that MPVT (red box) has higher tracking precision compared with the baseline model (green box).

3.6.3. Model Efficiency

The efficiency of MPVT was analyzed from three aspects: training time, GPU memory usage, and inference speed, as shown in Table 4. The training time and GPU memory usage by prompt learning decreased significantly compared with those of full-parameter fine-tuning models, saving 43.8% of GPU memory and 62.9% of training time. In terms of inference speed, due to the addition of model parameters and computational complexity, the inference speed slightly decreased by about 1 fps. The dynamic adaptive prompted fusion (DAPF) module maintained consistent computational complexity for all auxiliary modalities. This minimal inference performance difference enables significant advantages in practice. Regarding edge deployment, 43.8% memory reduction allows operation on resource-constrained devices, and efficiency gains enable the simultaneous processing of more streams.

3.6.4. Differences Between MPVT and Vision–Language Prompting

MPVT shares core principles with CLIP-style prompting [43] but differs in three aspects: (1) Token construction: CLIP uses discrete text tokens, while MPVT employs continuous visual prompts requiring no vocabulary constraints. (2) Fusion granularity: CLIP fuses global semantic features, whereas our low-rank cross-attention preserves spatial details critical for localization. (3) Knowledge transfer: Both leverage frozen backbones, but MPVT’s dynamic scaling adapts to sensor-specific noise patterns unseen in CLIP. These differences highlight how prompt engineering must be rethought for dense prediction tasks.

4. Conclusions

Focusing on the issues of low transfer efficiency, catastrophic forgetting of pre-trained knowledge, and weak multi-modal correlation modeling in traditional full-parameter fine-tuning, an efficient multi-modal prompt vision tracker is proposed to achieve efficient prompt learning on the basis of an RGB pre-trained model. The proposed model can minimize the cost of fine-tuning downstream tasks while achieving higher multi-modal tracking performance. The model mainly includes three key components: a decoupled input enhancement module, a dynamic adaptive prompt fusion module, and a fully connected head network module. The decoupled input enhancement module enhances input from position and type embedding. The dynamic adaptive prompt fusion module achieves efficient prompt learning and multi-modal interaction through scaling convolution and low-rank cross-modal attention learning. The fully connected head network module overcomes the shortcomings of traditional convolutional head networks, such as inductive biases.

The performance of MPVT was evaluated under three multi-modal scenarios: RGB-T, RGB-D, and RGB-E. The experimental results show that the model achieved the best performance on all three multi-modal datasets, achieving tracking precisions of 72.4%, 63.1%, and 76.1%, respectively, and outperforming the second-best model by 2.2%, 0.9%, and 0.3%, respectively. Further ablation studies showed that all three proposed components contribute to more accurate target tracking. Furthermore, the model achieved efficient multi-modal transfer with less than 1% of learnable parameters. The efficiency analysis showed that MPVT is much more effective in training time and GPU memory usage than full-parameter fine-tuning models, saving up to 43.8% of GPU memory and 62.9% of training time.

Recently, many researchers have fused vision and text in various large models [44]. Extending MPVT to vision–language tracking requires solving the fundamental challenge of spatial–semantic alignment, i.e., bridging textual descriptions with visual coordinates while maintaining real-time efficiency. Key research directions include the following: (1) developing geometric-aware attention mechanisms to ground linguistic concepts (e.g., “left”, “occluded”) to pixel-level locations; (2) designing memory-augmented architectures that preserve object identity across evolving text descriptions; and (3) designing efficient fusion schemes to avoid computational explosion when combining visual tokens with linguistic tokens. In future work, we will introduce lingual information into MPVT to further extend multi-modal scenarios.

Author Contributions

Conceptualization, J.X. and D.C.; methodology, J.X., Y.F. (Yan Fu) and J.Z.; software, J.X. and T.H.; validation, T.H., X.W. and Y.F. (Yuke Fang); formal analysis, J.X., T.H. and X.W.; investigation, J.X., Y.F. (Yuke Fang) and D.C.; resources, X.W. and Y.F. (Yuke Fang); data curation, X.W. and Y.F. (Yuke Fang); writing—original draft preparation, J.X. and D.C.; writing—review and editing, J.X., J.Z., Y.F. (Yan Fu), X.W. and D.C.; visualization, T.H. and X.W.; supervision, J.Z. and D.C.; funding acquisition, Y.F. (Yan Fu) and D.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the Major Science and Technology Projects in Sichuan Province under Grant No. 2024ZDZX0021 and the Major Program of National Natural Science Foundation of China under Grant No. T2293771.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Three datasets used in this paper are available from public URLs. LasHeR is available at https://github.com/BUGPLEASEOUT/LasHeR (accessed on 15 January 2025), DepthTrack is available at https://github.com/xiaozai/DeT (accessed on 21 January 2025), and VisEvent is available at https://github.com/wangxiao5791509/VisEvent_SOT_Benchmark (accessed on 2 February 2025).

Conflicts of Interest

Authors Yan Fu, Junlin Zhou, Tianxiang He, Yuke Fang and Duanbing Chen were employed by the company Chengdu Union Big Data Tech. Inc. Author Xiaopeng Wang was employed by the company Zhuhai Yiyuan Technology Co., Ltd. Author Duanbing Chen was employed by the agency Suining Institute of Digital Economy. The remaining authors declare that this research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Cui, Y.; Jiang, C.; Wang, L.; Wu, G. MixFormer: End-to-End Tracking with Iterative Mixed Attention. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13598–13608. [Google Scholar] [CrossRef]
Zhang, Z.; Peng, H. Deeper and Wider Siamese Networks for Real-Time Visual Tracking. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4586–4595. [Google Scholar] [CrossRef]
Chen, X.; Peng, H.; Wang, D.; Lu, H.; Hu, H. SeqTrack: Sequence to sequence learning for visual object tracking. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14572–14581. [Google Scholar] [CrossRef]
Hui, T.; Xun, Z.; Peng, F.; Huang, J.; Wei, X.; Wei, X.; Dai, J.; Han, J.; Liu, S. Bridging search region interaction with template for RGB-T tracking. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13630–13639. [Google Scholar] [CrossRef]
Xiao, Y.; Yang, M.; Li, C.; Liu, L.; Tang, J. Attribute-based progressive fusion network for RGBT tracking. Proc. AAAI Conf. Artif. Intell. 2022, 36, 2831–2838. [Google Scholar] [CrossRef]
Feng, M.; Su, J. Learning reliable modal weight with transformer for robust RGBT tracking. Knowl.-Based Syst. 2022, 249, 108945. [Google Scholar] [CrossRef]
Yan, S.; Yang, J.; Käpylä, J.; Zheng, F.; Leonardis, A.; Kämäräinen, J.K. DepthTrack: Unveiling the power of RGBD tracking. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10705–10713. [Google Scholar] [CrossRef]
Qian, Y.; Yan, S.; Lukežič, A.; Kristan, M.; Kämäräinen, J.K.; Matas, J. DAL: A deep depth-aware long-term tracker. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 7825–7832. [Google Scholar] [CrossRef]
Wang, X.; Li, J.; Zhu, L.; Zhang, Z.; Chen, Z.; Li, X.; Wang, Y.; Tian, Y.; Wu, F. VisEvent: Reliable object tracking via collaboration of frame and event flows. IEEE Trans. Cybern. 2024, 54, 1997–2010. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Wang, Y.; Liu, W.; Li, M.; Bai, J.; Yin, B.; Yang, X. Frame-event alignment and fusion network for high frame rate tracking. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 9781–9790. [Google Scholar] [CrossRef]
Zhu, Z.; Hou, J.; Wu, D.O. Cross-modal orthogonal high-rank augmentation for RGB-event transformer-trackers. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 22045–22055. [Google Scholar]
Li, C.; Liu, L.; Lu, A.; Ji, Q.; Tang, J. Challenge-aware RGBT tracking. In Computer Vision—ECCV 2020; Lecture Notes in Computer Science; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; Volume 12367, pp. 222–237. [Google Scholar] [CrossRef]
Zhang, P.; Wang, D.; Lu, H.; Yang, X. Learning adaptive attribute-driven representation for real-time RGB-T tracking. Int. J. Comput. Vis. 2021, 129, 2714–2729. [Google Scholar] [CrossRef]
Li, C.L.; Lu, A.; Zheng, A.H.; Tu, Z.; Tang, J. Multi-adapter RGBT tracking. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 2262–2270. [Google Scholar] [CrossRef]
Lu, A.; Li, C.; Yan, Y.; Tang, J.; Luo, B. RGBT tracking via multi-adapter network with hierarchical divergence loss. IEEE Trans. Image Process. 2021, 30, 5613–5625. [Google Scholar] [CrossRef] [PubMed]
Lu, A.; Qian, C.; Li, C.; Tang, J.; Wang, L. Duality-gated mutual condition network for RGBT tracking. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 4118–4131. [Google Scholar] [CrossRef] [PubMed]
Zhu, Y.; Li, C.; Tang, J.; Luo, B. Quality-aware feature aggregation network for robust RGBT tracking. IEEE Trans. Intell. Veh. 2021, 6, 121–130. [Google Scholar] [CrossRef]
Wang, C.; Xu, C.; Cui, Z.; Zhou, L.; Zhang, T.; Zhang, X.; Yang, J. Cross-modal pattern-propagation for RGB-T tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7062–7071. [Google Scholar] [CrossRef]
Zhu, Y.; Guo, R.; Wang, Y.; Lu, W. ViPT: Visual prompt multi-modal tracking. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12234–12243. [Google Scholar]
Hou, R.; Xu, B.; Ren, T.; Wu, G. MTNet: Learning modality-aware representation with transformer for RGBT tracking. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 1163–1168. [Google Scholar] [CrossRef]
Xue, Y.; Zhang, J.; Lin, Z.; Li, C.; Huo, B.; Zhang, Y. SiamCAF: Complementary attention fusion-based siamese network for RGBT tracking. Remote Sens. 2023, 15, 3252. [Google Scholar] [CrossRef]
Wang, H.; Xu, T.; Tang, Z.; Wu, X.J.; Kittler, J. Multi-modal adapter for RGB-T tracking. Inf. Fusion 2025, 118, 102940. [Google Scholar] [CrossRef]
Hannuna, S.; Camplani, M.; Hall, J.; Mirmehdi, M.; Damen, D.; Burghardt, T.; Paiement, A.; Tao, L. DS-KCF: A real-time tracker for RGB-D data. J. Real-Time Image Process. 2019, 16, 1439–1458. [Google Scholar] [CrossRef]
Xiao, J.; Stolkin, R.; Gao, Y.; Leonardis, A. Robust fusion of color and depth data for RGB-D target tracking using adaptive range-invariant depth models and spatio-temporal consistency constraints. IEEE Trans. Cybern. 2018, 48, 2485–2499. [Google Scholar] [CrossRef] [PubMed]
Jiang, M.; Deng, C.; Shan, J.; Wang, Y.; Jia, Y.; Sun, X. Hierarchical multi-modal fusion FCN with attention model for RGB-D tracking. Inf. Fusion 2019, 50, 1–8. [Google Scholar] [CrossRef]
Kart, U.; Lukežič, A.; Kristan, M.; Kämäräinen, J.K.; Matas, J. Object tracking by reconstruction with view-specific discriminative correlation filters. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1339–1348. [Google Scholar] [CrossRef]
Gao, L.; Ke, Y.; Zhao, W.; Zhang, Y.; Jiang, Y.; He, G.; Li, Y. RGB-D visual object tracking with transformer-based multi-modal feature fusion. Knowl.-Based Syst. 2025, 322, 113531. [Google Scholar] [CrossRef]
Zhang, J.; Yang, X.; Fu, Y.; Wei, X.; Yin, B.; Dong, B. Object tracking by jointly exploiting frame and event domain. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 13023–13032. [Google Scholar] [CrossRef]
Yang, J.; Li, Z.; Zheng, F.; Leonardis, A.; Song, J. Prompting for multi-modal tracking. In Proceedings of the MM ’22: 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 3492–3500. [Google Scholar] [CrossRef]
Hong, L.; Yan, S.; Zhang, R.; Li, W.; Zhou, X.; Guo, P.; Jiang, K.; Chen, Y.; Li, J.; Chen, Z.; et al. OneTracker: Unifying visual object tracking with foundation models and efficient tuning. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 19079–19091. [Google Scholar] [CrossRef]
Shi, L.; Zhong, B.; Liang, Q.; Li, N.; Zhang, S.; Li, X. Explicit visual prompts for visual object tracking. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI’24), Vancouver, BC, Canada, 20–27 February 2024; AAAI Press: Cambridge, MA, USA, 2024; Volume 38, pp. 4838–4846. [Google Scholar] [CrossRef]
Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; pp. 341–357. [Google Scholar] [CrossRef]
Li, C.; Xue, W.; Jia, Y.; Qu, Z.; Luo, B.; Tang, J.; Sun, D. LasHeR: A large-scale high-diversity benchmark for RGBT tracking. IEEE Trans. Image Process. 2022, 31, 392–404. [Google Scholar] [CrossRef] [PubMed]
Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. arXiv 2017. [Google Scholar] [CrossRef]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, Chia Laguna Resort, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
Nam, H.; Han, B. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4293–4302. [Google Scholar] [CrossRef]
Danelljan, M.; Van Gool, L.; Timofte, R. Probabilistic regression for visual tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7181–7190. [Google Scholar] [CrossRef]
Dai, K.; Zhang, Y.; Wang, D.; Li, J.; Lu, H.; Yang, X. High-performance long-term tracking with meta-updater. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6297–6306. [Google Scholar] [CrossRef]
Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8122–8131. [Google Scholar] [CrossRef]
Wang, H.; Liu, X.; Li, Y.; Sun, M.; Yuan, D.; Liu, J. Temporal adaptive RGBT tracking with modality prompt. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; AAAI Press: Cambridge, MA, USA, 2024; Volume 38, pp. 5436–5444. [Google Scholar] [CrossRef]
Ying, G.; Zhang, D.; Ou, Z.; Wang, X.; Zheng, Z. Temporal adaptive bidirectional bridging for RGB-D tracking. Pattern Recognit. 2025, 158, 111053. [Google Scholar] [CrossRef]
Wu, Z.; Zheng, J.; Ren, X.; Vasluianu, F.A.; Ma, C.; Paudel, D.P.; Van Gool, L.; Timofte, R. Single-model and any-modality for video object tracking. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 19156–19166. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021. [Google Scholar] [CrossRef]
Cheng, Z.; Chen, Q.; Zhang, J.; Fei, H.; Feng, X.; Che, W.; Li, M.; Qin, L. CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models. Proc. AAAI Conf. Artif. Intell. 2025, 39, 23678–23686. [Google Scholar] [CrossRef]

Figure 1. The framework of MPVT.

Figure 2. The framework of dynamic adaptive prompt fusion.

Figure 3. Precision in 15 degraded scenarios in the LasHeR dataset.

Figure 4. Precision in 15 degraded scenarios in the VisEvent dataset.

Figure 5. Sensitivity analysis of prompt depth across three datasets.

Figure 6. Visualization of tracking in three multi-modal scenarios.

Table 1. The model parameter settings.

Parameter Name	Value
Layers (N)	40
Hidden dimensions (d)	1536
Size of searching image ( $H_{x} = W_{x}$ )	378
Size of template image ( $H_{z} = W_{z}$ )	196
Feature size of searching region	27
Feature size of template region	14
Epochs	60
Batch size	16
Learning rate	Maximal rate was $10^{- 4}$ ; minimal rate was $10^{- 6}$

Table 2. The results on LasHeR, DepthTrack, and VisEvent.

Method	$R_{TP}$ *	LasHeR		DepthTrack			VisEvent
Method	$R_{TP}$ *	$Sr$ (↑)	$\Pr$ (↑)	$F$ 1 (↑)	$TRe$ (↑)	$\Pr$ (↑)	$Sr$ (↑)	$\Pr$ (↑)
TBSI [4]	100%	0.506	0.638	-	-	-	-	-
APFNet [5]	100%	0.362	0.500	-	-	-	-	-
FANet [17]	100%	0.309	0.441	-	-	-	-	-
CAT [12]	100%	0.314	0.450	-	-	-	-	-
MDNet [36]	100%	-	-	-	-	-	0.426	0.661
PrDIMP50 [37]	100%	-	-	-	-	-	0.453	0.644
LTMU [38]	100%	-	-	0.460	0.417	0.512	0.459	0.655
OSTrack [32]	100%	0.412	0.515	0.529	0.522	0.536	0.534	0.695
TransT [39]	100%	0.394	0.524	-	-	-	0.474	0.650
DAL [8]	100%	-	-	0.429	0.369	0.512	-	-
TATrack [40]	100%	0.559	0.702	-	-	-	-	-
TABBTrack [41]	100%	-	-	0.618	0.615	0.622	-	-
ProTrack [29]	<1%	0.420	0.538	0.578	0.573	0.583	0.471	0.632
ViPT [19]	0.4%	0.525	0.651	0.594	0.596	0.592	0.592	0.758
Un-Track [42]	6.74%	0.536	0.667	-	-	-	0.589	0.755
MPVT	0.9%	0.578	0.724	0.628	0.626	0.631	0.624	0.761

*

R_{T P}

—ratio of trainable parameters.

Table 3. Ablation results on LasHeR, DepthTrack, and VisEvent.

Method	${R_{TP}}^{†}$	LasHer		DepthTrack			VisEvent
Method	${R_{TP}}^{†}$	$Sr$ (↑)	$\Pr$ (↑)	$F$ 1 (↑)	$TRe$ (↑)	$\Pr$ (↑)	$Sr$ (↑)	$\Pr$ (↑)
MPVT	0.9%	0.578	0.724	0.628	0.626	0.631	0.624	0.761
MPVT-FC	0.8%	0.562	0.711	0.613	0.612	0.614	0.609	0.758
MPVT-PF	100%	0.523	0.654	0.570	0.569	0.571	0.594	0.754
MPVT-IE	100%	0.517	0.648	0.556	0.554	0.557	0.587	0.752

^†

R_{T P}

—ratio of trainable parameters.

Table 4. Results of efficiency analysis.

Type	Training Time	GPU Memory Usage	Inference Speed
	(Hours)↓	(GB)↓	(fps)↑
Full fine-tuning	62.2	38.1	18
(removing prompt fusion module)
Prompt learning	23.1	21.4	17
(enabling prompt fusion module)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xie, J.; Fu, Y.; Zhou, J.; He, T.; Wang, X.; Fang, Y.; Chen, D. MPVT: An Efficient Multi-Modal Prompt Vision Tracker for Visual Target Tracking. Appl. Sci. 2025, 15, 7967. https://doi.org/10.3390/app15147967

AMA Style

Xie J, Fu Y, Zhou J, He T, Wang X, Fang Y, Chen D. MPVT: An Efficient Multi-Modal Prompt Vision Tracker for Visual Target Tracking. Applied Sciences. 2025; 15(14):7967. https://doi.org/10.3390/app15147967

Chicago/Turabian Style

Xie, Jianyu, Yan Fu, Junlin Zhou, Tianxiang He, Xiaopeng Wang, Yuke Fang, and Duanbing Chen. 2025. "MPVT: An Efficient Multi-Modal Prompt Vision Tracker for Visual Target Tracking" Applied Sciences 15, no. 14: 7967. https://doi.org/10.3390/app15147967

APA Style

Xie, J., Fu, Y., Zhou, J., He, T., Wang, X., Fang, Y., & Chen, D. (2025). MPVT: An Efficient Multi-Modal Prompt Vision Tracker for Visual Target Tracking. Applied Sciences, 15(14), 7967. https://doi.org/10.3390/app15147967

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MPVT: An Efficient Multi-Modal Prompt Vision Tracker for Visual Target Tracking

Abstract

1. Introduction

2. Materials and Methods

2.1. Decoupled Input Embedding Enhancement Module

2.2. Dynamic Adaptive Prompt Fusion Module

2.3. Fully Connected Head Network Module

3. Experimental Results and Discussion

3.1. Datasets

3.2. Evaluation Indices

3.3. Model Parameter Settings

3.4. Main Results

3.5. Ablation Study

3.6. Model Analysis

3.6.1. Sensitivity Analysis of Prompt Depth

3.6.2. Feature Visualization

3.6.3. Model Efficiency

3.6.4. Differences Between MPVT and Vision–Language Prompting

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI