Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

MPVT: An Efficient Multi-Modal Prompt Vision Tracker for Visual Target Tracking

Appl. Sci. 2025, 15(14), 7967; https://doi.org/10.3390/app15147967

by Jianyu Xie¹, Yan Fu^1,2

, Junlin Zhou^1,2

, Tianxiang He², Xiaopeng Wang³, Yuke Fang² and Duanbing Chen^1,2,4,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Appl. Sci. 2025, 15(14), 7967; https://doi.org/10.3390/app15147967

Submission received: 18 June 2025 / Revised: 11 July 2025 / Accepted: 14 July 2025 / Published: 17 July 2025

(This article belongs to the Special Issue Advanced Technologies Applied for Object Detection and Tracking)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The article presents MPVT, a new efficient multi-modal tracking model that relies on prompt tuning instead of fully tuning all parameters, thereby enabling faster and more economical knowledge transfer.

The MPVT model introduces three main components: a decoupled input reinforcement module, an adaptive prompt fusion module, and a fully connected network head.

This architecture, built on a pre-trained ViT encoder, allows most of the model layers to be frozen and only modality-specific modules to be trained, drastically reducing GPU memory requirements and training time.

The MPVT model was evaluated on three multi-modal benchmarks: LasHeR (RGB-T), DepthTrack (RGB-D), and VisEvent (RGB-E). It outperforms all existing models in terms of accuracy, recall, and F1-score while using only 0.9% of the adjustable parameters.

In addition to its performance, it reduces GPU memory consumption by 43.8% and training time by 62.9% compared to conventional approaches. Ablation studies show that each proposed component clearly contributes to tracking quality. Thus, MPVT offers an efficient, generalizable, and lightweight solution for multi-modal visual tracking.

Originality & Scientific impact

The originality of the article lies in the intelligent combination of prompt learning, efficient modular architecture, and parametric optimization, applied to a complex, multimodal problem, with solid empirical results to support it.

The article effectively transposes this approach to the field of multi-modal visual tracking (RGB-T, RGB-D, RGB-E). This strategy allows the pre-trained backbone to be frozen and only small specific portions of the model to be trained, significantly reducing computational costs and transfer complexity.

The MPVT model introduces three original modules working in synergy: the decoupled input reinforcement module, the adaptive fusion module for dynamic prompts, and the fully connected head network.

Recommendations for future improvements

Several paragraphs reintroduce modules that have already been explained without adding anything new (for example, the repeated description of how ViPT and ProTrack work).
Although the article focuses on visual data, prompt learning is often used in vision-language architectures (such as CLIP). The lack of analysis of this synergy is a missed opportunity.
It is unclear how the model reacts to degraded data (e.g., thermal noise, imprecise depth). This could have been an important evaluation criterion.
The results show a 5–6% decrease in inference speed, but without detailed analysis (e.g., complexity according to modalities).

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper proposes MPVT, a novel multi-modal tracking framework that leverages prompt learning for efficient fine-tuning of pre-trained vision models in RGB-T, RGB-D, and RGB-E tracking. It introduces three key modules—decoupled input enhancement, dynamic adaptive prompt fusion, and a fully connected head. It achieves state-of-the-art results on LasHeR, DepthTrack, and VisEvent while greatly reducing GPU memory and training time. Overall, it addresses a crucial challenge in multi-modal tracking with a technically sound, well-executed approach supported by comprehensive experiments. The paper contributes technical depth and experimental relevance to the community. It would benefit from modest improvements in explanation and minor polishing of language.
Comments and suggestions for improvement:
1. Clarity of some technical details: The dynamic adaptive prompt fusion is complex; while the equations are rigorous, an intuitive explanation or a small illustrative diagram showing how it processes multi-modal features would help readers unfamiliar with prompt learning.
The distinction between MPVT-PF (prompt fusion removed) and MPVT-IE (input enhancement removed) could be emphasized more clearly in the ablation discussions.

2. Inference speed trade-off: The paper notes a ~5-6% drop in inference FPS when enabling the prompt module. While modest, a brief discussion on the practical impact (for real-time scenarios) would strengthen the application perspective.

3. Future work could be expanded: The paper briefly mentions integrating LLMs for vision-text scenarios. Expanding on concrete challenges (e.g., aligning text cues with spatial tracking tasks) could inspire further research.

Comments on the Quality of English Language

Some sentences are long and could be split for clarity. For example, the abstract’s sentence starting with “Existing multi-modal tracking methods typically…” is very dense.
There are occasional small errors (e.g., “full-connected head network” might be clearer as “fully-connected head network” consistently).

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The paper presents a promising direction for efficient multi-modal tracking via prompt tuning. However, significant improvements are needed in methodological rigor, experimental validation, and reproducibility.

The paper emphasizes reduced GPU memory (43.8%) and training time (62.9%) compared to full-parameter fine-tuning. However, it fails to provide concrete comparisons of absolute parameter counts (e.g., "MPVT uses 1.2M parameters vs. ViPT’s 130M"). Without this, claims of efficiency are unconvincing.

Missing Baselines,Compare against recent prompt-based trackers like ProTrack[29] and ViPT[19] across all three modalities. The current comparison omits some state-of-the-art methods (e.g., [43, 44]).

Hyperparameter Sensitivity. No analysis of sensitivity to hyperparameters (e.g., learning rate, scaling factor α in Eq. 9). Provide tuning details and ablation studies.

Article Menu

MPVT: An Efficient Multi-Modal Prompt Vision Tracker for Visual Target Tracking

Further Information

Guidelines

MDPI Initiatives

Follow MDPI