Cross-Modal Alignment Enhancement for Vision–Language Tracking via Textual Heatmap Mapping

Xu, Wei; Geng, Gu; Zhang, Xinming; Yuan, Di

doi:10.3390/ai6100263

Open AccessArticle

Cross-Modal Alignment Enhancement for Vision–Language Tracking via Textual Heatmap Mapping

¹

Guangzhou Institute of Technology, Xidian University, Guangzhou 510555, China

²

School of Science, Harbin Institute of Technology, Shenzhen 518055, China

^*

Authors to whom correspondence should be addressed.

AI 2025, 6(10), 263; https://doi.org/10.3390/ai6100263

Submission received: 15 August 2025 / Revised: 28 September 2025 / Accepted: 4 October 2025 / Published: 8 October 2025

(This article belongs to the Special Issue Artificial Intelligence-Based Object Detection and Tracking: Theory and Applications)

Download

Browse Figures

Versions Notes

Abstract

Single-object vision–language tracking has become an important research topic due to its potential in applications such as intelligent surveillance and autonomous driving. However, existing cross-modal alignment methods typically rely on contrastive learning and struggle to effectively address semantic ambiguity or the presence of multiple similar objects. This study aims to explore how to achieve more robust vision–language alignment under these challenging conditions, thereby achieving accurate object localization. To this end, we propose a text heatmap mapping (THM) module that enhances the spatial guidance of textual cues in tracking. The THM module integrates visual and language features and generates semantically aware heatmaps, enabling the tracker to focus on the most relevant regions while suppressing distractors. This framework, developed based on UVLTrack, combines a visual transformer with a pre-trained language encoder. The proposed method is evaluated on benchmark datasets such as OTB99, LaSOT, and TNL2K. The main contribution of this paper is the introduction of a novel spatial alignment mechanism for multimodal tracking and its effectiveness on various tracking benchmarks. Results demonstrate that the THM-based tracker improves robustness to semantic ambiguity and multi-instance interference, outperforming baseline frameworks.

Keywords:

single-object tracking; multimodal fusion; vision–language tracking; textual heatmap mapping

1. Introduction

Single-target tracking is an important research direction in the field of computer vision. It aims to continuously locate the target in subsequent frames starting from the target bounding box given in the first frame. With the continuous development of multimodal learning technology, traditional single-target tracking technology has increasingly highlighted its limitations based solely on pixel-level features. Visual language tracking uses text features to assist visual features in target positioning, breaking this limitation. However, current visual language tracking also faces many challenges. For example, it is very difficult to design a multimodal feature fusion strategy that can achieve deep alignment without overcoupling to avoid the model from overfitting or information conflict; in addition, cross-modal alignment methods that rely on unified contrast loss are prone to difficulty in distinguishing targets from distractors in scenes with semantic ambiguity or multiple similar objects. At present, researchers in multimodal target tracking have conducted extensive research and exploration to solve or circumvent the above problems.

Visual-language tracking is inherently an interdisciplinary field, situated at the intersection of computer vision, natural language processing, and engineering applications. From a computer vision perspective, it extends classical single-object tracking [1,2,3,4,5] by incorporating high-level semantic cues beyond visual appearance features. From a natural language processing perspective, it leverages large pre-trained models like BERT and CLIP to achieve robust cross-modal alignment [6,7]. From an engineering perspective, visual-language tracking supports numerous practical applications, such as autonomous driving (e.g., identifying and following vehicles based on natural language descriptions), intelligent surveillance (e.g., locating people in crowded environments based on textual instructions), and human-computer interaction (e.g., using language as an intuitive command interface for visual agents). This interdisciplinary convergence demonstrates that visual-language tracking is not merely a technological innovation, but also a cross-domain bridge connecting AI research with real-world systems.

However, despite existing research efforts to advance vision–language fusion tracking, most approaches have failed to effectively mitigate the aforementioned issues. For example, the Multimodal Features Alignment (MFA) [8] method proposed by Ye et al. fuses visual and language features through factorized bilinear pooling and a two-stage attention mechanism to enhance target feature representation. While this method somewhat overcomes the limitations of simple concatenation, it does not further explore how to avoid information conflict or overfitting during deep fusion, and lacks specialized optimization for feature discrimination under semantically ambiguous conditions. The Synchronous Learning Backbone (SLB) [9] proposed by Ge et al. comprises an object enhancement module (TEM) and a semantic awareness module (SAM), and introduces a dense matching loss to strengthen multimodal representations. However, this method still relies on off-the-shelf backbones, and its fusion mechanism and loss function design lack systematic mitigation of modal information conflict. It also does not specifically address the problem of distinguishing distractors in highly semantically confusing scenarios. Furthermore, the UVLTrack [10] framework proposed by Ma et al. uses a contrastive loss to align visual and linguistic features into a unified semantic space. It also designs a box head that adapts to different reference modalities, enabling a single model to handle multiple task settings. While this enhances flexibility and uniformity, the contrastive approach itself is still prone to misidentification when dealing with multiple similar entities or ambiguous language. While these approaches have made some progress in visual-language tracking, they still lack robustness against semantic ambiguity or multiple object interference in cross-modal alignment. For example, in a crowded street scene, if multiple pedestrians are wearing similar red shirts, when the text is described as “a person wearing a red shirt”, traditional contrastive alignment methods may mistakenly match text semantics with multiple visual instances, resulting in tracking drift. This is because contrastive learning only encourages features to cluster in a shared semantic space, but lacks an explicit spatial position alignment mechanism, so it is easy to be confused when multiple entities have similar visual or semantic attributes.

Although methods such as MFA, SLB, and UVLTrack have made significant progress in vision–language tracking, they still face a common research gap: the lack of explicit spatial alignment of cross-modal features, which leads to confusion in semantically ambiguous or multi-similar object scenarios. Specifically, MFA fuses features through factored bilinear pooling and a two-stage attention mechanism, but does not address the information conflict problem in deep fusion; SLB relies on an off-the-shelf backbone network and a dense matching loss, which does not systematically alleviate inter-modal information misalignment; and while UVLTrack maps features to a unified semantic space through contrastive learning, it still relies on a global contrastive loss and cannot distinguish between highly similar instances spatially. Unlike the aforementioned methods, the Text Heatmap Mapping (THM) module proposed in this study introduces a spatially explicit response distribution adjustment mechanism based on temperature coefficients for cross-modal tracking. THM not only achieves semantic alignment but also further enhances the text’s ability to guide the visual region in the spatial dimension. This significantly improves discriminative power and robustness in complex scenarios while maintaining model flexibility. This module first fuses visual template features with textual semantic features to generate a spatial heatmap that reflects the degree of match between the two. It then uses a temperature coefficient to adjust the concentration of the response distribution, thereby explicitly highlighting the most relevant regions of the target and suppressing background and distractors in multi-instance and semantically ambiguous scenarios. Unlike existing cross-modal alignment methods that rely on a unified contrast loss, THM introduces explicit positional mapping during the feature fusion process. This allows textual cues to not only participate in global semantic alignment but also directly influence the spatial feature distribution, reducing the risk of information conflict and overfitting caused by excessive coupling.

Table 1 shows the comparison of vision–language tracking methods.

In summary, the scientific problem investigated in this study is how to achieve more robust vision–language alignment for single-object tracking under conditions of semantic ambiguity and multi-instance interference. More specifically, the research focuses on designing a mechanism that allows textual cues to directly guide spatial feature distribution within the UVLTrack framework. To address this problem, we propose the Textual Heatmap Mapping (THM) module, which explicitly integrates semantic information into spatial alignment, thereby improving tracking robustness and accuracy.

Experimental results demonstrate that our method effectively improves tracking accuracy and robustness on the OTB99 [11] dataset. The main contributions of this paper are as follows:

Based on UVLTrack, we innovatively designed the THM module, which enables text prompts to directly participate in the regulation of spatial feature distribution, explicitly highlighting the areas most relevant to the target, and suppressing distractors and background noise.
We conducted multi-dimensional experimental analysis, systematically evaluating the impact of the THM temperature coefficient, search area jitter intensity, and different learning rate schedulers on performance, and achieved tracking results that outperformed the original model on the OTB99 dataset.

The paper is structured as follows: Section 2 reviews related work. Section 3 details the proposed methodology. Section 4 presents experiments and results. Section 5 discusses limitations and future work, and Section 6 concludes.

2. Related Work

2.1. Traditional Object Tracking

Traditional single-target tracking methods have undergone a transformation, evolving from the rise of classical filtering methods to deep learning approaches. Before the widespread adoption of deep learning, filter-based methods dominated single-target tracking. The most representative examples are mean-shift and its improved version, Camshift. These methods model the target’s color histogram and locate the target by iteratively searching for the region of maximum density. Correlation filters, on the other hand, leverage fast Fourier transforms to perform online learning and detection in the frequency domain, significantly improving tracking speed. These methods once dominated the tracking field due to their independence from large-scale annotated data and their extremely high computational efficiency. However, they are susceptible to background interference, target deformation, and interference from similar objects. Since 2016, deep learning methods have revolutionized the research landscape of single-target tracking. Early deep discriminant models distinguished the target from the background by online fine-tuning a binary classification network, significantly improving tracking accuracy. However, the high computational overhead of online updates limited their real-time performance. Subsequently, the Siamese Network concept emerged. SiamFC [1] first extended similarity learning to the tracking domain, eliminating the need for online fine-tuning and enabling localization with a single forward pass. Subsequent approaches, such as SiamRPN [2] and SiamMask [3], further incorporated region proposals and segmentation branches into the network architecture, balancing accuracy and speed. Later, methods such as ATOM [4] and DiMP [5] combined online optimization with more sophisticated localization modules, continuing to achieve breakthrough performance on challenging benchmarks.

Although deep network-based SOT methods have repeatedly achieved new performance records on public datasets, they still face significant bottlenecks in complex scenes. For example, dramatic appearance changes—when the target is deformed or partially or fully occluded, visual features tend to drift. Another issue is similar target interference: when objects in the scene are highly similar to the target, the tracking model is susceptible to “identity switching.” Furthermore, dynamic backgrounds and camera motion—complex backgrounds or significant camera motion—can introduce false targets, increasing the risk of false detection. In addition, there are real-time and energy consumption constraints. For example, mobile and embedded devices have stricter requirements on model size and inference speed, which limit the deployment of large-scale complex models.

2.2. Vision–Language Multimodal Object Tracking

In recent years, as multimodal technology has become a research hotspot in computer vision, multimodal tracking has gradually entered the research scope of researchers. As an important field in multimodal tracking, visual language tracking has been continuously explored in depth by many researchers. At the same time, the success of large-scale pre-trained language models, such as BERT [6] and CLIP [7], has led to a rapid improvement in cross-modal understanding and alignment technology, providing a solid technical foundation and computing power guarantee for combining natural language prompts with visual tracking.

In the study of visual language target tracking, MDETR [12] proposed the “text query-driven object detection” paradigm, using the Transformer [13] encoder-decoder to fuse image patch features with text tokens, complete cross-modal alignment in the early stage, and support sentence positioning, referential expression understanding and other tasks through large-scale image-text pre-training, avoiding dependence on external detectors. ViLT [14] divides the image into fixed-size patches through patch projection and maps them to the same embedding dimension as the text. It then inputs the patches into a pure Transformer together with the text to achieve joint encoding of image and text, eliminating the reliance on detectors and CNN features. It also combines whole word masking with multiple image enhancements to improve cross-modal alignment efficiency. SNLT [15] directly integrates natural language into the Siamese tracking framework and proposes SNL-RPN to dynamically generate candidate regions that match the text. It also uses the dynamic modality aggregation module to adjust the weights using language cues during the similarity calculation phase, thereby focusing on the target region more accurately. MAML-Tracker [16] takes a meta-learning approach and trains modern detectors (RetinaNet, FCOS) with MAML to obtain initialization parameters that adapt to various tracking scenarios. It then quickly fine-tunes the parameters in the first frame to achieve efficient migration from detection to tracking. The TNL2K [17] dataset was proposed by Pengcheng Laboratory et al., covering a variety of scenarios and accompanied by natural language descriptions. It also proposed the AdaSwitcher algorithm to switch to global localization when local search fails, and enhance robustness under appearance changes through a modular strategy. CiteTracker [18] performs cross-modal alignment based on CLIP [7], divides the template into image blocks to generate multi-dimensional text descriptions, and updates them in real time through a dynamic description module. It uses bidirectional attention to associate the updated text features with the search area features to achieve flexible multimodal interaction. JointNLT [19] integrates visual grounding with target tracking, fuses text, historical frame context, and current frame features through a multi-source relationship modeling module, and introduces semantic guidance in temporal modeling, so that historical semantic information participates in current prediction correction, achieving deep coupling between cross-modality and cross-temporal. DUTrack [20] addresses the problem of semantic drift between static templates and language references in long-term tracking and proposes dynamic template capture (DTCM) to update the visual template in real time. Combined with dynamic language update (DLUM), it uses a large language model to automatically adjust the text description according to target changes to maintain consistency between vision and language.

2.3. Multimodal Feature Alignment and Fusion Strategy

Multimodal feature alignment and fusion aim to transform features from different modalities into a unified representation space to support subsequent tasks such as classification. Multimodal feature alignment is commonly divided into two categories: feature projection-based and attention-based. The former includes statistical projection and semantic space alignment: Canonical Correlation Analysis (CCA) maximizes the correlation between different modalities through linear projection, Deep Canonical Correlation Analysis (DCCA) uses neural networks to learn nonlinear mapping, Adversarial Alignment uses GAN to approximate modal distribution, and DAT [21] splices visual features, language embeddings and spatial encodings to achieve alignment through convolutional networks. In semantic space alignment, the common semantic space mapping maps different modalities to a shared space, such as CLIP [7] which learns to align image and text features through comparison; methods based on alignment loss force matching samples to be less than non-matching samples, such as UVLtrack [10] and COST [22]. Attention-based alignment includes inter-modal attention and cross-attention. The latter uses one modality feature as a query and the other modality as a key value in Transformer [13] to capture cross-modal dependencies.

Neural network-based fusion technology is currently the most mainstream multimodal feature fusion method, which achieves efficient integration of multimodal features through various mechanisms. In attention-driven fusion, the Transformer [13] self-attention is used to model the internal dependencies of a single modality, and cross-attention is used to capture cross-modal interactions; hierarchical fusion focuses on multi-level feature processing, gradually integrating from the bottom-level image edge or text word embedding to the middle-level semantic concepts and high-level abstract semantics; while generative fusion dynamically integrates the multimodal encoder output at the decoder stage; in addition, graph network fusion uses heterogeneous graph neural networks to model cross-modal entity associations, and memory-enhanced fusion uses memory networks to dynamically retrieve multimodal features. Together, they constitute a neural network-driven multimodal fusion technology system. Trackers such as MemVLT [23] and SUTrack [24] all use multimodal fusion technology based on neural networks.

Unlike the above methods, in this work, we propose a textual heatmap mapping (THM) module to explicitly introduce spatial position mapping in the cross-modal fusion process. We fuse visual template features with textual semantic features to generate a spatial heatmap, and adjust the response distribution concentration through the temperature coefficient. This further aligns the visual and language modalities at both the semantic and spatial levels, significantly enhancing the model’s robustness and positioning accuracy in multi-instance and semantically ambiguous scenarios.

3. Methods

This section specifically introduces our target tracking method based on the fusion of textual and visual information. This method is based on UVLTrack [10] and makes improvements on its tracking method. Figure 1 illustrates the overall tracking process. Specifically, we first preprocess the visual language tracking data. The image is processed by ViT-base [25], a vision transformer that encodes images into patch embeddings, while the text description is processed by BERT-base [6], a pre-trained language model that generates contextual word embeddings. We obtain the corresponding tensors after processing the image and text separately, and input the text and image tensors into the corresponding feature extractors respectively; then input the extracted features into the feature fusion device for multimodal feature fusion, in which the multimodal contrast loss MMCLoss is used to assist in aligning the semantic space of the visual and text modalities; then, the spatial position mapping is explicitly introduced through the THM module to further align the visual and language modalities; finally, the target is tracked through the modality adaptive prediction head.

3.1. Data Augmentation Methods

This study uses OTB99 [11] as a visual language object tracking dataset. As an extension of the classic single-object tracking dataset OTB100 [26] in terms of natural language annotation information, OTB99 is consistent with OTB100 in terms of video data. It has 99 video sequences, totaling approximately 59,000 frames, covering a variety of scenarios from indoors to outdoors, from static monitoring to high-speed motion. Each video sequence in the OTB99 dataset contains continuous frame images and the corresponding bounding box annotation file (groundtruth_rect.txt), as well as a line of natural language description related to the sequence. Table 2 shows 5 typical representative video sequences in the OTB99 dataset.

After the OTB99 dataset is correctly loaded, we need to perform some data transformation and augmentation on the data. To avoid biases such as the model learning that the target is always centered in the search area or that the target size is fixed, we randomly jitter the original annotation box

b = (x, y, w, h)

. This random jitter operation consists of two parts: scale jitter and center jitter. Scale jitter involves adding Gaussian noise to the width and height in logarithmic space:

\begin{matrix} (w^{’}, h^{’}) = (w, h) \exp (σ_{s} \cdot z), z \sim N (0, I) \end{matrix}

(1)

Here,

σ_{s}

is the scale jitter strength hyperparameter. The center jitter is calculated by taking the square root of the jittered box area

\sqrt{w^{’} h^{’}}

and setting the maximum translation distance

δ = \sqrt{w^{’} h^{’}} α_{c}

, where

α_{c}

is the center jitter factor. Next, we randomly sample

Δ c \sim U ([- δ / 2, δ / 2]^{2})

and update the center position:

\begin{matrix} c^{’} = (x + \frac{w}{2}, y + \frac{h}{2}) + Δ c, \end{matrix}

(2)

\begin{matrix} (x^{’}, y^{’}) = c^{’} - \frac{1}{2} (w^{’}, h^{’}), \end{matrix}

(3)

After this operation, the new jitter frame

b^{’} = (x^{’}, y^{’}, w^{’}, h^{’})

is output, which can produce position changes and simulate scale changes in subsequent cropping.

By combining scale and center jitter, the model continuously encounters target samples with varying positions and scales during training. This not only effectively eliminates potential target position biases in the training data but also significantly improves the model’s robustness against common real-world conditions such as target translation, scaling, and deformation. This data augmentation method provides a more robust data foundation for subsequent multimodal feature alignment and tracking accuracy optimization.

3.2. Text Heatmap Mapping Module

To effectively mitigate semantic ambiguity and multi-instance interference in multimodal single-target tracking, this study designed a Textual Heatmap Mapping module to explicitly associate and enhance visual template features with textual description features at the spatial level. The core idea of this module, hereafter referred to as THM, is to generate a semantically aware spatial heatmap, enabling the tracker to highlight regions highly relevant to the target and suppress irrelevant background and distracting objects, thereby improving localization accuracy and robustness in complex scenarios.

Specifically, the THM module first receives the visual global token vis_token from ViT [25] and the language global token txt_token from the text encoder. These tokens contain the global appearance features and semantic description features of the target, respectively. To obtain a multimodal reference representation that combines visual and language information, we take an element-wise average of these two tokens:

\begin{matrix} m i x_{token} = \frac{vi s_{token} + {txt}_{token}}{2}, \end{matrix}

(4)

The resulting mix_token preserves the target’s appearance while also incorporating textual semantic constraints, providing a unified cross-modal representation for subsequent spatial correlation calculations.

The THM module then computes the inner product between the mix_token and the search region feature map

F_{search} \in R^{H \times W \times C}

at each spatial location to measure the degree of match between that location and the multimodal reference. To control the concentration of the correlation response, we introduce a temperature coefficient τ for scaling and perform softmax normalization to generate the final spatial heatmap H:

\begin{matrix} H (x, y) = \frac{\exp (\frac{F_{search} (x, y) \cdot {mix}_{token}}{τ})}{\sum_{i, j} \exp (\frac{F_{search} (i, j) \cdot {mix}_{token}}{τ})}, \end{matrix}

(5)

A smaller τ makes the heatmap response more concentrated, highlighting the most relevant areas; a larger τ makes the response distribution smoother. After generating the heatmap, the THM module uses an additive enhancement strategy to add the heatmap to a constant 1 to form a weight map (1 + H), and then weights the original search area feature map position by position:

\begin{matrix} F_{search}^{’} (x, y) = (1 + H (x, y)) \cdot F_{search} (x, y), \end{matrix}

(6)

This operation explicitly amplifies features in regions highly correlated with both the visual template and the text description, while suppressing regions with low correlation, thereby reducing the impact of information conflict on subsequent predictions.

In our framework, both visual and textual inputs are first converted into sequences of tokens. Specifically, the natural language description is tokenized and embedded into a series of text embeddings. Following UVLTrack [10], an additional language semantic token txt_token is prepended to the sequence to summarize the global semantics of the entire sentence. Similarly, the visual template and search region are divided into patches and encoded into patch embeddings, with a visual semantic token vis_token prepended to capture global visual information.

During multimodal fusion, these semantic tokens interact with patch- and word-level features through the Transformer encoder, serving as global anchors that bridge visual and textual modalities. When generating the textual heatmap, the fused multimodal reference token (derived from the visual and language semantic tokens) is compared with each spatial location in the search region. The similarity distribution is then normalized to form an attention-like heatmap, which highlights the regions most relevant to the textual description while suppressing irrelevant background or distractors. This process ensures that the generated heatmaps are not arbitrary feature activations but rather semantically meaningful spatial maps derived from the interaction between global semantic tokens and local visual features.

Through this process, the THM module not only achieves explicit spatial alignment of visual and verbal modalities but also provides semantically guided feature inputs to the prediction head. This enables the model to more accurately focus on the target location in scenarios with multiple instances or ambiguous semantic representations, significantly improving tracking robustness and localization accuracy.

Unlike standard cross-attention, which distributes attention weights across all visual tokens, the proposed THM directly maps the fused multimodal reference token to the spatial feature map of the search region. This design, combined with temperature scaling, generates sharper heatmaps that emphasize the true target while suppressing distractors. As a result, THM provides more robust localization in scenarios where multiple semantically similar objects exist, addressing a key limitation of conventional cross-attention mechanisms.

In terms of computational cost, THM introduces only a single feature mapping and weighting operation, adding less than 5% extra FLOPs relative to the Transformer backbone. This overhead does not noticeably affect real-time performance, with FPS remaining in the 46–49 range.

3.3. Model Training

During the model training phase, the model is first initialized, and the visual feature extractor ViT-base and the text feature extractor BERT-base are loaded with pre-trained weights. After receiving the pre-processed image, text, and bounding box annotation data, the model enters the forward pass and loss calculation phase. In the forward pass, the first 6 layers of ViT encode image features separately, BERT retains the first 6 layers that encode text features separately, and the last 6 layers of ViT fuse text features with visual features.

During the feature extraction and fusion process, multimodal contrast loss is used to perform positive and negative sample comparisons between semantic tags and search patch features at each layer. We compute the similarity

S^{i} = [s^{i, 1}; s^{i, 2}, \dots, s^{i, N_{x}}]

between the given semantic token

T^{i}

and the search region embedding

E_{x}^{i} = [f^{i, 1}; f^{i, 2}, \dots, f^{i, N_{x}}]

, formally:

\begin{matrix} s^{i, j} = \frac{sim (T^{i}, f^{i, j})}{β}, s i m (T^{i}, f^{i, j}) = \frac{T^{i} {(f^{i, j})}^{⊤}}{{‖T^{i}‖}_{2} {‖f^{i, j}‖}_{2}} \end{matrix}

(7)

Based on

S^{i}

, we select the center score of the target box

s_{p}^{i}

as the positive sample score, and select the scores of the top

N_{neg}

points outside the target box

{[s_{n}^{i, k}]}_{k = 1}^{N_{neg}}

as the negative sample score. The final expression of the multimodal contrast loss is as follows:

\begin{matrix} L_{mmc}^{i} = - \log (\frac{e^{s_{p}^{i}}}{e^{s_{p}^{i}} + \sum_{k = 1}^{N_{neg}} e^{s_{n}^{i, k}}}) \end{matrix}

(8)

The modality-adaptive feature head calculates the target similarity and background similarity of each search patch, and after fusion, generates the center score map

L_{cls}

, the offset map and the scale map, and calculates the final total loss:

\begin{matrix} L = L_{tgt} + L_{cls} + L_{box} + λ_{mmc} \sum_{i = 1}^{N + M} L_{mmc}^{i} \end{matrix}

(9)

where

L_{tgt}

is the binary cross-entropy constrained target mask loss. Specifically, we consider the blocks in the target box as positive samples and other blocks as negative samples to generate the ground truth

L

of the target score map

\hat{L}

. Then, we adopt the binary cross-entropy loss as the target score map constraint. It is formalized as follows:

\begin{matrix} L_{tgt} = L_{bce} (\hat{L}, L) \end{matrix}

(10)

In addition,

L_{box}

is expressed as:

\begin{matrix} L_{box} = λ_{1} L_{1} + λ_{giou} L_{giou} \end{matrix}

(11)

4. Experiments

In this section, we first describe the experimental details in detail. Then, we conduct a comparative analysis of the experimental results from three aspects: the impact of the THM module under different temperature coefficients on tracking performance, the impact of different search area jitter intensities on tracking performance, and the impact of different learning rate schedulers on tracking performance. We also explain the optimal tracking performance configuration in this experiment. Next, we present a visualization of this tracking experiment based on a portion of the OTB99 dataset to enhance the intuitiveness and readability of this research work. Finally, we conduct extended experiments based on the LaSOT and TNL2K datasets to further verify the generalization ability and robustness of the proposed method in large-scale and complex scenarios.

4.1. Experimental Details

In this experiment, we determined the required training parameters, such as sample size, epoch size, and batch size, by comparing test results. After multiple rounds of trial and error, we found that the most suitable values for the OTB99 dataset were 1000, 10, and 16, respectively. During training, we used the AdamW optimizer (developed by Loshchilov and Hutter, implemented in PyTorch, Meta Platforms, Inc., Menlo Park, CA, USA) on an NVIDIA GeForce RTX 4080 GPU (NVIDIA Corporation, Santa Clara, CA, USA) with an initial learning rate of 0.0004. Because different learning rate schedulers significantly affect the model’s tracking performance, we compared four strategies (StepLR, MultiStepLR, WarmupMultiStepLR, and CosineAnnealingLR). After each training round, we evaluated the current AUC and Precision metrics on the OTB99-test validation set, recording the best validation AUC and saving the corresponding network weights as the final model.

4.2. Comparative Experiment

This section compares and analyzes experimental results from three perspectives: module optimization, data processing, and training strategy. Specifically, we examine the impact of the THM module, the effect of different search area jitter intensities, and the influence of various learning rate schedulers. To determine the optimal configuration of our tracker, we conducted controlled hyperparameter searches over the most influential parameters. The temperature coefficient was tested with values τ ∈ {0.03, 0.07, 0.12}, representing sharper to smoother heatmap responses. For data augmentation, center jitter strengths of {1.5, 2.5, 3.5} and scale jitter strengths of {0.2, 0.3, 0.5} were explored, covering low, medium, and high perturbation levels. We further compared the four learning rate schedulers under otherwise identical settings. To ensure fairness and reduce computational cost, we employed a one-factor-at-a-time search strategy: first tuning the temperature coefficient, then adjusting jitter parameters based on the best τ, and finally comparing schedulers. The AUC score was used as the primary selection criterion, while Precision and Normalized Precision served as secondary measures when AUC values were close. Stability and runtime efficiency were also considered when multiple configurations performed comparably. It is worth noting that attempts with larger jitter values (e.g., center jitter = 4.5, scale jitter = 0.7) led to unstable training and early crashes, preventing reliable evaluation. Thus, the reported “optimal” configuration should be understood as the best-performing setting within the stable training range. This limitation has been acknowledged in the Discussion, where we outline plans to employ more advanced tuning strategies such as Bayesian optimization or Hyperband in future work.

4.2.1. THM Modules with Different Temperature Coefficients

As mentioned earlier, the THM module, based on the UVLTrack model, can better distinguish targets from distractors in scenes with semantic ambiguity or multiple similar objects. The THM module introduces a temperature parameter τ when generating attention heatmaps to control the concentration of the heatmap distribution. A smaller τ results in a more concentrated heatmap, emphasizing local salient regions, while a larger τ produces a smoother response distribution, incorporating more contextual information. To explore the impact of different temperature coefficients of the THM module on model tracking performance, we trained and tested the model under three settings of τ ∈ {0.03, 0.07, 0.12} and compared the tracking performance with the baseline model in terms of AUC and Precision, while keeping all other hyperparameters at their default settings. Figure 2 shows the AUC, Precision values and Average FPS of the model without the THM module and with the THM module at different temperature coefficients on the OTB99 dataset.

For clarity, the “baseline” in Figure 2 refers to the original UVLTrack model without the proposed THM module. The three variants with different temperature coefficients (τ∈{0.03, 0.07, 0.12}) correspond to the UVLTrack model equipped with our THM module. This setup constitutes an ablation study, as it directly compares the baseline model and the THM-enhanced versions under otherwise identical conditions. The quantitative differences in AUC and Precision therefore reflect the independent contribution of the THM module.

Figure 2 illustrates the impact of the THM module with different temperature coefficients (τ) on tracking performance. As shown in Figure 2a, compared to the baseline model without THM, introducing the THM module with τ = 0.03 improves the AUC from 54.36 to 54.82, while Figure 2b indicates an increase in Precision from 71.75 to 73.29. Meanwhile, Figure 2c reveals only a slight decrease in average FPS, demonstrating that moderate temperature scaling enhances the heatmap’s focus and improves localization accuracy without significantly compromising speed. When τ is increased to 0.07, the model achieves optimal performance with an AUC of 54.92 and Precision of 73.46, indicating that a sharper heatmap helps the model better distinguish the target region. However, when τ further increases to 0.12, although the FPS rises slightly, both AUC and Precision decline noticeably—even falling below the baseline. This suggests that an excessively high τ leads to an overly smooth heatmap, blurring the attention distribution and weakening the model’s localization capability. Overall, these results confirm that the THM module with τ = 0.07 achieves the best trade-off between accuracy, robustness, and inference efficiency.

4.2.2. Different Search Area Jitter Strengths

As mentioned in Section 3, in order to enhance the robustness of the model, this experiment imposed different degrees of perturbation on the position and scale of the target in the search area during the training phase. In order to evaluate the impact of different “center jitter intensity” and “scale jitter intensity” groups on the model tracking performance, we set up three groups of jitter configuration parameters for comparative experiments, corresponding to low, moderate, and high jitter intensities. Table 3 shows the parameter group settings for different jitter intensities. While keeping other training parameters unchanged, this experiment evaluated the tracking performance of the model trained under different jitter intensities on the OTB99 dataset. Figure 3 shows the AUC, Precision values and average FPS of the model under different jitter intensities.

Figure 3 illustrates the impact of different levels of data augmentation on tracking performance. Figure 3a shows the AUC values of the model under the configuration shown in Table 3. The augmented training samples are generated by perturbing the target bounding boxes; scale jitter uses Gaussian noise in logarithmic space, while center jitter uses uniform sampling within a dynamically determined range. This data augmentation strategy simulates various scenarios of target position and scale variations in real-world scenes. As the augmentation strength increases from low to high, the AUC value in Figure 3a steadily improves from 48.87 to 54.92, and the accuracy in Figure 3b increases from 61.41 to 73.46. This indicates that stronger augmentation enhances the model’s robustness to variations in target position and scale. Furthermore, as shown in Figure 3c, even with high augmentation strength (center jitter = 3.5, scale jitter = 0.5)—where the training data diversity is highest, effectively preventing overfitting to specific target positions or scales—the model still maintains real-time performance (46–49 FPS). These results demonstrate that appropriate data augmentation is crucial for improving the generalization ability of visual-language tracking models, without sacrificing model speed.

4.2.3. Different Learning Rate Schedulers

Since different learning rate scheduling strategies have certain differences in the convergence effect and generalization ability of the tracking model on the OTB99 dataset, we also focused on the impact of different learning rate schedulers on the model tracking performance. Specifically, this experiment used four learning rate schedulers, StepLR, MultiStepLR, WarmupMultiStepLR, and CosineAnnealingLR, for comparative experiments while keeping other training hyperparameters completely consistent. Table 4 shows the parameters used by different learning rate schedulers. This experiment recorded the changes in the optimal AUC, Precision, Norm Precision, and average FPS of the model under each learning rate scheduling strategy on the OTB99 dataset. Table 5 shows the tracking performance indicators using each learning rate scheduler.

Table 5 shows that training with the learning rate schedulers StepLR and CosineAnnealingLR achieves high AUC and Precision. CosineAnnealingLR achieves the highest precision, 0.93 higher than StepLR, and its AUC is only 0.16 lower. MultiStepLR and WarmupMultiStepLR, on the other hand, perform poorly on all metrics, indicating that their multi-stage descent strategies on the OTB99 dataset are less effective than the balanced strategies of StepLR and CosineAnnealingLR.

4.2.4. Optimal Tracking Performance Configuration

According to the results of the aforementioned comparative experiments, Table 6 specifically gives the parameter settings that enable the tracking model to achieve optimal tracking performance.

This experiment achieved the best results in terms of accuracy and success rate by training and testing the model using the optimal performance configuration. Figure 4 shows the trends in the various losses during training, and Figure 5 shows the success rate curve during testing and the accuracy curve of the model at different center position error thresholds on the OTB99 dataset.

Figure 4 shows the convergence trends of different loss terms during training, which are directly derived from the optimization process of the total loss function defined in Section 3. The classification loss (Loss/cls), L1 regression loss (Loss/l1), and multimodal contrastive loss (Loss/cont) decrease rapidly in the early stages of training and stabilize after approximately 10 iterations. This convergence behavior indicates that our model can effectively learn to distinguish target features from background information (cls), accurately regress target bounding box coordinates (l1), and map visual and textual information into a unified semantic space (cont). The rapid convergence of these loss terms demonstrates the effectiveness of the model optimization and the stability of the training process, meaning that our proposed THM enhancement framework can achieve effective multimodal fusion with fewer training iterations.

Figure 5 presents the standard evaluation metrics for tracking performance. Figure 5a shows the success rate curve generated by calculating the Intersection-over-Union (IoU) between predicted and ground-truth bounding boxes across various overlap thresholds. Figure 5b shows the precision curve obtained by computing the center location error at different pixel thresholds. The success rate curve in Figure 5a demonstrates that our model maintains high success rates (above 0.6) at overlap thresholds below 0.5, indicating strong capability in approximate target localization. However, the rapid performance decline at higher thresholds (>0.7) reveals the challenge in achieving precise bounding box overlap, which is consistent with common limitations in visual tracking systems. The precision curve in Figure 5b shows that our model achieves 73.46% precision at the 20-pixel threshold, exceeding 80% at 30 pixels, demonstrating robust positioning accuracy for practical applications. The area under the success rate curve (AUC) of 54.92 and precision of 73.46% collectively verify the effectiveness of our tracking framework in balancing localization accuracy and robustness.

4.3. Tracking Visualization

To enhance the intuitiveness and readability of this research, we selected several representative video sequences from the OTB99 test set to visualize the target tracking results. The visualization results in Figure 6 demonstrate the model’s strong performance in a variety of complex scenarios, further validating the effectiveness of the proposed visual language target tracking method.

While Figure 6 highlights representative successful cases, we also observed failures in scenarios with heavy occlusion and ambiguous language, where the THM heatmaps became dispersed and misaligned with the true target.

4.4. Extended Experiment

To further examine the generalization capability of the proposed THM-based tracking framework, we conducted extended experiments on larger and more challenging datasets. While the OTB99 benchmark is widely used, it is relatively limited in scale and scene diversity. In contrast, LaSOT provides large-scale long-term tracking sequences with diverse environments, and TNL2K emphasizes vision–language alignment with natural language descriptions. These datasets allow us to assess the robustness of our model in more realistic and complex tracking scenarios.

4.4.1. Datasets

The LaSOT dataset contains 1400 video sequences with an average sequence length exceeding 2500 frames, covering a wide variety of object categories and tracking conditions. It is designed for long-term tracking evaluation. TNL2K, on the other hand, consists of 2000 sequences with detailed natural language annotations, offering challenges such as occlusion, background clutter, and semantic ambiguity. Together, these datasets complement OTB99 and provide a comprehensive testing ground for vision–language tracking.

In our experiments, model training was conducted on the OTB99 and LaSOT training splits, while evaluation was performed exclusively on the OTB99 test set, the LaSOT test split, and the TNL2K benchmark. A small subset of sequences was held out from the training split for validation and not used in testing. This protocol strictly avoids data leakage and ensures a fair evaluation of the proposed method.

4.4.2. Experimental Setup

For this experiment, we trained our tracker on a combined dataset of LaSOT and OTB99, using the same parameter configuration as in the previous sections (τ = 0.07, center jitter = 3.5, scale jitter = 0.5, and cosine annealing learning rate scheduler). After training, the model was evaluated on the LaSOT, OTB99, and TNL2K test sets.

4.4.3. Experimental Results

Table 7 summarizes the tracking performance. On OTB99, training with a larger dataset significantly improves AUC and accuracy compared to training on OTB99 alone. On LaSOT and TNL2K, our method achieves relatively stable performance on long sequences, and our model demonstrates its ability to leverage natural language cues for cross-modal alignment.

We evaluated the tracking performance of the proposed method with the best hyperparameter configuration on three benchmark datasets. Results are reported as the mean ± standard deviation over three independent runs, demonstrating robustness of the method despite small variability (standard deviations < 0.3).

When trained on the combined LaSOT and OTB99 datasets, our tracker achieved an AUC of 67.09 ± 0.15, a Precision of 89.20 ± 0.10, and a Normalized Precision of 83.60 ± 0.12 on OTB99. These results represent a clear improvement compared with the earlier OTB99-only training, indicating that exposure to a larger dataset enhances the model’s robustness and generalization. On LaSOT, the tracker obtained an AUC of 62.57 ± 0.20, a Precision of 65.71 ± 0.18, and a Normalized Precision of 72.64 ± 0.22. Considering the dataset’s long sequences and diverse scenarios, these results demonstrate that the proposed THM module can effectively generalize to large-scale and challenging tracking conditions. On TNL2K, the model achieved an AUC of 46.84 ± 0.25, a Precision of 44.90 ± 0.20, and a Normalized Precision of 63.75 ± 0.28. Although the absolute performance on TNL2K is lower than on OTB99 and LaSOT, this outcome is expected, as TNL2K introduces highly challenging conditions such as ambiguous language descriptions and heavy occlusions. The results nonetheless confirm that the THM framework is transferable to new datasets and can handle multimodal alignment in unseen scenarios. Overall, these findings suggest that the proposed THM-based approach not only strengthens baseline performance on OTB99 but also provides consistent gains on larger and more complex benchmarks, highlighting its potential for real-world applications.

To further evaluate the effectiveness of the proposed THM module, we compared our approach with several recent vision–language trackers, including DUTrack, JointNLT, and SNLT, on the LaSOT and OTB99 benchmarks.

As shown in Table 8, our method achieves competitive performance. While DUTrack yields the best overall results, our tracker clearly outperforms JointNLT and SNLT on both datasets, confirming that the proposed THM module provides substantial improvements over strong baselines.

5. Limitations, Discussion, and Future Work

Although our proposed THM-based tracking framework achieved relatively promising results in experiments, it still has some limitations that warrant further research and improvement.

First, module optimization. The tracking performance of the THM module is sensitive to the temperature coefficient τ, which currently requires manual tuning and limits adaptability to new scenarios. In future work, we plan to integrate automated hyperparameter optimization strategies, such as Bayesian optimization and evolutionary search, to dynamically adjust τ during training. Moreover, meta-learning-based parameter adaptation methods could be explored, enabling the tracker to automatically select appropriate τ values according to different video contexts. Second, dataset Expansion. Our current experiments are primarily based on the OTB99 dataset, which limits its data size and scenario diversity. To enhance the generalization capabilities of our model, we are currently expanding our evaluation to the LaSOT and TNL2K datasets. In the future, we will continue to expand to large-scale and diverse benchmark datasets such as TREK-150. These datasets cover long-term tracking, language ambiguity, and real-world scenarios, allowing us to rigorously test the robustness and scalability of our proposed framework. Third, evaluation protocols. The current analysis mainly reports AUC and Precision, without sufficient insight into failure cases. As part of future work, we will perform qualitative and quantitative failure analysis. Specifically, we plan to visualize attention heatmaps for low-IoU frames to better understand failure modes such as occlusion, background distraction, and semantic ambiguity. Additionally, we will incorporate metrics such as Expected Average Overlap (EAO) and Normalized Precision to provide a more comprehensive evaluation.

Although we mainly report point estimates (AUC and Precision) in the current tables, we verified that the variance across multiple runs was negligible, which confirms the stability of the findings. In future work, we will provide more comprehensive statistical confidence intervals and significance analyses. Although the proposed method does not surpass the latest state-of-the-art trackers, it consistently improves the UVLTrack baseline, validating the effectiveness and potential extensibility of the THM module. we plan to compare with these state-of-the-art trackers in future work, implementation, and resources permitting. We also note that hyperparameters such as the loss weights λ were not varied in this study, and a more systematic sensitivity analysis is left for future work.

6. Conclusions

This study aims to address the challenges of semantic ambiguity and clutter interference in visual-language single-object tracking. Based on the UVLTrack framework, we propose a novel Textual Heatmap Mapping (THM) module. This module explicitly introduces a spatial alignment mechanism into cross-modal fusion through temperature-controlled heatmap generation. The THM module leverages textual description information more precisely to guide target localization, thus enhancing the model’s robustness in complex multi-object scenarios. Extensive experiments on the OTB99 dataset demonstrate that our method significantly outperforms the baseline model. With the optimal parameter configuration (τ = 0.07, center jitter = 3.5, scale jitter = 0.5, cosine annealing learning rate scheduler), the AUC and accuracy reach 54.92 and 73.46, respectively. These results validate the effectiveness of the explicit spatial alignment cross-modal fusion strategy and highlight the importance of combining textual and visual information for robust tracking. Beyond experimental advantages, this study theoretically demonstrates that cross-modal alignment can be enhanced not only at the semantic level but also at the spatial level, thus addressing a current research gap. From a practical perspective, this method holds significant application potential in fields such as autonomous driving (e.g., identifying specific vehicles in crowded traffic) and intelligent surveillance (e.g., target recognition in crowded scenes). The THM framework lays the foundation for building more reliable tracking systems by reducing overfitting and information conflicts in multi-modal fusion.

Despite these achievements, this study has some limitations. First, the THM module’s performance is sensitive to hyperparameter selection, especially the temperature coefficient, which currently requires manual tuning. Second, the evaluation is primarily based on OTB99 and some extended datasets, and further validation on larger, more challenging benchmark datasets is needed. Finally, while our model outperforms UVLTrack, it does not yet surpass the state-of-the-art tracking algorithms. Therefore, future research will focus on automatic hyperparameter optimization, validation on larger and more challenging benchmark datasets (e.g., TREK-150, GOT-10k), and integrating the THM module with more advanced multi-modal architectures and training strategies. Additionally, introducing explainability tools such as failure case analysis and heatmap visualization can provide deeper insights into model behavior and error patterns.

In summary, this paper proposes a novel cross-modal alignment mechanism for visual-language tracking tasks, and its effectiveness is validated through controlled experiments and real-world scenarios. By explicitly correlating semantic and spatial information, our method not only improves tracking accuracy but also offers a new perspective for the development of multimodal fusion techniques. We believe that this research will represent an important step towards building more robust, interpretable, and practically useful visual-language tracking systems.

Author Contributions

Conceptualization, W.X. and G.G.; methodology, W.X. and D.Y.; software, W.X.; validation, W.X., G.G. and D.Y.; formal analysis, X.Z.; investigation, X.Z.; resources, W.X. and G.G.; data curation, W.X. and G.G.; writing—original draft preparation, W.X., X.Z. and D.Y.; writing—review and editing, X.Z. and D.Y.; visualization, W.X. and G.G.; supervision, X.Z. and D.Y.; project administration, X.Z. and D.Y.; funding acquisition, X.Z. and D.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China under Grant No. 62202362, by the China Postdoctoral Science Foundation under Grant Nos. 2022TQ0247 and 2023M742742, by the Guangdong Basic and Applied Basic Research Foundation under Grant Nos. 2024A1515011626 and 2025A1515012949, and by the Science and Technology Projects in Guangzhou under Grant No. 2023A04J0397.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The OTB99 dataset is available at https://github.com/QUVA-Lab/lang-tracker/ (accessed on 7 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUC	Area Under the Curve
AdamW	Adaptive Moment Estimation with Weight Decay
BERT	Bidirectional Encoder Representations from Transformers
CLIP	Contrastive Language–Image Pre-training
FPS	Frames Per Second
LaSOT	Large-Scale Single-object Tracking
LR	Learning Rate
MFA	Multimodal Features Alignment
OTB99	Object Tracking Benchmark with 99 Sequences
SLB	Semantic Language Branch
THM	Textual Heatmap Mapping
TNL2K	Tracking by Natural Language with 2K Samples
UVLTrack	Unified Vision–Language Tracker
ViT	Vision Transformer

References

Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H.S. Fully-convolutional siamese networks for object tracking. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2016; pp. 850–865. [Google Scholar]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8971–8980. [Google Scholar]
Wang, Q.; Zhang, L.; Bertinetto, L.; Hu, W.; Torr, P.H.S. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1328–1338. [Google Scholar]
Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4660–4669. [Google Scholar]
Bhat, G.; Danelljan, M.; Van Gool, L.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6182–6191. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: Cambridge, MA, USA, 2021; pp. 8748–8763. [Google Scholar]
Ye, P.; Xiao, G.; Liu, J. Multimodal Features Alignment for Vision–Language Object Tracking. Remote Sens. 2024, 16, 1168. [Google Scholar] [CrossRef]
Ge, J.; Cao, J.; Chen, X.; Zhu, X.; Liu, W.; Liu, C.; Wang, K.; Liu, B. Beyond visual cues: Synchronously exploring target-centric semantics for vision-language tracking. ACM Trans. Multimed. Comput. Commun. Appl. 2025, 21, 1–21. [Google Scholar] [CrossRef]
Ma, Y.; Tang, Y.; Yang, W.; Zhang, T.; Zhang, J.; Kang, M. Unifying visual and vision-language tracking via contrastive learning. AAAI Conf. Artif. Intell. 2024, 38, 4107–4116. [Google Scholar] [CrossRef]
Li, Z.; Tao, R.; Gavves, E.; Snoek, C.G.M.; Smeulders, A.W.M. Tracking by natural language specification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6495–6503. [Google Scholar]
Kamath, A.; Singh, M.; LeCun, Y.; Synnaeve, G.; Misra, I.; Carion, N. MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1780–1790. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Kim, W.; Son, B.; Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: Cambridge, MA, USA, 2021; pp. 5583–5594. [Google Scholar]
Feng, Q.; Ablavsky, V.; Bai, Q.; Sclaroff, S. Siamese natural language tracker: Tracking by natural language descriptions with siamese trackers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 5851–5860. [Google Scholar]
Wang, G.; Luo, C.; Sun, X.; Xiong, Z.; Zeng, W. Tracking by instance detection: A meta-learning approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6288–6297. [Google Scholar]
Wang, X.; Shu, X.; Zhang, Z.; Jiang, B.; Wang, Y.; Tian, Y.; Wu, F. Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13763–13773. [Google Scholar]
Li, X.; Huang, Y.; He, Z.; Wang, Y.; Lu, H.; Yang, M.-H. Citetracker: Correlating image and text for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 9974–9983. [Google Scholar]
Zhou, L.; Zhou, Z.; Mao, K.; He, Z. Joint visual grounding and tracking with natural language specification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 23151–23160. [Google Scholar]
Li, X.; Zhong, B.; Liang, Q.; Mo, Z.; Nong, J.; Song, S. Dynamic Updates for Language Adaptation in Visual-Language Tracking. In Proceedings of the Computer Vision and Pattern Recognition Conference, Las Vegas, NV, USA, 21–24 July 2025; pp. 19165–19174. [Google Scholar]
Wang, X.; Li, C.; Yang, R.; Zhang, T.; Tang, J.; Luo, B. Describe and attend to track: Learning natural language guided structural representation and visual attention for object tracking. arXiv 2018, arXiv:1811.10014. [Google Scholar] [CrossRef]
Zhang, C.; Liu, L.; Gao, J.; Sun, X.; Wen, H.; Zhou, X.; Ge, S.; Wang, Y. COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking. arXiv 2025, arXiv:2504.01321. [Google Scholar] [CrossRef]
Feng, X.; Li, X.; Hu, S.; Zhang, D.; Zhang, J.; Chen, X.; Huang, K. MemVLT: Vision-language tracking with adaptive memory-based prompts. Adv. Neural Inf. Process. Syst. 2024, 37, 14903–14933. [Google Scholar]
Chen, X.; Kang, B.; Geng, W.; Zhu, J.; Liu, Y.; Wang, D.; Lu, H. SUTrack: Towards simple and unified single object tracking. AAAI Conf. Artif. Intell. 2025, 39, 2239–2247. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Wu, Y.; Lim, J.; Yang, M.H. Online object tracking: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2411–2418. [Google Scholar]

Figure 1. Model architecture diagram of this study.

Figure 2. (a) AUC values of the baseline model and the THM-enhanced models with different temperature coefficients (τ) on OTB99. (b) Precision values. (c) Average FPS.

Figure 3. Model performance under different jitter intensities: (a) AUC, (b) Precision, and (c) Average FPS.

Figure 4. Loss change trend chart during training.

Figure 5. (a) is the success rate curve during the test, and (b) is the accuracy change curve of the model under different center position error thresholds on the OTB99 dataset.

Figure 6. Visual display of target tracking results for some video sequences in the OTB99 test set. The red frames indicate the tracked targets.

Table 1. Comparison of Vision–language Tracking Methods.

Method	Backbone	Fusion Strategy	Key Limitations
MFA	CNN	Bilinear Pooling	No explicit spatial alignment; struggles with semantic ambiguity.
SLB	ResNet	Dense Matching	Information conflict; weak against distractors.
UVLTrack	ViT + BERT	Contrastive Learning	Global alignment only; fails with similar objects.
Ours	ViT + BERT	THM + Contrastive	Explicit spatial mapping; robust to ambiguity.

Table 2. Five representative video sequences in the OTB99 dataset.

Sequence Number	Sequence Name	Frames	Resolution	Text Length
1	Basketball	725	480 × 360	46
2	CarScale	252	352 × 288	37
3	David3	252	768 × 576	39
4	BlurCar1	247	352 × 288	42
5	Jogging1	307	352 × 288	35

Table 3. The parameter group settings for different jitter intensities.

	Center Jitter	Scale Jitter
Low intensity	1.5	0.2
medium intensity	2.5	0.3
high intensity	3.5	0.5

Table 4. Parameters used by different learning rate schedulers.

Parameter	StepLR	MultiStepLR	WarmupMultiStepLR	CosineAnnealingLR
step_size	5	-	-	-
gamma	0.1	0.1	0.1	-
milestones	-	[4,8]	[4,8]	-
warmup_epoch	-	-	2	-
warmup_factor	-	-	0.01	-
T_max	-	-	-	10
eta_min	-	-	-	0

Table 5. Tracking performance metrics using different learning rate schedulers.

	AUC	Precision	Norm Precision	Ave-FPS
StepLR	55.08	72.59	70.67	49.59
MultiStepLR	52.43	69.71	66.64	46.33
WarmupMultiStepLR	51.92	67.76	64.91	48.80
CosineAnnealingLR	54.92	73.46	69.25	46.97

Table 6. Parameter settings for tracking models to achieve optimal tracking performance.

Parameter	Value
THM module temperature coefficient	0.07
Search region center jitter intensity	3.5
Search region scale jitter intensity	0.5
Learning rate scheduler	CosineAnnealingLR

Table 7. Tracking results on OTB99, LaSOT and TNL2K.

Datasets	AUC	Precision	Norm Precision
OTB99	67.09 ± 0.16	89.20 ± 0.11	83.60 ± 0.13
LaSOT	62.57 ± 0.21	65.71 ± 0.14	72.64 ± 0.19
TNL2K	46.84 ± 0.26	44.90 ± 0.24	63.75 ± 0.21

Table 8. Comparison of our method with recent state-of-the-art trackers on LaSOT and OTB99 datasets.

Methods	Lasot		OTB99
Methods	AUC	Precision	AUC	Precision
DUTrack	73.0	81.1	70.9	93.9
JointNLT	60.4	63.6	65.3	85.6
SNLT	54.0	57.6	66.6	80.4
Ours	62.5	65.7	67.0	89.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, W.; Geng, G.; Zhang, X.; Yuan, D. Cross-Modal Alignment Enhancement for Vision–Language Tracking via Textual Heatmap Mapping. AI 2025, 6, 263. https://doi.org/10.3390/ai6100263

AMA Style

Xu W, Geng G, Zhang X, Yuan D. Cross-Modal Alignment Enhancement for Vision–Language Tracking via Textual Heatmap Mapping. AI. 2025; 6(10):263. https://doi.org/10.3390/ai6100263

Chicago/Turabian Style

Xu, Wei, Gu Geng, Xinming Zhang, and Di Yuan. 2025. "Cross-Modal Alignment Enhancement for Vision–Language Tracking via Textual Heatmap Mapping" AI 6, no. 10: 263. https://doi.org/10.3390/ai6100263

APA Style

Xu, W., Geng, G., Zhang, X., & Yuan, D. (2025). Cross-Modal Alignment Enhancement for Vision–Language Tracking via Textual Heatmap Mapping. AI, 6(10), 263. https://doi.org/10.3390/ai6100263

Article Menu

Cross-Modal Alignment Enhancement for Vision–Language Tracking via Textual Heatmap Mapping

Abstract

1. Introduction

2. Related Work

2.1. Traditional Object Tracking

2.2. Vision–Language Multimodal Object Tracking

2.3. Multimodal Feature Alignment and Fusion Strategy

3. Methods

3.1. Data Augmentation Methods

3.2. Text Heatmap Mapping Module

3.3. Model Training

4. Experiments

4.1. Experimental Details

4.2. Comparative Experiment

4.2.1. THM Modules with Different Temperature Coefficients

4.2.2. Different Search Area Jitter Strengths

4.2.3. Different Learning Rate Schedulers

4.2.4. Optimal Tracking Performance Configuration

4.3. Tracking Visualization

4.4. Extended Experiment

4.4.1. Datasets

4.4.2. Experimental Setup

4.4.3. Experimental Results

5. Limitations, Discussion, and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI