Open-Vocabulary Multi-Object Tracking Based on Multi-Cue Fusion

Liangfeng Xu; Jinqi Bai; Lin Nai; Chang Liu

doi:10.3390/app152413151

,

and

¹

State Post Bureau Safety Supervision Center, Beijing 100091, China

²

College of Computer Science, Beijing Information Science and Technology University, Beijing 102206, China

^*

Author to whom correspondence should be addressed.

Appl. Sci.2025, 15(24), 13151;https://doi.org/10.3390/app152413151

This article belongs to the Special Issue AI for Sustainability and Innovation—2nd Edition

Version Notes

Order Reprints

Abstract

Multi-object tracking (MOT) technology integrates multiple fields such as pattern recognition, machine learning, and object detection, demonstrating broad application potential in scenarios like low-altitude logistics delivery, urban security, autonomous driving, and intelligent navigation. However, in open-world scenarios, existing MOT methods often face challenges of imprecise target category identification and insufficient tracking accuracy, especially when dealing with numerous target types affected by occlusion and deformation. To address this, we propose a multi-object tracking strategy based on multi-cue fusion. This strategy combines appearance features and spatial feature information, employing BYTE and weighted Intersection over Union (IoU) modules to handle target association, thereby improving tracking accuracy. Furthermore, to tackle the challenge of large vocabularies in open-world scenarios, we introduce an open-vocabulary prompting strategy. By incorporating diverse sentence structures, emotional elements, and image quality descriptions, the expressiveness of text descriptions is enhanced. Combined with the CLIP model, this strategy significantly improves the recognition capability for novel category targets without requiring model retraining. Experimental results on the public TAO benchmark show that our method yields consistent TETA improvements over existing open-vocabulary trackers, with gains of 10% and 16% on base and novel categories, respectively. The results demonstrate that the proposed framework offers a more robust solution for open-vocabulary multi-object tracking in complex environments.

Keywords:

multi-object tracking; open vocabulary; target association; multi-cue fusion; vision–language model

1. Introduction

Multi-object tracking (MOT) is a crucial research topic in computer vision. It involves integrating target detection and localization technologies with feature matching and data association algorithms from deep learning to build robust models capable of predicting spatiotemporally continuous trajectories for multiple targets in video sequences, thereby maintaining identity (ID) consistency over time. In recent years, with the widespread adoption of applications such as autonomous driving, unmanned aerial vehicle (UAV) logistics delivery, intelligent security surveillance, and smart traffic systems, the demand for MOT technology in practical, complex, open environments has grown significantly, accompanied by numerous new challenges.

In traditional application scenarios, MOT primarily focuses on limited categories like pedestrians and vehicles. However, in open environments, tracking targets exhibit high diversity and uncertainty. For instance, wildlife monitoring requires identifying and tracking non-fixed category targets like tigers and leopards; autonomous driving systems need to promptly identify obstacles on the road such as plastic bags or bottles, which are often undefined objects; in low-altitude logistics delivery, UAVs need to achieve the continuous tracking of various dynamic targets in complex urban or rural terrains. In such scenarios, target categories are numerous and morphologically variable, often accompanied by severe occlusion, rapid deformation, illumination changes, and interference from similar appearances, making traditional tracking methods based on closed vocabulary sets difficult to adapt. Particularly in dense, complex-motion low-altitude logistics environments, imaging equipment is affected by flight attitude and ground object occlusion, easily leading to target loss or ID confusion, severely impacting the accuracy and reliability of tracking systems.

To overcome the reliance of traditional MOT methods on predefined category sets, open-vocabulary MOT has emerged. Its core objective is to leverage the powerful semantic understanding and generalization capabilities of vision–language models (e.g., CLIP) to enable the recognition and tracking of novel category targets not seen during training. Open-vocabulary methods divide categories into base classes (known) and novel classes (unknown). By training on the base class label space, the model gains the ability to recognize novel classes, significantly expanding the applicability of tracking systems. However, current research on open-vocabulary MOT is still in its early stages, and representative works like OVTrack have several shortcomings. Especially in practical scenarios with frequent target motion and severe occlusion, existing methods have considerable room for improvement in terms of target association robustness, discriminative feature extraction capability, and cross-category generalization performance. Furthermore, vision–language models themselves are not specifically designed for MOT tasks and have limitations in distinguishing between different targets with similar appearances and handling complex motion patterns and occlusion situations.

Addressing the aforementioned issues, we propose an open-vocabulary multi-object tracking framework based on multi-cue fusion. By fusing appearance features and spatial motion features, combined with the BYTE data association mechanism and a weighted IoU module, the accuracy of target association is enhanced. Simultaneously, an open-vocabulary prompting strategy is introduced, utilizing diverse sentence structures, emotional descriptions, and image quality-related text prompts to enrich semantic expression, thereby strengthening the CLIP model’s ability to recognize and generalize to novel categories, effectively identifying unknown category targets without retraining the model. Experiments on the public TAO dataset show that our method improves the TETA evaluation metric by 10% and 16% for base and novel categories, respectively, validating its good tracking accuracy and robustness in complex scenes.

To address the above challenges, we propose an open-vocabulary multi-object tracking framework that explicitly integrates multi-cue fusion and open-vocabulary prompting. Unlike OVTrack, which relies primarily on appearance cues, and ByteTrack/QDTrack, which emphasize motion-based heuristics, our approach jointly leverages appearance, motion, and spatial cues through a unified association strategy. In particular, we introduce a Height-Weighted IoU (HWIoU) module tailored to open-vocabulary tracking, enabling more reliable matching under scale variation and deformation—capabilities not present in existing MOT pipelines. Furthermore, we designed tracking-oriented prompt templates that incorporate structural, affective, and quality-aware textual variants to enhance the semantic expressiveness of CLIP, thereby improving generalization to unseen categories without retraining. Experiments on the TAO benchmark demonstrate that our approach achieves 10% and 16% TETA improvements on base and novel categories, respectively, validating its effectiveness and robustness in complex open-world scenarios.

3. Multi-Object Tracking Based on Open Vocabulary and Multi-Cue Fusion

3.1. Algorithm Architecture

As illustrated in Figure 1, open-vocabulary MOT involves a substantially richer set of category-level textual descriptions compared to conventional closed-set tracking. To effectively utilize this information, we first introduce a caption parser that extracts candidate category phrases using a scene parser [21], followed by a pre-trained vision–language model that computes image–text similarity for open-vocabulary detection. Building on this, our tracker performs multi-cue joint association—integrating appearance, motion, and spatial cues—to robustly assign identities, as depicted in Figure 2. This formulation enables open-vocabulary tracking without requiring category-specific model retraining.

Figure 1. Category differences between open-vocabulary and traditional multi-object tracking. The open-world scenarios involve more category text information than closed-set MOT. By introducing a caption parser, category information is extracted from captions using a scene parser [21], and a pre-trained vision–language model is used to process similarity information between images and categories.

Figure 2. Open-vocabulary tracking framework based on multi-cue fusion. A tracker associates targets to achieve MOT for the specified categories.

To address deformation and occlusion in open scenarios, a multi-cue fusion association strategy is introduced, with prompts adjusted for the MOT context. The detailed overall algorithm architecture is shown in Figure 3.

Figure 3. Overall algorithm architecture. The proposed method builds on object detection, pre-trained vision–language model classification, and multi-cue fusion association. It uses a Faster R-CNN [2] detector to detect objects in videos and the CLIP vision–language model to obtain category labels and combines strong motion–appearance cues with weak bounding box height cues for target association.

3.2. Open-Vocabulary Object Detection

3.2.1. Object Detector

This study uses a backbone network based on ResNeXt-50 [22] and a Feature Pyramid Network, with Faster R-CNN as the detector. Faster R-CNN is a two-stage detector. In the first stage, it generates numerous candidate regions (proposals) with a Region Proposal Network (RPN). To localize objects of arbitrary and potentially unknown classes

C \in N o v e l

in videos, the Region Proposal Network and regression loss defined in [3] are used. This localization detection process [23] generalizes well to object classes unknown during training. During training, the RPN is used as the object proposer M to achieve greater diversity, while during inference, the R-CNN output is used as the object proposals. Each proposal

r \in P

is defined by a confidence score

P_{r}

and a bounding box

B_{r}

. The detector’s final stage is connected to the pre-distilled vision–language model CLIP for classification.

3.2.2. Open-Vocabulary Object Detection Pre-Training

This study used the DetPro [24] method combined with ViLD to extract the vision–language model CLIP and its pre-trained prompts. The CLIP model consists of a text encoder T and an image encoder I. T takes prompts representing classes as input and outputs corresponding text embeddings (class embeddings). I takes an image resized to 224 × 224 as input and outputs the corresponding image embedding.

The method [23] distills a two-stage object detector (student model) using a pre-trained open-set classification model as the teacher model. Specifically, the teacher model is used to encode category text and generated proposals, training the student detector to align text and region features. Let

R (φ (I), r)

denote the region embedding, where

φ (\cdot)

is the backbone model and

R (\cdot)

is a lightweight head generating the region embedding (the output before the classification layer).

The classifier of Faster R-CNN [3] is replaced with a text embedding processing module. The goal of this module is to train the region embeddings to enable classification via text embeddings. The loss function for the text embedding processing module can be written as

e_{r} = R (ϕ (I), r)

(1)

z (r) = [s i m (e_{r}, e_{b}), s i m (e_{r}, t_{1}), \dots, s i m (e_{r}, t_{| C_{B} |})]

(2)

L_{t e x t} = \frac{1}{N} \sum_{r \in P} L_{C E} (softmax (z (r) / μ))

(3)

Here, softmax is the loss function for the text embedding processing module. The text embedding processing module uses only the text embeddings

T_{C B}

of the base categories

C_{B}

for training. Regions that do not match any ground truth label in

C_{B}

are assigned to a background category. Since the word “background” may not well represent these unmatched regions, the background category learns its own embedding

e_{b}

. The cosine similarity between each region embedding

R (ϕ (I), r)

and all category embeddings (including

T_{C B}

and

e_{b}

) is computed. A softmax activation function with temperature

μ

is applied to calculate the cross-entropy loss. To train the first stage (RPN) of the two-stage detector, regions

r \in P

extracted by the detector are utilized, and the detector is initialized and trained using the text embedding processing module.

The overall detector training framework is shown in Figure 4.

Figure 4. Open-vocabulary detector training and data distillation process. Text encoder distillation is performed only from base classes. Category names are input into the pre-trained text encoder to obtain text embeddings, and the inferred text embeddings are used to classify the detected regions. Image encoder distillation is performed from both base and novel classes because the proposal network might detect regions containing novel classes. First, the object proposal regions are filtered using the NMS algorithm [3], and then cropped from the original image. The cropped regions are fed into the image encoder of the CLIP model. The output of the vision–language model’s image encoder serves as the teacher region features. Subsequently, a Faster R-CNN detector is trained to learn the output of the teacher network, aligning the student output with the teacher image feature embeddings.

Image encoder distillation is introduced to extract image embeddings, aiming to distill knowledge from the pre-trained vision–language model image encoder V (teacher model) into the object detector (student model).

L_{image}

is used for distillation.

L_{image} = \frac{1}{M} \sum_{\tilde{r} \in \tilde{P}} {∥V (crop (I, \tilde{r})) - R (ϕ (I), \tilde{r})∥}_{1}

(4)

Here, the method aims to align the region embedding

e_{r} = R (ϕ (I), \tilde{r}))

with the introduced image embedding

V (c r o p (i, \tilde{r}))

. To improve training efficiency, M regions

\tilde{r} \in \tilde{P}

are extracted for each training image, and the corresponding M image embeddings are precomputed. The candidate regions can contain objects from both base

C_{B}

and novel

C_{N}

categories. In contrast, the text encoder can only learn from base categories

C_{B}

. An L1 loss is applied between the region and image embeddings to minimize their distance.

3.3. Multi-Cue Fusion Target Association Strategy

In open-world scenarios, target motion is complex and variable, prone to deformation, making methods relying solely on appearance matching ineffective under complex conditions like occlusion and deformation [24]. For example, in pedestrian and vehicle MOT scenarios (e.g., MOT17 [25]), target color changes are insignificant and displacement is small. However, in open-world scenarios, targets may undergo significant appearance changes due to their own motion, as shown in Figure 5.

Figure 5. Appearance variations in traditional MOT scenes vs. open-vocabulary MOT scenes.

To alleviate these problems, introducing a multi-cue fusion target association strategy can effectively mitigate the aforementioned challenges. This involves using strong cues like motion mixed with appearance, and weak cues like candidate bounding box height for target association. Appearance-based matching methods primarily use category-sample-based matching to find candidate associations for a query object within tracklets.

The BYTE method is adopted to obtain motion cues, tracking targets by associating every detection box in the video, not just high-score boxes. This method excels in addressing issues of missed detections and track fragmentation caused by target occlusion or blurring, significantly enhancing track completeness and accuracy. Almost all detection boxes are retained and categorized into high-score and low-score boxes. In the processing pipeline, high-score boxes are first associated with previous tracks. Due to factors like occlusion, motion blur, or target size changes, some previous tracks might not find suitable high-score matches. Low-score boxes are then associated with these unmatched tracklets, recovering target objects from low-score boxes while effectively filtering out background information.

The association IoU is calculated using the Height-Weighted IoU (HWIoU) method. This method combines the maximum and minimum heights of the predicted box and the detection box from the previous frame to assess the intensity of target motion, weighting the motion cue accordingly to judge its reliability. After obtaining the motion cue, it is combined with the appearance cue to obtain the final target object similarity value. The Hungarian algorithm is then used for ID assignment.

Tracker Module Training: The LVIS dataset [26], a large-scale image dataset with diverse categories and complex scenes, is used as the training set to train the tracker. The training flowchart is shown in Figure 6.

Figure 6. Training flowchart for the multi-cue fusion target association strategy.

For each image

I_{i}

in LVIS, a reference image

I_{r}

is generated using the DDPM [27] data diffusion method. Instance similarity loss is used, employing the contrastive learning method from [6] for tracking training. The loss functions are calculated as shown in Equations (5)–(7).

P o s D (a) = \frac{1}{| A^{+} (a) |} \sum_{a^{+} \in A^{+}} exp (a \cdot a^{+}) / τ

(5)

Sim (a) = \frac{exp (a \cdot a^{+} / τ)}{PosD (a) + \sum_{a^{-} \in A^{-}} exp (a \cdot a^{-} / τ)}

(6)

L_{track} = - \sum_{a \in A} \frac{1}{| A^{+} (a) |} \sum_{a^{+} \in A^{+} (a)} log (sim (a^{+})), L_{a} = {(\frac{a \cdot a^{'}}{∥ a ∥ ∥ a^{'} ∥} - n)}^{2}

(7)

Here, Regions of Interest (RoIs) are extracted from the corresponding feature maps of

I_{i}

and

I_{r}

, and the IoU method is used to match these RoIs with annotations. For each matched RoI in the reference image

I_{i}

, an appearance embedding

a \in A

(feature vector) is extracted. In

I_{r}

, positive examples

A^{+}

are used to cluster objects with the same identity, and negative examples

A^{-}

with different identities are separated. An auxiliary loss function is applied to control the output scale of the model’s final layer.

Appearance similarity learning is achieved by increasing the distance between positive

a^{+}

and negative

a^{-}

examples. Traditional data augmentation techniques like translation, scaling, and rotation are used to simulate video effects.

During testing, the BYTE method and HWIoU module are used, combined with appearance features for MOT. The Height-Weighted IoU calculation is shown in Equations (8) and (9).

IoU (D_{i}, D_{j}) = \frac{{Area}^{(D_{i} \cap D_{j})}}{{Area}^{(D_{i})} + {Area}^{(D_{j})} - {Area}^{(D_{i} \cap D_{j})}}

(8)

HWIoU (D_{i}, D_{j}) = {IoU}^{(D_{i}, D_{j}) \cdot \frac{min (h_{i}, h_{j})}{max (h_{i}, h_{j}) + ϵ}}

(9)

Here,

D_{i}

and

D_{j}

represent the detection box and prediction box, respectively. The input to this method is a video sequence

V_{i}

, an object detector

λ

, and a detection threshold

λ

. For each frame in the video, the detector

D_{e}

predicts detection boxes and scores. These are divided into high-confidence detection boxes

D_{h i g h}

and low-confidence detection boxes

D_{l o w}

based on

λ

, along with existing tracks T. For high-score boxes, a Kalman filter is used to predict the new position of each target from the previous frame in the current frame. In the current frame, corresponding feature embeddings are extracted for these high-score boxes. Cosine similarity and bisoftmax are used to measure the appearance similarity between target embeddings and matching candidates. The formulas for cosine similarity and bisoftmax appearance similarity are shown.

S_{c} (i, j) = \frac{h_{i} \cdot e_{j}}{∥ h_{i} ∥ ∥ e_{j} ∥}, S_{b} (i, j) = \frac{1}{2} [\frac{exp (h_{i} e_{j})}{\sum_{k = 0}^{M - 1} exp (e_{i} e_{j})} + \frac{exp (h_{i} e_{j})}{\sum_{k = 0}^{H - 1} exp (h_{k} e_{m_{j^{'}}})}]

(10)

Here, suppose there are H high-confidence targets in the current frame with feature embeddings h and M matching candidates from the past x frames with feature embeddings e. Appearance similarity is used as a re-identification feature, incorporated into the cost matrix, and the Hungarian algorithm is used for matching based on this similarity.

C_{a p p r} = \frac{1}{2} (S_{b} (i, j) + S_{c} (i, j)), C = α C_{H W I O U} + (1 - α) C_{a p p r}

(11)

Here, unmatched high-confidence detection boxes

D_{r e}

are used to generate new tracklets. Unmatched tracks are retained in

T_{r e}

. For low-confidence detection boxes

D_{l o w}

and the remaining tracks

T_{r e}

, association is performed using HWIoU matching. Tracks that remain unmatched by both low- and high-confidence boxes are placed into

T_{l}

. Tracks in

T_{l}

continue to be tracked and matched; they are only deleted if they remain unmatched for a certain duration.

α

was set to be 0.6 in this study.

3.4. Open-Vocabulary Prompting Strategy for Multi-Object Tracking

In research on vision–language models like CLIP, text–image pairs typically consist of complete sentences rather than single words. This pairing method helps improve model performance. To bridge the vocabulary distribution gap in text–image pairing, research has found that using “prompt” text effectively enhances classification performance. For the open-vocabulary MOT task, CLIP’s ability to recognize novel categories is enhanced by adapting the “prompt” text. However, since CLIP’s weights are frozen and retraining requires substantial computational resources, prompt tuning becomes an effective means to improve task adaptability.

This study proposes a new image description template specifically designed for large-vocabulary scenarios. Compared to traditional methods, this template employs more diverse sentence structures, such as “This is a photo of a category” and “This is one category in the scene”. The template also introduces emotional descriptions (e.g., “a photo of a clean category” or “a dark photo of the category”), increasing the expressiveness and personalized characteristics of the text. By refining descriptions related to image quality, such as “a low resolution photo of a category”, the model’s classification ability for novel categories is significantly improved.

3.5. Interaction Between BYTE, HWIoU, Appearance Embeddings, and Prompt-Enhanced CLIP Features

To provide a clearer description of the methodology, we explicitly explain how the different cues interact within the proposed association framework. First, the open-vocabulary detector produces region-level appearance embeddings that have been enriched by our tracking-oriented CLIP prompts, enabling stronger visual discrimination for both base and novel categories. These embeddings form the appearance similarity branch in the association module. Second, the BYTE strategy performs a two-stage matching process, where high-confidence detections are first matched using appearance similarity combined with motion predictions from a Kalman filter. For unmatched trajectories, a second-stage association is conducted using low-confidence detections. In this stage, we introduce the Height-Weighted IoU (HWIoU) as a geometry-aware cue that emphasizes height consistency between predicted and detected boxes, which is particularly beneficial in open-vocabulary scenarios where object shapes and scales vary widely. Finally, appearance similarity, motion cost, and HWIoU are normalized and fused into a unified cost matrix before Hungarian matching. This integrated design ensures that CLIP-enhanced semantics guide the model toward correct category discrimination, while BYTE and HWIoU provide robust spatiotemporal constraints to mitigate occlusion, deformation, and large-scale variations.

4. Experiments and Result Analysis

4.1. Datasets

In the dataset configuration, the training dataset used was the LVIS dataset, a large-scale, fine-grained vocabulary dataset with approximately 2 million high-quality instance segmentation annotations for over 1000 object categories, containing about 164k images. The validation and testing datasets use the TAO dataset. TAO is the first large-scale dataset for general multi-object tracking, including 2907 videos and 833 object categories.

4.2. Evaluation Metrics

To quantitatively analyze the open-vocabulary tracking performance of the model, this study adopted the Tracking-Every-Thing Accuracy (TETA) [16] metric from the MOT field. The calculation method is shown in Equation (12).

T E T A = \frac{L o c A + A s s o c A + C l s A}{3}

(12)

Here, the TETA metric decomposes tracking measurement into three factors: Localization Accuracy (LocA), Association Accuracy (AssocA), and Classification Accuracy (ClsA). A higher metric indicates better performance. TETA already covers the functional dimensions that traditional metrics measure while additionally incorporating open-set classification, making it better aligned with the goals of this work.

4.3. Experimental Setup

This experiment employed the testing environment based on the PyTorch v2.9.0 deep learning framework, using the Ubuntu 20.04 operating system, a GeForce RTX 4090 GPU accelerated with CUDA 11.1, and an Intel^® Xeon^® Gold 6430 CPU. With this setup, the proposed framework achieves an average inference speed of 28–33 ms per frame (approximately 30–35 FPS). The open-vocabulary detection module requires approximately 45–55 GFLOPs, while the multi-cue tracking and association components introduce an additional 20–30 GFLOPs, resulting in a total computational cost of around 70–85 GFLOPs.

This study implemented an object tracking model incorporating a ResNeXt-50 backbone network. The backbone was pre-trained on the ImageNet-1K dataset and trained using the LVIS dataset. For the training process, the Stochastic Gradient Descent (SGD) optimizer was used with a learning rate of 0.02, momentum of 0.9, and weight decay of 0.0001. A step decay strategy was employed, reducing the learning rate after the third and fifth epochs. (All models were trained for 24 epochs, with the learning rate decaying in a step-wise manner at 2/3 and 5/6 of the total training rounds (i.e., at the 16th and 20th epochs); linear warmup was applied at the beginning of training.) The classification loss used cross-entropy loss, the bounding box regression loss used L1 loss, the tracking loss used multi-positive cross-entropy loss, and the auxiliary tracking loss used L2 loss. During testing, multi-scale flipping augmentation was used, images were resized to (1333, 800), and normalization, padding, and tensor conversion were applied.

4.4. Comparative Experiments

Table 1 lists the training data configurations designed for the TAO validation and test sets. Furthermore, Table 2 presents the experimental evaluation results on the TAO validation set. Table 3 shows the evaluation results on the TAO test set. Categories in the dataset are divided into base and novel categories.

Table 1. Training data settings on TAO validation and test sets.

Table 2. Comparative experiments on the TAO validation set.

Table 3. Comparative experiments on the TAO test set.

The `TAO’ data column indicates training using TAO video sequences, while the `LVIS’ column indicates training using the LVIS dataset. The established baselines include both closed-set trackers and open-vocabulary trackers. Two state-of-the-art closed-set trackers, TETer [16] and QDTrack, which were trained on both novel and base categories, were selected. The off-the-shelf trackers DeepSORT and Tracktor++ [10] combined with the open-vocabulary detector ViLD served as baseline open-vocabulary trackers. These trackers, like OVTrack, were trained only on base categories. Our method and OVTrack used only static images for training, while other methods used video data for training.

Specifically, considering the characteristics of the open-vocabulary scenario, the test samples were divided into two subsets: base (categories seen during training) and novel (categories not seen during training).

On the TAO validation set (Table 2), the tracker utilizing the multi-cue fusion association strategy and MOT prompts improves the TETA metric by 5% on base categories and 15% on novel categories compared to the OVTrack model.

In Table 3, on the TAO test set, the tracker employing a multi-cue fusion association strategy and multi-object tracking prompts achieves a 10% improvement in the TETA metric for base classes and a 16% improvement for novel classes compared with the OVTrack model.

In the open-vocabulary tracking task on the TAO test set, the proposed method demonstrates clear advantages. Compared with the state-of-the-art OVTrack model, the TETA scores for both base and novel classes are improved, validating the generalization effectiveness of the multi-cue fusion strategy and tracking prompt mechanism. The association metric for base classes increases from 35.4 to 39.4, indicating that the multi-cue fusion strategy significantly enhances association performance. Meanwhile, compared with a closed-set tracker combined with an open-vocabulary detector, the proposed method achieves noticeably better classification performance on novel classes. Moreover, by leveraging visual–language model prompts, the model exhibits improved recognition capability for unseen categories compared with closed-set trackers.

4.5. Ablation Studies

Following the comparative experiments above, Table 4 compares the proposed algorithm model using the track prompt against using only a single prompt, CLIP’s prompt mechanism, ViLD’s prompt, OVTrack’s prompt, and a prompt template generated by GPT-4o [28] on the TAO validation set. The experimental results indicate that for the tracking model using only a single prompt template, the classification ability for novel classes significantly decreases compared to the OVTrack work, with its ClsA metric dropping by 0.9. This result highlights the limitations of a single prompt template in open-vocabulary scenarios. After introducing the track prompt, compared to the OVTrack work, the classification metric for base classes increases by 0.2, and for novel classes by 2.6. After incorporating the track prompt, the model shows minor fluctuations in LocA and AssocA metrics, while the ClsA for novel categories is effectively improved, enhancing the model’s generalization ability in open-vocabulary scenarios.

Table 4. Ablation study on prompt templates on the TAO validation set.

Compared to other prompts, the model using the CLIP prompt shows decreased classification performance on base categories, but improved classification and association performance on novel categories. For the model using the ViLD prompt, classification performance on base categories decreases while association performance increases; classification performance on novel categories improves, but association performance declines. For the model using visual–language prompts generated by the large language model GPT-4o, localization and association effects on base classes decrease significantly, although classification improves; localization and association effects on novel classes also decrease noticeably, but classification performance improves.

In Table 5, using OVTrack as the baseline model, and employing the aforementioned vision–language model tracking prompts, we analyzed the use of only the BYTE method to obtain motion cues combined with appearance cues for target association and then added the HWIoU and WWIoU (Width-Weighted IoU) modules for experiments.

Table 5. Ablation study on association modules on the TAO validation set.

Experiments on the TAO validation set show that using only the BYTE method improves the association capability for both base and novel classes compared to the baseline model, increasing the AssocA metric for base and novel classes by 13% and 5%, respectively. Using the BYTE method with the Width-IoU module increases the novel class AssocA metric by 2%, while using it with the Height-IoU module increases the novel class AssocA metric by 3%, proving that the width and height information of detection boxes have a certain influence on capturing target motion. Adding the Height-Weighted IoU module has a minor impact on target localization and classification performance. The BYTE method, along with the Height-Weighted and Width-Weighted IoU modules, has a greater impact on the model’s association capability and a weaker effect on classification ability.

In Table 6, experiments on the TAO test set show that using only the BYTE method increases the AssocA metric for base and novel classes by 8% and 9%, respectively. Using the BYTE method with the Width-IoU module increases the novel class AssocA metric by 3%, and using it with the Height-IoU module also increases the novel class AssocA metric by 3%. These results confirm that the width and height information of candidate boxes positively influences capturing target motion, contributing to improved multi-object tracking performance in open-vocabulary scenarios. Figure 7 shows partial tracking results of the proposed algorithm, including different target categories. It can be observed that the proposed algorithm achieves high accuracy in the open-vocabulary object tracking task, with fewer missed detections and false alarms, effectively tracking the motion trajectories of targets. Notably, even for the untrained ‘drone’ category, the proposed tracker can robustly track its flight trajectory in the air.

Table 6. Performance comparison on testing set with different methods.

Figure 7. Partial object tracking results on the TAO dataset.

Figure 7 shows partial tracking results of the proposed algorithm, including different target categories. It can be observed that the proposed algorithm achieves high accuracy in the open-vocabulary object tracking task, with fewer missed detections and false alarms, effectively tracking the motion trajectories of targets. Notably, even for the untrained ‘drone’ category, the proposed tracker can robustly track its flight trajectory in the air.

To visually demonstrate the improvement, the proposed method was tested on untrained video data and target categories, compared with the baseline model. Different bounding box colors represent different targets. At frame t Figure 8, the baseline model detects three performing fighter jets, missing one. The fighter jet detected by the blue bounding box is climbing rapidly and is about to be occluded by the fighter jet detected by the pink bounding box. The proposed model detects all four performing fighter jets at frame t. At frame

t + 1

Figure 9, the baseline model, using a pure appearance strategy for target association, misidentifies the fighter jet represented by the blue box at frame t as the fighter jet from the pink box at frame t, resulting in an incorrect ID assignment. Meanwhile, the pink fighter jet from frame t is considered a new appearance and is represented by a red bounding box at frame

t + 1

. In contrast, the proposed model, utilizing the multi-cue fusion target association strategy, maintains the ID assignment (blue box) for the rapidly climbing fighter jet from frame t even after occlusion with the pink-box fighter jet at frame

t + 1

. The ID of the fighter jet represented by the pink box at frame

t + 1

also remains consistent with frame t. There is no ID switch due to deformation and occlusion before and after the event, demonstrating that the multi-cue fusion strategy has a certain effect on mitigating occlusion and deformation.

Figure 8. Multi-object tracking comparison at frame t.

Figure 9. Multi-object tracking comparison at frame

t + 1

.

5. Discussion

This paper presents an open-vocabulary multi-object tracking framework that integrates multi-cue fusion with enhanced text–visual alignment. Using a caption parser to extract category-related textual cues and incorporating them into a vision–language model, the method improves the quality of open-vocabulary detection. During tracking, the algorithm jointly leverages appearance, motion, and spatial cues; adopts BYTE-style association; and introduces HWIoU to stabilize matching under scale variation. The high–low confidence separation and secondary matching further improve the utilization of detection results.

The effectiveness of our framework in open-vocabulary settings stems from the complementary strengths of its components. HWIoU provides more stable spatial affinity by mitigating scale drift and deformation, which are common in unseen categories where standard IoU becomes unreliable. Motion cues further compensate for appearance inconsistencies, especially when CLIP features are weak for sparsely supervised or novel classes. In addition, our tracking-oriented prompt templates enhance image–text alignment by enriching contextual descriptions beyond generic CLIP prompts, improving the recognition of rare or unseen categories. Together, these choices yield more robust associations and explain the observed gains on both base and novel categories.

Despite the consistent improvements achieved by our method across both base and novel categories, several limitations merit consideration. First, the qualitative examples included in this study are selective and do not fully illustrate failure cases or long-sequence ID-consistency, which are important for evaluating tracking robustness in challenging scenarios. Second, the method’s performance depends on the quality of CLIP-derived features; suboptimal or misaligned features may limit generalization to certain unseen categories or domains. Third, although the computational overhead of our framework is moderate on contemporary hardware, it remains higher than that of simpler baseline methods, which may constrain deployment in real-time or resource-limited settings. Addressing these limitations—through more comprehensive qualitative analyses, systematic failure case reporting, improved feature adaptation, and optimized computational efficiency—constitutes an important direction for future work.

6. Conclusions

This paper proposed an open-vocabulary multi-object tracking algorithm based on multi-cue fusion. The algorithm uses a caption parser to extract textual information, enhancing the text processing capability of the vision–language model. During the tracking phase, it combines location, motion, and appearance cues, adopts the BYTE data association method and the HWIoU module, categorizes detection boxes into high- and low-confidence levels, and performs secondary matching for unmatched tracks, thereby effectively utilizing detection boxes for target association. Compared with appearance-only baselines, the proposed framework offers more reliable localization and moderate association gains, leading to consistent TETA improvements on both base and novel categories. However, Classification Accuracy for many low-frequency novel classes remains limited, indicating that open-vocabulary recognition in long-tailed settings is still a challenging problem. Overall, the framework provides a stronger and more balanced solution for open-world MOT, while leaving room for future work on improving fine-grained novel-class classification.

In multi-object tracking based on multi-cue fusion and open-vocabulary learning, the model’s generalization ability largely depends on the recognition capability of the vision–language model. However, there is a contradiction between the high computational complexity of vision–language models and the real-time requirements of multi-object tracking. With advancements in vision–language model technology, video-scene-based vision–language models are expected to emerge. Future open-vocabulary multi-object tracking might involve joint end-to-end training with vision–language models. This would allow the algorithm to better leverage spatiotemporal information, demonstrating powerful adaptability and robustness in more complex and open scenarios, while achieving a better balance between computational efficiency and tracking accuracy.

Author Contributions

Conceptualization, L.X.; methodology, L.X. and C.L.; software, L.N. and J.B.; formal analysis, C.L.; resources, L.N. and J.B.; writing—original draft preparation, L.X.; writing—review and editing, C.L.; visualization, L.N. and J.B.; supervision, C.L.; project administration, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported by the QinXin Talents Cultivation Program: Beijing Information Science and Technology University, No. QXTCPB202105.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1–9. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Fischer, T.; Huang, T.E.; Pang, J.; Qiu, L.; Chen, H.; Darrell, T.; Yu, F. Qdtrack: Quasi-dense similarity learning for appearance-only multiple object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15380–15393. [Google Scholar] [CrossRef] [PubMed]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
Bergmann, P.; Meinhardt, T.; Leal-Taixe, L. Tracking without bells and whistles. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 941–951. [Google Scholar]
Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for multi-object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar] [CrossRef]
Ošep, A.; Hermans, A.; Engelmann, F.; Klostermann, D.; Mathias, M.; Leibe, B. Multi-scale object candidates for generic object tracking in street scenes. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 3180–3187. [Google Scholar]
Ošep, A.; Voigtlaender, P.; Weber, M.; Luiten, J.; Leibe, B. 4D generic video object proposals. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 10031–10037. [Google Scholar]
Liu, Y.; Zulfikar, I.E.; Luiten, J.; Dave, A.; Ramanan, D.; Leibe, B.; Ošep, A.; Leal-Taixé, L. Opening up open world tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 19045–19055. [Google Scholar]
Li, S.; Fischer, T.; Ke, L.; Ding, H.; Danelljan, M.; Yu, F. Ovtrack: Open-vocabulary multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5567–5577. [Google Scholar]
Li, S.; Danelljan, M.; Ding, H.; Huang, T.E.; Yu, F. Tracking every thing in the wild. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 498–515. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Ma, W.; Wu, Y.; Cen, F.; Wang, G. Mdfn: Multi-scale deep feature learning network for object detection. Pattern Recognit. 2020, 100, 107149. [Google Scholar] [CrossRef]
Yang, M.; Han, G.; Yan, B.; Zhang, W.; Qi, J.; Lu, H.; Wang, D. Hybrid-sort: Weak cues matter for online multi-object tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6504–6512. [Google Scholar]
Du, Y.; Wei, F.; Zhang, Z.; Shi, M.; Gao, Y.; Li, G. Learning to prompt for open-vocabulary object detection with vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14084–14093. [Google Scholar]
Wu, H.; Mao, J.; Zhang, Y.; Jiang, Y.; Li, L.; Sun, W.; Ma, W.Y. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6609–6618. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Gu, X.; Lin, T.Y.; Kuo, W.; Cui, Y. Open-vocabulary object detection via vision and language knowledge distillation. arXiv 2021, arXiv:2104.13921. [Google Scholar]
Wu, J.; Cao, J.; Song, L.; Wang, Y.; Yang, M.; Yuan, J. Track to detect and segment: An online multi-object tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12352–12361. [Google Scholar]
Dendorfer, P.; Ošep, A.; Milan, A.; Schindler, K.; Cremers, D.; Reid, I.; Roth, S.; Leal-Taixé, L. Motchallenge: A benchmark for single-camera multiple target tracking. Int. J. Comput. Vis. 2021, 129, 845–881. [Google Scholar] [CrossRef]
Gupta, A.; Dollar, P.; Girshick, R. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5356–5364. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]

Figure 1. Category differences between open-vocabulary and traditional multi-object tracking. The open-world scenarios involve more category text information than closed-set MOT. By introducing a caption parser, category information is extracted from captions using a scene parser [21], and a pre-trained vision–language model is used to process similarity information between images and categories.

Figure 2. Open-vocabulary tracking framework based on multi-cue fusion. A tracker associates targets to achieve MOT for the specified categories.

Figure 3. Overall algorithm architecture. The proposed method builds on object detection, pre-trained vision–language model classification, and multi-cue fusion association. It uses a Faster R-CNN [2] detector to detect objects in videos and the CLIP vision–language model to obtain category labels and combines strong motion–appearance cues with weak bounding box height cues for target association.

Figure 4. Open-vocabulary detector training and data distillation process. Text encoder distillation is performed only from base classes. Category names are input into the pre-trained text encoder to obtain text embeddings, and the inferred text embeddings are used to classify the detected regions. Image encoder distillation is performed from both base and novel classes because the proposal network might detect regions containing novel classes. First, the object proposal regions are filtered using the NMS algorithm [3], and then cropped from the original image. The cropped regions are fed into the image encoder of the CLIP model. The output of the vision–language model’s image encoder serves as the teacher region features. Subsequently, a Faster R-CNN detector is trained to learn the output of the teacher network, aligning the student output with the teacher image feature embeddings.

Figure 5. Appearance variations in traditional MOT scenes vs. open-vocabulary MOT scenes.

Figure 6. Training flowchart for the multi-cue fusion target association strategy.

Figure 7. Partial object tracking results on the TAO dataset.

Figure 8. Multi-object tracking comparison at frame t.

Figure 9. Multi-object tracking comparison at frame

t + 1

.

Table 1. Training data settings on TAO validation and test sets.

Val and Test Method	Classes		Data
Val and Test Method	Base	Novel	LVIS	TAO
DeepSORT (ViLD)	✓	–	✓	✓
QDTrack	✓	✓	✓	✓
Tracktor++ (ViLD)	✓	–	✓	✓
TETer	✓	✓	✓	✓
OVTrack	✓	–	✓	–
Ours	✓	–	✓	–

Table 2. Comparative experiments on the TAO validation set.

Validation Set Method	Base				Novel
Validation Set Method	TETA	LocA	AssocA	ClsA	TETA	LocA	AssocA	ClsA
DeepSORT (ViLD)	26.9	47.1	15.8	17.7	21.1	46.4	14.7	2.3
QDTrack	27.1	45.6	24.7	11.0	22.5	42.7	24.4	0.4
Tracktor++ (ViLD)	28.3	47.4	20.5	17.0	22.7	46.7	19.3	2.2
TETer	30.3	47.4	31.6	12.1	25.7	45.9	31.1	0.2
OVTrack	35.5	49.3	36.9	20.2	27.8	48.8	33.6	1.5
Ours	37.3	56.2	39.2	16.5	32.1	54.4	37.6	4.3

Table 3. Comparative experiments on the TAO test set.

Testing Set Method	Base				Novel
Testing Set Method	TETA	LocA	AssocA	ClsA	TETA	LocA	AssocA	ClsA
DeepSORT (ViLD)	24.5	43.8	14.6	15.2	17.2	38.4	11.6	1.7
Tracktor++ (ViLD)	26.0	44.1	19.0	14.8	18.0	39.0	13.4	1.7
QDTrack	25.8	43.2	23.5	10.6	20.2	39.7	20.9	0.2
TETer	29.2	44.0	30.4	10.7	21.7	39.1	25.9	0.0
OVTrack	32.6	45.6	35.4	16.9	24.1	41.8	28.7	1.8
Ours	36.1	55.7	39.4	13.1	28.0	50.4	31.3	2.3

Table 4. Ablation study on prompt templates on the TAO validation set.

Val Method	Base				Novel
Val Method	TETA	LocA	AssocA	ClsA	TETA	LocA	AssocA	ClsA
Single prompt	36.8	53.9	40.1	16.2	29.7	52.6	35.9	0.7
CLIP prompt	36.5	54.7	39.3	15.6	31.1	52.7	37.5	3.0
ViLD prompt	37.1	55.6	40.3	15.4	31.2	54.3	36.0	3.2
GPT prompt	35.2	49.4	38.9	17.2	28.2	47.3	33.8	3.4
OVTrack prompt	36.9	54.2	40.1	16.3	30.8	53.9	37.0	1.6
Track prompt	37.5	56.3	39.5	16.5	31.7	54.2	36.7	4.2

Table 5. Ablation study on association modules on the TAO validation set.

Testing Set Method	Base				Novel
Testing Set Method	TETA	LocA	AssocA	ClsA	TETA	LocA	AssocA	ClsA
Baseline	35.6	54.4	35.7	16.6	30.2	52.6	34.2	3.9
Baseline + BYTE	37.7	56.3	40.4	16.4	31.4	54.4	35.9	3.8
Baseline + WWIoU	37.7	56.4	40.3	16.4	31.6	54.5	36.5	3.8
Baseline + HWIoU	37.7	56.4	40.3	16.4	31.7	54.5	36.8	3.8

Table 6. Performance comparison on testing set with different methods.

Testing Set Method	Base				Novel
Testing Set Method	TETA	LocA	AssocA	ClsA	TETA	LocA	AssocA	ClsA
Baseline	34.4	52.6	36.7	13.8	26.0	48.0	27.7	2.3
Baseline + BYTE	36.0	55.0	39.6	13.5	26.6	47.2	30.3	2.3
Baseline + WWIoU	36.1	55.7	39.5	13.2	28.0	50.4	31.3	2.3
Baseline + HWIoU	36.1	55.7	39.4	13.1	28.0	50.4	31.3	2.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Open-Vocabulary Multi-Object Tracking Based on Multi-Cue Fusion

Abstract

1. Introduction

3. Multi-Object Tracking Based on Open Vocabulary and Multi-Cue Fusion

3.1. Algorithm Architecture

3.2. Open-Vocabulary Object Detection

3.2.1. Object Detector

3.2.2. Open-Vocabulary Object Detection Pre-Training

3.3. Multi-Cue Fusion Target Association Strategy

3.4. Open-Vocabulary Prompting Strategy for Multi-Object Tracking

3.5. Interaction Between BYTE, HWIoU, Appearance Embeddings, and Prompt-Enhanced CLIP Features

4. Experiments and Result Analysis

4.1. Datasets

4.2. Evaluation Metrics

4.3. Experimental Setup

4.4. Comparative Experiments

4.5. Ablation Studies

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Open-Vocabulary Multi-Object Tracking Based on Multi-Cue Fusion

Abstract

1. Introduction

2. Related Work

2.1. Detection-Based Multi-Object Tracking Methods

2.2. Appearance-Only Multi-Object Tracking Strategies

2.3. Spatially Based Multi-Object Tracking Strategies

2.4. Open-Vocabulary Multi-Object Tracking

3. Multi-Object Tracking Based on Open Vocabulary and Multi-Cue Fusion

3.1. Algorithm Architecture

3.2. Open-Vocabulary Object Detection

3.2.1. Object Detector

3.2.2. Open-Vocabulary Object Detection Pre-Training

3.3. Multi-Cue Fusion Target Association Strategy

3.4. Open-Vocabulary Prompting Strategy for Multi-Object Tracking

3.5. Interaction Between BYTE, HWIoU, Appearance Embeddings, and Prompt-Enhanced CLIP Features

4. Experiments and Result Analysis

4.1. Datasets

4.2. Evaluation Metrics

4.3. Experimental Setup

4.4. Comparative Experiments

4.5. Ablation Studies

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics