Next Article in Journal
A Knowledge-Based Bidirectional Encoder Representation from Transformers to Predict a Paratope Position from a B Cell Receptor’s Amino Acid Sequence Alone
Previous Article in Journal
Abiotic Indicators for Sustainability Assessment in a Post-Mining Coal Rehabilitated Area
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Text-Guided Spatio-Temporal 2D and 3D Data Fusion for Multi-Object Tracking with RegionCLIP

by
Youlin Liu
,
Zainal Rasyid Mahayuddin
* and
Mohammmad Faidzul Nasrudin
Center for Artificial Intelligence Technology (CAIT), Universiti Kebangsaan Malaysia, Bangi Selangor 43600, Malaysia
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(18), 10112; https://doi.org/10.3390/app151810112
Submission received: 11 August 2025 / Revised: 1 September 2025 / Accepted: 10 September 2025 / Published: 16 September 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

3D Multi-Object Tracking (3D MOT) is a critical task in autonomous systems, where accurate and robust tracking of multiple objects in dynamic environments is essential. Traditional approaches primarily rely on visual or geometric features, often neglecting the rich semantic information available in textual modalities. In this paper, we propose Text-Guided 3D Multi-Object Tracking (TG3MOT), a novel framework that incorporates Vision-Language Models (VLMs) into the YONTD architecture to improve 3D MOT performance. Our framework leverages RegionCLIP, a multimodal open-vocabulary detector, to achieve fine-grained alignment between image regions and textual concepts, enabling the incorporation of semantic information into the tracking process. To address challenges such as occlusion, blurring, and ambiguous object appearances, we introduce the Target Semantic Matching Module (TSM), which quantifies the uncertainty of semantic alignment and filters out unreliable regions. Additionally, we propose the 3D Feature Exponential Moving Average Module (3D F-EMA) to incorporate temporal information, improving robustness in noisy or occluded scenarios. Furthermore, the Gaussian Confidence Fusion Module (GCF) is introduced to weight historical trajectory confidences based on temporal proximity, enhancing the accuracy of trajectory management. We evaluate our framework on the KITTI dataset and compare it with the YONTD baseline. Extensive experiments demonstrate that although the overall HOTA gain of TG3MOT is modest (+0.64%), our method achieves substantial improvements in association accuracy (+0.83%) and significantly reduces ID switches (−16.7%). These improvements are particularly valuable in real-world autonomous driving scenarios, where maintaining consistent trajectories under occlusion and ambiguous appearances is crucial for downstream tasks such as trajectory prediction and motion planning. The code will be made publicly available.

1. Introduction

Multi-Object Tracking (MOT) is a core task in computer vision, widely used in autonomous driving, robotics, and surveillance. Its goal is to assign consistent identities to multiple objects across frames. Traditional MOT methods are commonly grouped into Tracking-by-Detection (TBD) [1,2,3,4,5,6,7], Joint Detection and Embedding (JDE) [8,9], and Joint Detection and Tracking (JDT) [10,11,12]. TBD separates detection and association, often suffering from latency and association complexity. JDE integrates detection and tracking for greater efficiency and joint optimization. JDT further improves by explicitly modeling temporal relationships, enhancing performance in challenging scenarios with occlusions or fast motion.
In recent years, deep learning-based trackers have significantly advanced MOT performance by introducing stronger detectors, re-identification modules, and temporal modeling techniques [6,13]. Despite this progress, MOT in unconstrained environments still faces difficulties such as long-term occlusions, crowded scenes, and drastic illumination changes [8]. These limitations are especially critical in safety-related applications like autonomous driving.
Most existing 3D MOT methods depend on single-modal data, such as RGB images or LiDAR, which may lack robustness. Multimodal fusion combines complementary cues from different sensors (e.g., LiDAR, camera, radar), showing superior results. However, current multimodal methods largely follow the TBD pipeline, limiting their potential due to complex association steps.
3D MOT introduces added challenges, such as sparse data, occlusions, and scale variations [14]. Recent methods [15,16] propose end-to-end frameworks to integrate detection and tracking, reducing reliance on hand-crafted associations and achieving state-of-the-art results without additional training. Still, many approaches over-rely on visual cues, making them fragile in ambiguous appearance conditions.
To address these limitations, several studies have explored multimodal and cross-modal tracking. For instance, Camera-LiDAR fusion has been used to improve spatial localization and robustness in autonomous driving [17,18]. Moreover, radar sensing has shown strong robustness under adverse environmental conditions such as rain, fog, and snow. The K-Radar dataset demonstrates that 4D radar (with elevation information) outperforms LiDAR-based methods for 3D object detection under challenging weather [19]. RadarNet further integrates radar and LiDAR using early and late fusion strategies to improve detection and velocity estimation, especially for moving objects in degraded visual conditions [20]. These advances highlight radar’s value as a complementary sensor in adverse weather environments. However, most existing fusion frameworks focus primarily on low-level geometric and appearance cues rather than high-level semantic understanding, which limits their ability to resolve ambiguous cases where appearance alone is insufficient.
Meanwhile, Vision-Language Models (VLMs) such as CLIP [21] and ALIGN [22] have shown strong cross-modal reasoning in tasks such as image-text retrieval [23], VQA [24], and zero-shot learning [25]. Recent works such as BLIP [26] and Flamingo [27] further demonstrate that large-scale vision-language pretraining enables richer semantic understanding and generalization across domains. These advances suggest that language supervision could serve as an additional modality to disambiguate visually similar objects and guide association in 3D MOT. However, their application in MOT remains underexplored, as illustrated in Figure 1.
Building on the success of [28], we propose TG3MOT, a novel framework that integrates Vision-Language Models (VLMs) into the 3D MOT pipeline. Specifically, we leverage CLIP to extract text-modal features, enabling the model to incorporate semantic cues from natural language descriptions. This enhances its ability to handle challenging scenarios where traditional visual or geometric cues fall short—such as severe occlusions, appearance variations, or identity ambiguity. To further improve robustness, we introduce three key modules: Target Semantic Matching, Gaussian Confidence Fusion, and 3D Feature EMA, which synergize multimodal fusion with text-guided understanding. Together, these innovations push the boundaries of multi-object tracking performance in complex real-world environments.
Our contributions include the following:
  • Extending the YONTD [29] framework with a semantic-aware RegionCLIP detector, achieving strong zero-shot detection without extra training;
  • Designing a Target Semantic Matching Module (TSM) to filter false positives based on semantic consistency;
  • Proposing a 3D Feature EMA module to improve regression accuracy and reduce false detections;
  • Introducing Gaussian Confidence Fusion (GCF) to stabilize trajectory confidence using historical weights;
  • Achieving superior results on KITTI [30], with our code publicly released to foster future research.

2. Related Work

2.1. Traditional MOT Methods

Multi-Object Tracking (MOT) methods mainly fall into Tracking-by-Detection (TBD) and Joint Detection and Embedding (JDE) paradigms. TBD separates detection and tracking: objects are detected per frame and then associated using algorithms such as the Hungarian method [31] or graph-based optimization [32]. Although effective, TBD is heavily dependent on the quality of the detection and the complex association, which can be error-prone.
JDE integrates detection and feature embedding into a single model, streamlining the pipeline and improving speed. However, it often struggles with balancing detection accuracy and discriminative feature learning, especially under occlusion. Recent Joint Detection and Tracking (JDT) methods further unify both tasks in an end-to-end framework, eliminating explicit association steps. While promising, they usually require extensive labeled data and complex training.

2.2. 3D MOT Methods

3D MOT extends tracking into spatial depth, critical for autonomous driving and robotics [33]. It introduces challenges such as depth ambiguity and sparse data. LiDAR-based methods like AB3DMOT [15] and PointRCNN [16] exploit point clouds for detection and tracking, while multimodal fusion approaches like MMF [34] and CenterPoint [12] enhance robustness by integrating LiDAR and camera inputs. YONTD [29] offers an end-to-end solution, achieving state-of-the-art performance without extra training.
Camera-based 3D MOT methods, such as Mono3D [35] and Pseudo-LiDAR [36], reconstruct 3D cues from monocular images. Graph-based models like GNN3DMOT [37] capture spatial relationships via graph neural networks, improving performance in dense scenes. Overall, integrating sensor modalities and unifying detection and tracking tasks is key to advancing 3D MOT.

2.3. Vision-Language Models (VLMs) in MOT

Vision-Language Models (VLMs), such as CLIP [21], have shown promise in aligning visual and textual modalities. RegionCLIP [28] extends this to region-level tasks, highlighting potential for MOT applications like semantic association and occlusion handling.
Recent efforts like LaMOT [38], CiteTracker [39], and Adaptive Text Feature Updating [40] leverage textual cues for improved object association and tracking consistency. However, VLM integration in MOT remains limited, with most works relying on purely visual or geometric features.
Our method addresses this gap by incorporating RegionCLIP into the MOT framework, enabling zero-shot detection and semantic reasoning. Built upon YONTD, our end-to-end system introduces the Target Semantic Matching (TSM), 3D Feature-EMA (3D F-EMA), and Gaussian Confidence Fusion (GCF) modules to enhance tracking in complex scenes. Full details are presented in the following section.

3. Proposed Method

In this section, we present the proposed Text-Guided 3D Multi-Object Tracking framework, a novel approach that integrates Vision-Language Models (VLMs) into the YONTD architecture to address key challenges in 3D MOT. Our framework, illustrated in Figure 2, consists of several innovative components designed to enhance detection, tracking, and trajectory management. The overall architecture maintains an end-to-end structure, enabling seamless integration of semantic information without requiring additional training. The framework begins with LiDAR point cloud data [41] as input, which is processed by a 3D detector (embed the 3D Feature EMA module) to generate the current frame’s 3D trajectory states [42,43]. These 3D trajectories are then projected onto the 2D image plane, where the 2D trajectories and the current image frame are fed into a multimodal detector to obtain trajectory scores and semantic features [44]. The Target Semantic Matching Module is employed to filter out false positives by analyzing the semantic information of candidate trajectories. Subsequently, the Gaussian Confidence Fusion Module fuses the 2D and 3D historical trajectories, assigning weights based on a Gaussian distribution to compute the final trajectory score, which guides the Non-Maximum Suppression (NMS) process to determine the final trajectories and associations for the current frame. In the Trajectory Management Module, we adopt the strategy from DeepFusionMOT [45]. Each of these components is detailed in the following subsections.

3.1. VLM and RegionCLIP

Most existing Multi-Object Tracking (MOT) methodologies predominantly rely on visual or geometric features, often overlooking the rich semantic information provided by textual modalities. A critical challenge addressed in our framework is the effective integration of semantic information into the classical 3D MOT framework. Our initial approach involved leveraging the CLIP [21] model on 2D images in conjunction with traditional detectors such as Faster R-CNN [46]. However, directly applying CLIP within the MOT framework resulted in suboptimal performance due to a significant domain shift: CLIP was trained to match an entire image to a textual description, lacking the capability to establish fine-grained alignment between specific image regions and corresponding text spans. Additionally, traditional detectors, typically trained on COCO datasets, exhibit a performance gap when applied to tracking scenarios, further exacerbating the issue. To address these limitations, we adopt RegionCLIP [28], a multimodal open-vocabulary detector that enables fine-grained alignment between image regions and textual concepts. Building upon the YONTD [29] framework, we integrate RegionCLIP as the 2D detector. For each object to be tracked, a set of textual descriptions is provided as T e x t = [ t e x t 1 , t e x t 2 , , t e x t N ] . These descriptions are encoded using RegionCLIP’s text encoder to obtain corresponding text features F t e x t = [ f t 1 , f t 2 , , f t N ] T . For the current frame t, let D t 3 D denote the 3D detection results, T t 3 D represent the 3D trajectories, and S t 3 D indicate the regression confidences of the 3D trajectories. T t 3 D are projected onto the 2D image plane to generate region proposals, which are then fed into RegionCLIP to extract region image features F t r e g i o n = [ f r 1 , f r 2 , , f r M ] T . The matching score is computed using the following formula:
S match R M × N = F t region · F t e x t T F t region · F t e x t T
This matching score is used to derive the classification output of the 2D detector. Furthermore, the maximum value of S m a t c h corresponding to the predicted class is selected, and the 2D trajectory regression confidence S t 2 D is computed as
S t 2 D = max dim = 1 S match · S t 3 D
This approach not only bridges the domain gap between detection and tracking scenarios but also enhances the framework’s ability to leverage semantic information for more accurate and robust object tracking. By integrating RegionCLIP, our framework achieves fine-grained alignment between image regions and textual concepts, significantly improving performance in complex tracking environments.

3.2. Target Semantic Matching Module

Although RegionCLIP offers robust detection capabilities in open-vocabulary scenarios, it faces significant challenges in complex tracking environments, particularly due to issues such as blurring, occlusion, and ambiguous object appearances. Even with the integration of semantic information, these challenges remain substantial.
To address these limitations, we focus on enhancing the alignment between image region features and textual features. Specifically, the matching score matrix S m a t c h is computed to identify the most semantically relevant textual description for each image region. Through extensive experimentation, we observe that when the objects within a region are clear and easily identifiable, the matching score distribution across textual descriptions exhibits significant discriminability. Conversely, in cases of ambiguity or occlusion, the distribution becomes notably smoother, as demonstrated in Figure 3. Building on these insights, we propose a Target Semantic Matching Module (TSM), which quantifies the uncertainty of each region’s semantic alignment by calculating the variance of its matching score distribution, denoted as V = { v 1 , v 2 , , v M } . To further incorporate the influence of 3D detection confidence, we define the TSM score for each region as follows:
S TSM = v i 1 S i 3 D i = 1 M
where S i 3 D represents the 3D detection confidence for the i-th region. Regions with S T S M below a predefined threshold are considered noisy and are subsequently discarded. This approach effectively filters out unreliable regions, ensuring that only semantically consistent and high-confidence regions contribute to the tracking process.

3.3. 3D Feature Exponential Moving Average

Multi-Object Tracking (MOT) inherently operates on temporally sequential data, where frames are intrinsically interconnected over time. However, conventional 3D object detectors often process each frame independently, neglecting the valuable temporal correlations that exist across consecutive frames. This limitation becomes particularly pronounced in scenarios affected by sensor noise, occlusions, or dynamic environmental changes, where trajectories detected in frame t 1 may be lost in frame, leading to false negatives, as illustrated in Figure 4. To tackle this challenge, we propose a novel 3D Feature Exponential Moving Average (EMA) Module, which ingeniously incorporates temporal information into the 3D detection framework through historical feature fusion, thereby enhancing the robustness and continuity of object tracking.
The core idea of the 3D Feature EMA Module is to leverage the temporal consistency of object features across frames. Specifically, for each detected object, we maintain a dynamically updated feature representation that aggregates historical information from previous frames. Let f t i denote the feature vector of the i-th object detected in frame t. The EMA-updated feature f ˜ t i is computed as follows:
f ˜ t i = α · f t i + ( 1 α ) · f ˜ t 1 i
where α ( 0 , 1 ) is a smoothing factor that controls the contribution of the current frame’s feature relative to the historical features. This recursive formulation ensures that the feature representation of each object evolves smoothly over time, effectively mitigating the impact of transient detection failures caused by noise or occlusions.

3.4. Gaussian Confidence Fusion Module

In the original YONTD [29] framework, a trajectory confidence fusion mechanism is proposed to address the challenges posed by “strong objects” and “weak objects” in the tracking process. This module traditionally computes the regression confidence for the current frame t by averaging the trajectory scores from the previous k frames. However, this approach fails to account for the temporal weighting of different frames, which is crucial for accurate tracking.
We hypothesize that the state of a tracked object at frame t is more similar to its state at frame t 1 than at frame t 2 , implying that temporal proximity correlates with state similarity. To incorporate this temporal dependency, a weighted regression confidence fusion method is proposed, leveraging a Gaussian distribution to assign weights to the historical trajectory confidences. Specifically, the confidence for the current frame t is computed as a weighted sum of the confidences from the previous k frames, where the weights are derived from a Gaussian function centered at the current frame. This approach ensures that frames closer in time to the current frame have a greater influence on the computed confidence. The mathematical formulation of this weighted fusion is as follows:
S ^ t n = i = 1 k ω i · S t i + 1 n i = 1 k ω i
where S ^ t n is the fusion regression confidence of the n-th trajectory at frame t, S t i + 1 n represents the confidence of the n-th trajectory at frame t i + 1 , and ω i is the weight assigned to the i-th historical frame, calculated using a Gaussian distribution:
ω i = exp i 2 2 σ 2
Here, σ is the standard deviation of the Gaussian distribution, controlling the rate at which the weights decay with increasing temporal distance from the current frame. This method effectively captures the temporal dynamics of object states, leading to more accurate and robust tracking performance. The implementation of Gaussian Confidence Fusion ( k = 9 ) is illustrated in Figure 5.

4. Experiments

In this section, we present the experimental setup, results, ablation studies, and discussions to evaluate the effectiveness of the proposed framework. All experiments are conducted on the KITTI dataset [30], and the performance is compared against the baseline YONTD framework, as well as other state-of-the-art methods.

4.1. Experimental Setup

Hardware: All experiments were conducted on a workstation with an NVIDIA RTX 4090 GPU (24GB VRAM) (NVIDIA, Santa Clara, CA, USA), Intel Xeon Platinum 8480+ CPU, and 2TB RAM (Intel, Santa Clara, CA, USA). Although not optimized for real-time performance, this setup enables efficient batch processing of large-scale LiDAR and image data.
Dataset: The proposed framework was evaluated on the KITTI tracking benchmark, which includes synchronized 3D LiDAR scans, RGB images, and annotations for multiple object categories (e.g., cars, pedestrians, cyclists). It consists of 21 training and 29 testing sequences.
Implementation: For 3D detection, we employed PVRCNN [47] and VoxelRCNN [48]. RegionCLIP [28] was adopted to extract region-level image features and align them with textual prompts. The textual prompts were manually designed following the template style used in RegionCLIP, such as “a photo of a car”, “a photo of a pedestrian”, and “a photo of a truck”. These standardized prompts ensure consistency across different categories and were encoded using RegionCLIP’s text encoder to guide semantic alignment. The TSM threshold was set to 0.0085 and the 3D F-EMA decay factor to 0.9.
Tracking performance was primarily assessed using Higher-Order Tracking Accuracy (HOTA) [49], which integrates detection, association, and localization quality. Additional metrics were also reported to provide a comprehensive evaluation:
MOTA (Multiple-Object Tracking Accuracy): Measures overall tracking accuracy by combining false positives, false negatives, and identity switches.
MOTP (Multiple-Object Tracking Precision): Measures localization precision of correctly matched targets.
AssA (Association Accuracy): Evaluates the correctness of identity association between frames.
LocA (Localization Accuracy): Measures the accuracy of object localization, independent of identity assignment.
DetA (Detection Accuracy): Reflects the quality of object detection, ignoring association errors.
IDSW (ID Switches): Counts the number of times a tracked identity changes, indicating identity fragmentation.

4.2. Experimental Results

Table 1 reports our results on the KITTI test set, compared with several state-of-the-art (SOTA) methods, including mmMOT, DeepFusion-MOT, PolarMOT, C-TWIX, YONTD-MOT, and others. Our method, TG3MOT, achieved the highest HOTA (78.72%), outperforming MSA-MOT (78.52%) and C-TWIX (77.58%). TG3MOT also achieved superior association accuracy(AssA: 83.69%), exceeding PC3T (81.59%) and MSA-MOT (82.56%). Although PNAS-MOT reported higher DetA (77.69%), it suffered from 751 ID switches, in contrast to only 35 for TG3MOT, demonstrating our framework’s consistency. While C-TWIX and PNAS-MOT achieved slightly higher MOTA, their elevated IDSW indicates inferior identity preservation.
A head-to-head comparison with YONTD-MOT (Table 2), using the same 3D detector, shows TG3MOT consistently outperforms across all key metrics. Figure 6 visualizes bird’s-eye-view trajectories, highlighting TG3MOT’s superior temporal consistency and reduced tracking errors. These improvements are attributed to the effective fusion of 2D–3D features and the incorporation of semantic information.

4.3. Ablation Study

We performed ablation studies to evaluate each component’s contribution (Table 3):
TSM: Removing TSM reduced HOTA and AssA and increased IDSW, indicating its role in filtering noisy regions and improving association. Specifically, the semantic filtering of TSM suppresses visually similar but semantically irrelevant proposals, which reduces mismatches across frames. This directly improves association accuracy (AssA) and identity consistency (lower IDSW), while also leading to higher HOTA.
3D F-EMA: Excluding this module decreased DetA and MOTA, suggesting temporal feature fusion improves stability and continuity. Without 3D F-EMA, the detector relies only on per-frame features, making it more vulnerable to occlusions and motion blur. By integrating historical embeddings through an exponential moving average, temporal consistency is reinforced, thus enhancing detection accuracy (DetA) and overall multi-object tracking accuracy (MOTA). To further clarify its contribution, we report results with three-decimal precision. Compared to the configuration with only TSM+GCF, adding 3D F-EMA yields consistent improvements across all metrics: HOTA (+0.003), DetA (+0.004), MOTA (+0.003), and IDF1 (+0.012). Although these gains appear small, they highlight the stabilizing role of temporal information and contribute to more reliable identity preservation in long-term tracking.
GCF: Disabling GCF led to lower HOTA and DetA, confirming the benefit of adaptive confidence weighting. Gaussian Confidence Fusion models the distribution of detection confidence with a Gaussian kernel and incorporates the temporal weighting of different frames. By assigning larger weights to frames closer to the current time step, GCF adaptively fuses historical trajectory confidences, which enhances robustness under ambiguous detections and reduces the impact of noisy predictions. This leads to improved detection accuracy (DetA) and overall tracking performance (HOTA).
All Modules Combined: Full integration yielded the best results across all metrics, demonstrating the complementary effects of semantic filtering, temporal fusion, and confidence refinement. In particular, TSM ensures the semantic correctness of proposals, 3D F-EMA stabilizes temporal information, and GCF adaptively calibrates confidence. Their combination provides balanced improvements across detection (DetA), association (AssA), and identity-related metrics (IDF1, IDSW).

4.4. Discussion

Text-Guided Features: The integration of semantic cues enhances tracking robustness, especially in dense or occluded scenarios, by improving inter-object discrimination and reducing ID switches.
Practical Implications: While the overall HOTA gain of TG3MOT compared to the YONTD baseline is modest (+0.64%), the framework delivers clear advantages in terms of association accuracy (+0.83%) and ID switch reduction (-16.7%). These improvements translate into more consistent and reliable object trajectories, which are critical for downstream tasks such as trajectory prediction, motion planning, and behavior analysis in autonomous driving. In real-world deployments, reducing ID switches mitigates trajectory fragmentation, thereby providing more stable inputs to decision-making modules. Moreover, the integration of multimodal semantics introduces a new paradigm for 3D MOT, demonstrating that even incremental performance gains in benchmarks can yield disproportionately high benefits in safety-critical applications.
Module-Level Insights: Although the numerical gains from the 3D F-EMA module appear modest when combined with TSM and GCF, results with higher-precision reporting confirm consistent improvements (HOTA +0.003, DetA +0.004, MOTA +0.003, IDF1 +0.012). These seemingly small increments reflect enhanced temporal stability and reduced identity fragmentation, which are crucial in long-term tracking scenarios. Importantly, stable identity preservation ensures smoother trajectories that directly benefit downstream tasks such as prediction and planning, underscoring the practical value of even marginal module-level contributions.
Limitations and Future Work: The framework’s computational demands limit real-time deployment. Future research may focus on efficiency optimization via model compression, lightweight language models, or hardware acceleration. Additionally, robustness in dynamic environments could be improved by incorporating temporal priors or multimodal inputs (e.g., radar, thermal).
Dataset Scope: Our evaluation is currently restricted to the KITTI dataset, which, while widely adopted, does not cover the full diversity of real-world driving scenarios. As a result, the generalizability of TG3MOT to other domains remains to be fully validated. Future work will extend experiments to broader benchmarks such as nuScenes, Waymo, and BDD100K to strengthen the robustness and applicability of our findings.
Despite the benefits of integrating Vision-Language Models (VLMs), several limitations remain regarding their use for guiding visual perception. A key concern is the robustness to unseen or noisy textual prompts, as models trained on specific datasets may not generalize well to novel or corrupted inputs, leading to performance degradation. This challenge is further intensified by exposure bias, where discrepancies between the training distribution and the inference scenario cause errors to compound during deployment. Recent work by Pozzi et al. [62] addresses this problem in the context of large language model distillation using an imitation learning approach, showing improved generalization under distributional shifts. Exploring similar strategies in multimodal tracking could enhance the reliability of prompt-based guidance, especially in dynamic or ambiguous scenarios.

5. Conclusions

In this paper, we propose TG3MOT, a novel Text-Guided 3D Multi-Object Tracking framework that integrates Vision-Language Models to enhance tracking performance. Our key innovation lies in the seamless fusion of semantic information with traditional 3D MOT pipelines, enabling more robust and accurate object tracking in complex environments. By leveraging RegionCLIP for fine-grained region-text alignment and introducing target semantic matching, temporal feature fusion, and confidence-based trajectory refinement, our method effectively mitigates challenges such as occlusions, appearance ambiguity, and false positives. Extensive experiments on the KITTI benchmark demonstrate that TG3MOT outperforms existing approaches, establishing a new state-of-the-art in Text-Guided 3D MOT.
While our current evaluation primarily compares TG3MOT with representative baselines such as YONTD and recent works, we acknowledge that a broader range of multimodal approaches could provide additional context. Future work will aim to extend comparisons to other multimodal methods, including sensor fusion and text-guided tracking frameworks, as well as to evaluate performance on diverse benchmarks beyond KITTI, in order to further validate the generalization and robustness of TG3MOT.
Although the overall HOTA improvement appears modest, TG3MOT achieves clear practical benefits by improving association accuracy and substantially reducing ID switches. These gains lead to more consistent and reliable trajectories, which are critical for downstream tasks such as trajectory prediction and motion planning in autonomous driving. Thus, TG3MOT not only advances performance metrics but also strengthens the real-world applicability of 3D MOT through multimodal semantic integration.
We hope this work inspires further research into multimodal learning for 3D object tracking, bridging the gap between vision and language in autonomous perception systems.

Author Contributions

Conceptualization, Z.R.M. and Y.L.; methodology, Z.R.M., M.F.N. and Y.L.; software, Y.L.; validation, Y.L.; formal analysis, Z.R.M. and Y.L.; investigation, Z.R.M., M.F.N. and Y.L.; resources, Z.R.M. and M.F.N.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, Z.R.M., M.F.N. and Y.L.; visualization, Y.L.; supervision, Z.R.M. and M.F.N.; project administration, Z.R.M.; funding acquisition, Z.R.M. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to thank the Universiti Kebangsaan Malaysia for providing financial support under the “FTM1-Peruntukan Dana Fakulti Teknologi dan Sains Maklumat, UKM”.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: http://www.cvlibs.net/datasets/kitti/, accessed on 1 August 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
  2. Cao, J.; Pang, J.; Weng, X.; Khirodkar, R.; Kitani, K. Observation-centric sort: Rethinking sort for robust multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 9686–9696. [Google Scholar]
  3. Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. Strongsort: Make deepsort great again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
  4. He, J.; Fu, C.; Wang, X. 3d multi-object tracking based on uncertainty-guided data association. arXiv 2023, arXiv:2303.01786. [Google Scholar]
  5. Wang, L.; Zhang, X.; Qin, W.; Li, X.; Gao, J.; Yang, L.; Li, Z.; Li, J.; Zhu, L.; Wang, H. Camo-mot: Combined appearance-motion optimization for 3d multiobject tracking with camera-lidar fusion. IEEE Trans. Intell. Transp. Syst. 2023, 24, 11981–11996. [Google Scholar] [CrossRef]
  6. Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
  7. Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
  8. Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
  9. Wang, Z.; Zheng, L.; Liu, Y.; Li, Y.; Wang, S. Towards real-time multi-object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 107–122. [Google Scholar]
  10. Bergmann, P.; Meinhardt, T.; Leal-Taixe, L. Tracking without bells and whistles. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 941–951. [Google Scholar]
  11. Zhang, J.; Zhou, S.; Chang, X.; Wan, F.; Wang, J.; Wu, Y.; Huang, D. Multiple object tracking by flowing and fusing. arXiv 2020, arXiv:2001.11180. [Google Scholar] [CrossRef]
  12. Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking objects as points. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 474–490. [Google Scholar]
  13. Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. Transtrack: Multiple object tracking with transformer. arXiv 2020, arXiv:2012.15460. [Google Scholar]
  14. Zulkifley, M.A.; Rawlinson, D.; Moran, B. Robust Observation Detection for Single Object Tracking: Deterministic and Probabilistic Patch-Based Approaches. Sensors 2012, 12, 15638–15670. [Google Scholar] [CrossRef]
  15. Weng, X.; Wang, J.; Held, D.; Kitani, K. 3d multi-object tracking: A baseline and new evaluation metrics. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24–30 October 2020; pp. 10359–10366. [Google Scholar]
  16. Shi, S.; Wang, X.; Li, H. 3D Object Proposal Generation and Detection from Point Cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 770–779. [Google Scholar]
  17. Yin, T.; Zhou, X.; Krähenbühl, P. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11784–11793. [Google Scholar]
  18. Pang, Z.; Li, Z.; Wang, N. Simpletrack: Understanding and rethinking 3d multi-object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 680–696. [Google Scholar]
  19. Paek, D.H.; Kong, S.H.; Wijaya, K.T. K-radar: 4d radar object detection for autonomous driving in various weather conditions. Adv. Neural Inf. Process. Syst. 2022, 35, 3819–3829. [Google Scholar]
  20. Yang, B.; Guo, R.; Liang, M.; Casas, S.; Urtasun, R. Radarnet: Exploiting radar for robust perception of dynamic objects. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 496–512. [Google Scholar]
  21. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sutskever, I. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; PMLR: 2021. Volume 139, pp. 8748–8763. [Google Scholar]
  22. Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; PMLR: 2021. Volume 139, pp. 4904–4916. [Google Scholar]
  23. Chen, Y.C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 104–120. [Google Scholar]
  24. Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar]
  25. Xian, Y.; Lampert, C.H.; Schiele, B.; Akata, Z. Zero-shot learning—A comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2251–2265. [Google Scholar] [CrossRef]
  26. Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International conference on machine learning (ICML), Virtual Event, 17–23 July 2022; PMLR: 2022. Volume 162, pp. 12888–12900. [Google Scholar]
  27. Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
  28. Zhong, Y.; Yang, J.; Zhang, P.; Li, C.; Codella, N.; Li, L.H.; Gao, J. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16793–16803. [Google Scholar]
  29. Wang, X.; Fu, C.; He, J.; Huang, M.; Meng, T.; Zhang, S.; Zhang, C. You only need two detectors to achieve multi-modal 3d multi-object tracking. arXiv 2023, arXiv:2304.08709. [Google Scholar]
  30. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
  31. Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
  32. Berclaz, J.; Fleuret, F.; Turetken, E.; Fua, P. Multiple object tracking using k-shortest paths optimization. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 1806–1819. [Google Scholar] [CrossRef]
  33. Karim, T.; Mahayuddin, Z.R.; Hasan, M.K. Singular and Multimodal Techniques of 3D Object Detection: Constraints, Advancements and Research Direction. Appl. Sci. 2023, 13, 13267. [Google Scholar] [CrossRef]
  34. Liang, M.; Yang, B.; Chen, Y.; Hu, R.; Urtasun, R. Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June; pp. 7345–7353.
  35. Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27 June–2 July 2016; pp. 2147–2156. [Google Scholar]
  36. Wang, Y.; Chao, W.L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8445–8453. [Google Scholar]
  37. Weng, X.; Wang, Y.; Man, Y.; Kitani, K.M. Gnn3dmot: Graph neural network for 3d multi-object tracking with 2d-3d multi-feature learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6499–6508. [Google Scholar]
  38. Li, Y.; Liu, X.; Liu, L.; Fan, H.; Zhang, L. LaMOT: Language-Guided Multi-Object Tracking. arXiv 2024, arXiv:2406.08324. [Google Scholar]
  39. Li, X.; Huang, Y.; He, Z.; Wang, Y.; Lu, H.; Yang, M.H. Citetracker: Correlating image and text for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–11 October 2023; pp. 9974–9983. [Google Scholar]
  40. Liu, X.; Zou, Z.; Hao, J. Adaptive Text Feature Updating for Visual-Language Tracking. In Proceedings of the International Conference on Pattern (ICPR), Montreal, QC, Canada, 8–12 January 2024; Springer Nature: Cham, Switzerland, 2024; pp. 366–381. [Google Scholar]
  41. Wang, Y.; Abd Rahman, A.H.; Nor Rashid, F.A.; Razali, M.K.M. Tackling Heterogeneous Light Detection and Ranging-Camera Alignment Challenges in Dynamic Environments: A Review for Object Detection. Sensors 2024, 24, 7855. [Google Scholar] [CrossRef] [PubMed]
  42. Saif, F.M.S.; Mahayuddin, Z.R. Vision based 3D object detection using deep learning: Methods with challenges and applications towards future directions. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 203–214. [Google Scholar] [CrossRef]
  43. Mohammed, S.A.K.; Razak, M.Z.A.; Rahman, A.H.A. 3D-DIoU: 3D Distance Intersection over Union for Multi-Object Tracking in Point Cloud. Sensors 2023, 23, 3390. [Google Scholar] [CrossRef]
  44. Su, Z.; Adam, A.; Nasrudin, M.F.; Prabuwono, A.S. Proposal-Free Fully Convolutional Network: Object Detection Based on a Box Map. Sensors 2024, 24, 3529. [Google Scholar] [CrossRef] [PubMed]
  45. Wang, X.; Fu, C.; Li, Z.; Lai, Y.; He, J. Deepfusionmot: A 3d multi-object tracking framework based on camera-lidar fusion with deep association. IEEE Robot. Automat. Lett. 2022, 7, 8260–8267. [Google Scholar] [CrossRef]
  46. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  47. Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X. Pointvoxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10526–10535. [Google Scholar]
  48. Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Virtual Event, 2–9 February 2021; Volume 35, pp. 1201–1209. [Google Scholar]
  49. Luiten, J.; Osep, A.; Dendorfer, P.; Torr, P.; Geiger, A.; Leal-Taixé, L.; Leibe, B. Hota: A higher order metric for evaluating multi-object tracking. Int. J. Comput. Vis. 2021, 129, 548–578. [Google Scholar] [CrossRef]
  50. Zhang, W.; Zhou, H.; Sun, S.; Wang, Z.; Shi, J.; Loy, C.C. Robust multi-modality multi-object tracking. In Proceedings of the IEEE/CVF International Conference on Computer (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2365–2374. [Google Scholar]
  51. Kim, A.; Ošep, A.; Leal-Taixé, L. Eagermot: 3d multi-object tracking via sensor fusion. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11315–11321. [Google Scholar]
  52. Wu, H.; Han, W.; Wen, C.; Li, X.; Wang, C. 3D multi-object tracking in point clouds based on prediction confidence-guided data association. IEEE Transact. Intell. Transport. Syst. 2022, 23, 5668–5677. [Google Scholar] [CrossRef]
  53. Reich, A.; Wuensche, H.J. Monocular 3d multi-object tracking with an ekf approach for long-term stable tracks. In Proceedings of the 2021 IEEE 24th International Conference on Information Fusion (FUSION), Annapolis, MD, USA, 5–8 July 2021; pp. 1–7. [Google Scholar]
  54. Zhu, Z.; Nie, J.; Wu, H.; He, Z.; Gao, M. MSA-MOT: Multi-stage association for 3D multimodality multi-object tracking. Sensors 2022, 22, 8650. [Google Scholar] [CrossRef]
  55. Kim, A.; Brasó, G.; Ošep, A.; Leal-Taixé, L. Polarmot: How far can geometric relations take us in 3d multi-object tracking? In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 41–58. [Google Scholar]
  56. Cho, M.; Kim, E. 3D LiDAR multi-object tracking with short-term and long-term multi-level associations. Remote Sens. 2023, 15, 5486. [Google Scholar] [CrossRef]
  57. Ninh, P.P.; Kim, H. CollabMOT Stereo Camera Collaborative Multi Object Tracking. IEEE Access 2024, 12, 21304–21319. [Google Scholar] [CrossRef]
  58. Peng, C.; Zeng, Z.; Gao, J.; Zhou, J.; Tomizuka, M.; Wang, X.; Ye, N. PNAS-mot: Multi-modal object tracking with pareto neural architecture search. IEEE Robot. Automat. Lett. 2024, 9, 4601–4608. [Google Scholar] [CrossRef]
  59. Zhou, T.; Ye, Q.; Luo, W.; Ran, H.; Shi, Z.; Chen, J. APPTracker+: Displacement Uncertainty for Occlusion Handling in Low-Frame-Rate Multiple Object Tracking. Int. J. Comput. Vis. 2024, 133, 2044–2069. [Google Scholar] [CrossRef]
  60. Miah, M.; Bilodeau, G.A.; Saunier, N. Learning data association for multi-object tracking using only coordinates. Pattern Recognit. 2025, 160, 111169. [Google Scholar] [CrossRef]
  61. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  62. Pozzi, A.; Incremona, A.; Tessera, D.; Toti, D. Mitigating exposure bias in large language model distillation: An imitation learning approach. Neural Comput. Appl. 2025, 37, 12013–12029. [Google Scholar] [CrossRef]
Figure 1. The success of Vision-Language Models (VLMs) in various vision tasks, such as detection, segmentation, and generation, compared to their limited exploration in 3D Multi-Object Tracking (MOT). While VLMs have achieved remarkable progress in a wide range of applications, their integration into 3D MOT remains largely unexplored, highlighting a significant research gap.
Figure 1. The success of Vision-Language Models (VLMs) in various vision tasks, such as detection, segmentation, and generation, compared to their limited exploration in 3D Multi-Object Tracking (MOT). While VLMs have achieved remarkable progress in a wide range of applications, their integration into 3D MOT remains largely unexplored, highlighting a significant research gap.
Applsci 15 10112 g001
Figure 2. Overall architecture of the proposed TG3MOT framework.
Figure 2. Overall architecture of the proposed TG3MOT framework.
Applsci 15 10112 g002
Figure 3. Objects that are clear and easily recognizable exhibit a steeper distribution of matching scores (red). In contrast, blurry or ambiguous objects (false detections) tend to have a smoother matching score distribution (blue).
Figure 3. Objects that are clear and easily recognizable exhibit a steeper distribution of matching scores (red). In contrast, blurry or ambiguous objects (false detections) tend to have a smoother matching score distribution (blue).
Applsci 15 10112 g003
Figure 4. Workflow of 3D Feature EMA. The object is consistently detected from frames t 1 to t n , but missed at frame t. By leveraging 3D F-EMA, historical feature information is fused into the current frame t, enhancing the performance of the 3D detector.
Figure 4. Workflow of 3D Feature EMA. The object is consistently detected from frames t 1 to t n , but missed at frame t. By leveraging 3D F-EMA, historical feature information is fused into the current frame t, enhancing the performance of the 3D detector.
Applsci 15 10112 g004
Figure 5. Implementation of Gaussian Confidence Fusion with k = 9 , where the standard normal distribution is used as weights, inversely proportional to temporal distance.
Figure 5. Implementation of Gaussian Confidence Fusion with k = 9 , where the standard normal distribution is used as weights, inversely proportional to temporal distance.
Applsci 15 10112 g005
Figure 6. Comparison of bird’s-eye-view trajectories between YONTD and our TG3MOT method. (a,b) show the input frames processed by YONTD and TG3MOT, respectively. (c,d) show the corresponding 2D tracking results, while (e,f) show the bird’s-eye-view trajectories. Annotations highlight the key differences: Red circles in (e) indicate false-positive trajectories (ghost tracks) generated by YONTD, which are successfully eliminated by our method, as shown in the corresponding areas of (f). Navy blue boxes mark instances of missed detections and track fragmentation in (e), while our approach (f) maintains full, continuous trajectories (green checkmarks). Overall, our method demonstrates superior performance with significantly reduced false alarms and more robust, consistent tracking.
Figure 6. Comparison of bird’s-eye-view trajectories between YONTD and our TG3MOT method. (a,b) show the input frames processed by YONTD and TG3MOT, respectively. (c,d) show the corresponding 2D tracking results, while (e,f) show the bird’s-eye-view trajectories. Annotations highlight the key differences: Red circles in (e) indicate false-positive trajectories (ghost tracks) generated by YONTD, which are successfully eliminated by our method, as shown in the corresponding areas of (f). Navy blue boxes mark instances of missed detections and track fragmentation in (e), while our approach (f) maintains full, continuous trajectories (green checkmarks). Overall, our method demonstrates superior performance with significantly reduced false alarms and more robust, consistent tracking.
Applsci 15 10112 g006
Table 1. Performance comparison with SOTA 3D MOT methods on the KITTI test set. The optimal results are highlighted in bold.
Table 1. Performance comparison with SOTA 3D MOT methods on the KITTI test set. The optimal results are highlighted in bold.
MethodPublishedHOTA (%) ↑DetA(%) ↑AssA (%) ↑LocA(%) ↑MOTA(%) ↑MOTP (%) ↑IDSW ↓
mmMOT [50]ICCV 201962.0572.2954.0286.5883.2385.03733
EagerMOT [51]ICRA 202174.3975.2774.1687.1787.8285.69239
PC3T [52]TITS 202177.8074.5781.5986.0788.8184.26225
Mono-3D-KF [53]FUSION 202175.4774.1077.6385.4888.4883.70162
MSA-MOT [54]Sensors 202278.5275.1982.5687.0088.0185.4591
DeepFusion-MOT [45]RA-L 202275.4671.5480.0586.7084.6385.0284
PolarMOT [55]ECCV 202275.1673.9476.9587.1285.0885.63462
3DMLA [56]Remote Sensing 202375.6571.9280.0286.6285.0384.9339
CollabMOT [57]Access 202475.2675.4675.7486.4489.0884.97227
PNAS-MOT [58]RA-L 202467.3277.6958.9986.9489.5985.44751
APPTracker+[59]IJCV 202475.1975.5575.3686.5989.0985.03176
C-TWIX [60]Pattern Recognition 202577.5876.9778.8486.9589.6885.50381
YONTD-MOT [29]78.0874.1682.8688.2385.0986.9842
TG3MOT (Ours)78.7274.5983.6987.6486.1586.2635
Table 2. Tracking performance on KITTI training sequence using the same 3D detector for TG3MOT and YONTD.
Table 2. Tracking performance on KITTI training sequence using the same 3D detector for TG3MOT and YONTD.
3D Detector2D DetectorHOTA (%)↑AssA(%)↑MOTA(%)↑IDSW↓IDF1↑
YONTD-MOTVoxelRCNN [48]FasterRCNN [46]77.5282.1482.111090.75
MaskRCNN [61]76.4981.1280.35888.85
YONTD-MOTPVRCNN [47]FasterRCNN 75.3882.3375.471487.91
MaskRCNN 75.0581.7875.141387.68
TG3MOT (Ours)VoxelRCNN RegionClip 77.7882.0882.35890.91
PVRCNN RegionClip 75.9381.2778.61489.01
Table 3. Impact of different modules on tracking performance metrics. (Results are reported with three-decimal precision to highlight marginal contributions.) The symbols ↑/↓ indicate that higher/lower values are better.
Table 3. Impact of different modules on tracking performance metrics. (Results are reported with three-decimal precision to highlight marginal contributions.) The symbols ↑/↓ indicate that higher/lower values are better.
ModuleComponentsPerformance Metrics
TSM 3D F-EMA GCF HOTA (%) ↑ DetA(%) ↑ AssA(%) ↑ MOTA(%) ↑ MOTP(%) ↑ IDSW ↓ IDF1(%) ↑
Base×××76.43271.85281.45379.38288.402588.423
TSM××77.74173.41282.48181.73288.521290.821
3D F-EMA××76.44171.86381.46279.40388.413588.434
GCF××76.59272.27180.04180.84288.423588.743
TSM+GCF×77.78273.84282.35182.35188.512890.902
Full77.78573.84682.08382.35488.514890.914
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Y.; Mahayuddin, Z.R.; Nasrudin, M.F. Text-Guided Spatio-Temporal 2D and 3D Data Fusion for Multi-Object Tracking with RegionCLIP. Appl. Sci. 2025, 15, 10112. https://doi.org/10.3390/app151810112

AMA Style

Liu Y, Mahayuddin ZR, Nasrudin MF. Text-Guided Spatio-Temporal 2D and 3D Data Fusion for Multi-Object Tracking with RegionCLIP. Applied Sciences. 2025; 15(18):10112. https://doi.org/10.3390/app151810112

Chicago/Turabian Style

Liu, Youlin, Zainal Rasyid Mahayuddin, and Mohammmad Faidzul Nasrudin. 2025. "Text-Guided Spatio-Temporal 2D and 3D Data Fusion for Multi-Object Tracking with RegionCLIP" Applied Sciences 15, no. 18: 10112. https://doi.org/10.3390/app151810112

APA Style

Liu, Y., Mahayuddin, Z. R., & Nasrudin, M. F. (2025). Text-Guided Spatio-Temporal 2D and 3D Data Fusion for Multi-Object Tracking with RegionCLIP. Applied Sciences, 15(18), 10112. https://doi.org/10.3390/app151810112

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop