Text-Guided Spatio-Temporal 2D and 3D Data Fusion for Multi-Object Tracking with RegionCLIP
Abstract
1. Introduction
- Extending the YONTD [29] framework with a semantic-aware RegionCLIP detector, achieving strong zero-shot detection without extra training;
- Designing a Target Semantic Matching Module (TSM) to filter false positives based on semantic consistency;
- Proposing a 3D Feature EMA module to improve regression accuracy and reduce false detections;
- Introducing Gaussian Confidence Fusion (GCF) to stabilize trajectory confidence using historical weights;
- Achieving superior results on KITTI [30], with our code publicly released to foster future research.
2. Related Work
2.1. Traditional MOT Methods
2.2. 3D MOT Methods
2.3. Vision-Language Models (VLMs) in MOT
3. Proposed Method
3.1. VLM and RegionCLIP
3.2. Target Semantic Matching Module
3.3. 3D Feature Exponential Moving Average
3.4. Gaussian Confidence Fusion Module
4. Experiments
4.1. Experimental Setup
4.2. Experimental Results
4.3. Ablation Study
4.4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
- Cao, J.; Pang, J.; Weng, X.; Khirodkar, R.; Kitani, K. Observation-centric sort: Rethinking sort for robust multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 9686–9696. [Google Scholar]
- Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. Strongsort: Make deepsort great again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
- He, J.; Fu, C.; Wang, X. 3d multi-object tracking based on uncertainty-guided data association. arXiv 2023, arXiv:2303.01786. [Google Scholar]
- Wang, L.; Zhang, X.; Qin, W.; Li, X.; Gao, J.; Yang, L.; Li, Z.; Li, J.; Zhu, L.; Wang, H. Camo-mot: Combined appearance-motion optimization for 3d multiobject tracking with camera-lidar fusion. IEEE Trans. Intell. Transp. Syst. 2023, 24, 11981–11996. [Google Scholar] [CrossRef]
- Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
- Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
- Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
- Wang, Z.; Zheng, L.; Liu, Y.; Li, Y.; Wang, S. Towards real-time multi-object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 107–122. [Google Scholar]
- Bergmann, P.; Meinhardt, T.; Leal-Taixe, L. Tracking without bells and whistles. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 941–951. [Google Scholar]
- Zhang, J.; Zhou, S.; Chang, X.; Wan, F.; Wang, J.; Wu, Y.; Huang, D. Multiple object tracking by flowing and fusing. arXiv 2020, arXiv:2001.11180. [Google Scholar] [CrossRef]
- Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking objects as points. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 474–490. [Google Scholar]
- Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. Transtrack: Multiple object tracking with transformer. arXiv 2020, arXiv:2012.15460. [Google Scholar]
- Zulkifley, M.A.; Rawlinson, D.; Moran, B. Robust Observation Detection for Single Object Tracking: Deterministic and Probabilistic Patch-Based Approaches. Sensors 2012, 12, 15638–15670. [Google Scholar] [CrossRef]
- Weng, X.; Wang, J.; Held, D.; Kitani, K. 3d multi-object tracking: A baseline and new evaluation metrics. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24–30 October 2020; pp. 10359–10366. [Google Scholar]
- Shi, S.; Wang, X.; Li, H. 3D Object Proposal Generation and Detection from Point Cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 770–779. [Google Scholar]
- Yin, T.; Zhou, X.; Krähenbühl, P. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11784–11793. [Google Scholar]
- Pang, Z.; Li, Z.; Wang, N. Simpletrack: Understanding and rethinking 3d multi-object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 680–696. [Google Scholar]
- Paek, D.H.; Kong, S.H.; Wijaya, K.T. K-radar: 4d radar object detection for autonomous driving in various weather conditions. Adv. Neural Inf. Process. Syst. 2022, 35, 3819–3829. [Google Scholar]
- Yang, B.; Guo, R.; Liang, M.; Casas, S.; Urtasun, R. Radarnet: Exploiting radar for robust perception of dynamic objects. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 496–512. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sutskever, I. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; PMLR: 2021. Volume 139, pp. 8748–8763. [Google Scholar]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; PMLR: 2021. Volume 139, pp. 4904–4916. [Google Scholar]
- Chen, Y.C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 104–120. [Google Scholar]
- Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar]
- Xian, Y.; Lampert, C.H.; Schiele, B.; Akata, Z. Zero-shot learning—A comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2251–2265. [Google Scholar] [CrossRef]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International conference on machine learning (ICML), Virtual Event, 17–23 July 2022; PMLR: 2022. Volume 162, pp. 12888–12900. [Google Scholar]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
- Zhong, Y.; Yang, J.; Zhang, P.; Li, C.; Codella, N.; Li, L.H.; Gao, J. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16793–16803. [Google Scholar]
- Wang, X.; Fu, C.; He, J.; Huang, M.; Meng, T.; Zhang, S.; Zhang, C. You only need two detectors to achieve multi-modal 3d multi-object tracking. arXiv 2023, arXiv:2304.08709. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
- Berclaz, J.; Fleuret, F.; Turetken, E.; Fua, P. Multiple object tracking using k-shortest paths optimization. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 1806–1819. [Google Scholar] [CrossRef]
- Karim, T.; Mahayuddin, Z.R.; Hasan, M.K. Singular and Multimodal Techniques of 3D Object Detection: Constraints, Advancements and Research Direction. Appl. Sci. 2023, 13, 13267. [Google Scholar] [CrossRef]
- Liang, M.; Yang, B.; Chen, Y.; Hu, R.; Urtasun, R. Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June; pp. 7345–7353.
- Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27 June–2 July 2016; pp. 2147–2156. [Google Scholar]
- Wang, Y.; Chao, W.L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8445–8453. [Google Scholar]
- Weng, X.; Wang, Y.; Man, Y.; Kitani, K.M. Gnn3dmot: Graph neural network for 3d multi-object tracking with 2d-3d multi-feature learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6499–6508. [Google Scholar]
- Li, Y.; Liu, X.; Liu, L.; Fan, H.; Zhang, L. LaMOT: Language-Guided Multi-Object Tracking. arXiv 2024, arXiv:2406.08324. [Google Scholar]
- Li, X.; Huang, Y.; He, Z.; Wang, Y.; Lu, H.; Yang, M.H. Citetracker: Correlating image and text for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–11 October 2023; pp. 9974–9983. [Google Scholar]
- Liu, X.; Zou, Z.; Hao, J. Adaptive Text Feature Updating for Visual-Language Tracking. In Proceedings of the International Conference on Pattern (ICPR), Montreal, QC, Canada, 8–12 January 2024; Springer Nature: Cham, Switzerland, 2024; pp. 366–381. [Google Scholar]
- Wang, Y.; Abd Rahman, A.H.; Nor Rashid, F.A.; Razali, M.K.M. Tackling Heterogeneous Light Detection and Ranging-Camera Alignment Challenges in Dynamic Environments: A Review for Object Detection. Sensors 2024, 24, 7855. [Google Scholar] [CrossRef] [PubMed]
- Saif, F.M.S.; Mahayuddin, Z.R. Vision based 3D object detection using deep learning: Methods with challenges and applications towards future directions. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 203–214. [Google Scholar] [CrossRef]
- Mohammed, S.A.K.; Razak, M.Z.A.; Rahman, A.H.A. 3D-DIoU: 3D Distance Intersection over Union for Multi-Object Tracking in Point Cloud. Sensors 2023, 23, 3390. [Google Scholar] [CrossRef]
- Su, Z.; Adam, A.; Nasrudin, M.F.; Prabuwono, A.S. Proposal-Free Fully Convolutional Network: Object Detection Based on a Box Map. Sensors 2024, 24, 3529. [Google Scholar] [CrossRef] [PubMed]
- Wang, X.; Fu, C.; Li, Z.; Lai, Y.; He, J. Deepfusionmot: A 3d multi-object tracking framework based on camera-lidar fusion with deep association. IEEE Robot. Automat. Lett. 2022, 7, 8260–8267. [Google Scholar] [CrossRef]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X. Pointvoxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10526–10535. [Google Scholar]
- Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Virtual Event, 2–9 February 2021; Volume 35, pp. 1201–1209. [Google Scholar]
- Luiten, J.; Osep, A.; Dendorfer, P.; Torr, P.; Geiger, A.; Leal-Taixé, L.; Leibe, B. Hota: A higher order metric for evaluating multi-object tracking. Int. J. Comput. Vis. 2021, 129, 548–578. [Google Scholar] [CrossRef]
- Zhang, W.; Zhou, H.; Sun, S.; Wang, Z.; Shi, J.; Loy, C.C. Robust multi-modality multi-object tracking. In Proceedings of the IEEE/CVF International Conference on Computer (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2365–2374. [Google Scholar]
- Kim, A.; Ošep, A.; Leal-Taixé, L. Eagermot: 3d multi-object tracking via sensor fusion. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11315–11321. [Google Scholar]
- Wu, H.; Han, W.; Wen, C.; Li, X.; Wang, C. 3D multi-object tracking in point clouds based on prediction confidence-guided data association. IEEE Transact. Intell. Transport. Syst. 2022, 23, 5668–5677. [Google Scholar] [CrossRef]
- Reich, A.; Wuensche, H.J. Monocular 3d multi-object tracking with an ekf approach for long-term stable tracks. In Proceedings of the 2021 IEEE 24th International Conference on Information Fusion (FUSION), Annapolis, MD, USA, 5–8 July 2021; pp. 1–7. [Google Scholar]
- Zhu, Z.; Nie, J.; Wu, H.; He, Z.; Gao, M. MSA-MOT: Multi-stage association for 3D multimodality multi-object tracking. Sensors 2022, 22, 8650. [Google Scholar] [CrossRef]
- Kim, A.; Brasó, G.; Ošep, A.; Leal-Taixé, L. Polarmot: How far can geometric relations take us in 3d multi-object tracking? In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 41–58. [Google Scholar]
- Cho, M.; Kim, E. 3D LiDAR multi-object tracking with short-term and long-term multi-level associations. Remote Sens. 2023, 15, 5486. [Google Scholar] [CrossRef]
- Ninh, P.P.; Kim, H. CollabMOT Stereo Camera Collaborative Multi Object Tracking. IEEE Access 2024, 12, 21304–21319. [Google Scholar] [CrossRef]
- Peng, C.; Zeng, Z.; Gao, J.; Zhou, J.; Tomizuka, M.; Wang, X.; Ye, N. PNAS-mot: Multi-modal object tracking with pareto neural architecture search. IEEE Robot. Automat. Lett. 2024, 9, 4601–4608. [Google Scholar] [CrossRef]
- Zhou, T.; Ye, Q.; Luo, W.; Ran, H.; Shi, Z.; Chen, J. APPTracker+: Displacement Uncertainty for Occlusion Handling in Low-Frame-Rate Multiple Object Tracking. Int. J. Comput. Vis. 2024, 133, 2044–2069. [Google Scholar] [CrossRef]
- Miah, M.; Bilodeau, G.A.; Saunier, N. Learning data association for multi-object tracking using only coordinates. Pattern Recognit. 2025, 160, 111169. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Pozzi, A.; Incremona, A.; Tessera, D.; Toti, D. Mitigating exposure bias in large language model distillation: An imitation learning approach. Neural Comput. Appl. 2025, 37, 12013–12029. [Google Scholar] [CrossRef]
Method | Published | HOTA (%) ↑ | DetA(%) ↑ | AssA (%) ↑ | LocA(%) ↑ | MOTA(%) ↑ | MOTP (%) ↑ | IDSW ↓ |
---|---|---|---|---|---|---|---|---|
mmMOT [50] | ICCV 2019 | 62.05 | 72.29 | 54.02 | 86.58 | 83.23 | 85.03 | 733 |
EagerMOT [51] | ICRA 2021 | 74.39 | 75.27 | 74.16 | 87.17 | 87.82 | 85.69 | 239 |
PC3T [52] | TITS 2021 | 77.80 | 74.57 | 81.59 | 86.07 | 88.81 | 84.26 | 225 |
Mono-3D-KF [53] | FUSION 2021 | 75.47 | 74.10 | 77.63 | 85.48 | 88.48 | 83.70 | 162 |
MSA-MOT [54] | Sensors 2022 | 78.52 | 75.19 | 82.56 | 87.00 | 88.01 | 85.45 | 91 |
DeepFusion-MOT [45] | RA-L 2022 | 75.46 | 71.54 | 80.05 | 86.70 | 84.63 | 85.02 | 84 |
PolarMOT [55] | ECCV 2022 | 75.16 | 73.94 | 76.95 | 87.12 | 85.08 | 85.63 | 462 |
3DMLA [56] | Remote Sensing 2023 | 75.65 | 71.92 | 80.02 | 86.62 | 85.03 | 84.93 | 39 |
CollabMOT [57] | Access 2024 | 75.26 | 75.46 | 75.74 | 86.44 | 89.08 | 84.97 | 227 |
PNAS-MOT [58] | RA-L 2024 | 67.32 | 77.69 | 58.99 | 86.94 | 89.59 | 85.44 | 751 |
APPTracker+[59] | IJCV 2024 | 75.19 | 75.55 | 75.36 | 86.59 | 89.09 | 85.03 | 176 |
C-TWIX [60] | Pattern Recognition 2025 | 77.58 | 76.97 | 78.84 | 86.95 | 89.68 | 85.50 | 381 |
YONTD-MOT [29] | — | 78.08 | 74.16 | 82.86 | 88.23 | 85.09 | 86.98 | 42 |
TG3MOT (Ours) | — | 78.72 | 74.59 | 83.69 | 87.64 | 86.15 | 86.26 | 35 |
3D Detector | 2D Detector | HOTA (%)↑ | AssA(%)↑ | MOTA(%)↑ | IDSW↓ | IDF1↑ | |
---|---|---|---|---|---|---|---|
YONTD-MOT | VoxelRCNN [48] | FasterRCNN [46] | 77.52 | 82.14 | 82.11 | 10 | 90.75 |
MaskRCNN [61] | 76.49 | 81.12 | 80.35 | 8 | 88.85 | ||
YONTD-MOT | PVRCNN [47] | FasterRCNN | 75.38 | 82.33 | 75.47 | 14 | 87.91 |
MaskRCNN | 75.05 | 81.78 | 75.14 | 13 | 87.68 | ||
TG3MOT (Ours) | VoxelRCNN | RegionClip | 77.78 | 82.08 | 82.35 | 8 | 90.91 |
PVRCNN | RegionClip | 75.93 | 81.27 | 78.61 | 4 | 89.01 |
Module | Components | Performance Metrics | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
TSM | 3D F-EMA | GCF | HOTA (%) ↑ | DetA(%) ↑ | AssA(%) ↑ | MOTA(%) ↑ | MOTP(%) ↑ | IDSW ↓ | IDF1(%) ↑ | |
Base | × | × | × | 76.432 | 71.852 | 81.453 | 79.382 | 88.402 | 5 | 88.423 |
TSM | ✓ | × | × | 77.741 | 73.412 | 82.481 | 81.732 | 88.521 | 2 | 90.821 |
3D F-EMA | × | ✓ | × | 76.441 | 71.863 | 81.462 | 79.403 | 88.413 | 5 | 88.434 |
GCF | × | × | ✓ | 76.592 | 72.271 | 80.041 | 80.842 | 88.423 | 5 | 88.743 |
TSM+GCF | ✓ | × | ✓ | 77.782 | 73.842 | 82.351 | 82.351 | 88.512 | 8 | 90.902 |
Full | ✓ | ✓ | ✓ | 77.785 | 73.846 | 82.083 | 82.354 | 88.514 | 8 | 90.914 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, Y.; Mahayuddin, Z.R.; Nasrudin, M.F. Text-Guided Spatio-Temporal 2D and 3D Data Fusion for Multi-Object Tracking with RegionCLIP. Appl. Sci. 2025, 15, 10112. https://doi.org/10.3390/app151810112
Liu Y, Mahayuddin ZR, Nasrudin MF. Text-Guided Spatio-Temporal 2D and 3D Data Fusion for Multi-Object Tracking with RegionCLIP. Applied Sciences. 2025; 15(18):10112. https://doi.org/10.3390/app151810112
Chicago/Turabian StyleLiu, Youlin, Zainal Rasyid Mahayuddin, and Mohammmad Faidzul Nasrudin. 2025. "Text-Guided Spatio-Temporal 2D and 3D Data Fusion for Multi-Object Tracking with RegionCLIP" Applied Sciences 15, no. 18: 10112. https://doi.org/10.3390/app151810112
APA StyleLiu, Y., Mahayuddin, Z. R., & Nasrudin, M. F. (2025). Text-Guided Spatio-Temporal 2D and 3D Data Fusion for Multi-Object Tracking with RegionCLIP. Applied Sciences, 15(18), 10112. https://doi.org/10.3390/app151810112