Video Temporal Grounding with Multi-Model Collaborative Learning
Abstract
:1. Introduction
- We propose a multi-model collaborative learning framework employing a bidirectional knowledge transfer scheme to foster cooperative learning across distinct models, thus transcending the confines of isolated model optimization and amplifying cross-model complementarity.
- We design a CLIP-guided pseudo-label generator that harnesses CLIP’s cross-modal priors to optimize pseudo-label generation, improving their quality and robustness, and bolstering the accuracy of knowledge transfer.
- We present an iterative multi-model training algorithm that alternately freezes and optimizes model parameters to effectively mitigate gradient conflicts, safeguard the balance between cooperative and independent optimization, and further boost training stability and model performance.
2. Related Work
2.1. Video Temporal Grounding
2.2. Knowledge Flow in Video Temporal Grounding
2.3. Pseudo-Label in Video Temporal Grounding
3. Methodology
3.1. Problem Formulation
3.2. General Model
3.3. Pseudo-Label Generator
3.3.1. CLIP-Guided Module
3.3.2. Pseudo-Labels Generation
3.4. Mutual Knowledge Transfer
3.5. Iterative Training Algorithm
3.5.1. Consistency Assessment
3.5.2. Dynamic Loss Adjustment
3.5.3. Algorithmic Realization
Algorithm 1: Dynamic Loss Adjustment with IoU-based Weighting |
4. Experiments
4.1. Datasets
4.2. Baselines
4.3. Implementation Details
4.3.1. Evaluation Metrics
4.3.2. Experimental Settings
4.4. Training and Inference
4.5. Comparison with State-of-the-Arts
4.5.1. Comparative Methods
4.5.2. Quantitative Analysis
4.6. Performance of Isomorphic Models
4.7. Ablation Studies
4.7.1. Loss Function
4.7.2. Hyperparameters of and
4.7.3. Hyperparameters of and
4.7.4. Effectiveness Analysis of the CLIP-Guided Module
4.7.5. Qualitative Analysis
4.8. Case Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Gao, J.; Sun, C.; Yang, Z.; Nevatia, R. TALL: Temporal Activity Localization via Language Query. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Chu, Y.W.; Lin, K.Y.; Hsu, C.C.; Ku, L.W. End-to-End Recurrent Cross-Modality Attention for Video Dialogue. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2456–2464. [Google Scholar] [CrossRef]
- Ji, W.; Li, Y.; Wei, M.; Shang, X.; Xiao, J.; Ren, T.; Chua, T.S. VidVRD 2021: The Third Grand Challenge on Video Relation Detection. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, Chengdu, China, 20–24 October 2021; pp. 4779–4783. [Google Scholar] [CrossRef]
- Shang, X.; Li, Y.; Xiao, J.; Ji, W.; Chua, T.S. Video Visual Relation Detection via Iterative Inference. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, Chengdu, China, 20–24 October 2021; pp. 3654–3663. [Google Scholar] [CrossRef]
- Shang, X.; Ren, T.; Guo, J.; Zhang, H.; Chua, T.S. Video Visual Relation Detection. In Proceedings of the 25th ACM International Conference on Multimedia, MM ’17, Mountain View, CA, USA, 23–27 October 2017; pp. 1300–1308. [Google Scholar] [CrossRef]
- Li, Y.; Wang, X.; Xiao, J.; Ji, W.; Chua, T.S. Invariant Grounding for Video Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 19–24 June 2022; pp. 2928–2937. [Google Scholar]
- Xiao, J.; Yao, A.; Liu, Z.; Li, Y.; Ji, W.; Chua, T.S. Video as Conditional Graph Hierarchy for Multi-Granular Question Answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Online Conference, 22 February–1 March 2022; Volume 36, pp. 2804–2812. [Google Scholar] [CrossRef]
- Li, S.; Li, B.; Sun, B.; Weng, Y. Towards Visual-Prompt Temporal Answer Grounding in Instructional Video. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8836–8853. [Google Scholar] [CrossRef] [PubMed]
- Liang, R.; Yang, Y.; Lu, H.; Li, L. Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation. arXiv 2014, arXiv:2308.03725. [Google Scholar]
- Lan, X.; Yuan, Y.; Wang, X.; Wang, Z.; Zhu, W. A Survey on Temporal Sentence Grounding in Videos. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–33. [Google Scholar] [CrossRef]
- Zhang, Y.; Xu, Y.; Chen, M.; Zhang, Y.; Feng, R.; Gao, S. SPTNET: Span-based Prompt Tuning for Video Grounding. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 2807–2812. [Google Scholar] [CrossRef]
- Ding, X.; Wang, N.; Zhang, S.; Huang, Z.; Li, X.; Tang, M.; Liu, T.; Gao, X. Exploring Language Hierarchy for Video Grounding. IEEE Trans. Image Process. 2022, 31, 4693–4706. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Li, K.; Li, J.; Chen, G.; Wang, M.; Guo, D. Dual-path temporal map optimization for make-up temporal video grounding. Multimed. Syst. 2024, 30, 140. [Google Scholar] [CrossRef]
- Zhang, Y.; Chen, X.; Jia, J.; Liu, S.; Ding, K. Text-Visual Prompting for Efficient 2D Temporal Video Grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 14794–14804. [Google Scholar]
- Bao, P.; Shao, Z.; Yang, W.; Ng, B.P.; Er, M.H.; Kot, A.C. Omnipotent Distillation with LLMs for Weakly-Supervised Natural Language Video Localization: When Divergence Meets Consistency. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 747–755. [Google Scholar] [CrossRef]
- Bao, P.; Xia, Y.; Yang, W.; Ng, B.P.; Er, M.H.; Kot, A.C. Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 738–746. [Google Scholar] [CrossRef]
- Weng, Y.; Li, B. Visual Answer Localization with Cross-Modal Mutual Knowledge Transfer. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Wu, J.; Jiang, Y.; Wei, X.Y.; Li, Q. PolySmart @ TRECVid 2024 Medical Video Question Answering. arXiv 2024, arXiv:2412.15514. [Google Scholar]
- Shi, G.; Li, Q.; Zhang, W.; Chen, J.; Wu, X.M. Recon: Reducing Conflicting Gradients from the Root for Multi-Task Learning. arXiv 2023, arXiv:2302.11289. [Google Scholar]
- Liu, B.; Liu, X.; Jin, X.; Stone, P.; Liu, Q. Conflict-Averse Gradient Descent for Multi-task learning. In Proceedings of the Advances in Neural Information Processing Systems, Red Hook, NY, USA, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 18878–18890. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021; Meila, M., Zhang, T., Eds.; Volume 139, pp. 8748–8763. [Google Scholar]
- Zhang, H.; Sun, A.; Jing, W.; Zhou, J.T. Span-based Localizing Network for Natural Language Video Localization. arXiv 2020, arXiv:2004.13931. [Google Scholar]
- Zhang, S.; Peng, H.; Fu, J.; Luo, J. Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12870–12877. [Google Scholar] [CrossRef]
- Zhang, S.; Peng, H.; Fu, J.; Lu, Y.; Luo, J. Multi-Scale 2D Temporal Adjacency Networks for Moment Localization With Natural Language. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 9073–9087. [Google Scholar] [CrossRef] [PubMed]
- Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
- Hinton, G. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
- Zheng, M.; Gong, S.; Jin, H.; Peng, Y.; Liu, Y. Generating Structured Pseudo Labels for Noise-resistant Zero-shot Video Sentence Localization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; pp. 14197–14209. [Google Scholar] [CrossRef]
- Xu, Z.; Wei, K.; Yang, X.; Deng, C. Point-Supervised Video Temporal Grounding. IEEE Trans. Multimed. 2023, 25, 6121–6131. [Google Scholar] [CrossRef]
- Xu, Y.; Wei, F.; Sun, X.; Yang, C.; Shen, Y.; Dai, B.; Zhou, B.; Lin, S. Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, Louisiana, 21–24 June 2022; pp. 2959–2968. [Google Scholar]
- Jiang, X.; Xu, X.; Zhang, J.; Shen, F.; Cao, Z.; Shen, H.T. Semi-Supervised Video Paragraph Grounding With Contrastive Encoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 21–24 June 2022; pp. 2466–2475. [Google Scholar]
- Luo, F.; Chen, S.; Chen, J.; Wu, Z.; Jiang, Y.G. Self-Supervised Learning for Semi-Supervised Temporal Language Grounding. IEEE Trans. Multimed. 2023, 25, 7747–7757. [Google Scholar] [CrossRef]
- Piao, Y.; Lu, C.; Zhang, M.; Lu, H. Semi-Supervised Video Salient Object Detection Based on Uncertainty-Guided Pseudo Labels. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 5614–5627. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features With 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 25 (NIPS 2012), Lake Tahoe, NV, USA, 3–6 December 2012; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2012; Volume 25. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Alaparthi, S.; Mishra, M. Bidirectional Encoder Representations from Transformers (BERT): A sentiment analysis odyssey. arXiv 2020, arXiv:2007.01127. [Google Scholar]
- Sigurdsson, G.A.; Varol, G.; Wang, X.; Farhadi, A.; Laptev, I.; Gupta, A. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 510–526. [Google Scholar]
- Caba Heilbron, F.; Escorcia, V.; Ghanem, B.; Carlos Niebles, J. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Regneri, M.; Rohrbach, M.; Wetzel, D.; Thater, S.; Schiele, B.; Pinkal, M. Grounding Action Descriptions in Videos. Trans. Assoc. Comput. Linguist. 2013, 1, 25–36. [Google Scholar] [CrossRef]
- Liu, M.; Wang, X.; Nie, L.; Tian, Q.; Chen, B.; Chua, T.S. Cross-modal Moment Localization in Videos. In Proceedings of the 26th ACM International Conference on Multimedia, MM ’18, Seoul, Republic of Korea, 22–26 October 2018; pp. 843–851. [Google Scholar] [CrossRef]
- Ge, R.; Gao, J.; Chen, K.; Nevatia, R. MAC: Mining Activity Concepts for Language-Based Temporal Localization. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 245–253. [Google Scholar] [CrossRef]
- Wang, W.; Huang, Y.; Wang, L. Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Ghosh, S.; Agarwal, A.; Parekh, Z.; Hauptmann, A. ExCL: Extractive Clip Localization Using Natural Language Descriptions. arXiv 2019, arXiv:1904.02755. [Google Scholar]
- Zeng, R.; Xu, H.; Huang, W.; Chen, P.; Tan, M.; Gan, C. Dense Regression Network for Video Grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Liu, D.; Qu, X.; Dong, J.; Zhou, P.; Cheng, Y.; Wei, W.; Xu, Z.; Xie, Y. Context-Aware Biaffine Localizing Network for Temporal Sentence Grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11235–11244. [Google Scholar]
- Li, H.; Shu, X.; He, S.; Qiao, R.; Wen, W.; Guo, T.; Gan, B.; Sun, X. D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 13734–13746. [Google Scholar]
- Ju, C.; Wang, H.; Liu, J.; Ma, C.; Zhang, Y.; Zhao, P.; Chang, J.; Tian, Q. Constraint and Union for Partially-Supervised Temporal Sentence Grounding. arXiv 2023, arXiv:2302.09850. [Google Scholar]
- Chen, J.; Chen, X.; Ma, L.; Jie, Z.; Chua, T.S. Temporally Grounding Natural Sentence in Video. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J., Eds.; pp. 162–171. [Google Scholar] [CrossRef]
- Yuan, Y.; Mei, T.; Zhu, W. To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression. In Proceedings of the 2019 AAAI Conference on Artificial Intelligence, Hilton Hawaiian Village, Honolulu, HI, USA, 27 January–1 February 2019; pp. 9159–9166. [Google Scholar] [CrossRef]
- Liu, M.; Wang, X.; Nie, L.; He, X.; Chen, B.; Chua, T.S. Attentive Moment Retrieval in Videos. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, Ann Arbor, MI, USA, 8–12 July 2018; pp. 15–24. [Google Scholar] [CrossRef]
- Ji, W.; Qin, Y.; Chen, L.; Wei, Y.; Wu, Y.; Zimmermann, R. Mrtnet: Multi-Resolution Temporal Network for Video Sentence Grounding. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 2770–2774. [Google Scholar] [CrossRef]
- Chen, S.; Jiang, Y.G. Semantic Proposal for Activity Localization in Videos via Sentence Query. In Proceedings of the AAAI Conference on Artificial Intelligence, Hilton Hawaiian Village, Honolulu, HI, USA, 27 January–1 February 2019; pp. 8199–8206. [Google Scholar] [CrossRef]
- Xu, H.; He, K.; Plummer, B.A.; Sigal, L.; Sclaroff, S.; Saenko, K. Multilevel Language and Vision Integration for Text-to-Clip Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 9062–9069. [Google Scholar] [CrossRef]
- Lu, C.; Chen, L.; Tan, C.; Li, X.; Xiao, J. DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; pp. 5144–5153. [Google Scholar] [CrossRef]
- Zhang, D.; Dai, X.; Wang, X.; Wang, Y.F.; Davis, L.S. MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Chen, L.; Lu, C.; Tang, S.; Xiao, J.; Zhang, D.; Tan, C.; Li, X. Rethinking the Bottom-Up Framework for Query-Based Video Localization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 10551–10558. [Google Scholar] [CrossRef]
- Yu, X.; Malmir, M.; He, X.; Chen, J.; Wang, T.; Wu, Y.; Liu, Y.; Liu, Y. Cross Interaction Network for Natural Language Guided Video Moment Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, Virtual, 11–15 July 2021; pp. 1860–1864. [Google Scholar] [CrossRef]
- Zhang, Z.; Lin, Z.; Zhao, Z.; Xiao, Z. Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’19, Paris, France, 21–25 July 2019; pp. 655–664. [Google Scholar] [CrossRef]
- Wang, Z.; Wang, L.; Wu, T.; Li, T.; Wu, G. Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 22 February–1 March 2022; pp. 2613–2623. [Google Scholar] [CrossRef]
- Zhang, H.; Sun, A.; Jing, W.; Zhen, L.; Zhou, J.T.; Goh, S.M.R. Parallel Attention Network with Sequence Matching for Video Grounding. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online Event, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021. [Google Scholar] [CrossRef]
Methods | Publication | Rank@1 | mIoU | ||
---|---|---|---|---|---|
CTRL [1] | ICCV2017 | – | 23.63 | 8.9 | – |
ROLE [40] | MM2018 | 25.26 | 12.12 | – | – |
ACL [41] | WACV2019 | – | 30.48 | 12.20 | – |
SAP [52] | AAAI2019 | – | 27.42 | 13.36 | – |
SM-RL [42] | CVPR2019 | – | 24.36 | 11.17 | – |
QSPN [53] | AAAI2019 | 54.70 | 35.60 | 15.80 | – |
DEBUG [54] | EMNLP2019 | 54.95 | 37.39 | 17.92 | 36.34 |
ExCL [43] | ACL2019 | – | 44.10 | 22.40 | – |
MAN [55] | CVPR2019 | – | 46.53 | 22.72 | – |
GDP [56] | AAAI2020 | 54.54 | 39.47 | 18.49 | – |
DRN [44] | CVPR2020 | – | 42.90 | 23.68 | – |
CBLN [45] | CVPR2021 | – | 43.67 | 24.44 | – |
CI-MHA [57] | SIGIR2021 | 69.87 | 54.68 | 35.27 | – |
MMN [59] | AAAI2022 | – | 47.31 | 27.28 | – |
PS-VTG [28] | ITM2022 | – | 39.22 | 20.17 | – |
D3G [46] | ICCV2023 | – | 41.64 | 19.60 | – |
2D-TAN [23] | AAAI2020 | 57.31 | 39.70 | 23.31 | 39.23 |
MRTNet (2D-TAN) [51] | ICASSP2024 | 59.23 | 44.27 | 25.88 | 40.59 |
MMCL (2D-TAN) | – | 60.58 | 46.95 | 28.26 | 42.17 |
VSLNet [22] | ACL2020 | 70.46 | 54.19 | 35.22 | 50.02 |
MRTNet (VSLNet) [51] | ICASSP2024 | 70.88 | 56.19 | 36.37 | 50.74 |
MMCL (VSLNet) | – | 72.31 | 58.86 | 38.52 | 51.96 |
Methods | Publication | Rank@1 | mIoU | ||
---|---|---|---|---|---|
TGN [48] | EMNLP2018 | 43.81 | 27.93 | – | – |
CMIN [58] | SIGIR2019 | 63.61 | 43.40 | 23.88 | – |
QSPN [53] | AAAI2019 | 45.30 | 27.70 | 13.60 | – |
ABLR-af [49] | AAAI2019 | 53.65 | 34.91 | – | 35.72 |
ABLR-aw [49] | AAAI2019 | 55.67 | 36.79 | – | 36.99 |
DEBUG [54] | EMNLP2019 | 55.91 | 39.72 | – | 39.51 |
GDP [56] | AAAI2020 | 56.17 | 39.27 | – | 39.80 |
DRN [44] | CVPR2020 | 58.52 | 41.51 | 23.07 | 43.13 |
CI-MHA [57] | SIGIR2021 | 61.49 | 43.97 | 25.13 | – |
SeqPAN [60] | ACL2021 | 61.65 | 45.50 | 29.37 | 45.11 |
CBLN [45] | CVPR2021 | 66.34 | 48.12 | 27.60 | – |
MMN [59] | AAAI2022 | 65.05 | 48.59 | 29.26 | – |
PS-VTG [28] | ITM2022 | 59.71 | 39.59 | 21.98 | – |
PFU [47] | ARXIV2023 | 59.63 | 36.35 | 16.61 | 40.15 |
D3G [46] | ICCV2023 | 58.25 | 36.68 | 18.54 | – |
2D-TAN [23] | AAAI2020 | 59.45 | 44.51 | 26.54 | 43.29 |
MRTNet (2D-TAN) [51] | ICASSP2024 | 60.71 | 45.59 | 28.07 | 44.54 |
MMCL (2D-TAN) | – | 62.15 | 47.21 | 29.35 | 46.53 |
VSLNet [22] | ACL2020 | 63.16 | 43.22 | 26.16 | 43.19 |
MRTNet (VSLNet) [51] | ICASSP2024 | 64.17 | 44.09 | 27.43 | 44.82 |
MMCL (VSLNet) | – | 65.59 | 46.82 | 28.59 | 45.98 |
Methods | Publication | Rank@1 | mIoU | ||
---|---|---|---|---|---|
CTRL [1] | ICCV2017 | 18.32 | 13.30 | – | – |
ACRN [50] | SIGIR2018 | 19.52 | 14.62 | – | – |
TGN [48] | EMNLP2018 | 21.77 | 18.90 | – | – |
SM-RL [42] | CVPR2019 | 20.25 | 15.95 | – | – |
ACL [41] | WACV2019 | 22.07 | 17.78 | – | – |
ABLR-aw [49] | AAAI2019 | 18.90 | 9.30 | – | 12.50 |
DEBUG [54] | EMNLP2019 | 23.45 | 11.72 | – | 16.03 |
CMIN [58] | SIGIR2019 | 24.64 | 18.05 | – | – |
CBLN [45] | CVPR2021 | 38.98 | 27.65 | – | – |
SeqPAN [60] | ACL2021 | 31.72 | 27.19 | 21.65 | 25.86 |
MMN [59] | AAAI2022 | 38.57 | 27.24 | – | – |
PS-VTG [28] | ITM2022 | 23.64 | 10.00 | 3.35 | – |
D3G [46] | ICCV2023 | 27.27 | 12.67 | 4.70 | – |
2D-TAN [23] | AAAI2020 | 37.29 | 25.32 | 13.32 | 25.19 |
MRTNet (2D-TAN) [51] | ICASSP2024 | 37.81 | 26.01 | 14.95 | 26.29 |
MMCL (2D-TAN) | – | 39.50 | 27.50 | 15.12 | 27.85 |
VSLNet [22] | ACL2020 | 29.61 | 24.27 | 20.03 | 24.11 |
MRTNet (VSLNet) [51] | ICASSP 2024 | 32.35 | 25.84 | 21.31 | 26.14 |
MMCL (VSLNet) | – | 34.15 | 26.81 | 22.57 | 27.83 |
Experimental Setting | Rank@1 | mIoU | |||
---|---|---|---|---|---|
MMCL (VSLNet) | Ours | 72.31 | 58.86 | 38.52 | 51.96 |
w/o | 68.56 | 52.82 | 33.89 | 48.11 | |
w/o | 70.12 | 56.71 | 36.25 | 49.92 | |
w/o | 70.85 | 56.71 | 36.79 | 50.25 | |
w/o | 71.92 | 57.91 | 37.65 | 51.35 | |
MMCL (2D-TAN) | Ours | 60.58 | 46.95 | 28.26 | 42.17 |
w/o | 59.41 | 45.63 | 27.11 | 41.03 | |
w/o | 55.81 | 38.89 | 22.18 | 37.84 | |
w/o | 59.83 | 46.42 | 27.94 | 41.72 | |
w/o | 58.78 | 45.56 | 27.21 | 40.85 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tian, Y.; Guo, X.; Wang, J.; Li, B.; Zhou, S. Video Temporal Grounding with Multi-Model Collaborative Learning. Appl. Sci. 2025, 15, 3072. https://doi.org/10.3390/app15063072
Tian Y, Guo X, Wang J, Li B, Zhou S. Video Temporal Grounding with Multi-Model Collaborative Learning. Applied Sciences. 2025; 15(6):3072. https://doi.org/10.3390/app15063072
Chicago/Turabian StyleTian, Yun, Xiaobo Guo, Jinsong Wang, Bin Li, and Shoujun Zhou. 2025. "Video Temporal Grounding with Multi-Model Collaborative Learning" Applied Sciences 15, no. 6: 3072. https://doi.org/10.3390/app15063072
APA StyleTian, Y., Guo, X., Wang, J., Li, B., & Zhou, S. (2025). Video Temporal Grounding with Multi-Model Collaborative Learning. Applied Sciences, 15(6), 3072. https://doi.org/10.3390/app15063072