Hierarchical Prototype Alignment for Video Temporal Grounding
Abstract
1. Introduction

- We propose a hierarchical prototype alignment method that decomposes cross-modal alignment into two complementary stages, namely object-phrase alignment and event-sentence alignment, thereby improving alignment accuracy from both spatial and temporal perspectives.
- We develop two prototype aggregation networks to model spatial and temporal information in videos, enabling the framework to capture fine-grained local details and diverse semantic patterns.
- Instead of constructing moment candidates through pooling or stacked convolution, we introduce an alignment result to construction strategy that emphasizes temporally localized regions most relevant to the textual semantics.
- Experiments on Charades-STA, ActivityNet Captions, and TACoS demonstrate that the proposed method outperforms existing approaches across multiple evaluation metrics.

2. Related Work
2.1. Video Temporal Grounding
2.2. Prototype Learning
3. Methodology
3.1. Problem Definition

3.2. Preparation
3.3. Object-Phrase Prototype Alignment
3.3.1. Spatial Prototype Generation

3.3.2. Object-Phrase Matching
3.4. Event-Sentence Prototype Alignment
3.4.1. Temporal Prototype Generation

3.4.2. Event-Sentence Matching
3.5. Fusion of Matching Results
3.6. Grounding Head
3.6.1. Moment Candidates Generation
3.6.2. Answer Prediction
3.7. Training Objective
3.7.1. Cross-Modal Alignment Loss
3.7.2. Binary Cross-Entropy Loss
4. Experiments
4.1. Datasets
4.2. Experimental Settings
4.2.1. Evaluation Metrics
4.2.2. Implementation Details
4.3. Comparison to State-of-the-Art Methods
4.3.1. Comparative Methods
4.3.2. Quantitative Comparison
| Type | Method | Rank@1 | mIoU | ||
|---|---|---|---|---|---|
| Global Alignment | VSLNet [21] | 70.46 | 54.19 | 35.22 | 50.02 |
| 2D-TAN [19] | 57.31 | 39.70 | 23.31 | 39.23 | |
| GDP [32] | 54.54 | 39.47 | 18.49 | – | |
| CPN [23] | 64.41 | 46.08 | 25.06 | 43.90 | |
| DCM [33] | – | 47.80 | 28.00 | 43.10 | |
| CI-MHA [61] | 69.87 | 54.68 | 35.27 | – | |
| MMN [20] | – | 47.31 | 27.28 | – | |
| PS-VTG [34] | – | 39.22 | 20.17 | – | |
| ViGA [35] | 60.22 | 36.72 | 17.20 | 38.62 | |
| D-TSG [36] | – | 65.05 | 42.77 | – | |
| D3G [37] | – | 41.64 | 19.60 | – | |
| MRTNet [22] | 59.23 | 44.27 | 25.88 | 40.59 | |
| Local Alignment | DRN [39] | – | 42.90 | 23.68 | – |
| LCNet [40] | 59.60 | 39.19 | 18.87 | 38.94 | |
| CBLN [24] | – | 43.67 | 24.44 | – | |
| SeqPAN [27] | 73.84 | 60.86 | 41.34 | 53.92 | |
| MGSL [28] | – | 63.98 | 41.03 | – | |
| PFU [41] | 71.57 | 54.66 | 28.34 | 48.65 | |
| VDI [42] | – | 46.47 | 28.63 | 41.60 | |
| UniVTG [62] | 70.81 | 58.01 | 35.65 | 50.10 | |
| MS-DETR [63] | 71.34 | 59.62 | 36.48 | 50.59 | |
| Ours | 69.91 | 49.32 | 33.68 | 45.03 | |
| Type | Method | Rank@1 | mIoU | ||
|---|---|---|---|---|---|
| Global Alignment | 2D-TAN [19] | 59.45 | 44.51 | 26.54 | 43.29 |
| VSLNet [21] | 63.16 | 43.22 | 26.16 | 43.19 | |
| GDP [32] | 56.17 | 39.27 | – | 39.80 | |
| CI-MHA [61] | 61.49 | 43.97 | 25.13 | – | |
| CPN [23] | 62.81 | 45.10 | 28.10 | 45.70 | |
| DCM [33] | – | 44.90 | 27.70 | 43.30 | |
| D-TSG [36] | – | 54.29 | 33.64 | – | |
| MMN [20] | 65.05 | 48.59 | 29.26 | – | |
| PS-VTG [34] | 59.71 | 39.59 | 21.98 | – | |
| ViGA [35] | 59.61 | 35.79 | 16.96 | 40.12 | |
| D3G [37] | 58.25 | 36.68 | 18.54 | – | |
| MRTNet [22] | 60.71 | 45.59 | 28.07 | 44.54 | |
| Local Alignment | LGI [25] | 58.52 | 41.51 | 23.07 | 41.13 |
| DRN [39] | 58.52 | 41.51 | 23.07 | 43.13 | |
| SeqPAN [27] | 61.65 | 45.50 | 29.37 | 45.11 | |
| CBLN [24] | 66.34 | 48.12 | 27.60 | – | |
| MAT [38] | – | 48.02 | 31.78 | – | |
| LCNet [40] | 48.49 | 26.33 | – | 34.29 | |
| MGSL [28] | – | 51.87 | 31.42 | – | |
| VDI [42] | – | 32.35 | 16.02 | 34.32 | |
| PFU [41] | 59.63 | 36.35 | 16.61 | 40.15 | |
| Ours | 67.73 | 56.81 | 35.48 | 46.03 | |
| Type | Method | Rank@1 | mIoU | ||
|---|---|---|---|---|---|
| Global Alignment | VSLNet [21] | 29.61 | 24.27 | 20.03 | 24.11 |
| 2D-TAN [19] | 37.29 | 25.32 | 13.32 | 25.19 | |
| GDP [32] | 24.14 | – | – | 16.18 | |
| CPN [23] | 47.69 | 36.33 | – | 34.49 | |
| D-TSG [36] | 46.32 | 35.91 | – | – | |
| PS-VTG [34] | 23.64 | 10.00 | 3.35 | – | |
| ViGA [35] | 19.62 | 8.85 | 3.22 | 15.47 | |
| MMN [20] | 38.57 | 27.24 | – | – | |
| D3G [37] | 27.27 | 12.67 | 4.70 | – | |
| MRTNet [22] | 37.81 | 26.01 | 14.95 | 26.29 | |
| Local Alignment | DRN [39] | – | 23.17 | – | – |
| CBLN [24] | 38.98 | 27.65 | – | – | |
| SeqPAN [27] | 31.72 | 27.19 | 21.65 | 25.86 | |
| MAT [38] | 48.79 | 37.57 | – | – | |
| MGSL [28] | 42.54 | 32.27 | – | – | |
| MESM [26] | 52.69 | 39.52 | – | 36.94 | |
| UniVTG [62] | 51.44 | 34.97 | 17.35 | 33.60 | |
| MS-DETR [63] | 53.16 | 39.65 | 23.42 | 37.01 | |
| Ours | 49.65 | 34.77 | 21.86 | 33.89 | |
4.4. Ablation Study
4.4.1. Effect of Individual Components
4.4.2. Analysis of Prototype Configuration
4.4.3. Analysis of the Fusion Mechanism
4.4.4. Parameter Sensitivity
4.4.5. Analysis of Kernel Size


4.5. Efficiency Analysis
4.6. Qualitative Analysis
4.6.1. Visualization of Prototypes
4.6.2. Visualization of Score Map
4.6.3. Grounding Results
4.7. Case Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Erin Liong, V.; Lu, J.; Tan, Y.P.; Zhou, J. Cross-Modal Deep Variational Hashing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Zhang, A.; Fei, H.; Yao, Y.; Ji, W.; Li, L.; Liu, Z.; Chua, T.S. VPGTrans: Transfer Visual Prompt Generator across LLMs. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 20299–20319. [Google Scholar]
- Li, S.; Li, B.; Sun, B.; Weng, Y. Towards Visual-Prompt Temporal Answer Grounding in Instructional Video. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8836–8853. [Google Scholar] [CrossRef] [PubMed]
- Hu, R.; Singh, A. UniT: Multimodal Multitask Learning with a Unified Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 1439–1449. [Google Scholar]
- Sun, J.; Li, Y.; Fang, H.S.; Lu, C. Three Steps to Multimodal Trajectory Prediction: Modality Clustering, Classification and Synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 13250–13259. [Google Scholar]
- Su, S.; Zhong, Z.; Zhang, C. Deep Joint-Semantics Reconstructing Hashing for Large-Scale Unsupervised Cross-Modal Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Gao, J.; Sun, C.; Yang, Z.; Nevatia, R. TALL: Temporal Activity Localization via Language Query. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Anne Hendricks, L.; Wang, O.; Shechtman, E.; Sivic, J.; Darrell, T.; Russell, B. Localizing Moments in Video with Natural Language. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Chu, Y.W.; Lin, K.Y.; Hsu, C.C.; Ku, L.W. End-to-End Recurrent Cross-Modality Attention for Video Dialogue. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2456–2464. [Google Scholar] [CrossRef]
- Ji, W.; Li, Y.; Wei, M.; Shang, X.; Xiao, J.; Ren, T.; Chua, T.S. VidVRD 2021: The Third Grand Challenge on Video Relation Detection. In Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA, 20–24 October 2021; MM ’21; Association for Computing Machinery: New York, NY, USA, 2021; pp. 4779–4783. [Google Scholar] [CrossRef]
- Shang, X.; Li, Y.; Xiao, J.; Ji, W.; Chua, T.S. Video Visual Relation Detection via Iterative Inference. In Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA, 20–24 October 2021; MM ’21; Association for Computing Machinery: New York, NY, USA, 2021; pp. 3654–3663. [Google Scholar] [CrossRef]
- Shang, X.; Ren, T.; Guo, J.; Zhang, H.; Chua, T.S. Video Visual Relation Detection. In Proceedings of the 25th ACM International Conference on Multimedia, New York, NY, USA, 23–27 October 2017; MM ’17; Association for Computing Machinery: New York, NY, USA, 2021; pp. 1300–1308. [Google Scholar] [CrossRef]
- Li, Y.; Wang, X.; Xiao, J.; Ji, W.; Chua, T.S. Invariant Grounding for Video Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2928–2937. [Google Scholar]
- Xiao, J.; Yao, A.; Liu, Z.; Li, Y.; Ji, W.; Chua, T.S. Video as Conditional Graph Hierarchy for Multi-Granular Question Answering. Proc. Aaai Conf. Artif. Intell. 2022, 36, 2804–2812. [Google Scholar] [CrossRef]
- Zhong, Y.; Ji, W.; Xiao, J.; Li, Y.; Deng, W.; Chua, T.S. Video Question Answering: Datasets, Algorithms and Challenges. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; ACL Anthology: Abu Dhabi, United Arab Emirates, 2022; pp. 6439–6455. [Google Scholar] [CrossRef]
- Liang, V.W.; Zhang, Y.; Kwon, Y.; Yeung, S.; Zou, J.Y. Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning. In Proceedings of the Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 17612–17625. [Google Scholar]
- Sun, C.; Myers, A.; Vondrick, C.; Murphy, K.; Schmid, C. VideoBERT: A Joint Model for Video and Language Representation Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Changpinyo, S.; Pont-Tuset, J.; Ferrari, V.; Soricut, R. Telling the What While Pointing to the Where: Multimodal Queries for Image Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 12136–12146. [Google Scholar]
- Zhang, S.; Peng, H.; Fu, J.; Luo, J. Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language. arXiv 2020, arXiv:1912.03590. [Google Scholar] [CrossRef]
- Wang, Z.; Wang, L.; Wu, T.; Li, T.; Wu, G. Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding. arXiv 2021, arXiv:2109.04872. [Google Scholar] [CrossRef]
- Zhang, H.; Sun, A.; Jing, W.; Zhou, J.T. Span-based Localizing Network for Natural Language Video Localization. arXiv 2021, arXiv:2004.13931. [Google Scholar] [CrossRef]
- Ji, W.; Qin, Y.; Chen, L.; Wei, Y.; Wu, Y.; Zimmermann, R. Mrtnet: Multi-Resolution Temporal Network for Video Sentence Grounding. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 2770–2774. [Google Scholar] [CrossRef]
- Zhao, Y.; Zhao, Z.; Zhang, Z.; Lin, Z. Cascaded Prediction Network via Segment Tree for Temporal Video Grounding. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 4195–4204. [Google Scholar] [CrossRef]
- Liu, D.; Qu, X.; Dong, J.; Zhou, P.; Cheng, Y.; Wei, W.; Xu, Z.; Xie, Y. Context-aware Biaffine Localizing Network for Temporal Sentence Grounding. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11230–11239. [Google Scholar] [CrossRef]
- Mun, J.; Cho, M.; Han, B. Local-Global Video-Text Interactions for Temporal Grounding. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10807–10816. [Google Scholar] [CrossRef]
- Liu, Z.; Li, J.; Xie, H.; Li, P.; Ge, J.; Liu, S.A.; Jin, G. Towards balanced alignment: Modal-enhanced semantic modeling for video moment retrieval. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence; AAAI’24/IAAI’24/EAAI’24; AAAI Press: Washington, DC, USA, 2024. [Google Scholar] [CrossRef]
- Zhang, H.; Sun, A.; Jing, W.; Zhen, L.; Zhou, J.T.; Goh, S.M.R. Parallel Attention Network with Sequence Matching for Video Grounding. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021. [Google Scholar] [CrossRef]
- Liu, D.; Qu, X.; Di, X.; Cheng, Y.; Xu, Z.; Zhou, P. Memory-Guided Semantic Learning Network for Temporal Sentence Grounding. Proc. Aaai Conf. Artif. Intell. 2022, 36, 1665–1673. [Google Scholar] [CrossRef]
- Lin, C.; Wu, A.; Liang, J.; Zhang, J.; Ge, W.; Zheng, W.S.; Shen, C. Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval. In Proceedings of the Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 38655–38666. [Google Scholar]
- Choi, J.; Gao, C.; Messou, J.C.E.; Huang, J.B. Why Can’t I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition. In Proceedings of the Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
- Wang, J.; Ge, Y.; Cai, G.; Yan, R.; Lin, X.; Shan, Y.; Qie, X.; Shou, M.Z. Object-Aware Video-Language Pre-Training for Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3313–3322. [Google Scholar]
- Chen, L.; Lu, C.; Tang, S.; Xiao, J.; Zhang, D.; Tan, C.; Li, X. Rethinking the Bottom-Up Framework for Query-Based Video Localization. Proc. Aaai Conf. Artif. Intell. 2020, 34, 10551–10558. [Google Scholar] [CrossRef]
- Yang, X.; Feng, F.; Ji, W.; Wang, M.; Chua, T.S. Deconfounded Video Moment Retrieval with Causal Intervention. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 11–15 July 2021; SIGIR ’21; ACM Digital Library: New York, NY, USA, 2021; pp. 1–10. [Google Scholar] [CrossRef]
- Xu, Z.; Wei, K.; Yang, X.; Deng, C. Point-Supervised Video Temporal Grounding. IEEE Trans. Multimed. 2023, 25, 6121–6131. [Google Scholar] [CrossRef]
- Cui, R.; Qian, T.; Peng, P.; Daskalaki, E.; Chen, J.; Guo, X.; Sun, H.; Jiang, Y.G. Video Moment Retrieval from Text Queries via Single Frame Annotation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 11–15 July 2022; SIGIR ’22; ACM Digital Library: New York, NY, USA, 2022; pp. 1033–1043. [Google Scholar] [CrossRef]
- Liu, D.; Qu, X.; Hu, W. Reducing the Vision and Language Bias for Temporal Sentence Grounding. In Proceedings of the 30th ACM International Conference on Multimedia, New York, NY, USA, 10–14 October 2022; MM ’22; ACM Digital Library: New York, NY, USA, 2022; pp. 4092–4101. [Google Scholar] [CrossRef]
- Li, H.; Shu, X.; He, S.; Qiao, R.; Wen, W.; Guo, T.; Gan, B.; Sun, X. D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 13688–13700. [Google Scholar] [CrossRef]
- Zhang, M.; Yang, Y.; Chen, X.; Ji, Y.; Xu, X.; Li, J.; Shen, H.T. Multi-Stage Aggregated Transformer Network for Temporal Language Localization in Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12669–12678. [Google Scholar]
- Zeng, R.; Xu, H.; Huang, W.; Chen, P.; Tan, M.; Gan, C. Dense Regression Network for Video Grounding. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10284–10293. [Google Scholar] [CrossRef]
- Yang, W.; Zhang, T.; Zhang, Y.; Wu, F. Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding. IEEE Trans. Image Process. 2021, 30, 3252–3262. [Google Scholar] [CrossRef] [PubMed]
- Ju, C.; Wang, H.; Liu, J.; Ma, C.; Zhang, Y.; Zhao, P.; Chang, J.; Tian, Q. Constraint and Union for Partially-Supervised Temporal Sentence Grounding. arXiv 2023, arXiv:2302.09850. [Google Scholar] [CrossRef]
- Luo, D.; Huang, J.; Gong, S.; Jin, H.; Liu, Y. Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 23045–23055. [Google Scholar] [CrossRef]
- Liu, C.; Zhao, R.; Chen, J.; Qi, Z.; Zou, Z.; Shi, Z. A Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5622018. [Google Scholar] [CrossRef]
- Liu, C.; Chen, K.; Zhang, H.; Qi, Z.; Zou, Z.; Shi, Z. Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5635616. [Google Scholar] [CrossRef]
- Reed, S.K. Pattern recognition and categorization. Cogn. Psychol. 1972, 3, 382–407. [Google Scholar] [CrossRef]
- Rosch, E.; Mervis, C.B.; Gray, W.D.; Johnson, D.M.; Boyes-Braem, P. Basic objects in natural categories. Cogn. Psychol. 1976, 8, 382–439. [Google Scholar] [CrossRef]
- Graf, A.B.A.; Bousquet, O.; Rätsch, G.; Schölkopf, B. Prototype Classification: Insights from Machine Learning. Neural Comput. 2009, 21, 272–300. [Google Scholar] [CrossRef] [PubMed]
- Snell, J.; Swersky, K.; Zemel, R. Prototypical Networks for Few-shot Learning. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Pahde, F.; Puscas, M.; Klein, T.; Nabi, M. Multimodal Prototypical Networks for Few-Shot Learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 2644–2653. [Google Scholar]
- Gao, T.; Han, X.; Liu, Z.; Sun, M. Hybrid Attention-Based Prototypical Networks for Noisy Few-Shot Relation Classification. Proc. Aaai Conf. Artif. Intell. 2019, 33, 6407–6414. [Google Scholar] [CrossRef]
- Zhao, K.; Jin, X.; Bai, L.; Guo, J.; Cheng, X. Knowledge-Enhanced Self-Supervised Prototypical Network for Few-Shot Event Detection. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; ACL Anthology: Abu Dhabi, United Arab Emirates, 2022; pp. 6266–6275. [Google Scholar] [CrossRef]
- Hu, Z.; Li, Z.; Xu, D.; Bai, L.; Jin, C.; Jin, X.; Guo, J.; Cheng, X. ProtoEM: A Prototype-Enhanced Matching Framework for Event Relation Extraction. arXiv 2023, arXiv:2309.12892. [Google Scholar] [CrossRef]
- Huang, Y.; Yang, L.; Sato, Y. Compound Prototype Matching for Few-Shot Action Recognition. In Proceedings of the Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; pp. 351–368. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR; ML Research Press: Cambridge, MA, USA, 2021; Volume 139, pp. 8748–8763. [Google Scholar]
- Li, P.; Xie, C.W.; Zhao, L.; Xie, H.; Ge, J.; Zheng, Y.; Zhao, D.; Zhang, Y. Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 4100–4110. [Google Scholar]
- Zhang, D.; Dai, X.; Wang, X.; Wang, Y.F.; Davis, L.S. MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Krishna, R.; Hata, K.; Ren, F.; Fei-Fei, L.; Niebles, J.C. Dense-Captioning Events in Videos. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 706–715. [Google Scholar] [CrossRef]
- Rohrbach, M.; Regneri, M.; Andriluka, M.; Amin, S.; Pinkal, M.; Schiele, B. Script data for attribute-based recognition of composite activities. In Proceedings of the 12th European Conference on Computer Vision—Volume Part I, Berlin, Heidelberg, 2012; ECCV’12; Springer: Berlin/Heidelberg, Germany, 2012; pp. 144–157. [Google Scholar] [CrossRef]
- Kingma, D.K.; Ba, J. A method for stochastic optimization. arXiv 2024, arXiv:1412.6980. [Google Scholar] [PubMed]
- Hosang, J.; Benenson, R.; Schiele, B. Learning Non-Maximum Suppression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Yu, X.; Malmir, M.; He, X.; Chen, J.; Wang, T.; Wu, Y.; Liu, Y.; Liu, Y. Cross Interaction Network for Natural Language Guided Video Moment Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 11–15 July 2021; SIGIR ’21; ACM Digital Library: New York, NY, USA, 2021; pp. 1860–1864. [Google Scholar] [CrossRef]
- Lin, K.Q.; Zhang, P.; Chen, J.; Pramanick, S.; Gao, D.; Wang, A.J.; Yan, R.; Shou, M.Z. UniVTG: Towards Unified Video-Language Temporal Grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 2794–2804. [Google Scholar]
- Ma, H.; Wang, G.; Yu, F.; Jia, Q.; Ding, S. MS-DETR: Towards Effective Video Moment Retrieval and Highlight Detection by Joint Motion-Semantic Learning. In Proceedings of the 33rd ACM International Conference on Multimedia, New York, NY, USA, 27–31 October 2025; MM ’25; ACM Digital Library: New York, NY, USA, 2025; pp. 4514–4523. [Google Scholar] [CrossRef]




| ID | OPPM | ESPM | Rank@1 | mIoU | ||||
|---|---|---|---|---|---|---|---|---|
| SPG | Matching | TPG | Matching | |||||
| 1 | ✗ | ✗ | ✓ | ✓ | 61.39 | 52.93 | 32.91 | 42.05 |
| 2 | ✓ | ✓ | ✗ | ✗ | 63.39 | 53.88 | 33.43 | 43.35 |
| 3 | ⟲TPG | ✓ | ✓ | ✓ | 64.06 | 54.39 | 34.13 | 43.89 |
| 4 | ✓ | ✓ | ⟲SPG | ✓ | 63.39 | 53.91 | 33.82 | 43.93 |
| 5 | ✓ | P–P | ✓ | ✓ | 62.73 | 53.42 | 32.52 | 42.97 |
| 6 | ✓ | O–W | ✓ | ✓ | 62.37 | 52.91 | 32.21 | 42.57 |
| 7 | ✓ | ✓ | –F | ✓ | 64.73 | 54.37 | 33.74 | 43.81 |
| 8 | ✓ | ✓ | –M | ✓ | 65.40 | 54.86 | 34.04 | 44.27 |
| 9 | ✓ | ✓ | –R | ✓ | 64.06 | 54.88 | 33.13 | 44.89 |
| 10 | ✓ | F–W | ✓ | ✓ | 64.73 | 54.37 | 34.04 | 44.81 |
| 11 | ✓ | ✓ | ✓ | –S | 63.39 | 53.39 | 33.87 | 43.43 |
| 12 | ✓ | mean | ✓ | ✓ | 66.18 | 54.73 | 32.69 | 44.20 |
| 13 | ✓ | ✓ | ✓ | ✓ | 67.73 | 56.81 | 35.48 | 46.03 |
| Rank@1 | mIoU | |||
|---|---|---|---|---|
| 60.31 | 51.82 | 32.13 | 41.95 | |
| 61.24 | 52.41 | 33.34 | 42.69 | |
| 63.33 | 54.52 | 33.86 | 43.17 | |
| 62.52 | 53.83 | 32.95 | 42.52 | |
| 63.73 | 54.81 | 33.48 | 43.79 | |
| 62.55 | 53.33 | 32.34 | 42.98 | |
| 67.73 | 56.81 | 35.48 | 46.03 | |
| Fusion Strategy | Rank@1 | mIoU | ||||
|---|---|---|---|---|---|---|
| Static fusion | 0.0 | 1.0 | 61.39 | 52.93 | 32.91 | 42.05 |
| 0.3 | 0.7 | 64.10 | 53.82 | 33.62 | 43.72 | |
| 0.5 | 0.5 | 65.55 | 54.23 | 34.85 | 44.68 | |
| 0.7 | 0.3 | 64.23 | 53.95 | 33.68 | 43.92 | |
| 1.0 | 0.0 | 63.39 | 53.88 | 33.43 | 43.35 | |
| Dynamic fusion | – | – | 67.73 | 56.81 | 35.48 | 46.03 |
| Dataset | FLOPS (B) | Params (M) | Times (ms) |
|---|---|---|---|
| Charades-STA | 1.454 | 162.261 | 130.674 |
| ActivityNet Captions | 2.275 | 162.261 | 560.992 |
| TACoS | 2.293 | 162.261 | 570.736 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Tian, Y.; Guo, X.; Wang, J.; Zhao, Y.; Li, B. Hierarchical Prototype Alignment for Video Temporal Grounding. Entropy 2026, 28, 389. https://doi.org/10.3390/e28040389
Tian Y, Guo X, Wang J, Zhao Y, Li B. Hierarchical Prototype Alignment for Video Temporal Grounding. Entropy. 2026; 28(4):389. https://doi.org/10.3390/e28040389
Chicago/Turabian StyleTian, Yun, Xiaobo Guo, Jinsong Wang, Yuming Zhao, and Bin Li. 2026. "Hierarchical Prototype Alignment for Video Temporal Grounding" Entropy 28, no. 4: 389. https://doi.org/10.3390/e28040389
APA StyleTian, Y., Guo, X., Wang, J., Zhao, Y., & Li, B. (2026). Hierarchical Prototype Alignment for Video Temporal Grounding. Entropy, 28(4), 389. https://doi.org/10.3390/e28040389

