Causal Visual–Semantic Enhancement for Video-Text Retrieval
Abstract
1. Introduction
- (1)
- We analyze and identify the existence of contextual observational co-occurrence bias in video-text retrieval tasks, as well as the common-sense causal relationships that are implicitly embedded within visual semantics.
- (2)
- To eliminate the harmful effects of contextual observational co-occurrence bias, we construct a causal graph and adopt a back-door intervention to extract common-sense causal visual features. Furthermore, we leverage the text as a conditional guide for frame aggregation, which enhances the representation of semantically relevant frames while suppressing redundant ones.
- (3)
- We design experiments to verify the effectiveness of our method. Experimental results on large-scale datasets (MSR-VTT, MSVD, and LSMDC) show that the proposed method significantly outperforms state-of-the-art retrieval models in retrieval performance.
2. Related Work
3. Proposed Method
3.1. Causal Graph and Causal Intervention
3.2. Common-Sense Causal Relationships in Visual Semantics
3.3. Causal Visual–Semantic Enhancement for Video-Text Retrieval
3.4. Model Training
4. Experiments and Results
4.1. Experimental Data and Evaluation Metrics
4.2. Experimental Details
4.3. Experimental Results
4.4. Ablation Study
4.5. Qualitative Analyses
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
| Metric | Text to Video () |
|---|---|
| R@1 | 0.028 |
| R@5 | 0.017 |
| R@10 | 0.012 |
References
- Omama, M.; Li, P.H.; Chinchali, S.P. Exploiting Distribution Constraints for Scalable and Efficient Image Retrieval. arXiv 2025, arXiv:2410.07022. [Google Scholar]
- Tian, K.; Cheng, Y.; Liu, Y.; Hou, X.; Chen, Q.; Li, H. Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning. arXiv 2024, arXiv:2401.00701. [Google Scholar] [CrossRef]
- Shen, L.; Hao, T.; He, T.; Zhao, S.; Zhang, Y.; Liu, P.; Bao, Y.; Ding, G. TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval. arXiv 2025, arXiv:2409.01156. [Google Scholar]
- Bai, Z.; Xiao, T.; He, T.; Wang, P.; Zhang, Z.; Brox, T.; Shou, M.Z. Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach. arXiv 2025, arXiv:2408.07249. [Google Scholar]
- Muhammad, M.A.A.Z.; Rasheed, H.; Khan, S.; Khan, F.S. Video-ChatGPT:Towards detailed video understanding via large vision and language models. arXiv 2023, arXiv:2306.05424. [Google Scholar]
- Zhang, S.; Mu, H.; Li, Q.; Xiao, C.; Liu, T. Fine-Grained Features Alignment and Fusion for Text-Video Cross-Modal Retrieval. In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024. [Google Scholar] [CrossRef]
- Zhu, C.; Jia, Q.; Chen, W.; Guo, Y.; Liu, Y. Deep learning for video-text retrieval: A review. Int. J. Multimed. Inf. Retr. 2023, 12, 3. [Google Scholar] [CrossRef]
- Yu, K.P.; Zhang, Z.; Hu, F.; Chai, J. Efficient in-context learning in vision-language models for egocentric videos. arXiv 2023, arXiv:2311.17041. [Google Scholar]
- Dong, J.; Li, X.; Xu, C.; Yang, X.; Yang, G.; Wang, X.; Wang, M. Dual encoding for video retrieval by text. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4065–4080. [Google Scholar] [CrossRef]
- Wen, J.; Chen, Y.; Shi, R.; Ji, W.; Yang, M.; Gao, D.; Yuan, J.; Zimmermann, R. HOVER: Hyperbolic Video-Text Retrieval. IEEE Trans. Image Process. 2025, 34, 6192–6203. [Google Scholar] [CrossRef]
- Chen, X.; Liu, D.; Yang, X.; Li, X.; Dong, J.; Wang, M.; Wang, X. PRVR: Partially relevant video retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 48, 1262–1277. [Google Scholar] [CrossRef]
- Wang, X.; Zhu, L.; Yang, Y. T2vlad: Global-local sequence alignment for text-video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 5079–5088. [Google Scholar]
- Bain, M.; Nagrani, A.; Varol, G.; Zisserman, A. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF international conference on computer vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1728–1738. [Google Scholar]
- Liu, S.; Fan, H.; Qian, S.; Chen, Y.; Ding, W.; Wang, Z. HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval. In Proceedings of the 2021 IEEE International Conferenceon Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Dzabraev, M.; Kalashnikov, M.; Komkov, S.; Petiushko, A. MDMMT: Multidomain multimodal transformer for video retrieval. arXiv 2021, 3354–3363. [Google Scholar] [CrossRef]
- Wang, J.; Wang, C.; Huang, K.; Huang, J.; Jin, L. VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models. arXiv 2024, arXiv:2410.00741. [Google Scholar]
- Deng, C.; Chen, Q.; Qin, P.; Chen, D.; Wu, Q. Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval. arXiv 2023. [Google Scholar] [CrossRef]
- Liu, Y.; Xiong, P.; Xu, L.; Cao, S.; Jin, Q. Ts2-net: Token shift and selection transformer for text-video retrieval. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 319–335. [Google Scholar]
- Fang, C.; Zhang, D.; Wang, L.; Zhang, Y.; Cheng, L.; Han, J. X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), Lisboa, Portugal, 10–14 October 2022; pp. 5006–5015. [Google Scholar] [CrossRef]
- Fang, H.; Xiong, P.; Xu, L.; Chen, Y. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. arXiv 2021. [Google Scholar] [CrossRef]
- Ma, Y.; Xu, G.; Sun, X.; Yan, M.; Zhang, J.; Ji, R. X-clip: Endto-end multi-grained contrastive learning for video-text retrieval. arXiv 2022, arXiv:2207.07285. [Google Scholar]
- Luo, H.; Ji, L.; Zhong, M.; Chen, Y.; Lei, W.; Duan, N.; Li, T. CLIP4Clip: An Empirical Study of CLIP for End to End Video ClipRetrieval. arXiv 2021. [Google Scholar] [CrossRef]
- Judea, P. Causality: Models, Reasoning and Inference; Cambridge University Press: Cambridge, UK, 2009; Volume 40, p. 478. [Google Scholar]
- Park, J.J.; Choi, S.J. Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives. arXiv 2024, arXiv:2412.10720. [Google Scholar] [CrossRef]
- Li, Y.; Li, R.; Ma, Y.; Xue, Y.; Meng, L. FCA: A Causal Inference Based Method for Analyzing the Failure Causes of Object Detection Algorithms. In Proceedings of the 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security Companion (QRS-C), Chiang Mai, Thailand, 22–26 October 2023; pp. 2693–9371. [Google Scholar]
- Chen, Z.; Hu, L.; Li, W.; Shao, Y.; Nie, L. Causal Intervention and Counterfactual Reasoning for Multi-modal Fake News Detection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; pp. 627–638. [Google Scholar]
- Xue, D.; Qian, S.; Xu, C. Variational Causal Inference Network for Explanatory Visual Question Answering. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023. [Google Scholar] [CrossRef]
- Cai, L.; Fang, H.; Xu, N.; Ren, B. Counterfactual Causal-Effect Intervention for Interpretable Medical Visual Question Answering. IEEE Trans. Med. Imaging 2024, 43, 4430–4441. [Google Scholar] [CrossRef]
- Liu, Y.; Qin, G.; Chen, H.; Cheng, Z.; Yang, X. Causality-Inspired Invariant Representation Learning for Text-Based Person Retrieval. Proc. AAAI Conf. Artif. Intell. 2024, 38, 14052–14060. [Google Scholar] [CrossRef]
- BSatar, B.; Zhu, H.; Zhang, H.; Lim, J.H. Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal Intervention. arXiv 2023, arXiv:2309.09311. [Google Scholar] [CrossRef]
- Cheng, D.; Kong, S.; Wang, W.; Qu, M.; Jiang, B. Long Term Memory-Enhanced Via Causal Reasoning for Text-To-Video Retrieval. In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 1520–6149. [Google Scholar]
- Yue, W.U.; Qi, Z.; Wu, Y.; Sun, J.; Wang, Y.; Wang, S. Learning Fine-Grained Representations through Textual Token Disentanglement. 2025. Available online: https://openreview.net/forum?id=wGa2plE8ka (accessed on 27 May 2025).
- Zou, X.; Wu, C.; Cheng, L.; Wang, Z. TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval. arXiv 2022, arXiv:2209.13822. [Google Scholar]
- Wang, Z.; Sung, Y.L.; Cheng, F.; Bertasius, G.; Bansal, M. Unified Coarse-to-Fine Alignment for Video-Text Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; Volume 34, pp. 1747–1756. Available online: https://openaccess.thecvf.com/content/ICCV2023/html/Wang_Unified_Coarse-to-Fine_Alignment_for_Video-Text_Retrieval_ICCV_2023_paper.html (accessed on 25 January 2026).
- Yang, X.; Zhu, L.; Wang, X.; Yang, Y. DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval. arXiv 2024, arXiv:2401.10588. [Google Scholar] [CrossRef]
- Chen, L.; Deng, Z.; Liu, L.; Yin, S. Multilevel semantic interaction alignment for video-text cross-modal retrieval. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 6559–6575. [Google Scholar] [CrossRef]
- Chen, W.; Liu, Y.; Wang, C.; Zhu, J.; Zhao, S.; Li, G.; Liu, C.L.; Lin, L. Cross-Modal Causal Intervention for Medical Report Generation. arXiv 2024, arXiv:2303.09117. [Google Scholar] [CrossRef]
- Rehman, M.U.; Nizami, I.F.; Ullah, F.; Hussain, I. IQA Vision Transformed: A Survey of Transformer Architectures in Perceptual Image Quality Assessment. IEEE Access 2024, 12, 83369–183393. [Google Scholar] [CrossRef]
- Li, W.; Li, Z.; Yang, X.; Ma, H. Causal-ViT: Robust Vision Transformer by causal intervention. Eng. Appl. Artif. Intell. 2023, 126, 107123. [Google Scholar] [CrossRef]
- Li, Z.; Wang, H.; Liu, D.; Zhang, C.; Ma, A.; Long, J.; Cai, W. Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images. Computer Vision and Pattern Recognition. arXiv 2024, arXiv:2408.08105. [Google Scholar]
- Chen, C.; Merullo, J.; Eickhoff, C. Axiomatic Causal Interventions for Reverse Engineering Relevance Computation in Neural Retrieval Models. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024. [Google Scholar] [CrossRef]
- Liu, J.; Liu, Z.; Zhang, Z.; Wang, L.; Liu, M. A New Causal Inference Framework for SAR Target Recognition. IEEE Trans. Artif. Intell. 2024, 5, 4042–4057. [Google Scholar] [CrossRef]
- Wu, X.; Guo, R.; Li, Q.; Zhu, N. Visual Commonsense Causal Reasoning from A Still Image. IEEE Access 2025, 13, 85084–85097. [Google Scholar] [CrossRef]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.J.; Shamma, D.A.; et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv 2017, arXiv:1602.07332. [Google Scholar] [CrossRef]
- Xu, J.; Mei, T.; Yao, T.; Rui, Y. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5288–5296. [Google Scholar] [CrossRef]
- Chen, D.; Dolan, W.B. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 190–200. [Google Scholar]
- Rohrbach, A.; Rohrbach, M.; Tandon, N.; Schiele, B. A dataset for movie description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3202–3212. [Google Scholar]
- Wang, Z.; Zhang, D.; Hu, Z. LSECA: Local semantic enhancement and cross aggregation for video-text retrieval. Int. J. Multimed. Inf. Retr. 2024, 13, 30. [Google Scholar] [CrossRef]







| Datasets | Clips | Captions | Source | Caption Source | Length | Year |
|---|---|---|---|---|---|---|
| MSR-VTT [45] | 10,000 | 200,000 | YouTube | Manual | 10~30 s | 2016 |
| MSVD [46] | 1970 | 78,800 | YouTube | Manual | 1~62 s | 2011 |
| LSMDC [47] | 118,081 | 118,081 | Movies | Manual | 2~30 s | 2015 |
| Method | Text to Video | Video to Text | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| R@1 | R@5 | R@10 | MnR | MdR | R@1 | R@5 | R@10 | MnR | MdR | ||
| NO CLIP AND CAUSAL INFERENCE | DFLB [30] | 24.64 | 52.99 | 66.09 | 26.26 | 5.0 | - | - | - | - | - |
| Memory Enhancing [31] | 24.8 | 51.6 | 64.9 | - | - | - | - | - | - | - | |
| CLIP-BASED AND CROSS-MODAL TEMPORAL FUSION | CLIP2Video [20] | 45.6 | 72.6 | 81.7 | 14.6 | - | 43.5 | 72.3 | 82.1 | 10.2 | 2.0 |
| X-CLIP [21] | 46.1 | 73.0 | 83.1 | 13.2 | 46.8 | 73.3 | 84.0 | 9.1 | 2.0 | ||
| TS2-Net [18] | 46.5 | 73.6 | 83.3 | 13.9 | - | 45.3 | 74.1 | 83.7 | 9.2 | 2.0 | |
| X-Pool [19] | 46.9 | 72.8 | 82.2 | 14.3 | 2.0 | 44.4 | 73.3 | 84.0 | 9.0 | 2.0 | |
| Prompt Switch [17] | 47.8 | 73.9 | 82.2 | 14.1 | - | 46.0 | 74.3 | 84.8 | 8.5 | - | |
| CLIP-BASED AND COARSE-TO-FINE REPRESENTATION LEARNING | CLIP4Clip [22] | 44.5 | 71.4 | 81.6 | 15.3 | 2.0 | 42.7 | 70.9 | 80.6 | 11.6 | 2.0 |
| MSIA [36] | 47.2 | 73.8 | 84.1 | - | 2.0 | - | - | - | - | - | |
| LSECA [48] | 47.1 | 74.9 | 82.8 | 14.9 | 2.0 | 47.5 | 75.4 | 83.4 | 12.3 | 2.0 | |
| CLIP-BASED AND PARAMETER/INFERENCE-EFFICIENT | DGL [35] | 45.8 | 69.3 | 79.4 | 16.3 | - | 43.5 | 70.5 | 80.7 | 13.1 | - |
| TempMe [3] | 46.1 | 71.8 | 80.7 | 14.8 | - | - | - | - | - | - | |
| CLIP-BASED AND CAUSAL INFERENCE | Ours | 49.0 | 74.1 | 83.3 | 12.4 | 2.0 | 45.8 | 74.4 | 83.4 | 11.5 | 2.0 |
| Method | Text to Video | Video to Text | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| R@1 | R@5 | R@10 | MnR | MdR | R@1 | R@5 | R@10 | MnR | MdR | ||
| CLIP-BASED AND CROSS-MODAL TEMPORAL FUSION | CLIP2Video [20] | 47.0 | 76.8 | 85.9 | 9.6 | - | 58.7 | 85.6 | 91.6 | 4.3 | - |
| X-CLIP [21] | 47.1 | 77.8 | - | 9.5 | - | 60.9 | 87.8 | - | 4.7 | - | |
| X-Pool [19] | 47.2 | 77.4 | 86.0 | 9.3 | - | 66.4 | 90.0 | 94.2 | 3.3 | - | |
| Prompt Switch [17] | 47.1 | 76.9 | 86.1 | 9.5 | - | 68.5 | 91.8 | 95.6 | 2.8 | - | |
| CLIP-BASED AND COARSE-TO-FINE REPRESENTATION LEARNING | CLIP4Clip [22] | 45.2 | 75.5 | 84.3 | 10.3 | 2.0 | 62.0 | 87.3 | 92.6 | 4.3 | 1.0 |
| MSIA [36] | 45.4 | 74.4 | 83.3 | - | 2.0 | - | - | - | - | - | |
| LSECA [48] | 46.9 | 76.8 | 85.7 | 9.9 | 2.0 | - | - | - | - | - | |
| CLIP-BASED AND CAUSAL INFERENCE | Ours | 49.4 | 77.3 | 86.0 | 9.3 | 2.0 | 68.5 | 91.2 | 95.0 | 3.0 | 1.0 |
| Method | Text to Video | Video to Text | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| R@1 | R@5 | R@10 | MnR | MdR | R@1 | R@5 | R@10 | MnR | MdR | ||
| CLIP-BASED AND CROSS-MODAL TEMPORAL FUSION | X-CLIP [21] | 23.3 | 43.0 | - | 56.0 | - | 22.5 | 42.2 | - | 50.7 | - |
| TS2-Net [18] | 23.4 | 42.3 | 50.1 | 56.9 | - | - | - | - | - | - | |
| X-Pool [19] | 25.2 | 43.7 | 53.5 | 53.2 | - | 22.7 | 42.6 | 51.2 | 47.4 | - | |
| Prompt Switch [17] | 23.1 | 41.7 | 50.5 | 56.8 | - | 22.0 | 40.8 | 50.3 | 51.0 | - | |
| CLIP-BASED AND COARSE-TO-FINE REPRESENTATION LEARNING | CLIP4Clip [22] | 22.6 | 41.0 | 49.1 | 61.0 | 11.0 | 20.8 | 39.0 | 48.6 | 54.2 | 12.0 |
| MSIA [36] | 19.7 | 38.1 | 47.5 | - | 12.0 | - | - | - | - | - | |
| LSECA [48] | 23.4 | 43.1 | 50.4 | 56.0 | 10.0 | - | - | - | - | - | |
| CLIP-BASED AND PARAMETER/INFERENCE-EFFICIENT | DGL [35] | 21.4 | 39.4 | 48.4 | 64.3 | - | - | - | - | - | - |
| TempMe [3] | 23.5 | 41.7 | 51.8 | 53.5 | - | - | - | - | - | - | |
| CLIP-BASED AND CAUSAL INFERENCE | ours | 26.0 | 43.3 | 54.3 | 53.5 | 10.0 | 22.9 | 42.9 | 50.7 | 50.0 | 12.0 |
| Method | Text to Video | ||
|---|---|---|---|
| R@1 | R@5 | R@10 | |
| No CI + No text-conditioned | 44.7 | 71.2 | 80.0 |
| CI | 47.1 | 72.8 | 81.7 |
| Text-conditioned | 47.6 | 73.2 | 82.1 |
| CI +text-conditioned | 49.0 | 74.1 | 83.3 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Lan, H.; Lv, C. Causal Visual–Semantic Enhancement for Video-Text Retrieval. Electronics 2026, 15, 739. https://doi.org/10.3390/electronics15040739
Lan H, Lv C. Causal Visual–Semantic Enhancement for Video-Text Retrieval. Electronics. 2026; 15(4):739. https://doi.org/10.3390/electronics15040739
Chicago/Turabian StyleLan, Hua, and Chaohui Lv. 2026. "Causal Visual–Semantic Enhancement for Video-Text Retrieval" Electronics 15, no. 4: 739. https://doi.org/10.3390/electronics15040739
APA StyleLan, H., & Lv, C. (2026). Causal Visual–Semantic Enhancement for Video-Text Retrieval. Electronics, 15(4), 739. https://doi.org/10.3390/electronics15040739

