Automated Dataset Construction for Composed Video Retrieval in Soccer
Abstract
1. Introduction
- We propose a framework for automatic triplet construction in soccer videos that introduces a two-stage integration of visual and commentary-based textual similarity, enabling the construction of counterfactual triplets without manual annotation.
- We introduce an MLLM-based evaluation method that formulates triplet validity assessment as a set of structured questions, providing an automatic screening signal for identifying potentially unreliable triplets.
2. Related Work
2.1. Sports Video Understanding and Retrieval
2.2. Automated Triplet Construction for CoVR
3. Proposed Method
- (i)
- Identifying relevant target videos by jointly leveraging visual similarity and semantic information from commentary captions.
- (ii)
- Generating query text that captures semantic differences between videos through an LLM.
- (iii)
- Estimating triplet validity and identifying potentially unreliable samples via MLLM-based evaluation.
3.1. Multimodal Identification of Target Videos
3.2. Generation of Query Text via LLM
3.3. Triplet Validity Evaluation via an MLLM
- Q1:
- This criterion verifies the visual-textual alignment. It is designed to filter out triplets in which the query text does not reflect the actual visual difference, often due to temporal misalignment in captions or LLM hallucinations.
- Q2:
- This criterion serves as a content filter to ensure the triplet captures meaningful event-related transformations. In soccer broadcasts, this is essential for excluding irrelevant changes, such as fluctuations in the displayed score or superficial textual variations, that do not pertain to the actual play.
4. Experimental Results
4.1. Experimental Settings
4.2. Quantitative Experimental Results
4.3. Qualitative Experimental Results
5. Limitations
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhang, Y.; Zhang, X.; Xu, C.; Lu, H. Personalized retrieval of sports video. In Proceedings of the International Workshop on Multimedia Information Retrieval; Association for Computing Machinery: New York, NY, USA, 2007; pp. 313–322. [Google Scholar]
- Hughes, M.; Franks, I. Notational Analysis of Sport: Systems for Better Coaching and Performance in Sport; Psychology Press: New York, NY, USA, 2004. [Google Scholar]
- Guo, Y.; Chen, C.; Peng, J.; Deng, L.; Yuan, T. Does visual training enhance athletes’ decision-making skills and sport-specific performance? A systematic review and meta-analysis. Scand. J. Med. Sci. Sport. 2025, 35, e70140. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Y.; Xu, C.; Zhang, X.; Lu, H. Personalized retrieval of sports video based on multi-modal analysis and user preference acquisition. Multimed. Tools Appl. 2009, 44, 305–330. [Google Scholar] [CrossRef]
- Toderici, G.; Aradhye, H.; Pasca, M.; Sbaiz, L.; Yagnik, J. Finding meaning on youtube: Tag recommendation and category discovery. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2010; pp. 3447–3454. [Google Scholar]
- Guo, C. Research on sports video retrieval algorithm based on semantic feature extraction. Multimed. Tools Appl. 2023, 82, 21941–21955. [Google Scholar] [CrossRef]
- Fang, H.; Xiong, P.; Xu, L.; Chen, Y. Clip2video: Mastering video-text retrieval via image clip. arXiv 2021, arXiv:2106.11097. [Google Scholar]
- Bain, M.; Nagrani, A.; Varol, G.; Zisserman, A. A clip-hitchhiker’s guide to long video retrieval. arXiv 2022, arXiv:2205.08508. [Google Scholar]
- Lan, H.; Lv, C. Causal attention transformer for video text retrieval. IET Image Process. 2025, 19, e70093. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning; PMLR: New York, NY, USA, 2021; pp. 8748–8763. [Google Scholar]
- Le, H.M.; Carr, P.; Yue, Y.; Lucey, P. Data-driven ghosting using deep imitation learning. In Proceedings of the MIT Sloan Sports Analytics Conference; MIT: Cambridge, MA, USA, 2017; pp. 1–15. [Google Scholar]
- Yurko, R.; Nguyen, Q.; Pelechrinis, K. NFL Ghosts: A framework for evaluating defender positioning with conditional density estimation. Ann. Appl. Stat. 2026, 20, 873–892. [Google Scholar] [CrossRef]
- Ventura, L.; Yang, A.; Schmid, C.; Varol, G. CoVR-2: Automatic data construction for composed video retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11409–11421. [Google Scholar] [CrossRef] [PubMed]
- Thawakar, O.; Naseer, M.; Anwer, R.; Khan, S.; Felsberg, M.; Shah, M.; Khan, F. Composed video retrieval via enriched context and discriminative embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 26896–26906. [Google Scholar]
- Gupta, A.; Parmar, J.; Dave, I.; Shah, M. From play to replay: Composed video retrieval for temporally fine-grained videos. arXiv 2025, arXiv:2506.05274. [Google Scholar] [CrossRef]
- Zhang, K.; Li, J.; Li, Z.; Zhang, J.; Li, F.; Liu, Y.; Yan, R.; Jiang, Z.; Chen, N.; Zhang, L.; et al. Composed multi-modal retrieval: A survey of approaches and applications. arXiv 2025, arXiv:2503.01334. [Google Scholar] [CrossRef]
- Hummel, T.; Karthik, S.; Georgescu, M.; Akata, Z. Egocvr: An egocentric benchmark for fine-grained composed video retrieval. In Proceedings of the European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 1–17. [Google Scholar]
- Naik, B.; Hashmi, M.; Bokde, N. A comprehensive review of computer vision in sports: Open issues, future trends and research directions. Appl. Sci. 2022, 12, 4429. [Google Scholar] [CrossRef]
- Gao, T.; Zhang, M.; Zhu, Y.; Zhang, Y.; Pang, X.; Ying, J.; Liu, W. Sports video classification method based on improved deep learning. Appl. Sci. 2024, 14, 948. [Google Scholar] [CrossRef]
- Deliege, A.; Cioppa, A.; Giancola, S.; Seikavandi, M.; Dueholm, J.; Nasrollahi, K.; Ghanem, B.; Moeslund, T.; Van Droogenbroeck, M. Soccernet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 4508–4519. [Google Scholar]
- Xu, H.; Baniya, A.; Well, S.; Bouadjenek, M.; Dazeley, R.; Aryal, S. Deep learning for sports video event detection: Tasks, datasets, methods, and challenges. arXiv 2025, arXiv:2505.03991. [Google Scholar] [CrossRef]
- Scott, A.; Uchida, I.; Onishi, M.; Kameda, Y.; Fukui, K.; Fujii, K. SoccerTrack: A dataset and tracking algorithm for soccer with fish-eye and drone videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 3569–3579. [Google Scholar]
- Lucey, P.; Oliver, D.; Carr, P.; Roth, J.; Matthews, I. Assessing team strategy using spatiotemporal data. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2013; pp. 1366–1374. [Google Scholar]
- Honda, Y.; Kawakami, R.; Yoshihashi, R.; Kato, K.; Naemura, T. Pass receiver prediction in soccer using video and players’ trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 3503–3512. [Google Scholar]
- Doughty, H.; Damen, D.; Mayol-Cuevas, W. Who’s better? who’s best? pairwise deep ranking for skill determination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 6057–6066. [Google Scholar]
- Shao, D.; Zhao, Y.; Dai, B.; Lin, D. Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2020; pp. 2616–2625. [Google Scholar]
- Xu, C.; Wang, J.; Lu, H.; Zhang, Y. A novel framework for semantic annotation and personalized retrieval of sports video. IEEE Trans. Multimed. 2008, 10, 421–436. [Google Scholar]
- Gan, Y.; Togo, R.; Ogawa, T.; Haseyama, M. Scene retrieval in soccer videos by spatial-temporal attention with video vision transformer. In Proceedings of the IEEE International Conference on Consumer Electronics-Taiwan; IEEE: Piscataway, NJ, USA, 2022; pp. 453–454. [Google Scholar]
- Haruyama, T.; Takahashi, S.; Ogawa, T.; Haseyama, M. Similar scene retrieval in soccer videos with weak annotations by multimodal use of bidirectional lstm. In Proceedings of the 2nd ACM International Conference on Multimedia in Asia; Association for Computing Machinery: New York, NY, USA, 2021; pp. 1–8. [Google Scholar]
- Anurag, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 6836–6846. [Google Scholar]
- Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar] [CrossRef]
- Bain, M.; Nagrani, A.; Varol, G.; Zisserman, A. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 1728–1738. [Google Scholar]
- Xu, J.; Rao, Y.; Yu, X.; Chen, G.; Zhou, J.; Lu, J. Finediving: A fine-grained dataset for procedure-aware action quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 2949–2958. [Google Scholar]
- Rao, J.; Wu, H.; Liu, C.; Wang, Y.; Xie, W. MatchTime: Towards automatic soccer game commentary generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 1671–1685. [Google Scholar]
- Giancola, S.; Amine, M.; Dghaily, T.; Ghanem, B. Soccernet: A scalable dataset for action spotting in soccer videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; IEEE: Piscataway, NJ, USA, 2018; pp. 1711–1721. [Google Scholar]
- Rao, J.; Wu, H.; Jiang, H.; Zhang, Y.; Wang, Y.; Xie, W. Towards universal soccer video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2025; pp. 8384–8394. [Google Scholar]
- AI@Meta. Llama 3 Model Card. 2024. Available online: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md (accessed on 1 April 2026).
- Zhang, Y.; Li, B.; Liu, h.; Lee, Y.; Gui, L.; Fu, D.; Feng, J.; Liu, Z.; Li, C. LLaVA-NeXT: A Strong Zero-Shot Video Understanding Model. April 2024. Available online: https://llava-vl.github.io/blog/2024-04-30-llava-next-video/ (accessed on 1 April 2026).
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Song, K.; Tan, X.; Qin, T.; Lu, J.; Liu, T. Mpnet: Masked and permuted pre-training for language understanding. Adv. Neural Inf. Process. Syst. 2020, 33, 16857–16867. [Google Scholar]
- Cioppa, A.; Giancola, S.; Deliege, A.; Kang, L.; Zhou, X.; Cheng, Z.; Ghanem, B.; Van Droogenbroeck, M. Soccernet-tracking: Multiple object tracking dataset and benchmark in soccer videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 3491–3502. [Google Scholar]
- Cui, Y.; Zeng, C.; Zhao, X.; Yang, Y.; Wu, G.; Wang, L. Sportsmot: A large multi-object tracking dataset in multiple sports scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 9921–9931. [Google Scholar]
- Hiemann, A.; Kautz, T.; Zottmann, T.; Hlawitschka, M. Enhancement of speed and accuracy trade-off for sports ball detection in videos—finding fast moving, small objects in real time. Sensors 2021, 21, 3214. [Google Scholar] [CrossRef] [PubMed]
- Zhang, B.; Gao, J.; Yuan, Y. A descriptive basketball highlight dataset for automatic commentary generation. In Proceedings of the 32nd ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2024; pp. 10316–10325. [Google Scholar]






| Encoder | MLLM Answers [%] | HVR [%] | ||||
|---|---|---|---|---|---|---|
| Vision | Text | Q1: Y, Q2: Y | Q1: Y, Q2: N | Q1: N, Q2: Y | Q1: N, Q2: N | |
| MatchVision [36] | CLIP-ViT-B/16 [10] | 90.91 | 3.43 | 0.01 | 5.59 | 66 |
| CLIP-ViT-B/16 | CLIP-ViT-B/16 | 90.39 | 3.52 | 0.02 | 6.04 | 66 |
| MatchVision | all-roberta-large-v1 [39] | 87.23 | 6.64 | 0.03 | 6.00 | 54 |
| MatchVision | all-mpnet-base-v2 [40] | 86.92 | 6.73 | 0.03 | 6.29 | 60 |
| Encoder | MLLM Answers [%] | HVR [%] | ||||
|---|---|---|---|---|---|---|
| MatchVision | CLIP-ViT-B/16 | Q1: Y, Q2: Y | Q1: Y, Q2: N | Q1: N, Q2: Y | Q1: N, Q2: N | |
| (Vision) | (Text) | |||||
| ✓ | ✓ | 76.47 | 5.99 | 0.02 | 17.33 | 62 |
| ✓ | 95.93 | 3.04 | 0.00 | 0.99 | 54 | |
| ✓ | 74.18 | 6.82 | 0.11 | 18.78 | 48 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yoshida, R.; Goka, R.; Maeda, K.; Ogawa, T.; Haseyama, M. Automated Dataset Construction for Composed Video Retrieval in Soccer. Appl. Sci. 2026, 16, 5360. https://doi.org/10.3390/app16115360
Yoshida R, Goka R, Maeda K, Ogawa T, Haseyama M. Automated Dataset Construction for Composed Video Retrieval in Soccer. Applied Sciences. 2026; 16(11):5360. https://doi.org/10.3390/app16115360
Chicago/Turabian StyleYoshida, Riku, Ryota Goka, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. 2026. "Automated Dataset Construction for Composed Video Retrieval in Soccer" Applied Sciences 16, no. 11: 5360. https://doi.org/10.3390/app16115360
APA StyleYoshida, R., Goka, R., Maeda, K., Ogawa, T., & Haseyama, M. (2026). Automated Dataset Construction for Composed Video Retrieval in Soccer. Applied Sciences, 16(11), 5360. https://doi.org/10.3390/app16115360

