Automated Dataset Construction for Composed Video Retrieval in Soccer

Yoshida, Riku; Goka, Ryota; Maeda, Keisuke; Ogawa, Takahiro; Haseyama, Miki

doi:10.3390/app16115360

Open AccessArticle

Automated Dataset Construction for Composed Video Retrieval in Soccer

by

Riku Yoshida

¹

,

Ryota Goka

¹

,

Keisuke Maeda

²

,

Takahiro Ogawa

²

and

Miki Haseyama

^2,*

¹

Graduate School of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-ku, Sapporo 060-0814, Japan

²

Faculty of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-ku, Sapporo 060-0814, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5360; https://doi.org/10.3390/app16115360

Submission received: 30 April 2026 / Revised: 14 May 2026 / Accepted: 19 May 2026 / Published: 27 May 2026

(This article belongs to the Collection Computer Science in Sport)

Download

Browse Figures

Versions Notes

Abstract

Composed Video Retrieval (CoVR) enables flexible video search by retrieving a target video that reflects a specified modification to a query video. The triplet datasets—consisting of query videos, query text, and target videos—required for model training have been collected manually. Recent studies have explored automatic construction of training triplets for CoVR; however, most existing approaches rely heavily on caption similarity. This limitation is particularly problematic in soccer videos, where identical or highly similar captions can correspond to visually distinct situations, making it difficult to construct triplets with appropriate relationships. To address this issue, this paper proposes a multimodal triplet construction framework specialized for soccer videos. The key idea is to explicitly incorporate visual similarity alongside textual similarity. Specifically, candidate target videos are selected by combining visual similarity with commentary caption filtering, enabling the identification of videos that are visually similar yet semantically different. The semantic difference between videos is then generated as query text using a large language model (LLM) without manual annotation. Furthermore, a multimodal large language model (MLLM) is introduced to estimate whether the generated modification is visually and semantically consistent with the video pair. Rather than replacing human verification, this step provides an automated screening signal to identify potentially unreliable triplets. The experiments show that the proposed framework automatically constructs triplets with reasonable validity under limited human validation. These results demonstrate the potential of scalable triplet construction for CoVR in soccer videos.

Keywords:

composed video retrieval; triplet construction; soccer; multimodal large language model

1. Introduction

Sports video content has expanded dramatically, coupled with the proliferation of social media and specialized streaming services such as DAZN (https://www.dazn.com/, accessed on 1 April 2026) and SPOTV NOW (https://www.spotvnow.com/, accessed on 1 April 2026). These platforms now offer not only full-match broadcasts but also a range of derivative content, including highlights and individual player compilations, enabling audiences to consume sports in more personalized and engaging ways. Accordingly, audiences increasingly seek to select specific moments or impressive plays rather than passively watching entire matches [1]. Such viewing styles are also of great importance in tactical analysis and player development. In many sports, coaches and analysts use video to examine players’ movements and decision-making in particular situations, such as successful plays and mistakes. These analyses provide players with concrete insights [2,3]. Hence, video data are continuously accumulated from daily training and official matches; however, specific interesting scenes are still identified and manually extracted from these extensive collections [4]. This manual process constitutes a significant bottleneck in providing timely feedback, and its impact has become increasingly severe as video data continues to grow. Therefore, in the sports field, there is an increasing demand for video retrieval systems that can quickly and accurately search for specific plays or scenes in response to users’ intent.

Video retrieval systems on social media and streaming platforms generally rely on metadata attached to content, such as titles, descriptions, and tags [5]. These systems, however, are limited in their ability to capture fine-grained contextual information within videos. Moreover, the more detailed the metadata is, the more it typically requires manual annotation, and such information is rarely available in daily videos. This limitation often renders metadata-driven video retrieval impractical. To address this gap, prior work has proposed numerous retrieval approaches based on content-feature similarity. Methods that rely on visual features [6] exist, and more recent work has explored multimodal approaches [7,8,9] using vision-language models such as contrastive language-image pretraining (CLIP) [10]. Although these methods are adequate for retrieving similar videos to a given input, video analysis also requires the ability to search across diverse scenarios, including counterfactual ones, which existing methods cannot support [11,12].

Composed Video Retrieval (CoVR) [13,14,15,16,17] has recently emerged as a method that better reflects user intent by using video-text pairs as queries. CoVR allows users to first specify a query video and then adjust their results by applying changes or adding details via textual instructions. Standard visual similarity aims to retrieve videos that are globally similar to the query in terms of appearance, composition, player arrangement, or camera viewpoint. In contrast, CoVR requires a target video that preserves the overall context of the query video while differing in a specific semantic factor described by the modification text, such as the event outcome, player action, or referee decision. Therefore, the desired target is not simply the nearest visual neighbor. It must be sufficiently similar to preserve the query context but differ in the specific aspect specified by the query text. This requirement is particularly important for counterfactual soccer retrieval, where users may seek a play similar to the query but with a different outcome. While CoVR demonstrates the potential for flexible and practical video retrieval, challenges remain in collecting the training data needed to achieve the desired performance levels for models, as most of these collections require manual annotation [13,17]. To address this challenge, several studies have proposed automated triplet construction methods [13], and some of these also target sports domains such as gymnastics and diving [15]. These methods construct triplets based on available information, such as caption similarity or predefined labels. Hence, the resulting triplets often struggle to capture only the nuanced and localized changes specified in query texts, leading to noisy supervision. This issue is particularly pronounced in sports videos such as soccer, where multiple players interact dynamically across a large field. In those videos, it is difficult to identify the specific elements for modification and to refine target videos, as even action labels, like passes and shots, can be applied across a wide range of situational contexts. Furthermore, it has been reported that automatically constructed triplets are noisy, in the sense that the query text does not accurately capture the differences between the two videos [13]. Although the use of noisy triplets for model training may negatively affect retrieval performance, the methods for evaluating their validity and reliability have not been sufficiently explored.

To overcome these challenges, this paper proposes a novel framework for automated triplet construction in soccer videos. The core novelty of the proposed framework lies in two key ideas: (i) a two-stage integration of visual similarity and commentary-based textual similarity, and (ii) a logically grounded validity assurance mechanism using a multimodal large language model (MLLM). Unlike existing approaches that rely primarily on caption similarity, our method explicitly incorporates visual information to address a critical issue in soccer videos: visually distinct situations may share highly similar textual descriptions. In particular, the two-stage design is essential, as visual similarity ensures contextual consistency, while textual similarity captures fine-grained semantic differences that cannot be distinguished by visual features alone. The proposed framework first narrows down candidate videos based on visual similarity to capture shared event contexts, and then refines them using commentary captions to identify subtle semantic differences between events. This multimodal integration enables the construction of triplets that capture counterfactual relationships, in which only specific aspects of an event are altered while preserving its overall context. For example, using the proposed framework, as shown in Figure 1, we can generate a triplet that transforms a scene in which the ball has been cleared into a scene in which the goalkeeper makes a save under similar conditions. Such counterfactual triplets are difficult to obtain using conventional retrieval-based or text-only methods, but our method allows them to be constructed without expert annotation. Furthermore, to address the inherent noise in automatically generated triplets, we introduce an MLLM-based evaluation mechanism that performs structured reasoning over the relationship between videos and query text. The MLLM evaluates triplet consistency through explicit question-based reasoning, such as whether the query text correctly reflects the visual change and whether the change corresponds to meaningful event content. This process provides a scalable screening mechanism for identifying potentially inconsistent samples in automatically constructed triplets.

In summary, the main contributions of this paper are as follows:

We propose a framework for automatic triplet construction in soccer videos that introduces a two-stage integration of visual and commentary-based textual similarity, enabling the construction of counterfactual triplets without manual annotation.
We introduce an MLLM-based evaluation method that formulates triplet validity assessment as a set of structured questions, providing an automatic screening signal for identifying potentially unreliable triplets.

The structure of this paper is as follows. Section 2 reviews the related work. Section 3 presents the proposed method. Section 4 shows the experimental results that verify the effectiveness of the proposed method. Section 5 presents the limitations of our method, and Section 6 concludes the paper.

2. Related Work

2.1. Sports Video Understanding and Retrieval

Researchers have extensively studied sports videos in computer vision due to the rich spatio-temporal information they provide regarding player movements and tactical situations [18,19]. Several studies have investigated the automatic detection of key events, such as fouls, shots, and goals, in soccer matches by modeling temporal dependencies and extracting visual features from broadcast videos [20,21]. Other studies have advanced player tracking by developing algorithms and datasets for fisheye and drone videos [22]. In addition, trajectory-based analysis has been used to study team strategies and player interactions, including transformer-based pass receiver prediction [23,24]. Furthermore, deep learning-based approaches, including pairwise deep ranking for skill determination [25] and hierarchical action understanding with fine-grained datasets [26], have been employed to evaluate athlete performance and skill levels. These approaches support applications such as coaching assistance, performance evaluation, and sports analytics.

While these fundamental tasks, including detection, tracking, and evaluation, provide a holistic understanding of a match, they do not directly address the practical need for retrieving specific scenes based on user intent. In practical scenarios such as coaching and match analysis, analysts often need to locate video scenes that meet specific criteria (e.g., similar plays or tactical situations). This requirement motivates the development of video retrieval systems that efficiently search large video collections for relevant scenes. Video retrieval methods in sports have evolved from keyword-based systems [27,28] to complex multimodal frameworks [29]. An early study established a framework for semantic annotation and personalized retrieval [27], utilizing manually assigned labels and user preferences to facilitate keyword-based searches. The subsequent study has explored the extraction of semantic features to bridge the gap between low-level visual data and high-level concepts, thus improving the representation of video content [6]. More recently, advancements in deep learning have introduced a more sophisticated visual-feature-based approach [28]; for instance, the use of spatial-temporal attention mechanisms combined with Video Vision Transformers [30] has enabled more precise scene retrieval in soccer by capturing the dynamic relationships between players and the ball. Furthermore, to address the inherent complexity of soccer matches, multimodal retrieval frameworks [29] have been developed that integrate diverse information sources, including video features, player positions, audio signals, and webcast text through bidirectional LSTMs [31].

These existing methods are primarily designed to find content that matches the query’s existing characteristics. In practical sports analysis, however, users often require a more flexible search capability that allows them to find scenes representing specific variations or “what-if” scenarios (e.g., “find a play similar to this one, but where the shot was blocked instead of scoring”). This requirement marks a significant shift from traditional similarity-based retrieval to intent-driven search. To address the limitations of existing analytical and retrieval frameworks, this study adopts CoVR. While traditional methodologies often rely on static metadata or near-duplicate visual matching, this research focuses on modeling complex, high-level semantic transformations such as counterfactual event outcomes within the intricate and dynamic multi-player environment of soccer. Our work establishes a novel pathway for practical sports analytics by facilitating the retrieval of scenes that reflect specific variations in play.

2.2. Automated Triplet Construction for CoVR

CoVR [13,14,15,17] has recently emerged as a promising framework to bridge the gap between fixed metadata and fluid user intent. By combining a query video and a text-based modification instruction, CoVR allows users to search for content that differs from the query in specific ways (e.g., “change the outcome of this shot to a goal”). Training CoVR models effectively requires high-quality triplet datasets comprising a query video, a query text, and a target video, which are traditionally collected through manual annotation. To scale these datasets, several automated construction frameworks [13,15] have been developed. A representative approach leverages caption similarity to identify candidate video pairs [13]. This method extracts text features from large-scale datasets such as WebVid2M [32] and identifies pairs whose captions are semantically similar but differ only in minor ways. An LLM is then employed to generate a query text that articulates the specific differences between these captions. This text-driven pipeline enables the creation of extensive datasets without manual intervention. Furthermore, a recent study proposed an automated construction framework for temporally fine-grained sports videos [15], such as gymnastics and diving [26,33]. Their approach uses structured labels, including technique names and rotation counts, to assess similarities and differences, with an LLM generating a query text from these categorical changes.

However, applying these text-centric or label-dependent methods to soccer presents a significant technical challenge: in dynamic multi-player sports, identical or highly similar captions can correspond to visually and semantically distinct situations. Our work distinguishes itself by proposing a multimodal construction framework that integrates visual and textual information in a two-stage process. In our framework, visual similarity is first used to ensure that the query and target videos share comparable scene context, and commentary caption similarity is then used to select pairs exhibiting meaningful semantic differences within that shared context. Additionally, we introduce an automated validity evaluation mechanism using an MLLM as a scalable screening step for constructed triplets. By formalizing the verification of visual and semantic consistency as a structured reasoning task, our framework facilitates the construction of candidate soccer CoVR datasets while reducing the manual inspection required.

3. Proposed Method

This section describes our framework for automated triplet construction in soccer videos, which consists of the following three stages:

(i): Identifying relevant target videos by jointly leveraging visual similarity and semantic information from commentary captions.
(ii): Generating query text that captures semantic differences between videos through an LLM.
(iii): Estimating triplet validity and identifying potentially unreliable samples via MLLM-based evaluation.

Figure 2 illustrates the overall pipeline of the proposed framework, which consists of three stages: target video identification, query text generation, and MLLM-based triplet validity evaluation.

3.1. Multimodal Identification of Target Videos

To identify target videos that preserve the scene context while differing in semantically meaningful ways from the query video, we adopt a two-stage filtering strategy that integrates visual and textual similarities. First, we identify visually similar candidates to ensure visual consistency. This is a prerequisite for CoVR, as the query text should represent a modification of the specific event depicted in the query video. We utilize the MatchTime dataset [34], which provides video clips with aligned commentary captions. These captions provide expert descriptions of key events and thus help identify semantically matched target videos. Let

V_{q}

denote the query video and

V_{i}

denote a candidate target video. Each video is processed to extract a global representation. Let

{f_{t}}_{t = 1}^{T}

be a set of T sampled frames from a video, where

f_{t}

denotes the visual feature of the t-th frame extracted by a vision encoder. The video-level feature v is obtained by temporal average pooling:

v = \frac{1}{T} \sum_{t = 1}^{T} f_{t} .

(1)

The visual similarity between a query video feature

v_{q}

and a candidate feature

v_{i}

is calculated using cosine similarity.

Second, to identify videos with semantic differences, we refine the visually similar candidates using their associated commentary captions. Let

C_{q}

denote the commentary caption associated with the query video

V_{q}

, and

C_{i}

denote the commentary caption associated with a candidate video

V_{i}

. Their text features are extracted using a text encoder as

t_{q} = TextEnc (C_{q})

and

t_{i} = TextEnc (C_{i})

. We then calculate the cosine similarity between

t_{q}

and

t_{i}

. To analyze triplet validity across different degrees of caption-level semantic relatedness, we define four threshold ranges based on preliminary qualitative inspection of caption pairs: below 0.70, [0.70, 0.79), [0.79, 0.85), and 0.85 or above. These ranges were not intended as globally optimal hyperparameters. Rather, they were introduced to stratify caption pairs into interpretable levels: largely unrelated, semantically different, partially different, and nearly identical pairs. For each query video, we construct four candidate triplets by selecting the top-1 candidate within each threshold range. This allows us to evaluate the effect of caption similarity on triplet validity across the full spectrum of similarity levels. This filtering strategy yields target videos that remain visually consistent with the query while exhibiting controlled semantic differences. The selected candidate video is denoted as the target video

V_{t}

, and its corresponding commentary caption is denoted as the target caption

C_{t}

.

3.2. Generation of Query Text via LLM

To generate query text that captures the nuanced semantic transformations between videos, we leverage the linguistic information embedded in commentary captions. Specifically, our method inputs the query caption

C_{q}

, the target caption

C_{t}

, and a prompt

P_{llm}

into an LLM to generate the query text

Q_{q t}

. The process is defined as follows:

\begin{matrix} Q_{q t} = LLM (C_{q}, C_{t}, P_{llm}) . \end{matrix}

(2)

To ensure the output aligns with the instructional format required for CoVR, prompt

P_{llm}

(detailed in Figure 3) incorporates an expert persona and explicit constraints. Specifically, the LLM is instructed to identify only the changes between the source and counterfactual events, focusing on transformations in outcomes, referee decisions, or player actions. Furthermore,

P_{llm}

includes few-shot examples, such as “Change the outcome from a goal to a missed shot”, to guide the model toward generating concise and single-sentence instructions without redundant information like team names or identical phrases. This ensures that the generated text reflects the actual semantic differences between scenes rather than merely rephrasing sentences. This approach is necessary since soccer videos contain complex player movements that coarse event labels (e.g., “Shot” or “Goal”) cannot sufficiently describe. This process enables the automated synthesis of event-focused query text that reflects semantic discrepancies across soccer scenes without requiring manual annotation.

3.3. Triplet Validity Evaluation via an MLLM

To automatically estimate the reliability and validity of the constructed triplets, we introduce an MLLM-based evaluation mechanism. This mechanism is intended to provide a scalable screening signal for potentially unreliable triplets, rather than to fully replace human verification. The MLLM analyzes a pair of videos

{V_{q}, V_{t}}

, each represented as a sequence of uniformly sampled frames (see Section 4.1 for sampling details), and the generated query text

Q_{q t}

to answer structured reasoning questions. As detailed in the prompt

P_{mllm}

shown in Figure 4, the MLLM assesses each triplet based on two primary validation criteria:

Q1:: This criterion verifies the visual-textual alignment. It is designed to filter out triplets in which the query text does not reflect the actual visual difference, often due to temporal misalignment in captions or LLM hallucinations.
Q2:: This criterion serves as a content filter to ensure the triplet captures meaningful event-related transformations. In soccer broadcasts, this is essential for excluding irrelevant changes, such as fluctuations in the displayed score or superficial textual variations, that do not pertain to the actual play.

In this study, a triplet is regarded as MLLM-valid when it satisfies both criteria. This dual-question evaluation is used to identify triplets that are likely to be visually and semantically consistent. The above MLLM-based evaluation process provides a structured validity indicator for large-scale candidate triplets. By identifying triplets exhibiting potential visual-textual misalignment or irrelevant content, the proposed framework enables scalable triplet construction while reducing the manual inspection required.

4. Experimental Results

This section evaluates the proposed framework through quantitative and qualitative experiments on a commentary-captioned soccer video dataset. We evaluate the proposed framework from three perspectives: the effect of various encoder combinations on triplet construction, the necessity of multimodal target video identification, and the validity of the constructed triplets through both MLLM-based and human evaluation.

4.1. Experimental Settings

In this study, we used the MatchTime dataset [34], which provides video clips aligned with commentary captions. The dataset contains 32,743 timestamped commentary captions from 471 soccer matches. Since many captions include non-play content, such as additional time, audience numbers, or substitutions, we filtered the captions using event labels and retained only those related to corner kicks and goals. As a result, 21,417 commentary captions were used in our experiments. For the video data, we followed the experimental settings of the MatchTime dataset and used the SoccerNet dataset [35] at a frame rate of 25 fps and a resolution of 224 × 224 pixels. As reported in the MatchTime paper, the timestamps of the commentary captions contain temporal misalignment. To mitigate this issue, we used 60-s video segments, corresponding to a ±30-s window centered on each commentary timestamp. This setting follows the MatchTime dataset, in which a ±30-s window was shown to be effective in an LLM-as-judge video-understanding evaluation [34]. A 60-s duration was adopted as a practical compromise: shorter clips may exclude the event described by the commentary, whereas longer clips may include multiple unrelated plays, making the relationship between the caption and video ambiguous. We sampled these segments at 1 fps to balance computational efficiency with the retention of essential semantic information.

As vision encoders, we employed CLIP (clip-vit-base-patch16), a widely used model, and MatchVision [36], a model pretrained for soccer event classification. For text encoders, we used the CLIP text encoder (clip-vit-base-patch16) and two SentenceTransformer models: all-mpnet-base-v2 and all-roberta-large-v1. To generate query texts, we used the Llama-3-8B-Instruct model [37]. We constructed triplets by combining these pretrained vision and text encoders.

Triplet validity was evaluated using an MLLM. We employed LLaVA-NeXT Video [38], an open-source MLLM. Given each triplet, the MLLM answered two “Yes” or “No” questions. For MLLM-based evaluation, we uniformly sample 8 frames from each 60-frame video clip at equal temporal intervals to limit the increase in token count. We first measured the proportion of triplets satisfying both criteria. A triplet was regarded as valid only when both answers were “Yes”. The valid triplet ratio

R_{valid}

was then computed as

R_{valid} = 100 \times \frac{N_{valid}}{M},

(3)

where

N_{valid}

denotes the number of triplets satisfying both criteria and M denotes the total number of evaluated triplets. Note that M varies depending on the specific combination of video and text encoders used for triplet construction. In our experiments, the values of M corresponding to the four rows in Table 1 were, from top to bottom, 73,385, 73,348, 67,294, and 65,347, respectively. We further conducted human validation on a subset of the valid triplets to verify the alignment between MLLM-judged validity and human assessment. For each encoder combination, 50 triplets were randomly sampled from those judged valid by the MLLM and evaluated by a human evaluator. For each triplet, the evaluator was presented with the query video, the target video, and the generated query text. The evaluator judged whether the query text correctly described the visual change from the query video to the target video and whether the change corresponded to soccer play or event content. A triplet was counted as human-valid only when both criteria were satisfied. The Human Validation Rate (HVR) was computed as

HVR = 100 \times \frac{N_{human}}{50},

(4)

where

N_{human}

denotes the number of sampled triplets judged appropriate by the evaluator. The human evaluation was conducted by a single evaluator with specialized expertise in soccer. We adopted a single evaluator setting since the task requires domain knowledge and careful comparison of two one-minute video clips and the query text for each triplet. In the quantitative evaluation, we compared validity across different model combinations, and in the qualitative evaluation, we examined whether triplets judged high validity indeed represented appropriate relationships. The 224 × 224 resolution was used for videos processed by the vision encoder and MLLM. Human validation was not restricted to this resolution; instead, the evaluator inspected the query and target video clips in a standard video-viewing environment to assess the visual changes and event content. Thus, human evaluation was used as domain-aware validation of MLLM-evaluated triplets rather than as an evaluation with exactly the same input resolution as the MLLM. A systematic analysis of the effect of video resolution on both MLLM-based evaluation and human validation remains future work.

4.2. Quantitative Experimental Results

Table 1 presents the results of triplet validity evaluation using different encoder combinations. The highest proportion of valid triplets was obtained when using the soccer-specific vision encoder (MatchVision) together with the CLIP text encoder. Notably, the valid triplet ratio decreased when MatchVision was paired with text encoders such as RoBERTa, suggesting that construction accuracy relies heavily on the cross-modal alignment between visual and textual information. Here, HVR should be interpreted as a supplementary indicator of validity rather than a definitive measure. The HVR of 66%, shared by both MatchVision and CLIP vision encoders when paired with the CLIP text encoder, indicates that approximately two-thirds of the triplets judged valid by the MLLM were also considered appropriate. This result suggests that the MLLM-based evaluation shows reasonable agreement with human judgment and serves as a useful automatic indicator of triplet validity. Although the HVR scores were identical, qualitative observations indicate that the soccer-specific encoder is more effective at preserving the contextual flow of play than the CLIP-only model. This implies that while the text encoder influences the validity ratio, the domain-specific vision encoder is essential for capturing tactical consistency in soccer scenes.

Table 2 presents an ablation study focusing on similar scenes using the encoder combination that achieved the highest valid triplet ratio in Table 1. The results show that the proportion of triplets where the MLLM answered “No” to both questions increases when only caption similarity is used. This is because caption-only filtering tends to select identical or nearly identical captions as target captions. In such cases, the generated query text is often “No change”, which yields negative answers from the MLLM. Moreover, the caption-only setting achieved the lowest HVR of 48%, whereas the proposed visual and textual setting achieved an HVR of 62%. This result suggests that caption similarity alone is insufficient for selecting target videos that preserve the visual context of the query video. In active multi-player soccer environments, similar captions can correspond to different player arrangements, camera viewpoints, and preceding play contexts, which makes caption-based triplet construction unreliable for fine-grained CoVR modifications.

In contrast, when only the soccer-specific vision encoder is used, the target video is selected based on visual similarity. Since captions for visually similar scenes are rarely identical, the proportion of negative responses decreases. Although the visual-only baseline achieved the highest MLLM-based valid triplet ratio (95.93%), its HVR was relatively low (54%). This discrepancy suggests that the MLLM may overestimate the validity of triplets when the query and target videos are visually similar. Since this baseline does not apply caption-based filtering, the semantic difference between the query and target captions is not explicitly controlled, which can lead to query texts that are less coherent as fine-grained CoVR modifications. Therefore, combining visual and textual filtering is necessary to construct triplets that are not only accepted by the MLLM but also judged valid by humans.

4.3. Qualitative Experimental Results

Figure 5 presents examples of triplet evaluation results obtained using the soccer-specific vision encoder and the CLIP text encoder. In each example, green indicates elements in the query text corresponding to the query video, while orange indicates the modified elements in the target video. Triplets for which the MLLM answered “Yes” to both questions correctly reflect the modification instructions in the query text, indicating that the model successfully recognizes the visual differences between the videos. A case in which the MLLM answered “Yes” to Q1 but “No” to Q2 typically involves subtle visual differences, such as the specific area of the goal at which the ball entered. These differences may be difficult for the model to interpret as play-related changes. A case in which the MLLM answered “No” to Q1 but “Yes” to Q2 is often caused by timestamp misalignment in the commentary captions. In such a case, the event described in the query text does not appear within the sampled video window. Finally, a case in which both answers are “No” corresponds to the generated query texts of “No change”. Even when captions are identical, soccer scenes often exhibit dynamic changes in player positions, resulting in slightly different visual content. As a result, the MLLM may fail to recognize the intended equivalence. Overall, the qualitative results support the usefulness of the proposed pipeline: valid triplets generally preserve the scene context while modifying only the event outcome or action of interest. At the same time, the failure cases reveal limitations in caption alignment and in the MLLM-based verification process. These examples also show that the proposed framework can construct triplets in which the camera angles differ, while the situations remain highly similar, and the outcomes differ.

Figure 6 compares triplets constructed using the soccer-specific vision encoder and those constructed using CLIP-ViT-B/16 for both vision and text encoders. Although both encoder combinations produce triplets with valid relationships, the target video retrieved using the soccer-specific encoder preserves the contextual flow of the play more effectively. In contrast, the CLIP-only configuration retrieves a scene with a different preceding action, resulting in a less consistent match with the query video. The quantitative results in Table 1 show that the MatchVision-CLIP setting achieved a slightly higher MLLM-based valid ratio than the CLIP-only setting, while both settings obtained the same HVR. Together with the qualitative example in Figure 6, this result suggests that MatchVision may help preserve soccer-specific contextual information in some cases. However, this evidence is not sufficient to conclude that MatchVision consistently outperforms CLIP in detecting soccer-specific visual details. A systematic quantitative comparison of domain-specific and generic vision encoders, along with a larger-scale human evaluation of contextual consistency, remains future work.

5. Limitations

The results indicate that the proposed framework is effective for automatically constructing triplets in soccer videos by combining visual and commentary-based textual information during target video identification. These findings support the paper’s hypothesis that constructing triplets solely from captions in soccer videos is insufficient and that visual filtering is necessary to preserve scene context during target video identification. This interpretation is consistent with previous studies on automated triplet construction for CoVR that rely mainly on textual similarity. However, the present results also suggest a limitation of such approaches in multi-player sports videos, where similar captions may correspond to visually different situations. These findings indicate that visual filtering plays an important role in constraining the pool of candidate target videos before text-based comparison.

The experiments also highlight both the usefulness and the limitations of the MLLM-based evaluation. The MLLM-based evaluation contributes to noise reduction by identifying triplets in which the generated query text is inconsistent with the visual change between the query and target videos or does not correspond to soccer play or event content. The human validation results suggest that the MLLM-based criteria are useful for automatically identifying triplets with reasonable validity, but the agreement is not perfect. A major limitation of the current validity evaluation is the discrepancy between the MLLM-based valid ratio and HVR. This discrepancy suggests that the MLLM may accept some triplets that appear plausible from sparsely sampled frames but are not sufficiently supported by careful human inspection. Possible reasons include temporal misalignment between commentary and visual events, sparse frame sampling, and subtle differences in events in broadcast soccer videos. Thus, the MLLM-based evaluation should be regarded as a scalable but imperfect screening mechanism, not as a complete substitute for human validation. In particular, failure cases were frequently associated with residual temporal misalignment between commentary captions and video clips, subtle differences in play content, and “No change” cases. Another possible factor is that query text generation relies on commentary captions rather than direct visual evidence. Although the prompt restricts the LLM to describing the difference between the two captions, it cannot guarantee that the generated modification is visually observable in the corresponding video clips. To mitigate this issue, we introduce an MLLM-based validity evaluation to assess whether the generated query text is visually consistent with the query and target videos. However, the extent to which this MLLM-based evaluation can quantitatively suppress hallucination or visually unsupported query texts remains to be further verified. These observations indicate that automatic triplet construction and verification should be improved jointly rather than treated as independent components.

Several limitations remain in the present study. First, human validation was conducted by a single evaluator, as the task requires soccer-specific knowledge and careful comparison of two one-minute video clips and the generated query text. Although this setting enables domain-aware evaluation of fine-grained soccer events, it prevents us from measuring inter-annotator agreement and may introduce subjective bias. Second, the present experiments evaluate the quality of the constructed triplets rather than the downstream retrieval performance of CoVR models trained on them. Therefore, future work should train CoVR models with the generated triplets and measure retrieval performance before and after MLLM-based screening. In addition, triplets judged as potentially noisy should not necessarily be simply removed, since reducing the dataset size may also reduce training diversity. A promising direction is to refine such triplets by regenerating query texts or reselecting target videos, thereby improving dataset quality while maintaining dataset scale. Third, the current framework is limited to corner kicks and goals, which restricts the diversity of the constructed dataset. The framework should be extended to additional soccer events beyond corner kicks and goals to improve the dataset’s diversity and generality.

There are also broader limitations related to practical sports video analysis. The current framework does not explicitly handle fast camera transitions and visual obstructions, which frequently occur in broadcast soccer videos. Although using 60-s segments increases the likelihood that the commentary’s event is present in the video, and uniform frame sampling yields sparse visual observations of the segment, these strategies do not guarantee that the decisive moment of the play is visible to the vision encoder or the MLLM. Future work should incorporate player and ball tracking [41,42,43] and video representations to preserve more detailed temporal dynamics. In particular, the current implementation represents each video segment using temporally averaged frame features, which do not explicitly preserve motion dynamics or event progression. Temporal motion features, such as temporal attention, would allow the framework to distinguish not only whether two clips are visually similar, but also how the play evolves over time and at which moment the event outcome changes.

Another issue to consider is the potential for bias accumulation when automatically generated textual descriptions are used to train retrieval models. Our framework relies on foundation models for query-text generation and MLLM-based validity evaluation. As a result, systematic biases or hallucinations from these models may be propagated into the constructed triplets. Future work should address this issue by incorporating human-in-the-loop auditing, multiple independent validation models, and downstream retrieval evaluation.

Finally, the proposed framework is not limited in principle to soccer. It may be applicable to other team sports, such as basketball [44], where videos are accompanied by commentary or event descriptions and where analysts need to retrieve plays with similar contexts but different outcomes. However, direct transfer is not guaranteed. The caption filtering thresholds, query-text generation prompts, and MLLM-based validity questions should be adapted to the target sport, as each sport has distinct event structures, spatial motion patterns, and definitions of meaningful event transformations.

6. Conclusions

In this paper, we have proposed an automatic triplet construction framework for composed video retrieval in soccer. By combining visual similarity with commentary-based textual similarity, the proposed framework identifies target videos that preserve scene context while exhibiting meaningful semantic differences. In addition, query text is generated automatically using an LLM, and an MLLM-based validity evaluation is introduced as a screening mechanism for identifying potentially unreliable triplets. The experimental results demonstrated that the proposed framework can construct triplets with reasonable validity and that incorporating visual information is important for preserving the contextual consistency of target scenes in soccer videos. These findings suggest that multimodal triplet construction is a promising approach to scalable CoVR dataset construction for soccer videos. By reducing dependence on manual triplet annotation, the proposed framework lowers the barrier to constructing customized sports-tech AI systems for retrieval-based match analysis and coaching support. At the same time, the results show that full automation remains challenging, and reliable deployment will require human-in-the-loop validation and downstream retrieval evaluation. Future work will further examine the reliability and downstream effectiveness of the constructed triplets, refine potentially noisy triplets, extend to additional events and other team sports, and integrate temporal motion features into triplet construction.

Author Contributions

Conceptualization, R.Y., R.G., K.M., T.O. and M.H.; methodology, R.Y., R.G., K.M., T.O. and M.H.; software, R.Y.; validation, R.Y., R.G., K.M., T.O. and M.H.; data curation, R.Y.; writing—original draft preparation, R.Y.; writing—review and editing, R.G., K.M., T.O. and M.H.; visualization, R.Y.; funding acquisition, R.G., K.M. and M.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by JSPS KAKENHI Grant Numbers JP24K02942 (Miki Haseyama), JP23K11211 (Keisuke Maeda), and JP25KJ0520 (Ryota Goka).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Y.; Zhang, X.; Xu, C.; Lu, H. Personalized retrieval of sports video. In Proceedings of the International Workshop on Multimedia Information Retrieval; Association for Computing Machinery: New York, NY, USA, 2007; pp. 313–322. [Google Scholar]
Hughes, M.; Franks, I. Notational Analysis of Sport: Systems for Better Coaching and Performance in Sport; Psychology Press: New York, NY, USA, 2004. [Google Scholar]
Guo, Y.; Chen, C.; Peng, J.; Deng, L.; Yuan, T. Does visual training enhance athletes’ decision-making skills and sport-specific performance? A systematic review and meta-analysis. Scand. J. Med. Sci. Sport. 2025, 35, e70140. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Xu, C.; Zhang, X.; Lu, H. Personalized retrieval of sports video based on multi-modal analysis and user preference acquisition. Multimed. Tools Appl. 2009, 44, 305–330. [Google Scholar] [CrossRef]
Toderici, G.; Aradhye, H.; Pasca, M.; Sbaiz, L.; Yagnik, J. Finding meaning on youtube: Tag recommendation and category discovery. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2010; pp. 3447–3454. [Google Scholar]
Guo, C. Research on sports video retrieval algorithm based on semantic feature extraction. Multimed. Tools Appl. 2023, 82, 21941–21955. [Google Scholar] [CrossRef]
Fang, H.; Xiong, P.; Xu, L.; Chen, Y. Clip2video: Mastering video-text retrieval via image clip. arXiv 2021, arXiv:2106.11097. [Google Scholar]
Bain, M.; Nagrani, A.; Varol, G.; Zisserman, A. A clip-hitchhiker’s guide to long video retrieval. arXiv 2022, arXiv:2205.08508. [Google Scholar]
Lan, H.; Lv, C. Causal attention transformer for video text retrieval. IET Image Process. 2025, 19, e70093. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning; PMLR: New York, NY, USA, 2021; pp. 8748–8763. [Google Scholar]
Le, H.M.; Carr, P.; Yue, Y.; Lucey, P. Data-driven ghosting using deep imitation learning. In Proceedings of the MIT Sloan Sports Analytics Conference; MIT: Cambridge, MA, USA, 2017; pp. 1–15. [Google Scholar]
Yurko, R.; Nguyen, Q.; Pelechrinis, K. NFL Ghosts: A framework for evaluating defender positioning with conditional density estimation. Ann. Appl. Stat. 2026, 20, 873–892. [Google Scholar] [CrossRef]
Ventura, L.; Yang, A.; Schmid, C.; Varol, G. CoVR-2: Automatic data construction for composed video retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11409–11421. [Google Scholar] [CrossRef] [PubMed]
Thawakar, O.; Naseer, M.; Anwer, R.; Khan, S.; Felsberg, M.; Shah, M.; Khan, F. Composed video retrieval via enriched context and discriminative embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 26896–26906. [Google Scholar]
Gupta, A.; Parmar, J.; Dave, I.; Shah, M. From play to replay: Composed video retrieval for temporally fine-grained videos. arXiv 2025, arXiv:2506.05274. [Google Scholar] [CrossRef]
Zhang, K.; Li, J.; Li, Z.; Zhang, J.; Li, F.; Liu, Y.; Yan, R.; Jiang, Z.; Chen, N.; Zhang, L.; et al. Composed multi-modal retrieval: A survey of approaches and applications. arXiv 2025, arXiv:2503.01334. [Google Scholar] [CrossRef]
Hummel, T.; Karthik, S.; Georgescu, M.; Akata, Z. Egocvr: An egocentric benchmark for fine-grained composed video retrieval. In Proceedings of the European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 1–17. [Google Scholar]
Naik, B.; Hashmi, M.; Bokde, N. A comprehensive review of computer vision in sports: Open issues, future trends and research directions. Appl. Sci. 2022, 12, 4429. [Google Scholar] [CrossRef]
Gao, T.; Zhang, M.; Zhu, Y.; Zhang, Y.; Pang, X.; Ying, J.; Liu, W. Sports video classification method based on improved deep learning. Appl. Sci. 2024, 14, 948. [Google Scholar] [CrossRef]
Deliege, A.; Cioppa, A.; Giancola, S.; Seikavandi, M.; Dueholm, J.; Nasrollahi, K.; Ghanem, B.; Moeslund, T.; Van Droogenbroeck, M. Soccernet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 4508–4519. [Google Scholar]
Xu, H.; Baniya, A.; Well, S.; Bouadjenek, M.; Dazeley, R.; Aryal, S. Deep learning for sports video event detection: Tasks, datasets, methods, and challenges. arXiv 2025, arXiv:2505.03991. [Google Scholar] [CrossRef]
Scott, A.; Uchida, I.; Onishi, M.; Kameda, Y.; Fukui, K.; Fujii, K. SoccerTrack: A dataset and tracking algorithm for soccer with fish-eye and drone videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 3569–3579. [Google Scholar]
Lucey, P.; Oliver, D.; Carr, P.; Roth, J.; Matthews, I. Assessing team strategy using spatiotemporal data. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2013; pp. 1366–1374. [Google Scholar]
Honda, Y.; Kawakami, R.; Yoshihashi, R.; Kato, K.; Naemura, T. Pass receiver prediction in soccer using video and players’ trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 3503–3512. [Google Scholar]
Doughty, H.; Damen, D.; Mayol-Cuevas, W. Who’s better? who’s best? pairwise deep ranking for skill determination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 6057–6066. [Google Scholar]
Shao, D.; Zhao, Y.; Dai, B.; Lin, D. Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2020; pp. 2616–2625. [Google Scholar]
Xu, C.; Wang, J.; Lu, H.; Zhang, Y. A novel framework for semantic annotation and personalized retrieval of sports video. IEEE Trans. Multimed. 2008, 10, 421–436. [Google Scholar]
Gan, Y.; Togo, R.; Ogawa, T.; Haseyama, M. Scene retrieval in soccer videos by spatial-temporal attention with video vision transformer. In Proceedings of the IEEE International Conference on Consumer Electronics-Taiwan; IEEE: Piscataway, NJ, USA, 2022; pp. 453–454. [Google Scholar]
Haruyama, T.; Takahashi, S.; Ogawa, T.; Haseyama, M. Similar scene retrieval in soccer videos with weak annotations by multimodal use of bidirectional lstm. In Proceedings of the 2nd ACM International Conference on Multimedia in Asia; Association for Computing Machinery: New York, NY, USA, 2021; pp. 1–8. [Google Scholar]
Anurag, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 6836–6846. [Google Scholar]
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar] [CrossRef]
Bain, M.; Nagrani, A.; Varol, G.; Zisserman, A. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 1728–1738. [Google Scholar]
Xu, J.; Rao, Y.; Yu, X.; Chen, G.; Zhou, J.; Lu, J. Finediving: A fine-grained dataset for procedure-aware action quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 2949–2958. [Google Scholar]
Rao, J.; Wu, H.; Liu, C.; Wang, Y.; Xie, W. MatchTime: Towards automatic soccer game commentary generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 1671–1685. [Google Scholar]
Giancola, S.; Amine, M.; Dghaily, T.; Ghanem, B. Soccernet: A scalable dataset for action spotting in soccer videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; IEEE: Piscataway, NJ, USA, 2018; pp. 1711–1721. [Google Scholar]
Rao, J.; Wu, H.; Jiang, H.; Zhang, Y.; Wang, Y.; Xie, W. Towards universal soccer video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2025; pp. 8384–8394. [Google Scholar]
AI@Meta. Llama 3 Model Card. 2024. Available online: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md (accessed on 1 April 2026).
Zhang, Y.; Li, B.; Liu, h.; Lee, Y.; Gui, L.; Fu, D.; Feng, J.; Liu, Z.; Li, C. LLaVA-NeXT: A Strong Zero-Shot Video Understanding Model. April 2024. Available online: https://llava-vl.github.io/blog/2024-04-30-llava-next-video/ (accessed on 1 April 2026).
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Song, K.; Tan, X.; Qin, T.; Lu, J.; Liu, T. Mpnet: Masked and permuted pre-training for language understanding. Adv. Neural Inf. Process. Syst. 2020, 33, 16857–16867. [Google Scholar]
Cioppa, A.; Giancola, S.; Deliege, A.; Kang, L.; Zhou, X.; Cheng, Z.; Ghanem, B.; Van Droogenbroeck, M. Soccernet-tracking: Multiple object tracking dataset and benchmark in soccer videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 3491–3502. [Google Scholar]
Cui, Y.; Zeng, C.; Zhao, X.; Yang, Y.; Wu, G.; Wang, L. Sportsmot: A large multi-object tracking dataset in multiple sports scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 9921–9931. [Google Scholar]
Hiemann, A.; Kautz, T.; Zottmann, T.; Hlawitschka, M. Enhancement of speed and accuracy trade-off for sports ball detection in videos—finding fast moving, small objects in real time. Sensors 2021, 21, 3214. [Google Scholar] [CrossRef] [PubMed]
Zhang, B.; Gao, J.; Yuan, Y. A descriptive basketball highlight dataset for automatic commentary generation. In Proceedings of the 32nd ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2024; pp. 10316–10325. [Google Scholar]

Figure 1. Example of a constructed triplet in this work. The arrow indicates that the instructions in the query text are applied to the query video, resulting in the target video. Each color shows that the query text corresponds to a specific part of each caption.

Figure 2. Overview of the proposed pipeline, detailing the progression from visual/textual filtering to MLLM-based validation. The green boxes indicate selected features based on their similarity to the query feature.

Figure 3. Detailed prompt

P_{llm}

design for query text generation using Llama-3-8B-Instruct. The prompt includes few-shot examples and explicit constraints to ensure the generation of concise, event-focused instructions.

Figure 3. Detailed prompt

P_{llm}

design for query text generation using Llama-3-8B-Instruct. The prompt includes few-shot examples and explicit constraints to ensure the generation of concise, event-focused instructions.

Figure 4. Detailed prompt

P_{mllm}

for automated triplet validity evaluation using LLaVA-NeXT Video. It formulates the verification process as a reasoning task based on two specific criteria (Q1 and Q2).

Figure 4. Detailed prompt

P_{mllm}

for automated triplet validity evaluation using LLaVA-NeXT Video. It formulates the verification process as a reasoning task based on two specific criteria (Q1 and Q2).

Figure 5. Examples of constructed triplets. In each triplet, green indicates elements in the query text related to the query video that were modified, while orange indicates elements in the modified target video.

Figure 6. Example of triplets constructed by each encoder for a given query video. In each triplet, green indicates elements in the query text related to the query video that were modified, while orange indicates elements in the modified target video.

Table 1. Distribution of the four MLLM answer patterns and Human Validation Rate (HVR) for different combinations of video and text encoders. The bold values indicate the highest percentage of valid triplets among the encoder combinations compared.

Encoder		MLLM Answers [%]				HVR [%]
Vision	Text	Q1: Y, Q2: Y	Q1: Y, Q2: N	Q1: N, Q2: Y	Q1: N, Q2: N	HVR [%]
MatchVision [36]	CLIP-ViT-B/16 [10]	90.91	3.43	0.01	5.59	66
CLIP-ViT-B/16	CLIP-ViT-B/16	90.39	3.52	0.02	6.04	66
MatchVision	all-roberta-large-v1 [39]	87.23	6.64	0.03	6.00	54
MatchVision	all-mpnet-base-v2 [40]	86.92	6.73	0.03	6.29	60

Table 2. Ablation study on target video identification, evaluated only on triplets constructed from the nearly identical range (caption similarity 0.85 or above): distribution of the four MLLM answer patterns and HVR.

Encoder		MLLM Answers [%]				HVR [%]
MatchVision	CLIP-ViT-B/16	Q1: Y, Q2: Y	Q1: Y, Q2: N	Q1: N, Q2: Y	Q1: N, Q2: N
(Vision)	(Text)
✓	✓	76.47	5.99	0.02	17.33	62
✓		95.93	3.04	0.00	0.99	54
	✓	74.18	6.82	0.11	18.78	48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yoshida, R.; Goka, R.; Maeda, K.; Ogawa, T.; Haseyama, M. Automated Dataset Construction for Composed Video Retrieval in Soccer. Appl. Sci. 2026, 16, 5360. https://doi.org/10.3390/app16115360

AMA Style

Yoshida R, Goka R, Maeda K, Ogawa T, Haseyama M. Automated Dataset Construction for Composed Video Retrieval in Soccer. Applied Sciences. 2026; 16(11):5360. https://doi.org/10.3390/app16115360

Chicago/Turabian Style

Yoshida, Riku, Ryota Goka, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. 2026. "Automated Dataset Construction for Composed Video Retrieval in Soccer" Applied Sciences 16, no. 11: 5360. https://doi.org/10.3390/app16115360

APA Style

Yoshida, R., Goka, R., Maeda, K., Ogawa, T., & Haseyama, M. (2026). Automated Dataset Construction for Composed Video Retrieval in Soccer. Applied Sciences, 16(11), 5360. https://doi.org/10.3390/app16115360

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automated Dataset Construction for Composed Video Retrieval in Soccer

Abstract

1. Introduction

2. Related Work

2.1. Sports Video Understanding and Retrieval

2.2. Automated Triplet Construction for CoVR

3. Proposed Method

3.1. Multimodal Identification of Target Videos

3.2. Generation of Query Text via LLM

3.3. Triplet Validity Evaluation via an MLLM

4. Experimental Results

4.1. Experimental Settings

4.2. Quantitative Experimental Results

4.3. Qualitative Experimental Results

5. Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI