VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT

Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query. Most existing VTG models are trained on extensive annotated video-text pairs, a process that not only introduces human biases from the queries but also incurs significant computational costs. To tackle these challenges, we propose VTG-GPT, a GPT-based method for zero-shot VTG without training or fine-tuning. To reduce prejudice in the original query, we employ Baichuan2 to generate debiased queries. To lessen redundant information in videos, we apply MiniGPT-v2 to transform visual content into more precise captions. Finally, we devise the proposal generator and post-processing to produce accurate segments from debiased queries and image captions. Extensive experiments demonstrate that VTG-GPT significantly outperforms SOTA methods in zero-shot settings and surpasses unsupervised approaches. More notably, it achieves competitive performance comparable to supervised methods. The code is available on https://github.com/YoucanBaby/VTG-GPT


Introduction
Given a linguistic query, video temporal grounding (VTG) aims to locate the most relevant temporal segments from an untrimmed video, each containing a start and end timestamp.An illustrative example of VTG is shown in Figure 1a.This task [1,2] has numerous practical applications in daily life, such as how it can help video platform users easily skip to relevant portions of a video.The field of natural language has witnessed a significant leap forward with the advent of GPT-4 [3].This development has spurred the rise of large language models (LLMs) such as LLaMA [4] and Baichuan2 [5].Concurrently, GPTbased (Generative Pre-trained Transformer) models like MiniGPT4 [6] and LLaVA [7] have made significant strides in vision and multimodal applications.A recent work, LLaViLo [8], reveals that training adapters alone can effectively leverage the video understanding capabilities of LLMs.However, this method requires designing a sophisticated fine-tuning strategy specifically for VTG, thereby introducing additional computing costs.
Existing VTG methods [1,[9][10][11][12] primarily adopt supervised learning, which demands massive training resources and numerous annotated video-query pairs, as illustrated in Figure 1b.However, developing datasets for VTG is time-consuming and expensive; for instance, Moment-DETR [1] spent 1455 person-hours and USD 16,600 to create the QVhighlights dataset.Furthermore, ground-truth (GT) queries often contain human biases, such as (1) Bias from erroneous word spellings, as depicted in Figure 2a.The misspelled word "ociture" in original query would be tokenized by language models into "o", "cit", "ure", leading to model misunderstanding; (2) Bias due to incorrect descriptions, as shown in Figure 2b.Here, the action "turns off the lights" mentioned in the query does not occur in the video.In this paper, we propose a tuning-free zero-shot method named VTG-GPT to address the above issues.As shown in Figure 1c, VTG-GPT completely satisfies zero-shot settings, adopting a direct feed-forward approach without training or fine-tuning.To minimize biases arising from human-annotated queries, we employ Baichuan2 [5] to rephrase the original query and obtain debiased queries.As illustrated in Figure 2, the erroneous word "ociture" in query (a) has been accurately revised to "image", and the non-existent action "turn off the lights" in query (b) has been effectively refined to "a darkened environment".Furthermore, considering that videos inherently contain more redundant information than text, and inspired by the human approach to understanding video linguistically, we apply MiniGPT-v2 [6] to transform visual content into more precise textual descriptions.Finally, to generate accurate temporal proposals, we design a proposal generator that models debiased queries and image captions in the textual domain.In summary, our main contributions include: (1) To the best of our knowledge, we are the first zero-shot method to utilize GPT on VTG without training or fine-tuning.
(2) We present a novel framework, VTG-GPT, which effectively leverages GPT to mitigate human prejudice in annotated queries.Furthermore, VTG-GPT distinctively models debiased queries and video content within the linguistic domain to generate temporal segments.
(3) Comprehensive experiments demonstrate that VTG-GPT significantly surpasses SOTA (State-of-the-Art) methods in zero-shot settings.More importantly, this method achieves competitive performance comparable to supervised methods.

Video Temporal Grounding
For fully-supervised VTG, prior works [1,[9][10][11][13][14][15][16] typically employ encoders to extract visual and textual features, followed by designing a VTG model (e.g., transformer encoder-decoder) to interact and align two modalities, as depicted in Figure 1b.UniVTG [13] designs a multi-modal and multi-task learning pipeline, undergoing pretraining or fine-tuning on dozens of datasets.To accelerate the training convergence of VTG, GPTSee [14] introduces LLMs to generate prior positional information for the transformer decoder.However, these supervised approaches inevitably rely on extensive human-annotated data and training resources.To alleviate the dependence on annotations, PSVL [17], DSCNet [18], and Gao et al. [19] propose unsupervised frameworks that employ clustering to generate pseudo queries from video features.Similarly, PZVMR [20] and Kim et al. [21] leverage CLIP [22] for pseudo query generation.Yet, the above unsupervised methods unavoidably introduce biases from mismatched video-query pairs.In this paper, we adhere to the definitions of unsupervised and zero-shot settings as discussed by Luo et al. [23], classifying these approaches [17,20,21] as unsupervised.
To avoid any training or fine-tuning of the model, Diwan et al. [2] design the first zeroshot framework utilizing CLIP, but its reliance on shot transition detectors for obtaining temporal segments limits performance.Considering that CLIP (InternVideo [24]) pretrained on 400 M image-text (12 M video-text) pairs can align visual and textual inputs in a shared feature space, Luo et al. [23] develop a bottom-up pipeline to leverage the capabilities of vision-language models.Wattasseril et al. [25] employ the sparse framesampling strategy and BLIP2 [26] to reduce the computational cost of inference.However, these zero-shot methods [2,23,25] tend to generate redundant video features, introducing new biases that impair model performance.A recent study [27] found that masking over 75% of the input images can effectively train large self-supervised models.Moreover, SeViLA [28] demonstrates that transforming visual signals into textual representations significantly reduces redundant information, thereby boosting performance in tasks such as video question answering and VTG.

Generative Pre-Trained Transformer
The groundbreaking success of GPT-4 [3] in the language domain has led to the development of a series of open-source LLMs [4,5,29,30].Baichuan2 [5], containing 7 billion parameters and 2.6 trillion tokens, excels in vertical domains such as technology and daily conversation.MiniGPT4 [6,31] introduces a large multi-modal model (LMM) based on GPT, adept at performing visual-linguistic tasks like image captioning and visual question answering.Recent studies demonstrate that leveraging GPT models effectively reduces prejudice originating from ground truth labels, while simultaneously enhancing model performance in zero-shot multimodal tasks.This advancement is particularly notable in areas such as relation detection and information extraction, showcasing the robust generalization capabilities of GPT in these complex scenarios.To further capitalize on GPT's capabilities in video understanding, LLaViLo [8] designs specialized adapters for VTG, but this method still necessitates model training.To overcome these limitations, this paper proposes a novel zero-shot VTG pipeline aiming to eliminate human biases from GT queries while fully harnessing the visual comprehension capabilities of GPT, achieving a tuning-free framework.

Our Method
In this section, we first formulate the VTG task and then present the overall architecture of our VTG-GPT.Subsequently, we provide details of each module in the model.

Overview
Given an untrimmed video V ∈ R N v ×H×W×3 consisting of N v frames and a natural language query T ∈ R L t formed by L t words, the objective of video temporal grounding (VTG) is to precisely identify time segments [t s , t e ] ∈ R N s ×2 in V that semantically correspond to T, where each segment starts at timestamp t s and ends at timestamp t e .The overview of our proposed VTG-GPT is illustrated in Figure 3.The core aim of VTG-GPT is to implement a tuning-free framework while reducing human bias in the process.To this end, our first step is employing Baichuan2 (Section 3.2) to refine raw query T, resulting in debiased queries Q ∈ R N q ×L t .Then, we leverage MiniGPT-v2 (Section 3.3) to convert visual content in each frame into image captions C ∈ R N v ×L c , effectively reducing redundant information in video V.In Section 3.4, we compute similarity scores S s ∈ R N q ×N v between Q and C, which is to say, in the linguistic domain via Sentence-BERT assess query-frame correlation.Following this, a proposal generator is designed to yield temporal proposals P ∈ R N p ×2 .Finally, in the post-processing stage (Section 3.5), we calculate final scores S f ∈ R N p for each proposal while removing excessively overlapping proposals to produce predicted segments Seg ∈ R N s ×2 .

Query Debiasing
Mitigating biases in ground-truth queries represents a crucial and challenging problem for VTG, as these biases often originate from inherent human subjectivity.Such biases often include errors like misspellings and inaccurate descriptions of video content, as shown in Figure 2.Moreover, different annotators may characterize the same video segment in varying ways.A minority might opt for a formal language style, while others might gravitate towards colloquial or slang expressions.This difference in descriptions can inadvertently lead the model to prefer certain types of queries, thus introducing human prejudice and potentially diminishing the model's performance.
To address the aforementioned challenges, we utilize Baichuan2 to eliminate human biases inherent in original queries, as demonstrated in Figure 4a.In line with human linguistic comprehension [32], our first step is to rectify spelling and grammatical inaccuracies in original query T, thus producing the corrected version T c .We direct GPT with the instruction: Please correct spelling and grammatical errors in the original query.Subsequently, we instruct Baichuan2 to rewrite T c to remove incorrect descriptions.The corresponding command is Please rephrase the corrected query using different wording while maintaining the same intent and information.Finally, we generate five semantically similar yet syntactically diverse queries Q to prevent the model from relying on a specific query type.The command for this is Provide five different rephrasings.Although it is generally advisable to issue only one command per message in GPT dialogues to avoid model errors, as noted in [30], we discover in our tests that aggregating all instructions into a single message to GPT proved more effective, as shown in Figure 4a.It is important to note that the red font is not present in the code.A case involving misspelled words is shown in Figure 5a, where the incorrectly spelled word "ociture" is corrected to "image" or "picture".Figure 5b demonstrates a scenario involving a non-existent action, where "turn off the lights" is optimized to "lights off " or "a darkened environment", where "a darkened environment" is more congruent with the original video segment.In short, this debiasing strategy, featuring variations that differ in structure and word choice, deeply explores semantic information and enables the model to process various real-world queries effectively.

Image Captioning
To retrieve corresponding video segments Seg based on the query, traditional zeroshot methods [2,23] initially employ pre-trained multi-modal models [22,24] for feature extraction from visual and textual modalities.These features are then used to calculate similarities to derive Seg.However, our preliminary experiments utilizing CLIP and InternVideo to assess cross-modal similarity, as shown in the upper part of Table 1, yielded mediocre results.We attribute this to the over-reliance of traditional methods on directly modeling raw frames, which is often influenced by background details, thereby reducing the accuracy of primary content recognition.Some recent works [28,33] suggest that videos contain abundant non-essential information and that translating visual signals into more abstract descriptions can enhance VTG performance.
Inspired by the above research, we incorporate a large multi-modal model (LMM), MiniGPT-v2 [6], to obtain more detailed image descriptions.As demonstrated in Figure 4b, our initial instruction to MiniGPT-v2 is [image caption] Please describe the content of this image in detail., where [] emphasizes the task to be performed.Subsequently, we sequentially send frames in video V to MiniGPT-v2, which provides us with detailed captions C ∈ R N v ×L c .Following this, we use the CLIP text encoder (CLIP-T) to extract linguistic features from C and Q and calculate their similarities.As illustrated in the third row of Table 1, the results are surprisingly effective, achieving significant gains with this straightforward approach.We ascribe this to the LMM's focus on capturing key image content, thereby reducing irrelevant background interference and enhancing semantic similarities between queries and frames.For instance, the last frame in Figure 4b, depicting "A blonde woman in a white shirt is smiling at the camera, with a black background.", is succinctly translated into text, closely matching the query: "The blonde girl chooses to vlog in a darkened environment alone."semantically.
Table 1.Preliminary experiment with different similarity models on QVHighlights val split, using proposal generator and proposal scorer but without NMS.CLIP-T is short for using CLIP [22] text encoder only.Please refer to Section 4.1 for a detailed explanation of the evaluation metrics.

Proposal Generation
Computing query-frame similarity.In Section 3.3, we have articulated the significance of image captioning within VTG-GPT and employed CLIP-T to model debiased queries Q and image captions C within the textual domain.Subsequently, taking into account CLIP-T, as a multi-modal model, does not outperform specialized language models in NLP (natural language processing) tasks, as outlined in previous research [36].Therefore, we explore the use of a language-specific model.We opt for RoBERTa [34] (Sentence-BERT [35]) to extract normalized pooling features of Q ∈ R N q ×L t and C ∈ R N v ×L c , denoted as f q ∈ R N q ×d and f c ∈ R N v ×d , respectively, where d represents the dimensionality.We then compute the cosine scores between f q and f c as similarities S s ∈ R N q ×N v : As demonstrated in rows four to five of Table 1, the leverage of expert NLP models yielded significant improvements, which also validates the viewpoints presented in the report [36].
Proposal generator.After obtaining query-frame similarity scores S s , we move towards generating temporal proposals P ∈ R N p ×2 .A straightforward method would be to apply a fixed threshold, considering frames with similarity scores exceeding this threshold as potential start or end timestamps.However, each query-video pair exhibits a unique similarity distribution.To adaptively obtain proposals, we introduce a dynamic mechanism within our devised proposal generator.For clarity, we denote the similarity between the i-th debiased query Q i and video V as S i s ∈ R N v , and the similarity between Q i and the j-th frame in V as S i,j s ∈ R 1 .To be specific, the generator begins by computing a histogram of S i s with N b bins.It then selects the bins containing the top k highest similarities as the dynamic threshold θ: where N b and k are hyperparameters.For their specific values, please refer to the implementation details (Section 4.1) and ablations (Section 4.3).Next, we iteratively assess each frame; if S i,j s exceeds θ, its corresponding timestamp is considered the proposal's starting point.When more than λ consecutive frames are all lower than θ, the last frame with a similarity greater than θ is marked as the end timestamp of this proposal.Here, λ denotes the continuity threshold.Finally, we produce proposals for all debiased queries in the same video using this process to form final temporal proposals P ∈ R N p ×2 (representing potentially relevant video segments).

Post-Processing
Proposal scorer.In Section 3.4, we generate a set of temporal proposals P through our designed proposal generators.To identify the most fitting video segments from P, it is essential to compute and rank each proposal's confidence score.Intuitively, a straightforward approach could be averaging the similarity scores for each frame within a proposal, or only considering frames exceeding dynamic threshold θ.However, these methods overlook the impact of proposal length on their scoring.In our experiments, we observe that within certain ground-truth segments containing scene transitions, the similarity of some frames significantly exceeded that of adjacent frames.This led to an excessively high dynamic threshold, resulting in the predicted segments being truncated or fragmented.To address this issue, we develop a length-aware scoring mechanism for proposals, encouraging the model to generate longer segments.Specifically, the evaluation of each proposal considers both its duration and the query-frame similarity, and the final score of each proposal S f ∈ R N p is calculated as follows: where S l = L p /L n .Here, L p represents the count of frames within a proposal exceeding θ, and L n denotes the total number of frames exceeding θ across the entire video.The balancing coefficient α is adjustable to optimize for the influence of length and similarity in the final score calculation.NMS.In the final stage, considering that multiple debiased queries will produce numerous overlapping proposals, we employ non-maximum suppression (NMS) to reduce redundant overlaps and derive the final predicted video segments Seg ∈ R N s ×2 : where segments exceeding the intersection over union (IoU) threshold µ are selectively eliminated.This method ensures that only the most representative and distinct video segments are retained, enhancing the accuracy and relevance of our VTG-GPT output.

Experiments 4.1. Experimental Settings
Datasets.To demonstrate the superiority and effectiveness of our proposed tuningfree VTG-GPT framework, we conduct extensive experiments on three publicly available datasets: QVHighlights [1], Charades-STA [37], and ActivityNet-Captions [38], as these datasets encompass diverse types of videos.QVHighlights consists of 10,148 distinct YouTube videos, each accompanied by human annotations that include a textual query, a temporal segment, and frame-level saliency scores.Here, the saliency scores serve as the output for the highlight detection (HD) task, quantifying the relevance between a query and its corresponding frames.QVHighlights encompasses a wide array of themes, ranging from daily activities and travel in everyday vlogs to social and political events in news videos.For evaluation, Moment-DETR [1] allocates 15% of the data for validation and another 15% for testing, with consistent data distribution across both sets.Due to limitations on the online test server (https://codalab.lisn.upsaclay.fr/competitions/6937,accessed on 1 September 2023) allowing a maximum of five submissions, all our ablation studies are conducted on the validation split.Charades-STA, derived from the original Charades [39] dataset, includes 9848 videos of human indoor activities, accompanied by 16,128 annotations.For this dataset, a standard split of 3720 annotations is specifically designated for testing.ActivityNet-Captions, built upon the raw ActivityNet [40] dataset, comprises 19,994 long YouTube videos from various domains.Since the test split is reserved for competitive evaluation, we follow the setup used in 2D-TAN [16], utilizing 17,031 annotations for testing.
Metrics.To effectively evaluate performance on VTG, we employ several metrics, including Recall-1 at Intersection over Union (IoU) thresholds (R1@m), mean average precision (mAP), and mean IoU (mIoU).R1@m measures the percentage of queries in the dataset where the highest-scoring predicted segment has an IoU greater than m with the ground truth.mIoU calculates the average IoU across all test samples.For a fair comparison, our results on the QVHighlights dataset report R1@m with m values of 0.5 and 0.7, mAP at IoU thresholds of 0.5 and 0.75, and the average mAP across multiple IoU thresholds [0.5:0.05:0.95].For the Charades-STA dataset, we report R1@m for m values of 0.3, 0.5, and 0.7, along with mIoU.Finally, we employ mAP and HIT@1 to evaluate the results of HD, thereby measuring the query-frame relevance.Here, HIT@1 represents the accuracy of the highest-scoring frame.
Implementation details.To mitigate video information redundancy, we downsample QVHighlights and Charades-STA datasets to a frame rate of 0.5 per second.Considering the extended duration of videos in the ActivityNet-Captions, we extract one frame every three seconds.In the image captioning stage, we utilize MiniGPT-v2 [6] based on the LLaMa-2-Chat-7B [4].For query debiasing, we employ Baichuan2-7B-Chat [5], also based on LLaMa-2 [5], generating five debiased queries (N q = 5) per instance.The temperature coefficients for MiniGPT-v2 and Baichuan2 are set at 0.1 and 0.2, respectively.Drawing from the preliminary experiments in Section 3.4, we select Sentence-BERT [35] as our similarity model to evaluate query-frame correlations using cosine similarity.The histogram in our proposal generator is configured with ten bins (N b ), with a selection of the top eight values (k = 8) and a continuity threshold λ = 6.During the post-processing phase, the balance coefficient (α) in the proposal scorer is set to 0.5, and the IoU threshold (µ) for nonmaximum suppression (NMS) is determined at 0.75.All pre-processing and experiments are conducted on eight NVIDIA RTX 3090 GPUs.It is important to note that our VTG-GPT is purely inferential, involving no training phase.

Comparisons to the State-of-the-Art
In this section, we present a comprehensive comparison of our VTG-GPT with state-ofthe-art (SOTA) methods in VTG.Firstly, we disclose results on the QVHighlights validation and test splits, as shown in Table 2.The approaches are categorized into fully supervised (FS), weakly supervised (WS), unsupervised (US), and zero-shot (ZS) methods.Notably, VTG-GPT significantly outperforms the previous SOTA zero-shot model (Diwan et al. [2]), demonstrating substantial improvements across five metrics.Specifically, R1@0.7 saw an increase of +7.49 and mAP@0.5 improved by +7.23.Remarkably, VTG-GPT also vastly exceeds all WS methods.Most impressively, our approach surpasses the FS baseline (Moment-DETR [1]) in most metrics, even achieving competitive performance compared with FS methods.Unlike these methods, VTG-GPT requires only a single inference pass, eliminating the need for training data and resources.

Method Year Setup
QVHighlights test QVHighlights val R1 mAP R1 mAP @0.5 @0.7 @0.5 @0.75 Avg.@0.5 @0.7 @0.5 @0.75 Avg.Subsequently, we report the performance on the Charades-STA test set and ActivityNet-Captions test set in Table 3.In Charades-STA, VTG-GPT surpasses the SOTA zero-shot method (Luo et al. [23]) with a +5.81 increase in R1@0.7 and a +1.89 improvement in mIoU.Furthermore, VTG-GPT significantly outperforms the best US method (Kim et al. [21]) across all metrics.However, on the ActivityNet-Captions dataset, our method falls slightly behind Luo et al. in two metrics, which we attribute to the high downsampling rate used for this dataset.Moreover, VTG-GPT approaches the performance of the fully supervised Moment-DETR, validating its capacity to handle diverse and complex video contexts without any training or fine-tuning.This underscores the robustness and adaptability of VTG-GPT in zero-shot VTG scenarios, demonstrating its potential as a versatile and efficient tool for video understanding.Where FS means fully-supervised setting, WS means weakly-supervised, US means unsupervised, and ZS means zero-shot.

Method
Year Setup Charades-STA ActivityNet-Captions R1@0.3 R1@0.5 R1@0.7 mIoU R1@0.3 R1@0.5 R1@0.To qualitatively validate the effectiveness of our VTG-GPT model, we present visual comparisons of grounding results from the Ground-Truth (GT), Moment-DETR, and VTG-GPT in Figure 5. Observations indicate that the tuning-free VTG-GPT achieves more precise localization than the supervised Moment-DETR.The primary reason is that Moment-DETR relies solely on the original queries, which contain human-annotated errors, thus failing to fully align with the video's semantic information.In contrast, VTG-GPT can correct erroneous queries and reduce the bias introduced by human annotations, leading to more accurate grounding.To be more specific, in Figure 5a, our model detects a spelling mistake in the query, where "ociture" is corrected to "image" or "picture".In Figure 5b, the action "turns off " is refined to terms more congruent with the video context, such as "lights off ", "darkened environment", and "no lights".Additionally, the five rephrasings of each original query, in contrast to the original phrasing, exhibit more flexible grammatical structures, enabling the text encoder to comprehensively capture the semantic information of the original query.

Ablation Studies
To demonstrate the effectiveness of each module within our VTG-GPT framework, we perform in-depth ablation studies on the QVHighlights dataset.
Effect of debiased query.Firstly, we report saliency scores used to evaluate queryframe correlation.As delineated in Table 4, row three corresponds to VTG-GPT without debiasing, where we directly employ the similarity generated by Sentence-BERT as the saliency scores.Conversely, row four is VTG-GPT with debiasing, wherein we average the similarity of five debiased queries as saliency scores.The comparison reveals that row four significantly outperforms row three, demonstrating the efficacy of our debiasing strategy in mitigating human biases inherent in the original queries.Furthermore, comparing row two (UMT [9]) and row four, our VTG-GPT achieves a notable increase in HIT@1, recording a score of 62.29 (+2.3).This enhancement underscores VTG-GPT's superior reasoning capabilities in discerning challenging cases, affirming the value of our debiasing approach in refining model performance.Then, we investigate the impact of different numbers of debiased queries (N q ) generated by Baichuan2 on the performance of the VTG-GPT model.Our findings, as depicted in Figure 6a, indicate that the model achieves optimal results when utilizing five debiased queries (N q = 5).Compared to using solely the original biased query, implementing five debiased queries resulted in a notable increase in R1@0.5 to 54.26 (+3.87) and an improvement in mAP Avg. to 30.91 (+2.59).This evidence suggests that removing bias from queries significantly enhances the model's accuracy.However, an intriguing observation emerged: the performance metrics decline when N q exceeds 5.This pattern suggests that excessive rephrasing does not continually yield improvements, likely due to the finite number of synonymous rewrites and syntactic variations available to maintain the original intent of the query.Over-rephrasing can introduce irrelevant content, deviating from the semantic intent of the raw query, and potentially diminishing model performance.This finding underscores the critical need to balance the number of query rewrites, ensuring that debiased queries capture a spectrum of semantic nuances while retaining the essence of the original query.Future research should focus on developing advanced query debiasing techniques to enhance this equilibrium.(a) Utilizing debiased queries can enhance model performance, yet increasing the number of debiased queries (N q ) does not always lead to better results.The model achieves optimal performance when N q is set to 5. (b) In the proposal scorer, proposal length significantly impacts the final outcomes, with the model performing optimally when α = 0.5.

LLMs and LMMs.
In Table 5, we evaluate the capabilities of LLMs (LLaMA-v2 [4] and Baichuan2 [5]), alongside LMMs (MiniGPT-4 [31] and MiniGPT-v2 [6]) in handling biased queries and generating image captions.A comparison between rows two and five reveals that Baichuan2 outperforms LLaMa-v2, since it is trained on a more diverse dataset and tasks based on LLaMa-v2, enhancing its sentence rewriting capabilities.As illustrated in row three, MiniGPT-v2, also developed on the foundations of LLaMa-v2, shows moderate results in text dialogue.Comparing rows four and five, we observe an improvement in the performance of MiniGPT-v2 over MiniGPT-4.Overall, the results suggest that the integration of Baichuan2 for query debiasing combined with MiniGPT-v2 for image captioning emerges as the most effective strategy.This effectiveness stems from their complementary capabilities: Baichuan2 excels in handling complex multi-turn text dialogues, while MiniGPT-v2 is adept at detailed multimodal dialogues.This synergy maximizes the text comprehension ability of LLMs and the video understanding capacity of LMMs, thereby enhancing the overall performance of our framework.
Proposal generator.In our study, top-k and the continuity threshold λ within the proposal generator play a critical role.The parameter k, acting as a count threshold in our dynamic mechanism, directly influences the identified length of relevant proposals.In contrast, λ determines the number of irrelevant consecutive frames.To optimize these parameters, we conducted a series of ablation experiments on the proposal generator, as illustrated in Figure 7.The visualized results indicate that a combination of k = 8 and λ = 6 yields the most favorable outcomes.This specific pairing strikes a balance between segment length and threshold sensitivity.It skillfully avoids the drawbacks of excessively low thresholds, which could incorporate irrelevant frames into prediction results.Simultaneously, it averts the "tolerance trap" where an overly high number of discontinuous frames makes it difficult to determine when the segment ends.Proposal scorer.To balance the quality and length of segments, we conduct experiments on our proposal scorer, as shown in Figure 6b.We explore integrating the length score S l into the scoring mechanism.Initially, without including the length score (α = 0), mAP Avg. is 30.45.Upon incorporating S l , mAP Avg.peak at 30.91.Similarly, R1@0.5 increases from 51.21 to 54.24, indicating that incorporating a length-based scoring mechanism is crucial for generating the final segment scores.
IoU threshold µ.Finally, we assess the effectiveness of IoU thresholds µ in the NMS process, focusing on their role in reducing segment overlap.It is important to note that NMS does not alter the values of R1@0.5 and R1@0.75.Therefore, we report only the mAP metrics in Table 6.As illustrated in Table 6, setting µ to 0.75, compared to not employing NMS (µ = 1), results in an increase of +0.53 in mAP Avg.This increment underscores the significance of eliminating excessively overlapping segments, affirming that reducing such overlaps can notably enhance the model's performance.

Conclusions
This paper proposes a tuning-free framework named VTG-GPT for zero-shot video temporal grounding.To minimize the bias from mismatched videos and queries, we employ Baichuan2 for refining human-annotated queries.Recognizing the inherent redundancy in video compared to text, we utilize MiniGPT-v2 to transform visual inputs into more exact descriptions.Moreover, we develop the proposal generator and post-processing to produce temporal segments from debiased queries and image descriptions.Comprehensive experiments validate that VTG-GPT significantly surpasses current SOTA methods in zero-shot settings.Remarkably, it achieves a level of performance on par with supervised approaches.

Discussion
Limitations.In our study, constrained by computational resources, we downsample frames in the long-video dataset ActivityNet-Captions, which adversely affected performance.Future work should focus on developing a more efficient and rapid GPT model to address this challenge.Moreover, due to the limitations imposed by the context length in video-based GPT, our framework relies solely on image-based GPT, thus needing more temporal information modeling.
In future work, we will explore applying video-based GPT (such as VideoChat-GPT [45]) to enhance the capabilities of zero-shot VTG.In addition, crafting a more efficient module for query debiasing and proposal generation is paramount.Finally, leveraging GPT to implement a zero-shot framework on other data-driven tasks (such as video summarization [13], depth estimation [46,47] and transformer diagnosis [48]) is very promising.
Ethical considerations.Our work is based on open-source LLMs and LMMs which require direct inference without training, thereby reducing the carbon footprint.Additionally, we utilize common and safe prompts, and have not observed the generation of harmful or offensive content by the model.

Figure 1 .
Figure 1.(a) An illustrative example of a video temporal grounding (VTG) task.(b) Previous methods require training for all modules.(c) Our proposed VTG-GPT operates without any training or fine-tuning.Moreover, it employs GPT to reduce bias in human-annotated queries.

Figure 2 .
Figure 2. Human biases in ground-truth queries arise from (a) misspelled words and (b) incorrect descriptions.Our approach effectively mitigates these biases by leveraging GPT to optimize raw queries.

Figure 4 .
Figure 4. (a) An example of query refinement using Baichuan2.(b) An example of image captioning using MiniGPT-v2.The red font employed here is for demonstration purposes only and is not present in the actual code.

Figure 5 .
Figure 5. Visualization of predictions on QVHighlights val split.(a) misspelled words.(b) incorrect descriptions.Our VTG-GPT achieves more precise localization compared to Moment-DETR [1], as it can correct errors in the original queries through rewriting and generate debiased queries, thereby facilitating more accurate grounding.

Figure 6 .
Figure 6.Ablation experiments on the QVHighlights val split focus on R1@0.5 and mAP Avg.(a) Utilizing debiased queries can enhance model performance, yet increasing the number of debiased queries (N q ) does not always lead to better results.The model achieves optimal performance when N q is set to 5. (b) In the proposal scorer, proposal length significantly impacts the final outcomes, with the model performing optimally when α = 0.5.

Figure 7 .
Figure 7. Ablation experiments for top-k and continuity threshold (λ) in proposal generator on QVHighlights val split.When k = 8 and λ = 6, the model achieves the best performance (red triangle).

Table 2 .
Performance comparison on QVHighlights test and val split.FS means fully-supervised method, WS means weakly supervised, and ZS means zero-shot.

Table 3 .
Performance comparison on Charades-STA test split and ActivityNet-Captions test split.

Table 4 .
Comparison of video highlight detection (HD) on QVHighlights val split.VG is the abbreviation of very good.✓ and ✗ respectively represent the use and non-use of debiased queries.

Table 5 .
Ablation study of different LLMs and LMMs (Large Multi-modal Models) on QVHighlights val split.

Table 6 .
Comparison of different IoU thresholds (µ) in NMS on QVHighlights val split.