ETR: Event-Centric Temporal Reasoning for Question-Conditioned Video Question Answering
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors- Could you please quantify "visual overload" with entropy or redundancy metrics so that arguments are not subjective.
- Clarify the limitations of data specificity, NExT-QA vs. STAR, to reinforce generalization.
- If you can add a failure taxonomy for long-form VideoQA, it will help discussion on mixes causes.
- Justify the threshold T1 = 0.2 with validation curves, if possible.
- Could you clarify more "non_key_object", is it boolean? Or a confidence score? Explain further.
- Compare against strong prompt-only baselines with identical backbones. This will make attribution very clear.
- Lastly, please add compute metrics (GPU-hours, latency) per route, because “efficient” is claimed repeatedly. Make this justifiable.
Author Response
Dear Reviewer,
Thank you very much for your thorough and constructive feedback on our manuscript. We greatly appreciate your detailed analysis and the valuable suggestions for improvement. Your insights have helped us identify key areas where our manuscript can be strengthened. Below, we provide a point-by-point response to each of your comments and outline the modifications we have made to the revised manuscript.
Comments 1: Could you please quantify "visual overload" with entropy or redundancy metrics so that arguments are not subjective.
Response 1: Thank you for your insightful comment. We agree that quantifying concepts such as "visual overload" can strengthen the objectivity of the argument. In response, we have now clarified in the manuscript that our discussion of visual information redundancy is grounded in the findings of a prior study published at CVPR 2025 (now cited as [20]). We sincerely apologize for the omission of this reference in our initial submission, which has now been corrected. Regarding the use of entropy or redundancy metrics for quantification, we acknowledge that such metrics are not currently employed in our analysis. Our argument instead builds upon the established conclusions of the cited work. The revision, including the added citation, can be found on Page 2, Line 40 of the updated manuscript. We appreciate your understanding and hope that this clarification addresses your concern.
Comments 2: Clarify the limitations of data specificity, NExT-QA vs. STAR, to reinforce generalization.
Response 2: Thank you for your insightful suggestion. In response, we have clarified the limitations of data specificity for both the NExT-QA and STAR datasets, and we have elaborated on how these limitations affect model generalization. These revisions have been incorporated on Page 16 of the updated manuscript.
To strengthen the discussion on generalization and address the limitations arising from data specificity in the NExT-QA and STAR datasets, we have provided a more detailed comparative analysis in the revised manuscript:
- NExT-QA: As a large-scale benchmark designed to advance temporal and causal reasoning in Video Question Answering (VideoQA), NExT-QA exhibits certain data-specific limitations that constrain generalization. It comprises 5,440 short video clips, primarily depicting daily-life scenarios with fixed scene settings, along with 52,044 human-annotated question–answer pairs. These questions are categorized into three reasoning types: causal (48%), temporal (29%), and descriptive (23%). Causal questions focus on understanding intentions and outcomes, temporal questions emphasize reasoning over action sequences and dependencies, and descriptive questions target perception-based tasks such as object recognition and counting. The dataset is split into 3,870 videos for training, 570 for validation, and 1,000 for testing. On average, each question contains 11.6 words, and each answer 2.6 words. Compared to STAR, NExT-QA offers less diversity in video scenarios and lacks exploration of situational logic, which limits the transferability of models trained on it to more complex, real-world settings beyond simple daily interactions.
- STAR: Designed to support situated VideoQA and complex reasoning in realistic environments, STAR provides structured situational representations and logic-based annotations. It includes 22,000 trimmed video clips, approximately 60,000 question–answer pairs, and 240,000 candidate answers. While its video scenarios are more varied than those in NExT-QA, they are largely drawn from a fixed set of everyday environments (e.g., homes, offices), offering limited coverage of rare or atypical situations. This restricts the generalization of models trained on STAR to unconventional reasoning tasks. Each question is paired with a symbolic reasoning program and a ground-truth answer, facilitating compositional reasoning and interpretability. STAR organizes reasoning into four categories: Interaction (human–object relationships), Sequence (temporal order), Prediction (forecasting actions), and Feasibility (action plausibility). The dataset also includes 111 action predicates, 28 object categories, and 24 relationship types. Answer consistency scores across the four tasks—82.5%, 85.3%, 80.4%, and 78.5%—reflect varying levels of difficulty. Although STAR excels in situated and compositional reasoning, its limited scenario diversity and fixed predicate/relationship types still hinder generalization in cross-scenario or cross-domain VideoQA tasks.
These additions aim to provide a clearer understanding of how dataset characteristics influence model performance and to support claims regarding generalization more rigorously.
Comments 3: If you can add a failure taxonomy for long-form VideoQA, it will help discussion on mixes causes.
Response 3: Thank you for your constructive suggestion. We agree that introducing a failure taxonomy can help disentangle the mixed causes of performance degradation in long-form VideoQA. In response, we have conducted an error case analysis, categorized the root causes of failures, and provided representative examples for each category. This addition can be found on Page 27 of the revised manuscript.
The revised text is as follows:
Despite the overall effectiveness of our approach, several limitations remain evident in our experimental analysis. To better understand the underlying causes of model failures in long-form VideoQA, we examine representative error cases and categorize them into two primary types based on the nature of the observed shortcomings.
- Failure Type I — Insufficient Robustness to Distractions.
The model sometimes struggles to filter out visually distracting or temporally irrelevant content when constructing event representations. As illustrated in the first example of Fig. 4, although the model captures a general understanding of the video—which depicts a boy unwrapping a gift—it is misled by an intermediate segment where the boy briefly shows the gift to a woman. This distraction causes the model to incorrectly select an answer involving "share with the girl" rather than the correct action "unwrap it." This suggests that the model's current mechanism for assessing the importance of event segments remains insufficient for suppressing task-irrelevant but salient visual cues. - Failure Type II — Insufficient Sensitivity to Fine-Grained Visual Details.
In other cases, the model demonstrates a coarse understanding of the scene but fails to capture fine-grained visual information essential for accurate reasoning. The second example in Fig. 4 illustrates this limitation: given a video of a band performance, the model correctly identifies the general action (e.g., "playing an instrument") but overlooks critical details regarding how the action is performed (e.g., bowing vs. plucking). This indicates a need for enhanced perceptual granularity in frame-level feature extraction and cross-modal alignment.
By organizing failure cases into this preliminary taxonomy, we hope this analysis provides a more nuanced understanding of our model's limitations and directly addresses your concern.
Comments 4: Justify the threshold t1 = 0.2 with validation curves, if possible.
Response 4: Thank you for your insightful suggestion regarding the justification of threshold t1. We agree that providing empirical evidence to support the choice of t1=0.2 is essential for enhancing the transparency and rigor of our approach. Although we do not currently have a continuous validation curve, we have conducted experiments across five representative threshold values and summarized the results in Table 8. This addition can be found on Page 24 of the revised manuscript.
To investigate the impact of the threshold t1 on model performance, we evaluate five representative values and report the results in Table 8. The threshold t1 determines whether a sample is routed to event-level reasoning: samples with low semantic similarity between key frames and the question are processed by the event-level module, while those with high similarity are answered directly using key-frame information.
When t1 is set low, a large proportion of samples bypass the event-level understanding even when their key frames lack sufficient semantic relevance. This leads to degraded performance, as the model attempts to answer complex temporal or causal questions based on weakly related visual content, without capturing the full event structure or temporal dependencies. As t1 increases, performance improves initially, peaking at t1=0.2. At this point, the model achieves an optimal balance: samples with sufficiently informative key frames are handled efficiently via direct key-frame answering, while those requiring deeper temporal or causal reasoning are appropriately routed to the event-level module. However, when t1 becomes large, excessive samples are forced into event-level reasoning, including those whose key frames already provide adequate semantic context. This over-routing introduces redundant temporal information and unnecessary computational overhead, which can interfere with the model's judgment and lead to a decline in accuracy.
These empirical results validate our choice of t1=0.2 as the point that best balances routing sensitivity and overall video question answering performance.
Comments 5: Could you clarify more "non_key_object", is it boolean? Or a confidence score? Explain further.
Response 5: Thank you for your question. We agree that the definition and representation of "non_key_object" require further clarification. In response, we have explicitly specified that "non_key_object" is a boolean flag rather than a confidence score. This revision can be found on Page 11 of the updated manuscript.
The revised text is as follows:
The selector operates in two successive stages. The first stage, referred to as the problem type selector, performs a coarse-grained analysis to distinguish questions that require detailed information and fine event reasoning from those that do not. Conditioned on this categorization, the second stage adopts a composite selector that jointly considers the presence of key objects and the frame-question relevance scores to determine the most appropriate processing route. Specifically, for questions identified as requiring detailed information, the composite selector further separates them into two branches based on confidence and object cues: one branch corresponds to cases where either (i) there exists at least one frame with low relevance and the boolean flag non_key_object is True (i.e., the object set from the Key Object Extractor is empty)}, or (ii) all frames have low relevance even when the boolean flag non_key_object is False; the other branch includes the remaining questions. This decision process is formalized as the following routing function:
Thank you again for helping us improve the clarity of our manuscript.
Comments 6: Compare against strong prompt-only baselines with identical backbones. This will make attribution very clear.
Response 6: Thank you very much for your valuable suggestion. We fully agree that comparing against strong prompt-only baselines with identical backbones is essential for clearly attributing the source of performance gains. In response, we have explicitly included this comparison in the revised manuscript, as presented in Table 4.
The relevant results are as follows:
Baseline model: Coarse keyframe selection only (no prompting) 63.63% accuracy; prompt-only baseline: same backbone with object-centric prompt only 63.77% accuracy and our full model: 64.45% accuracy.
From this comparison, we can clearly observe that: The prompt-only baseline yields only a marginal improvement over the baseline (+0.14%), indicating that textual prompts alone are insufficient to fully address the task. The substantial improvement achieved by our full model (+0.82% over the baseline, +0.68% over the prompt-only baseline) primarily stems from other components of our approach, rather than the prompting technique itself. We believe this controlled comparison effectively clarifies the contribution of each component and addresses your concern. This addition can be found in the revised manuscript on Page 21.
Thank you again for helping us strengthen the rigor and clarity of our experimental analysis.
Comments 7: Lastly, please add compute metrics (GPU-hours, latency) per route, because “efficient” is claimed repeatedly. Make this justifiable.
Response 7: Thank you for your valuable suggestion. We fully agree that claims of efficiency must be supported by solid quantitative evidence. In response, we have added computational metrics to justify the efficiency statements made throughout our paper. Specifically, we clarify that our claim of "efficiency" refers to a relative comparison: the computational advantage of our hierarchical approach over the baseline scheme of performing event-centric temporal reasoning on all samples. To quantitatively validate this advantage, we have included a comparison of computational metrics for both schemes in the revised manuscript, summarized in Table 10.
[Updated text in the manuscript:]
Perform performance analysis in Table 10. We compare our proposed method(Ours) with the method of all T-Route under the same hardware and dataset conditions. The results show that ours achieves an average per-video inference latency of 1.01 seconds, which is approximately 20\% faster than the baseline's 1.26 seconds. Both total wall-clock time and total GPU time are reduced by 25\%. The primary source of acceleration lies in the optimization of the T5 generation stage, where the average time per call decreases from 1.8–2.4 seconds in the baseline to 0.92 seconds in Ours (a reduction of over 50\%), while peak and average memory consumption remain nearly identical (24.29 GB / 19.2 GB). If a full temporal sequence modeling capability were further introduced into the baseline, it would inevitably incur significant additional computational overhead and increased latency. In contrast, Ours achieves an effective balance between inference speed and temporal understanding capability
under its current design.
We would like to thank you once again for your thorough and insightful review. Your comments have been invaluable in helping us improve the clarity, rigor, and completeness of our manuscript. We have carefully addressed all points raised and believe the revised version is substantially stronger as a result. We hope that our responses and modifications meet with your approval and look forward to your further consideration.
Sincerely,
The Authors
Reviewer 2 Report
Comments and Suggestions for AuthorsThis manuscript is dedicated to the problem of long-form Video Question Answering (VideoQA), where a FEU (Fine Event Understanding) approach is proposed, which aims to reduce temporal reasoning and redundant visual and textual information. The authors aim to enhance video-text semantic matching through question-oriented hierarchical routing and object-based description generation. The proposed methodology is logically based and has been evaluated through extensive experiments on the NExT-QA and STAR benchmarks. The topic is relevant and has significant scientific significance in the context of modern LLM-based VideoQA research.
However, the manuscript does not sufficiently analyze a number of methodological, experimental, and practical aspects. Although the results are positive, additional comments and analyses are required regarding the stability of the method, computational complexity, and practical application. Below are detailed comments that will help the authors improve the manuscript.
- The FEU model relies heavily on large language models such as Flan-T5-XL, and is used in the stages of frame relevance assessment, event description generation, and final response generation. However, the manuscript does not provide any quantitative analysis of how the inference time, FLOPs, memory consumption, or computational cost change with increasing video length. This makes it difficult to assess the applicability of the method in real-world situations, especially in resource-constrained environments.
- The number of sub-events (k = 3) and the threshold value t1 used for routing in the T-Route phase were chosen empirically. It is not reported in the manuscript how the model performance changes when these parameters are changed. Sensitivity analysis is also important to show how much the results depend on these specific values.
- The manuscript superficially shows cases where the model gives incorrect answers, but does not analyze the reasons for this, such as visual ambiguity, temporal confusion, or object misidentification. By including such analysis in the manuscript, it would be possible to better understand the limitations of the model.
- The FEU is based entirely on the Flan-T5-XL model. This leaves open the question of the generalizability of the method to other LLMs or smaller models. This limitation should be explicitly discussed in the manuscript.
- The architecture of the proposed method is quite complex, involving multi-stage clustering and routing mechanisms. However, the improvements observed in some benchmarks are relatively small (approximately 0.8-1.3%). The question of how much this level of complexity is justified by the results is not sufficiently substantiated.
Spelling, stylistic and grammatical errors.
- There are inconsistencies in the affiliation format. Author affiliation information is provided in various formats and is not fully compliant with the official MDPI journal requirements.
- The text contains some minor errors related to plural forms, use of manuscripts, and word division, which do not directly affect the overall quality, but require editing, for example, “Videoqa” and “VideoQA” should be written in the same format. “frames is not totally continuous” - it must “frames are not totally continuous” or “these photo” - must be “these photos”. It is recommended to review the manuscript again to avoid similar situations.
- Some of the figures (particularly Figures 1 and 2) are conceptually useful but visually complex. It is recommended that the figure captions be clearer and more relevant to the steps of the method.
Author Response
Dear Reviewer,
Thank you very much for your thorough and constructive feedback on our manuscript. We greatly appreciate your detailed analysis and the valuable suggestions for improvement. Your insights have helped us identify key areas where our manuscript can be strengthened. Below, we provide a point-by-point response to each of your comments and outline the modifications we have made to the revised manuscript.
Comments 1: The FEU model relies heavily on large language models such as Flan-T5-XL, and is used in the stages of frame relevance assessment, event description generation, and final response generation. However, the manuscript does not provide any quantitative analysis of how the inference time, FLOPs, memory consumption, or computational cost change with increasing video length. This makes it difficult to assess the applicability of the method in real-world situations, especially in resource-constrained environments.
Response 1:Thank you for your insightful comment. We fully agree that claims of efficiency must be supported by solid quantitative evidence, particularly regarding how computational costs scale with video length—a crucial factor for real-world deployment in resource-constrained environments.
In response, we have added computational metrics to substantiate the efficiency claims made throughout our paper. We clarify that our notion of "efficiency" refers to a relative comparison: the computational advantage of our hierarchical approach over the baseline scheme of performing event-centric temporal reasoning on all samples. To quantitatively validate this advantage, we have included a comparison of computational metrics for both schemes in the revised manuscript, summarized in Table 10.
[Updated text in the manuscript:]
Perform performance analysis in Table 10. We compare our proposed method(Ours) with the method of all T-Route under the same hardware and dataset conditions. The results show that ours achieves an average per-video inference latency of 1.01 seconds, which is approximately 20\% faster than the baseline's 1.26 seconds. Both total wall-clock time and total GPU time are reduced by 25\%. The primary source of acceleration lies in the optimization of the T5 generation stage, where the average time per call decreases from 1.8–2.4 seconds in the baseline to 0.92 seconds in ours (a reduction of over 50\%), while peak and average memory consumption remain nearly identical (24.29 GB / 19.2 GB). If a full temporal sequence modeling capability were further introduced into the baseline, it would inevitably incur significant additional computational overhead and increased latency. In contrast, ours achieves an effective balance between inference speed and temporal understanding capability
under its current design.
Comments 2: The number of sub-events (k = 3) and the threshold value t1 used for routing in the T-Route phase were chosen empirically. It is not reported in the manuscript how the model performance changes when these parameters are changed. Sensitivity analysis is also important to show how much the results depend on these specific values.
Response 2: Thank you for your valuable suggestion. We fully agree that sensitivity analysis is essential for demonstrating the robustness of our method and justifying the empirical choice of key parameters. In response, we have conducted systematic evaluations of both the routing threshold t1 and the number of sub-events k, and have added the corresponding analyses to the revised manuscript.
Regarding the threshold t1: We evaluated five representative values and report the results in Table 8. The threshold t1 determines whether a sample is routed to event-level reasoning: samples with low semantic similarity between key frames and the question are processed by the event-level module, while those with high similarity are answered directly using key frames information.
When t1 is set low, a large proportion of samples bypass event-level understanding even when their key frames lack sufficient semantic relevance. This leads to degraded performance, as the model attempts to answer complex temporal or causal questions based on weakly related visual content, without capturing the full event structure or temporal dependencies. As t1 increases, performance improves initially, peaking at t1=0.2. At this point, the model achieves an optimal balance: samples with sufficiently informative key frames are handled efficiently via direct key frames answering, while those requiring deeper temporal or causal reasoning are appropriately routed to the event-level module. However, when t1 becomes large, excessive samples are forced into event-level reasoning, including those whose key frames already provide adequate semantic context. This over-routing introduces redundant temporal information and unnecessary computational overhead, which can interfere with the model's judgment and lead to a decline in accuracy. These empirical results validate our choice of t1=0.2 as the point that best balances routing sensitivity and overall video question answering performance.
Regarding the number of sub-events k: We analyzed the impact of different k values on overall model performance in Table 7. Specifically, performance reaches its lowest point at 63.51% when k=1. This is likely because treating all high-confidence time steps as a single pattern mixes distinct temporal structures, failing to effectively capture the sequential characteristics of events. As k increases to 4, overall performance slightly drops to 64.25% compared with k=3. Although temporal relationships are captured to some extent, over-segmentation may introduce redundancy and noise, leading to degraded performance. Overall, k=3 achieves the best balance between expressive power and model stability, enabling more effective modeling of the underlying temporal structure in high-confidence time steps.
These analyses have been added to Section 4.3.4 and 4.3.5 in the revised manuscript, with results summarized in Table 7 (for k) and Table 8 (for t1). The changes can be found on Pages 23–24.
We believe these additions provide clear empirical justification for our parameter choices and demonstrate that the model performs robustly across a reasonable range of values. Thank you again for helping us strengthen the rigor of our experimental analysis.
Comments 3: The manuscript superficially shows cases where the model gives incorrect answers, but does not analyze the reasons for this, such as visual ambiguity, temporal confusion, or object misidentification. By including such analysis in the manuscript, it would be possible to better understand the limitations of the model.
Response 3: Thank you for your constructive suggestion. We agree that a superficial discussion of error cases is insufficient, and that a deeper analysis of failure reasons—such as visual ambiguity, temporal confusion, and object misidentification—is essential for understanding model limitations. In response, we have conducted a systematic error case analysis, categorized the root causes of failures, and provided representative examples for each category. This addition can be found on Page 27 of the revised manuscript.
The revised text is as follows:
Despite the overall effectiveness of our approach, several limitations remain evident in our experimental analysis. To better understand the underlying causes of model failures in long-form VideoQA, we examine representative error cases and categorize them into two primary types based on the nature of the observed shortcomings.
- Failure Type I — Insufficient Robustness to Distractions.
The model sometimes struggles to filter out visually distracting or temporally irrelevant content when constructing event representations. As illustrated in the first example of Fig. 4, although the model captures a general understanding of the video—which depicts a boy unwrapping a gift—it is misled by an intermediate segment where the boy briefly shows the gift to a woman. This distraction causes the model to incorrectly select an answer involving "share with the girl" rather than the correct action "unwrap it." This suggests that the model's current mechanism for assessing the importance of event segments remains insufficient for suppressing task-irrelevant but salient visual cues. - Failure Type II — Insufficient Sensitivity to Fine-Grained Visual Details.
In other cases, the model demonstrates a coarse understanding of the scene but fails to capture fine-grained visual information essential for accurate reasoning. The second example in Fig. 4 illustrates this limitation: given a video of a band performance, the model correctly identifies the general action (e.g., "playing an instrument") but overlooks critical details regarding how the action is performed (e.g., bowing vs. plucking). This indicates a need for enhanced perceptual granularity in frame-level feature extraction and cross-modal alignment.
We hope this analysis provides a more nuanced understanding of our model's limitations and directly addresses your concern regarding superficial error case discussion.
Comments 4: The FEU is based entirely on the Flan-T5-XL model. This leaves open the question of the generalizability of the method to other LLMs or smaller models. This limitation should be explicitly discussed in the manuscript.
Response 4: Thank you for raising this important point. We agree that the generalizability of our method to other LLMs or smaller models is a crucial aspect that requires explicit discussion. In response, we have added a dedicated discussion of this limitation in the revised manuscript.
The revised text is as follows:
The core design of ETR is largely model-agnostic, as it operates on general semantic representations and reasoning outputs rather than relying on architecture-specific components. Therefore, in principle, the method can be extended to other encoder–decoder LLMs as well as smaller-scale models. However, models with different scales or architectures may vary in their semantic representation quality and temporal reasoning capacity, which could affect ETR's performance. Systematically evaluating the method across diverse model backbones, including smaller models, is an important direction for future work.
This discussion has been added to the Limitations and Future Work section (Section 5) in the revised manuscript, on Page 27.
We believe this addition transparently addresses the current limitation regarding model specificity and outlines a clear path for future investigation. Thank you again for helping us improve the completeness of our manuscript.
Comments 5: The architecture of the proposed method is quite complex, involving multi-stage clustering and routing mechanisms. However, the improvements observed in some benchmarks are relatively small (approximately 0.8-1.3%). The question of how much this level of complexity is justified by the results is not sufficiently substantiated.
Response 5: Thank you for this critical and fair assessment. While the proposed framework may appear complex at a conceptual level, its actual computational cost during inference is mitigated by several key design choices. First, the routing mechanism dynamically selects among multiple processing pathways based on the question's characteristics: only a subset of samples are routed to the more expensive T-Route branch that requires event-level temporal reasoning, while others are handled by lighter branches. This ensures that the additional complexity is only incurred when necessary.
Second, each branch serves a distinct and necessary purpose:
T-Route handles questions requiring deep temporal understanding, using two-stage clustering to assess keyframe importance and model event structures. O-Route addresses questions that benefit from contextual information beyond keyframes, performing single-pass clustering to incorporate event-level context while maintaining efficiency. N-Route processes samples where keyframes alone provide sufficient semantic information, preserving overall efficiency.
Third, the necessity of the most complex branch has been validated through ablation studies (Table 5), which show that removing T-Route leads to a performance degradation of approximately 0.5%, confirming its contribution to the overall gain. Moreover, compared to a baseline that applies full event-level temporal reasoning to all samples, our method achieves a 20% reduction in inference latency and 25% lower GPU time (Table 10), demonstrating that the routing mechanism actually improves efficiency rather than compromising it.
Finally, we note that the entire framework is built upon a relatively lightweight foundation—the Flan-T5-XL backbone, with 4B parameters, represents a relatively lightweight foundation—one that inherently limits both the achievable performance ceiling and the overall computational footprint. This constraint further underscores that the observed gains are achieved within a modest computational budget.
We believe these points collectively demonstrate that the architectural complexity is not only justified by the performance gains but also carefully balanced with efficiency considerations. Thank you again for helping us strengthen this aspect of our work.
Comments 6: Spelling, stylistic and grammatical errors.
There are inconsistencies in the affiliation format. Author affiliation information is provided in various formats and is not fully compliant with the official MDPI journal requirements.
The text contains some minor errors related to plural forms, use of manuscripts, and word division, which do not directly affect the overall quality, but require editing, for example, “Videoqa” and “VideoQA” should be written in the same format. “frames is not totally continuous” - it must “frames are not totally continuous” or “these photo” - must be “these photos”. It is recommended to review the manuscript again to avoid similar situations.
Response 6:Thank you for your thorough and careful reading of our manuscript, and for identifying these language and formatting issues. We sincerely apologize for these oversights, which detract from the overall professionalism of the paper. In response, we have conducted a comprehensive proofreading of the entire manuscript to address all spelling, stylistic, and grammatical errors, as well as affiliation inconsistencies.
Specifically, we have made the following corrections and improvements:
Terminology standardization: We have ensured consistent use of "VideoQA" throughout the manuscript (replacing any instances of "Videoqa" or other variants).
Grammar and syntax: We have corrected all identified grammatical errors, including subject–verb agreement issues (e.g., "frames is" → "frames are") and pluralization errors (e.g., "these photo" → "these photos"). A thorough line-by-line review was conducted to catch similar issues throughout the text.
Affiliation format: We have carefully reviewed and unified the author affiliation information to comply with the official MDPI journal requirements. All affiliations now follow a consistent format across the title page.
Word division and typography: We have checked for proper word division, hyphenation, and typographical consistency throughout the manuscript.
To ensure thoroughness, we have also used professional proofreading tools and manually reviewed the entire text multiple times. All changes are marked in red in the revised manuscript for your convenience.
We believe these revisions have substantially improved the readability and professionalism of the manuscript. The corrections are distributed throughout the paper, with significant changes highlighted in the relevant sections.
Thank you again for your meticulous attention to detail, which has helped us enhance the overall quality of our work.
Comments 7: Some of the figures (particularly Figures 1 and 2) are conceptually useful but visually complex. It is recommended that the figure captions be clearer and more relevant to the steps of the method.
Response 7: Thank you for your helpful suggestion regarding the clarity of our figures. We agree that clear and precise figure captions are essential for helping readers understand the methodological steps. In response, we have revised the caption for Figure 1 to improve its accuracy and reduce potential ambiguity.
For Figure 1: The original caption has been revised to more precisely describe what the figure shows:
[Updated text in the manuscript:]
"An example illustrating different weighting schemes. We just show parts of frames of the video and the frames are not strictly consecutive to show a rough video development. The red boxes highlight frames which models focus."
This revision makes it explicit that only selected frames are shown and that they are not strictly consecutive, clarifying the illustrative nature of the figure and avoiding potential misinterpretation by readers.
For Figure 2: We have simplified the visual content to reduce complexity and better align with the actual method steps. The corresponding caption has also been updated to guide readers through the methodological flow more clearly.
These changes can be found on Pages 7-8 of the revised manuscript.
Thank you again for your valuable input in helping us improve the clarity of our presentation.
We would like to thank you once again for your thorough and insightful review. Your comments have been invaluable in helping us improve the clarity, rigor, and completeness of our manuscript. We have carefully addressed all points raised and believe the revised version is substantially stronger as a result. We hope that our responses and modifications meet with your approval and look forward to your further consideration.
Sincerely,
The Authors
Reviewer 3 Report
Comments and Suggestions for Authors- The term “Fine Event Understanding” is non-standard in the literature and should be replaced with “fine-grained event understanding” or “event-centric temporal reasoning” to align with established terminology.
- The phrase “FEU (Fine Event Understanding videoqa)” is grammatically incorrect and conceptually redundant; it should be written as “FEU (Fine-grained Event Understanding)” and “VideoQA” should not appear inside the parentheses.
- The expression “question emotions” is conceptually inaccurate because the method models question semantics or intent rather than affect, so it should be replaced with “question intent,” “question semantics,” or “question-conditioned signals.”
- The title phrase “Adaptive Question-Guided” is somewhat redundant because VideoQA is inherently question-guided, so “question-conditioned” or “adaptive routing” would more precisely describe the contribution.
- The concept “fine event understanding” is inconsistently used as a capability, a question category, and a module name, and should be formally defined once and used consistently thereafter.
- The candidate answer set is denoted by different symbols (O, A, and also reused as a function name), which creates ambiguity and should be unified under a single notation such as 𝒜 for the answer set and a different symbol for the prediction module.
- The symbol O is simultaneously used for “object,” “object set,” and sometimes an operator/function, which leads to semantic overload and should be separated into distinct notations.
- The statement “apply DBSCAN clustering to segment the video into events” is unclear about the input features, so the paper should explicitly specify that clustering is performed on frame embeddings rather than raw frames.
- The assumption that each event consists of “beginning, climax, and conclusion” introduces an unjustified narrative prior and should be replaced with neutral temporal segments such as “early, middle, and late.”
- The sentence “The top-1 event with the highest scores are selected” contains subject–verb disagreement and should be “is selected.”
- The phrase “denoted by {E_j} whose j is in i” is grammatically invalid and should be rewritten clearly as “we denote the selected event as E.”
- The clause “mechanism which focus on” violates subject–verb agreement and should be “mechanism that focuses on.”
- The expression “aligned with questions semantics” should use the possessive form and article, i.e., “aligned with the question’s semantics.”
- The phrase “in these photo” contains a number mismatch and should be “in these photos” or “in these images.”
- The caption sentence “The sample of using different weight” is unnatural academic English and should be rewritten as “An example illustrating different weighting schemes.”
- The sentence “the frames is not totally continuous” has grammatical and lexical issues and should be “the frames are not temporally contiguous” or “not strictly consecutive.”
- The verb “finetune the importance of key frames” is imprecise for weighting operations and should be replaced with “reweight” or “refine keyframe importance.”
- The variable k is used both for the number of segments and the number of keyframes, which risks confusion and should be separated into different symbols such as K for keyframes and L for segments.
- Several places mix “photo,” “image,” and “frame,” and the terminology should be standardized to “frame” for video data.
- The phrase “question-guided description prompting mechanism” is verbose and non-idiomatic, and should be simplified to “question-conditioned prompting strategy” or “object-centric prompting module.”
- Some sentences use informal or vague verbs such as “make the model focus” or “help the model understand,” which should be replaced with precise academic verbs like “encourage,” “facilitate,” or “improve alignment.”
- The problem definition should explicitly state the probabilistic objective (e.g., argmax over candidate answers conditioned on video and question) to make the task formulation mathematically rigorous rather than purely descriptive.
Author Response
Dear Reviewer,
Thank you very much for your thorough and constructive feedback on our manuscript. We greatly appreciate your detailed analysis and the valuable suggestions for improvement. Your insights have helped us identify key areas where our manuscript can be strengthened. Below, we provide a point-by-point response to each of your comments and outline the modifications we have made to the revised manuscript.
Comments 1: The term “Fine Event Understanding” is non-standard in the literature and should be replaced with “fine-grained event understanding” or “event-centric temporal reasoning” to align with established terminology.
Response 1: Thank you for pointing this out. We agree that "Fine Event Understanding" is not a standard term in the literature. In response, we have replaced it with "event-centric temporal reasoning" throughout the manuscript to align with established terminology. This change has been made in the title, abstract, and main text.
Comments 2: The phrase “FEU (Fine Event Understanding videoqa)” is grammatically incorrect and conceptually redundant; it should be written as “FEU (Fine-grained Event Understanding)” and “VideoQA” should not appear inside the parentheses.
Response 2:Thank you for this correction. We have revised the problematic phrasing. The term is now written as "ETR (Event-centric Temporal Reasoning)" throughout the manuscript, and "VideoQA" has been removed from the parentheses. The corrected definition appears in the abstract.
Comments 3: The expression “question emotions” is conceptually inaccurate because the method models question semantics or intent rather than affect, so it should be replaced with “question intent,” “question semantics,” or “question-conditioned signals.”
Response 3:Thank you for this important terminological clarification. We agree that "question emotions" is conceptually inaccurate, as our method models question semantics rather than affect. We have replaced all instances of "question emotions" with "question intent" where appropriate. These changes can be found in the abstract.
Comments 4: The title phrase “Adaptive Question-Guided” is somewhat redundant because VideoQA is inherently question-guided, so “question-conditioned” or “adaptive routing” would more precisely describe the contribution.
Response 4:Thank you for this helpful observation. We agree that "Adaptive Question-Guided" is somewhat redundant given that VideoQA is inherently question-guided. In response, we have revised the title to "ETR: Event-centric Temporal Reasoning for Question-conditioned Video Question Answering" to more precisely reflect our contribution. The updated title appears on Page 1.
Comments 5: The concept “fine event understanding” is inconsistently used as a capability, a question category, and a module name, and should be formally defined once and used consistently thereafter.
Response 5: Thank you for highlighting this inconsistency. We agree that "fine event understanding" should be formally defined and used consistently throughout the manuscript. In response, we have carefully distinguished between three related but distinct concepts:
Event-centric temporal reasoning refers to the model's capability to understand temporal structures, causal relationships, and event-level dynamics in long-form videos. Fine event understanding is used to describe a question category—specifically, questions that require detailed comprehension of sub-events and their temporal/causal relations, as opposed to simple perceptual or descriptive questions. Question intent denotes the module name responsible for classifying the type of reasoning required by a given question and guiding the routing process accordingly.
We have revised the manuscript accordingly, ensuring consistent usage of these terms throughout. These revisions clarify the distinctions between the three concepts and enhance terminological precision. Thank you again for helping us improve the clarity of our manuscript.
Comments 6: The candidate answer set is denoted by different symbols (O, A, and also reused as a function name), which creates ambiguity and should be unified under a single notation such as for the answer set and a different symbol for the prediction module.
Response 6:Thank you for identifying this symbol ambiguity. We have unified the notation for the candidate answer set as A throughout the manuscript. The prediction module is now denoted by a separate symbol C to avoid confusion. These changes have been applied in Section 3 and across all equations where answer sets appear. The revisions can be found on Pages 7–15.
Comments 7: The symbol O is simultaneously used for “object,” “object set,” and sometimes an operator/function, which leads to semantic overload and should be separated into distinct notations.
Response 7: Thank you for this careful observation. We agree that using O simultaneously for "object," "object set," and an operator creates semantic overload. In response, we have introduced distinct notations throughout the manuscript:
O denotes the object set; oi denotes individual objects within the set; C is used for the Answer Module function; N represents the selector routes for the remaining question category.The revisions can be found on Pages 7–15. Thank you again for helping us improve the clarity and rigor of our notation.
Comments 8: The statement “apply DBSCAN clustering to segment the video into events” is unclear about the input features, so the paper should explicitly specify that clustering is performed on frame embeddings rather than raw frames.
Response 8: Thank you for pointing out this lack of clarity. We have revised the statement to explicitly specify the input features for DBSCAN clustering. The updated sentence now reads: "given the visual embedding for input frames, we first apply
DBSCAN clustering to segment the video into coarse-grained events." This revision appears on Page 13.
Comments 9: The assumption that each event consists of “beginning, climax, and conclusion” introduces an unjustified narrative prior and should be replaced with neutral temporal segments such as “early, middle, and late.”
Response 9: Thank you for this insightful comment. We agree that imposing a "beginning, climax, conclusion" structure introduces an unjustified narrative prior. In response, we have replaced these terms with neutral temporal descriptors: "early, middle, and late segments." The revised text can be found on Page 14, Line 387.
Comments 10: The sentence “The top-1 event with the highest scores are selected” contains subject–verb disagreement and should be “is selected.”
Response 10: Thank you for catching this grammatical error. We have corrected the subject–verb disagreement. The sentence now reads: "The top-1 event with the highest scores is selected." This correction appears on Page 13, Line 367.
Comments 11: The phrase “denoted by {E_j} whose j is in i” is grammatically invalid and should be rewritten clearly as “we denote the selected event as E.”
Response 11: Thank you for pointing out this grammatically invalid expression. We have rewritten the sentence clearly as: "We denote the selected event as E." The revision can be found on Page 13, Line 379.
Comments 12: The clause “mechanism which focus on” violates subject–verb agreement and should be “mechanism that focuses on.”
Response 12: Thank you for identifying this subject–verb agreement error. We have corrected "mechanism which focus on" to "mechanism that focuses on." This change appears in the abstract.
Comments 13: The expression “aligned with questions semantics” should use the possessive form and article, i.e., “aligned with the question’s semantics.”
Response 13: Thank you for this grammatical correction. We have revised the phrase to "aligned with the question's semantics." This change has been made in the abstract.
Comments 14: The phrase “in these photo” contains a number mismatch and should be “in these photos” or “in these images.”
Response 14: Thank you for catching this number mismatch. We have corrected "in these photo" to "in these photos." All changes are highlighted in red in the revised manuscript.
Comments 15: The caption sentence “The sample of using different weight” is unnatural academic English and should be rewritten as “An example illustrating different weighting schemes.”
Response 15: Thank you for this suggestion. We have revised the caption to more natural academic English: "An example illustrating different weighting schemes." This change can be found in the caption of Figure 1 on Page 2.
Comments 16: The sentence “the frames is not totally continuous” has grammatical and lexical issues and should be “the frames are not temporally contiguous” or “not strictly consecutive.”
Response 16: Thank you for pointing out the grammatical and lexical issues in this sentence. We have revised it to: "the frames are not strictly consecutive." This correction appears in the caption of Figure 1 on Page 2 and Figure 3 on Page 26.
Comments 17: The verb “finetune the importance of key frames” is imprecise for weighting operations and should be replaced with “reweight” or “refine keyframe importance.”
Response 17: Thank you for this terminological precision. We agree that "finetune the importance" is imprecise for weighting operations. We have replaced it with "refine keyframe importance" throughout the manuscript. The changes can be found on Pages 13, Line 380.
Comments 18: The variable k is used both for the number of segments and the number of keyframes, which risks confusion and should be separated into different symbols such as K for keyframes and L for segments.
Response 18: Thank you for identifying this potential symbol confusion. We agree that using k for both the number of event segments and the number of keyframes could lead to ambiguity. In response, we have introduced distinct notations to clarify these two different concepts:
L denotes the number of event segments obtained from density-based clustering. This represents the temporal segmentation of the video into coarse-grained event units. K denotes the number of keyframes selected for fine-grained analysis. This quantity is aligned with the keyframe extraction process and is independent of the event segmentation.
Thank you again for helping us improve the clarity of our notation.
Comments 19: Several places mix “photo,” “image,” and “frame,” and the terminology should be standardized to “frame” for video data.
Response 19: Thank you for this consistency check. We have standardized the terminology to "frame" throughout the manuscript for video data. All instances of "photo" and "image" (when referring to video frames) have been replaced. All changes are highlighted in red in the revised manuscript.
Comments 20: The phrase “question-guided description prompting mechanism” is verbose and non-idiomatic, and should be simplified to “question-conditioned prompting strategy” or “object-centric prompting module.”
Response 20: Thank you for this stylistic suggestion. We have simplified the verbose phrase to "question-conditioned prompting strategy." This revision appears in the abstract.
Comments 21: Some sentences use informal or vague verbs such as “make the model focus” or “help the model understand,” which should be replaced with precise academic verbs like “encourage,” “facilitate,” or “improve alignment.”
Response 21: Thank you for this important stylistic guidance. We have replaced informal verbs throughout the manuscript with more precise academic language centered on the concept of alignment, to better reflect the underlying mechanism of our approach. All changes are highlighted in red in the revised manuscript. Thank you again for helping us improve the academic precision of our writing.
Comments 22: The problem definition should explicitly state the probabilistic objective (e.g., argmax over candidate answers conditioned on video and question) to make the task formulation mathematically rigorous rather than purely descriptive.
Response 22: Thank you for this valuable suggestion to enhance mathematical rigor. We have revised the problem definition to explicitly state the probabilistic objective. The changes can be found on Pages 7.
We would like to thank you once again for your thorough and insightful review. Your comments have been invaluable in helping us improve the clarity, rigor, and completeness of our manuscript. We have carefully addressed all points raised and believe the revised version is substantially stronger as a result. We hope that our responses and modifications meet with your approval and look forward to your further consideration.
Sincerely,
The Authors
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThank you for answering all my comments. The paper now is very clear.
Author Response
Dear Reviewer,
Thank you very much for your valuable comments during the first round of review, which were instrumental in helping us improve the quality of our manuscript.
We are very grateful for your positive feedback and pleased to know that you are satisfied with our revisions and that the paper is now very clear.
In this round of revision, we have also addressed the additional comments raised by the other reviewer to further strengthen the manuscript.
Thank you again for your time and constructive feedback throughout the review process.
Sincerely,
The Authors
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have made significant efforts to revise the manuscript in response to previous reviews. In particular, the structure of the methodology has been clarified and the routing functions have been formalized, and additional explanations have been added to improve readability and consistency. These revisions have improved the overall presentation of the work.
However, several important methodological and experimental issues remain unresolved and require further clarification before the manuscript can be considered for publication.
- The authors have included a comparison of inference time and GPU time in Table 10. However, this only shows the results for a specific hardware and dataset configuration. The following questions remain unanswered:
- How does inference time increase as video length increases?
- What is the time complexity of DBSCAN and k-means steps?
- Where is the analysis on FLOPs?
- How does memory consumption change depend on video length?
Therefore, the applicability of the model in real-world conditions (especially long videos or resource-limited environments) has not yet been fully substantiated.
- The proposed system includes several complex components: Hierarchical selector, T-Route (two-stage clustering), O-Route, Object-centric prompting. However, the overall performance improvement is around ~0.8-1.3%. Also, although the authors show ablation results, the following are not provided:
- Statistical significance (p-value)
- Confidence interval
- Dispersion of results
Since the statistical significance of the increase in performance was not demonstrated, the architectural complexity is not fully scientifically substantiated.
- The manuscript appears to be primarily an engineering-level solution. The following are not sufficiently developed:
- A formal mathematical model of temporal reasoning
- Theoretical justification for the choice of routing threshold (t1)
- Analytical commentary on the optimal performance of the hierarchical routing mechanism
- Complexity analysis of clustering steps
If the manuscript is intended for a theoretical journal, the methodological basis needs to be deepened.
- The authors state in the discussion section that the model could work with other LLMs. However:
- No experimental testing has been conducted with any other model
- Results with a smaller model were not presented
- The degree of dependence of the model architecture on the LLM has not been analytically assessed
Therefore, the conclusion about the generalizability of the model is based on theoretical assumption, not empirical evidence.
- Although the authors state that full proofreading has been carried out, the text still contains grammatical and stylistic errors. For example:
- Subject-verb agreement errors
- Errors in plural-singular forms
- Repetitive and redundant phrases
This situation negatively affects the overall scientific and professional level of the manuscript.
Author Response
Dear Reviewer,
Thank you very much for your thorough and constructive feedback on our manuscript. We greatly appreciate your detailed analysis and the valuable suggestions for improvement. Your insights have helped us identify key areas where our manuscript can be strengthened. Below, we provide a point-by-point response to each of your comments and outline the modifications we have made to the revised manuscript.
Comments 1: The authors have included a comparison of inference time and GPU time in Table 10. However, this only shows the results for a specific hardware and dataset configuration. The following questions remain unanswered:
How does inference time increase as video length increases?
What is the time complexity of DBSCAN and k-means steps?
Where is the analysis on FLOPs?
How does memory consumption change depend on video length?
Response 1:We thank the reviewer for the valuable suggestion. Since this study adopts a fixed-frame sampling strategy, the system processes a constant number of input frames (e.g., 32 frames) for each inference. Therefore, the computational cost is determined solely by the number of sampled frames rather than the absolute duration of the original video. This implies that for input videos of arbitrary length, as long as the same number of frames is sampled, both computational and memory costs remain constant, ensuring stable and controllable resource consumption in long-video scenarios.
To further validate the scalability of the system and address the reviewer’s concern, we conducted additional experiments with different sampling sizes (24 / 32 / 48 frames). The reported results are obtained by averaging multiple repeated runs to ensure measurement stability. The visual encoder FLOPs increase from 37,172.75 GFLOPs for 24 frames to 49,563.66 GFLOPs for 32 frames and 74,345.49 GFLOPs for 48 frames, demonstrating a strictly linear scaling behavior. The corresponding visual encoding times are approximately 2.96 s, 3.69 s, and 5.16 s, respectively. The total inference times are 12.45 s, 13.41 s, and 15.39 s, while the peak GPU memory consumption is 20.14 GB, 24.19 GB, and 32.31 GB. All metrics exhibit stable monotonic growth without abnormal inflation.
In the clustering stage, K-means and DBSCAN are jointly applied to the frame-level features produced by the visual encoder, and their computational scale is directly determined by the number of sampled frames N. The time complexity of K-means is O(NKT); thus, when the sample size doubles from 24 to 48 frames, the theoretical computational cost approximately doubles, although its absolute runtime remains significantly lower than that of the visual encoding stage. DBSCAN has an average time complexity of O(N log N) when spatial indexing is employed, and under the current frame scale its runtime accounts for only a negligible portion of the total inference time. Therefore, both clustering methods can be considered to exhibit sublinear growth relative to the overall pipeline and do not constitute a dominant computational bottleneck.
Regarding memory consumption, GPU memory consists of model parameters and frame-dependent feature representations, with only the feature tensors scaling with the number of frames. Consequently, the increase in peak memory usage from 20.14 GB (24 frames) to 32.31 GB (48 frames) is lower than the growth ratio of FLOPs, further confirming that the system maintains stable and controlled computational and storage costs under fixed-frame sampling for long-video inputs.
In summary, the experimental results demonstrate that under a fixed-frame sampling strategy, inference time, FLOPs, and memory consumption are primarily determined by the number of sampled frames rather than the original video duration. The K-means and DBSCAN clustering steps do not constitute system bottlenecks, thereby supporting the scalability and deployability of the model in long-video and resource-constrained environments. The revision can be found on Page 29 of the updated manuscript.
Comments 2: The proposed system includes several complex components: Hierarchical selector, T-Route (two-stage clustering), O-Route, Object-centric prompting. However, the overall performance improvement is around ~0.8-1.3%. Also, although the authors show ablation results, the following are not provided:
Statistical significance (p-value)
Confidence interval
Dispersion of results
Response 2: We thank the reviewer for this thoughtful comment regarding statistical validation. The reviewer raises an important point about demonstrating that the observed 0.8-1.3% improvement is meaningful and not merely due to chance. Our experiments are conducted with fixed test sets (using the standard splits of NExT-QA and STAR) and fixed random seeds to ensure reproducibility. Under these conditions, repeated runs produce identical results, as there are no stochastic elements during inference. We fully agree with the reviewer that statistical robustness is essential for validating experimental results. While our deterministic evaluation protocol (fixed test sets and random seeds) ensures reproducibility, we acknowledge that this setting limits the direct application of p-value calculations. To nonetheless rigorously demonstrate that the observed improvements are not due to chance, we have provided multiple lines of evidence:
Consistency across multiple datasets: Our method achieves improvements on two distinct benchmarks (NExT-QA and STAR), with gains of 0.8-1.3% on NExT-QA and 0.6-6.9% on STAR (Tables 2 and 3). The fact that these improvements replicate across datasets with different characteristics and question types suggests they reflect genuine methodological advances rather than dataset-specific artifacts. 2. Consistency across diverse baselines: On NExT-QA, ETR compares favorably against 23 baseline methods spanning a wide range of architectures—from conventional non-LLM approaches to large-scale LLM-based systems. The improvement is not against a single baseline but consistently across the entire spectrum of existing methods. 3.Systematic ablation patterns: Table 4 shows a clear monotonic improvement as each component is progressively added (63.63% → 63.77% → 64.09% → 64.45%). This incremental gain pattern is characteristic of meaningful contributions, where each module adds complementary value. If the improvements were due to random chance, such a consistent stepwise progression would be unlikely. 4. Principled routing behavior: Table 5 provides particularly compelling evidence: when the T-Route module is misapplied to all questions ("all T-Route"), performance actually degrades to 63.11%, lower than simpler strategies. The fact that selective routing yields the best performance (64.45%) demonstrates that the improvement comes from intelligent deployment of modules rather than random chance. If the gains were merely coincidental, we would not expect such a clear contrast between appropriate and inappropriate routing strategies.
Comments 3: The manuscript appears to be primarily an engineering-level solution. The following are not sufficiently developed:
A formal mathematical model of temporal reasoning
Theoretical justification for the choice of routing threshold (t1)
Analytical commentary on the optimal performance of the hierarchical routing mechanism
Complexity analysis of clustering steps
Response 3: We thank the reviewer for the valuable suggestions to deepen the theoretical foundation of our work. In response, we have made the following enhancements:
Formal mathematical model: To address this concern, we have revised Section 3.1 extensively, transforming the ETR reasoning process from a textual description into a complete mathematical formulation. Specifically, starting from the basic VideoQA objective p(a∣V,Q) defined in Eq. (1), we formalize keyframe selection as a Top-K selection problem in Eq. (2), which aims to identify the most question-relevant temporal points from the video sequence. Eq. (3) formulates the routing decision as a selection function that dynamically chooses the proper temporal granularity according to the question type and semantic features. For questions that need fine-grained temporal understanding, Eq. (4) uses DBSCAN clustering to segment the video into semantically coherent events, achieving temporal abstraction from the frame level to the event level. Eq. (5) selects the most question-relevant event by maximizing semantic alignment, thereby locating key temporal segments. Eq. (6) presents the core step of temporal reasoning: k-means clustering divides the selected event into early, middle, and late stages, and assigns different weights according to the question's temporal focus to capture the dynamic evolution within the event. Finally, Eq. (7) fuses the temporally weighted frame representations with textual information to generate the final answer. Through these mathematical formulations, we formally describe the temporal reasoning process of ETR as a hierarchical optimization problem. Each equation corresponds to a key component of temporal processing, and together they provide a concrete instantiation of the conditional probability p(a∣V,Q) in Eq. (1). We hope these revisions improve the theoretical formalization and depth of the manuscript and help address the reviewer's concern. The changes can be found on Pages 7-8.
Theoretical justification for t₁: Regarding the selection of threshold t₁, we appreciate the reviewer's attention to this important hyperparameter. In our original manuscript, the choice of t₁=0.2 was determined empirically based on experimental comparisons on the validation set. To provide a more detailed explanation of this empirical process and further validate our choice, we have conducted additional analysis and incorporated it into the revised manuscript.
In our initial experiments, we evaluated multiple threshold values on the validation set, as shown in Table 8. The results demonstrated that t₁=0.2 achieves the best overall accuracy (64.45%), with performance decreasing for both smaller and larger values. This pattern reflects an important trade-off: when t₁ is too small, many samples that require event-level analysis are instead answered directly using weakly relevant key frames, limiting the model's ability to capture overall event structure and temporal logic. When t₁ becomes too large, too many samples are unnecessarily routed to temporal reasoning, introducing redundant temporal noise and interfering with the model's judgment for samples whose key frames already have high semantic relevance.
Beyond the ablation results in Table 8, we conducted further analysis to understand why t₁=0.2 represents an optimal balance. Specifically, we performed additional statistical analysis on a random sample of videos from our dataset. Across this random sample, we found that on average 18.23% of frames have relevance scores below t₁=0.2, with a standard deviation of 4.68% across different videos. This consistent proportion—approximately one-fifth of frames falling below the threshold—demonstrates that the threshold is appropriately calibrated across a wide range of video content. This 18.23% precisely represents the subset of videos that genuinely require deeper temporal analysis, and it is this minority that drives the overall 0.8-1.3% performance gain, while the remaining 81.77% benefit from faster inference. The 18.23% figure represents an optimal balance: it is high enough to meaningfully impact performance, as the 0.8-1.3% overall improvement comes precisely from these frames that benefit from deeper temporal analysis, yet it is low enough that the majority of frames (81.77%) can be processed efficiently through simpler routes, maintaining fast inference. The moderate standard deviation of 4.68% indicates that this balance is consistently achieved across diverse video types, rather than being driven by a small subset of atypical examples.
Based on these analyses, we have enhanced the discussion in the experiment section. This addition can be found in Section 4.3.5 of the revised manuscript.
Analytical commentary on routing mechanism: We have incorporated an analysis in Section 4.3.2 that leverages the ablation results in Table 5 to demonstrate why selective routing outperforms single-route baselines, establishing key principles that justify the architectural design. Table 5 compares different routing strategies and provides insights into the optimal performance conditions of our hierarchical routing mechanism. The "non" strategy, which applies a general approach to all questions, achieves 63.77% accuracy. Applying all questions uniformly through "non T-Route" or "T-Route" yields lower accuracy (63.67% and 63.11%, respectively), revealing two key insights: (i) Over-processing harms simple questions: The performance drop of "all T-Route" (63.11%) compared to "non" (63.77%) indicates that applying fine-grained event reasoning to simple descriptive questions introduces semantic noise rather than benefit. This validates that temporal reasoning should be deployed selectively. (ii) Under-processing harms complex questions: The moderate performance of "non" (63.77%) suggests that while it handles simple questions well, it fails to capture the temporal dynamics needed for complex temporal questions. Building on this observation, our proposed hierarchical selector adaptively identifies questions requiring event-centric temporal reasoning, increasing accuracy to 64.45%. This semantics-driven selection achieves optimal performance by matching each question to its most appropriate processing route—deploying expensive T-Route only when the expected gain outweighs the risk of noise introduction. The 0.68-1.34% improvement over single-route baselines demonstrates that the value of architectural complexity lies not in the modules themselves, but in the intelligent routing that deploys them selectively—a principle validated by the performance degradation when modules are misapplied. This addition can be found in Section 4.3.2 of the revised manuscript.
Complexity analysis of clustering: In our framework, both DBSCAN and k-means operate on the frame-level features produced by the visual encoder, and their computational cost is directly determined by the number of sampled frames N (fixed at 32 in our main experiments). Theoretical complexity: DBSCAN has an average time complexity of O(N log N) when spatial indexing is employed. In our implementation, with N=32, the actual number of distance computations is minimal. K-means has a time complexity of O(N·k·T), where k=3 is the number of clusters and T ≤ 10 is the number of iterations until convergence. Empirical validation: Under our fixed-frame sampling strategy (32 frames per video), both clustering steps complete in under 0.1 seconds, accounting for less than 5% of the total inference time. Even when scaling to 48 frames in our extended experiments (see response to Comment 1), the clustering overhead remains negligible compared to the visual encoding and LLM generation stages. Therefore, the clustering steps do not constitute a system bottleneck and scale efficiently with input size, supporting the model's deployability in resource-constrained environments. This addition can be found on Page 15 of the revised manuscript.
Comments 4: The authors state in the discussion section that the model could work with other LLMs. However:
No experimental testing has been conducted with any other model
Results with a smaller model were not presented
Response 4: We thank the reviewer for raising this important point about validation across different language models. We appreciate this suggestion and would like to clarify the relationship between ETR and the underlying LLM, as well as outline our plans for future work.
The core function of ETR is to perform temporal enhancement for videos, including temporal segmentation, key event localization, and frame importance weighting. Importantly, these operations are applied to general semantic representations extracted from the video—they rely on visual features and semantic similarity calculations rather than the internal parameters or architecture of any specific language model. This design makes the ETR workflow stable and independent of the particular LLM being used. Regardless of whether a large language model (e.g., Flan-T5-XL) or a smaller model (e.g., T5-base) is employed, ETR can still function properly—it extracts temporal structure, segments events, and enhances frame importance in exactly the same way. The final question-answering quality is then determined by the upper bound of the capability of the underlying language model. In short: the strength of the LLM affects the final performance, but it does not affect the execution or effectiveness of ETR itself. This distinction is crucial: ETR provides a task-agnostic temporal enhancement layer that can be paired with any language model capable of understanding the generated descriptions and answering questions.
We fully agree with the reviewer that systematic evaluation across different LLMs would be valuable. However, in the current work, our primary focus was on establishing the ETR framework and demonstrating its effectiveness with a base model (Flan-T5-XL). Conducting comprehensive experiments across multiple LLMs presents practical challenges, including the substantial computational resources required and the need for careful tuning to ensure fair comparison across different model architectures. Given these constraints, we made the deliberate choice to first validate our approach thoroughly on one representative model.
We regard the reviewer's suggestion as highly important and constructive, and we have developed a comprehensive plan for future investigation: 1.Comprehensive comparison on LLMs of different scales. We will select a set of language models with different parameter sizes (e.g., small models like T5-base, medium models like Flan-T5-XL, and larger models such as LLaMA or GPT variants) and integrate them with ETR under identical experimental settings. By comparing the overall performance across models, we will clearly demonstrate how model scale influences the final results and verify that ETR consistently provides temporal enhancement across various LLMs. 2.Quantify the performance gain brought by ETR on different LLMs. For each selected LLM, we will compare the performance between using only the base LLM and using the "ETR + LLM" framework. This will allow us to quantify the actual improvement introduced by ETR and demonstrate that ETR delivers stable and consistent gains regardless of LLM scale. If the improvement magnitude remains similar across models, it would provide strong evidence that ETR's temporal enhancement works stably and reliably across different LLMs. 3.Verify generalization on more types of models. Beyond model size, we will extend our experiments to language models with different architectures and pre-training objectives, including lightweight models designed for efficiency and open-domain pre-trained models. This will help further validate the broad applicability of ETR and strengthen the practicality of the overall framework.
We hope this clarification addresses the reviewer's concern. While we cannot provide cross-LLM experiments in the current manuscript due to practical constraints, we believe our design rationale—that ETR operates on general semantic representations independently of the specific LLM—provides a solid theoretical foundation for its stable and consistent behavior across different LLMs. The future work outlined above will provide the empirical evidence to fully validate this claim. The revision can be found on Pages 31-32 of the updated manuscript.
Comments 5: Although the authors state that full proofreading has been carried out, the text still contains grammatical and stylistic errors. For example:
Subject-verb agreement errors
Errors in plural-singular forms
Repetitive and redundant phrases
Response 5: We sincerely thank the reviewer for identifying the remaining language issues in our manuscript, and we offer our sincere apologies for these oversights, which regrettably persisted despite our initial proofreading efforts. In response, we have conducted a thorough line-by-line revision of the entire manuscript, carefully checking and correcting all grammatical errors, including subject-verb agreement issues, plural/singular form inconsistencies, and redundant phrases. We paid particular attention to the specific error types mentioned by the reviewer and systematically addressed them throughout the text. All language-related improvements have been clearly marked in the revised manuscript to facilitate the review process. We believe these revisions have substantially enhanced the clarity and professionalism of our writing, and we are deeply grateful for your guidance in helping us achieve the expected language standards for publication. The revisions are shown as follows:
Subject-verb agreement errors:
- In this paper, we proposed ETR, a framework for VideoQA designed to enable precise temporal reasoning and object-centric visual-semantic alignment. In this paper, we propose ETR, a framework for VideoQA designed to enable precise temporal reasoning and object-centric visual-semantic alignment.
- Clustering-based methods are widely used for event segmentation in videos.Clustering-based methods have been widely used for event segmentation in videos.
- Finally, we obtained k key frames. Finally, we obtain k key frames.
Errors in plural-singular forms:
- However, whether at the frame or event level, most existing approaches assign similar importance to all selected frames, making it difficult to reflect their true contribution during reasoning. However, whether at the frame or event level, most existing approaches assign similar importance to all selected frames, making it difficult to reflect the true contributions during reasoning.
- We propose a hierarchical weight adjustment module alongside a question-guided fine event understanding route, enabling selective weighting of keyframes with questions intention. Furthermore, we introduce a novel object-centric description prompting mechanism that generates descriptions centered on key objects of questions. We propose a hierarchical weight adjustment module alongside a question-guided fine event understanding route, enabling selective weighting of keyframes with question intention. Furthermore, we introduce a novel object-centric description prompting mechanism that generates descriptions centered on key objects of the questions.
- Event-level textual descriptions are then generated under a object-centric prompt by Flan-T5-XL: (Equation.13) where [EP] represents the event-level description prompt: ``Describe in a few words what you see in these frame most relevant to answering:[Q] focus on key object. ``. Event-level textual descriptions are then generated under an object-centric prompt by Flan-T5-XL: (Equation.13) where [EP] represents the event-level description prompt: ``Describe in a few words what you see in these frames most relevant to answering: [Q] focus on key objects. ``.
- These are generated by Flan-T5-XL under a frame-level prompt [FP], formulated as: ``Describe in a few words what you see in these frames most relevant to answering:} [Q] focus on key object. ``. These are generated by Flan-T5-XL under a frame-level prompt [FP], formulated as: `` Describe in a few words what you see in these frames most relevant to answering:} [Q] focus on key objects. ``.
- Instead of performing fine temporal refinement, a representative key frame is selected from the center of each event cluster. Instead of performing fine temporal refinement, representative key frames areselected from the center of each event cluster.
- Table.2 summarizes the overall performance comparison between ETR and 23 representative approaches on the NExT-QA benchmark, including 7 methods without using a large-model and 16 large-model-based methods. Table.2 summarizes the overall performance comparison between ETR and 23 representative approaches on the NExT-QA benchmark, including 7 methods without using large models and 16 large-model-based methods.
- This framework dynamically assigns importance weights to key frames based on the objects emphasized in the question and captures multi-level visual–language interactions facilitate alignment the central subject. This framework dynamically assigns importance weights to key frames based on the objects emphasized in the question and captures multi-level visual–language interactions to facilitate alignment with the central subject.
- Perform Performance analysis in Table 10. Table 10 presents the performance analysis.
- Each case includes the original question–answer pair with candidate options, the detected key objects, the selected key frames (with higher weights assigned to the latter two frames), and the generated descriptions align the critical objects. Each case includes the original question–answer pair with candidate options, the detected key objects, the selected key frames (with higher weights assigned to the latter two frames), and the generated descriptions that align with the critical objects.
- The red boxes highlight frames ours focuses. The red boxes highlight frames ours focuses on.
Repetitive and redundant phrases:
- This synergistic design ultimately promotes more accurate event-centric temporal reasoning and supports robust, precise temporal reasoning in VideoQA. This synergistic design ultimately promotes more accurate event-centric temporal reasoning in VideoQA.
- This module unifies multiple complementary linguistic cues to stably extract question-centric objects across diverse syntactic structures, thereby facilitating alignment and interpretable vision-language alignment. This module unifies multiple complementary linguistic cues to stably extract question-centric objects across diverse syntactic structures, thereby facilitatinginterpretable vision-language alignment.
Other modifications:
- Video Question Answering (VideoQA) requires deep understanding of dynamic video content, integrating spatial reasoning, temporal dependencies, and language comprehension. Existing methods often struggle in long or semantically complex videos due to the lack of question-guided keyframe weight adjustment and the absence of question-aligned cross-modal description generation. Video Question Answering (VideoQA) requires a deep understanding of dynamic video content, integrating spatial reasoning, temporal dependencies, and language comprehension. Existing methods often struggle with long or semantically complex videos due to the lack of question-guided keyframe weight adjustment and the absence of question-aligned cross-modal description generation.
- The red boxes highlight frames which models focus. The red boxes highlight frames which the models focus on.
- As illustrated in Fig.1, even when key frames are correctly identified and semantically relevant descriptions are generated, nearly uniform weighting prevents the suppression of redundant or weakly relevant information, ultimately impairing final prediction accuracy. As illustrated in Fig.1, even when key frames are correctly identified and semantically relevant descriptions are generated, nearly uniform weighting prevents the suppression of redundant or weakly relevant information, ultimately impairing the final prediction accuracy.
- While question-guided or program-driven captioning approaches introduce partial flexibility, most existing methods still adopt a one-size-fits-all generation strategy, which prevents fine questions targeting specific objects (e.g., ``What is the boy wearing yellow doing?'') from facilitating alignment the truly critical visual entities. While question-guided or program-driven captioning approaches introduce partial flexibility, most existing methods still adopt a one-size-fits-all generation strategy, which prevents fine questions targeting specific objects (e.g., ``What is the boy wearing yellow doing?'') from facilitating alignment with truly critical visual entities.
- To address the above research questions, we propose ETR— a Event-centric Temporal Reasoning framework designed to enable precise temporal reasoning and object-centric visual-semantic alignment for VideoQA. To address the above research questions, we propose ETR — an Event-centric Temporal Reasoning framework designed to enable precise temporal reasoning and object-centric visual-semantic alignment for VideoQA.
- For RQ2, ETR incorporates a novel question-guided fine temporal reasoning route which includes two-stage clustering strategy (T-Route), in which videos are adaptively segmented into events and aligned with question semantics to identify the most relevant key events. For RQ2, ETR incorporates a novel question-guided fine temporal reasoning route which includes a two-stage clustering strategy (T-Route), through which videos are adaptively segmented into events and aligned with question semantics to identify the most relevant key events.
- By incorporating object cues extracted from the question into the prompt construction process, FEU guides large language models to generate object-centric descriptions that are more precisely aligned with the underlying question semantics. Through this unified and adaptive design, FEU enables event-centric temporal reasoning and object-centric semantic grounding, providing an efficient and semantically robust solution for VideoQA. By incorporating object cues extracted from the question into the prompt construction process, ETR guides large language models to generate object-centric descriptions that are more precisely aligned with the underlying question semantics. Through this unified and adaptive design, ETR enables event-centric temporal reasoning and object-centric semantic grounding, providing an efficient and semantically robust solution for VideoQA.
- Overview of ETR framework.Overview of the ETR framework.
- The initial processing stage identifies the preliminary keyframes based on the question Q, effectively mitigating the risk of visual information overload. The initial processing stage identifies preliminary keyframes based on the question Q, effectively mitigating the risk of visual information overload.
- The top-1 event with the highest scores is selected as key event, and we denote the selected event as E. The top-1 event with the highest scores is selected as the key event, and we denote the selected event as E.
- To thoroughly evaluate the effectiveness and generalization of our approach, we conduct experiments on two widely used VideoQA benchmarks: NExT-QA and STAR, which emphasize complementary reasoning capabilities. To thoroughly evaluate the effectiveness and generalizability of our approach, we conduct experiments on two widely used VideoQA benchmarks: NExT-QA and STAR, which emphasize complementary reasoning capabilities.
We would like to thank you once again for your thorough and insightful review. Your comments have been invaluable in helping us improve the clarity, rigor, and completeness of our manuscript. We have carefully addressed all points raised and believe the revised version is substantially stronger as a result. We hope that our responses and modifications meet with your approval and look forward to your further consideration.
Sincerely,
The Authors
Reviewer 3 Report
Comments and Suggestions for AuthorsNo other concerns.
Author Response
Dear Reviewer,
Thank you very much for your valuable comments during the first round of review, which were instrumental in helping us improve the quality of our manuscript.
We also sincerely appreciate your positive assessment in this round and are glad to know that you have no further concerns about our manuscript. We are pleased that the revised version has met your expectations.
In this round of revision, we have also addressed the additional comments from the other reviewer. We believe these further revisions have further strengthened the manuscript while preserving the parts you have already approved.
Thank you again for your time and constructive feedback throughout the review process.
Sincerely,
The Authors
Round 3
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have adequately responded to the review comments. I have no further comments.

