Accelerating Inference in Retrieval-Augmented Generation Models for Long-Form Question Answering via Dynamic Token Pruning
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper is mostly fine but there are two main issues that make the paper not ready for publication.
The first issue is the presentation of the method proposed. Sections 3.1 to 3.3 are not clear enough and should be re-organised and/or rewritten. It is unclear for example where equation 6 is used. Equation 9 is unclear and was not explained. Equation 10 is disconnected from the previous equations. Section 3.3 refers to equations 7, 9 and 10 give the impression of being used in both layers. If that's not the case, then equations 7, 9 and 10 should be presented in this section -- not 3.2, for clarity purposes.
The equations of subsection 3.4.3 should be re-organised according to the explanations in the text. For example, if you are explaining L_total, you don't show equation (15) but (16) first; i.e., from general to specific (or from a high-level perspective to finer details).
There are too many things in this paper and is hard to know which ones will have a positive of negative impact. For example, "weight initialization" (subsection 3.4.4) was not evaluated systematically and its effects or impact on the results are unknown.
The second major issue is that the paper did not include enough baselines. Some of the references [12] to[16] could have been included to be more convincing in the presented results.
Minor issues:
Explain a few things more about the datasets such as number and type of questions.
Please include the BERTScore metric (or equivalent) in your future experiments.
Author Response
We have attached our reply to your review feedback and the updated details. Thank you
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper presents the proposed method for token pruning very clearly.
I have some questions:
- Datasets are split into train/dev/test subsets and the results are reported on test set. Why did you report the results on the dev set?
The scope of the experiments and the reported results is too modest. I suggest:
- It is often beneficial to separately measure false positives (incorrectly predicting an answer), and false negatives (failing to predict an answer).
- Add inference speed (Tokens/sec) (the number of tokens processed per second) to evaluate the computational throughput.
- Add memory usage (GB)(the peak GPU memory used during inference. This reflects resource consumption.
- Add experiments with a QA system that is published in the journal literature (and published source code) to position your system among state-of-the-art systems.
Author Response
We have attached our reply to your review feedback and the updated details. Thank you
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThis paper proposes a dynamic token pruning method to accelerate the inference process of retrieval-augmented generation (FiD) models in long-form question answering tasks. By analyzing the issue of excessively high computational costs in the decoder's cross-attention when processing multiple retrieved passages, the authors design a mechanism to dynamically identify and remove tokens with minimal contribution to answer generation during the encoding stage. Experimental results demonstrate that this method achieves up to a 1.78x speedup in inference on the CLAPNQ and ASQA long-form QA datasets, with only a slight degradation in answer quality, proving its effectiveness in significantly improving computational efficiency while maintaining generation quality.
- The font size of the section title "1.4.2.1" is inconsistent.
- Equations 3 and 4 lack an equals sign to properly integrate the expressions.
- The color gradient in Figure 2 is difficult to distinguish, making it hard to interpret. Please revise it.
- Do existing methods like Transkimmer (token pruning based on hidden states) and LTP (token pruning based on attention thresholds) already partially cover your approach?
- The paper primarily uses speedup multiples (1.78x) as the key metric but does not provide absolute inference time or GPU memory usage comparisons. What is the actual inference time of the pruned model under the same hardware conditions?
- The paper mentions only a 0.1%-0.2% drop in F1/ROUGE-L scores but does not analyze the types of errors introduced. Could pruning lead to factual inaccuracies (e.g., incorrect entities or relations)?
- The paper employs exponential decay for dynamic target retention rate adjustment but lacks clarity on the decay strategy's design (e.g., β or step allocation). Would linear decay or alternative rates (slower/faster exponential) significantly affect answer quality (F1/ROUGE-L) or inference speed (TPQ)?
Author Response
We have attached our reply to your review feedback and the updated details. Thank you
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have done a good job with the revisions in a short amount of time, which is commendable.
Reviewer 2 Report
Comments and Suggestions for AuthorsAll my concerns were satisfactorily addressed, and the paper was improved.