Screening Smarter, Not Harder: Budget Allocation Strategies for Technology-Assisted Reviews (TARs) in Empirical Medicine
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors intend to design budget allocation strategies for technology assisted reviews in empirical medicine.But there are too many shortcomings and problems in the paper, some of which are as follows:
- Most of the reference documents are too old and do not represent the latest developments in the issues under study.
- The author's writing level is not high and cannot accurately and clearly describe the complete process of the method used.
- The academic level of the authors needs to be improved, and no new method with theoretical value and reference value has been given.
- The authors did not apply the latest advances in machine learning and artificial intelligence to this paper.
- Labels for some formulas are not given.
- Some symbolic meanings are not given accurately.
- When describing algorithms and models, they are not given in the form of general general meaning.
- During simulation, there is no comparison with existing methods to prove the superiority of the designed method.
- Simulation experiments cannot prove that the authors have made valuable innovative work, and the scientificity and correctness of the experiments cannot be guaranteed.
Author Response
We thank the reviewer for the evaluation of our manuscript. However, we believe that several of the concerns raised are either not applicable to the nature of our work or lack sufficient specificity for us to respond meaningfully. Below we address each point to the best of our ability:
- Most references are too old:
While our paper does cite foundational work in technology-assisted review (TAR), we also reference and analyze recent research, including benchmarks and metrics developed between 2017 and 2024. We would welcome suggestions of more recent or particularly relevant references that the reviewer believes are missing.
However, we added a recent reference [3] as part of the ALTARS 2024 workshop specifically dedicated to the most recent state of the art on TAR systems.
In addition, to the best of our knowledge, budget allocation in TAR systems is a new topic which has not been investigated thoroughly (compared to stopping strategies and recall-optimisation).
Following the suggestion of the reviewer, we added some additional references as a state of the art about budget allocation in information retrieval [16-20].
- The writing is not clear:
We have revised the parts of the manuscript where we think clarity and precision might have not been excellent. Nonetheless, we are happy to further revise any passages the reviewer found unclear—specific examples would be appreciated.
- No new method with theoretical value:
Our paper does not propose a new ML model or algorithm, but rather focuses on evaluating TAR systems under budget allocation constraints using novel metrics. These measures provide new ways to assess cost-effectiveness and are valuable for practical decision-making.
- Lack of ML/AI methods:
Our work analyzes the performance of existing systems from CLEF eHealth benchmarks, which include various ML-based TAR systems. The goal is to provide tools and metrics to better understand these systems, not to propose a new ML model.
- Missing labels or unclear symbols:
All formulas have been numbered and as far as we can see there are no missing labels.
Regarding symbols,
Eq(1), T, B, and B_i are defined in the previous paragraph;
Eq(2), D_i are defined in the previous paragraph;
Eq(3) uses the same symbols of Eq 1 and 2;
Eq(4), tau and B_remaning are defined the previous line;
Eq(5), k, TP_k, and P_total are defined the line below;
Eq(6), c is defined the line below;
Eq(7) uses symbols already defined
Eq(8), g is defined the line below.
If the reviewer could indicate which specific symbols or equations are problematic, we would be glad to correct them.
However, the reviewer is right about some duplicated labels of the tables in the appendix and we fixed those. Thank you very much for spotting it.
- No general meaning in algorithm descriptions:
As we do not propose new algorithms but evaluate existing runs, algorithmic descriptions are not the core focus. Nonetheless, we have aimed to generalize the evaluation methodology and clearly state assumptions.
- No comparison with existing methods in simulation:
Our evaluation compares multiple systems submitted to CLEF TAR tasks across three years using traditional and new metrics. These comparisons are explicitly detailed in the experimental section.
- Experiments are not valuable:
Our contribution lies in enhancing the accessibility and interpretability of TAR system evaluations through reproducibility, new cost-aware metrics, and an open, interactive platform. We respectfully suggest that this is a valuable contribution to the community, even if it does not introduce a new learning algorithm.
We hope this response clarifies the goals and contributions of our work. We are open to revising the manuscript based on more specific guidance.
Reviewer 2 Report
Comments and Suggestions for AuthorsDear Authors,
Thank you for the opportunity to review your manuscript. The area that you sought to explore is interesting and relevant. Overall, the manuscript looks good. Please find my comments below, which I hope will help improve the clarity and quality of the work.
In your research you selected the budget allocation strategies- even, proportional, inverse proportional, and threshold-capped greedy – to explore how review effort can be efficiently distributed across topics. Could you please provide a justification for why you selected these 4? Have you considered adaptive and performance based allocation strategies?
You mentioed that" We observed that our replicated evaluation closely matched the original scores, with differences limited to a maximum of 2–3% in only a few isolated cases" and "we could confidently proceed to study the impact of budget allocation strategies, ensuring that any observed differences in performance are due to the evaluation framework rather than inconsistencies in the data or original runs." Could you please clarify how you analysed the variation and what you mean by the 'isolated' cases? Also, how do you ensure the differences in performance are because of the evaluation framework and not with the data?
Author Response
We thank the reviewer for their constructive comments and thoughtful suggestions, which are much appreciated. We address each point below:
- Justification of budget allocation strategies:
We selected the four strategies (even, proportional, inverse proportional, and threshold-capped greedy) to reflect a range of simple, interpretable policies with different assumptions about topic relevance distribution. These strategies provide a useful contrast between uniform effort (even), topic-prior-based effort (proportional/inverse), and a greedy baseline driven by estimated gains.
We agree that adaptive or performance-based allocation strategies (e.g., multi-armed bandit) are promising directions. We now mention this in the conclusion as an area for future work and plan to explore them in follow-up studies.
At the current stage, however, our primary goal is to establish a shared, reproducible evaluation framework for budget-aware TAR experiments. We believe that providing a transparent and well, documented foundation—based on simple yet meaningful allocation policies, will help foster community consensus on the problem setting. This foundation is important for the fair comparison and subsequent development of more sophisticated adaptive strategies in future work.
- Clarification on replicated evaluation and isolated differences:
The term 'isolated cases' refers to specific topic-system combinations where the re-scored values differed slightly (by 2–3%) from the official CLEF 2017–2019 scores. These cases accounted for less than 10% of all topic-system pairs and all of them occurred in the Work Saved over Sampling (WSS) metric. This is due to the behavior of trec_eval, which applies a truncation mechanism when a run fails to retrieve all relevant documents, leading to slight variations in WSS calculation.
We computed the absolute and relative differences between our results and the official ones for each system-topic-year tuple and confirmed that the variation was minor and did not systematically favor any system which is the main goal of the evaluation in CLEF.
We have now clarified this explanation in the text.
- Ensuring that differences arise from the evaluation framework and not from the data:
Since our re-evaluation was conducted on the official released data (topics, qrels, and system runs), and the preprocessing steps were documented and minimal (e.g., removing duplicate entries, correcting encoding issues), we are confident that the source data matches the original. The only introduced variable is the evaluation framework—i.e., our cost-aware metrics and budget allocation strategies. Therefore, any performance variation observed in these new experiments is a result of applying these evaluation settings, not changes in the data itself.
We thank the reviewer again for their positive assessment and helpful remarks, which have helped us improve the clarity of the manuscript.
Reviewer 3 Report
Comments and Suggestions for AuthorsThank you for the opportunity to review this paper. This is an interesting paper on an important topic and a worthwhile contribution to the literature. I do not have any suggestions for improvement and deem it suitable for publication in its current form.
Author Response
We thank the reviewer for taking the time to evaluate our manuscript.
We are glad about the positive review.
Reviewer 4 Report
Comments and Suggestions for AuthorsDear Authors,
Based on my best understanding, your paper explores how to evaluate TAR systems under realistic resource constraints, specifically through budget allocation strategies and new cost-sensitive metrics. It's a valuable contribution, especially given the practical challenges of systematic reviews in medicine. The empirical analysis using CLEF eHealth data is thorough and provides a solid foundation for future work in cost-aware evaluation. That said, there are a few areas where the paper could be strengthened in your paper which I recommend consider them:
- The use of a fixed cost per document makes the analysis clean, but may oversimplify real-world review settings where screening effort can vary. Please discuss how the metrics would behave under variable costs, or include a small simulation to illustrate this.
- The allocation strategies explored are easy to interpret, but not adaptive to dynamic evidence streams. Consider mentioning how adaptive or learning-based strategies could be integrated or explored in future work.
- While the paper discusses human burden, it doesn’t simulate or account for reviewer behaviour such as fatigue or early stopping. You might include a brief simulation or scenario-based reflection on how reviewer profiles could affect performance outcomes.
- The UG@B metric assumes fixed gain and cost values, which may not generalize to domains where false positives and false negatives have different impacts. Please consider discussing how UG@B could be tuned or extended to reflect domain-specific cost-benefit trade-offs.
- Stopping strategies are acknowledged but not addressed, despite being closely tied to resource usage. It would be useful to briefly discuss how stopping rules might complement budget allocation strategies, or at least highlight it more explicitly in future work.
- Although the paper touches on human-in-the-loop factors, it lacks empirical validation involving actual users or behavioural models. Please consider discussing how human decision variability, risk tolerance, or task-switching might affect the utility of different allocation strategies in practice.
- The study mentions bandit-based or learning-based methods in related work but does not implement or simulate any adaptive allocation baseline. A simple baseline using, for example, contextual bandits or dynamic feedback could make the comparative analysis more robust.
Author Response
We thank the reviewer for the thoughtful and constructive feedback. We appreciate the recognition of our work as a valuable contribution to the evaluation of TAR systems under realistic constraints and agree with the relevance of the points raised. Due to the very short timeline for this final revision, we were unfortunately unable to implement new simulations or empirical extensions.
Nonetheless, we have carefully revised the manuscript to integrate several of the reviewer's suggestions as clarifications and future directions and we have also tried to sketch the formal extension of the formulae in case of variable costs.
We have added an entire new section, Section 5, that discusses this matters.
Specifically:
Variable Screening Costs: We agree that the assumption of fixed cost per document is a simplification. We have added a discussion in Section 5 to clarify how our proposed metrics (RFCU@k and UG@B) could be adapted to handle variable costs, for example by assigning cost weights to different documents based on difficulty or annotation effort.
Reviewer Fatigue and Behaviour: While there was not enough time to implement simulations of human behavior, we have included a paragraph in Section 5 discussing how factors such as fatigue, decision variability, and risk tolerance may influence the effectiveness of different allocation strategies. We believe these elements could be incorporated in future user-centric simulations or hybrid human-in-the-loop evaluations.
Adaptive Strategies: We now highlight in the conclusion and future work sections that adaptive allocation strategies, such as contextual bandits or performance-based learning approaches, represent a promising and mandatory area for extension. We note that such methods would benefit from a stable, shared evaluation framework like the one we propose.
Stopping Strategies: We now explicitly clarify in the conclusion how stopping rules are complementary to budget allocation strategies. While not the focus of this paper, we agree that their integration into cost-sensitive TAR evaluation is an important direction for future work.
User Validation and Behavioural Models: We now acknowledge that the current work does not include empirical user studies, and we note in the discussion that experimental protocols involving human annotators or simulated user models would be a valuable addition to understanding the real-world utility of TAR systems under budget constraints.
We thank the reviewer once again for the precious comments and suggestions for future directions (we are currently organizing a Dagsthul workshop on TAR systems and these suggestions are of great value as food for thoughts for this workshop).
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe author didn't improve the paper well and respond peer-to-peer, and the quality of the paper didn't improve. Some remaining shortcomings are as follows:
- The author's views are not new.
- The theory used by the author is so simple that there is no need to publish a paper.
- The summary work done by the author is also of no reference value.
- The practicability, superiority and effectiveness of the method given by the author cannot be proved.
- The author still doesn't compare it with the existingmethods
- The writing level and academic level of the paper are very low and there is no value for publication.
Author Response
We appreciate the reviewer’s time with our submission. While we respectfully disagree with several of the assessments, we welcome the opportunity to clarify our objectives and the scope of our contribution.
- The author's views are not new.
Our work builds on established insights in the field of Technology-Assisted Review (TAR) but introduces a novel perspective: the explicit analysis of budget allocation strategies across topics in multi-topic TAR evaluations. While the concepts of cost and budget constraints have been discussed before, prior studies typically apply such constraints within single-topic or adaptive settings. Our contribution is to operationalize fixed budget allocation as a comparative lens across TAR systems in CLEF eHealth evaluations, which we believe offers a new and valuable framing for system comparison under realistic constraints.
- The theory used by the author is so simple that there is no need to publish a paper.
We intentionally chose simple, interpretable allocation strategies to isolate and analyze the effect of budget distribution on evaluation outcomes. As in many evaluation studies, the goal is not to introduce complex algorithms but to establish a transparent framework for comparing existing methods under shared constraints. We argue that conceptual simplicity in this context is a strength: it allows the community to better understand how evaluation metrics respond to allocation decisions and opens the door to more nuanced, adaptive strategies in future work.
- The summary work done by the author is also of no reference value.
Our re-analysis of CLEF eHealth TAR runs is designed to be a reference point for future evaluations involving cost-sensitive settings. By applying a consistent budget constraint across previously submitted systems, we provide insights into how different methods perform when effort is standardized, a condition often lacking in current evaluations. We also introduce new metrics such as RFCU@k and UG@B to encourage discussion around utility-driven evaluation frameworks.
- The practicability, superiority and effectiveness of the method given by the author cannot be proved.
We do not claim that our allocation strategies are superior in a practical deployment setting. Rather, our aim is to analyze how evaluation outcomes vary under different plausible strategies, particularly in scenarios where total annotation effort must be fixed. We are transparent in the paper about the limitations of these strategies and explicitly position our work as a first step toward more adaptive and behaviorally informed allocation approaches.
- The author still doesn't compare it with the existing methods.
Our study focuses on evaluating existing TAR systems from CLEF eHealth under a new budget allocation framework, rather than introducing a new retrieval method to be compared. That said, we do compare across multiple allocation strategies and across systems previously submitted to CLEF. We also position our work relative to adaptive and bandit-based strategies in the related work and conclusion sections, and outline these as directions for future comparative analysis.
- The writing level and academic level of the paper are very low and there is no value for publication.
We are sorry to hear that the writing did not meet the reviewer’s expectations. The manuscript has been revised extensively for clarity and academic tone, including the use of proofreading tools and external review. We welcome concrete suggestions for improvement in this area. As for the academic value, we believe that establishing a reproducible framework for budget-constrained evaluation is a necessary step to evolve TAR research from purely effectiveness-driven to cost-aware and human-centric paradigms, which are increasingly relevant in empirical medicine and systematic reviews.
Reviewer 4 Report
Comments and Suggestions for AuthorsDear Authors,
The updated version is more clear and detailed, however, a few areas still merit further development:
- While you acknowledge the limitations of using static allocation strategies, the absence of even a basic adaptive baseline remains a missed opportunity.
- Similarly, the mention of reviewer behaviour variability and stopping strategies is helpful, but the paper would benefit from a more concrete treatment of how such human-in-the-loop factors could influence performance and utility.
- Your clarification of UG@B's flexibility is good, though demonstrating the effect of tuning gain/loss ratios in a small example would further strengthen your argument for metric generalizability.
Author Response
- While you acknowledge the limitations of using static allocation strategies, the absence of even a basic adaptive baseline remains a missed opportunity.
[Answer] Thank you again for pushing us in doing this part. In the last month, thanks to your suggestions, we have started to explore the simplest but very interesting multi-armed bandit strategies that will serve as a baseline for comparison.
We have added the description of this approach in Section 4.1 and the results in 4.4.1 (and expanded Figures 1, 2, and 3)
- Similarly, the mention of reviewer behaviour variability and stopping strategies is helpful, but the paper would benefit from a more concrete treatment of how such human-in-the-loop factors could influence performance and utility.
[Answer] We thank the reviewer for this valuable suggestion. We agree that human-in-the-loop variability and stopping strategies are important factors that could significantly affect the observed utility of TAR policies. While modeling such aspects in full detail is beyond the scope of this paper, we have now expanded the discussion to provide a more concrete treatment of how reviewer behavior (e.g., annotation variability, fatigue, or early stopping decisions) could interact with allocation policies and evaluation metrics. We expanded the discussion in the conclusion section o address these issues.
- Your clarification of UG@B's flexibility is good, though demonstrating the effect of tuning gain/loss ratios in a small example would further strengthen your argument for metric generalizability.
[Answer] Thank you for suggesting this part. We have added section 4.6 with an analysis of what could happen if the cost gain ratio is changed in terms of the evaluation of systems under the UG@B metric. This also gave us the opportunity to reflect on what Budget B can be in terms of resources (instead of number of documents to review, time reviewers can spend).

