Screening Smarter, Not Harder: Budget Allocation Strategies for Technology-Assisted Reviews (TARs) in Empirical Medicine

Di Nunzio, Giorgio Maria

doi:10.3390/make7030104

Open AccessArticle

Screening Smarter, Not Harder: Budget Allocation Strategies for Technology-Assisted Reviews (TARs) in Empirical Medicine

by

Giorgio Maria Di Nunzio

Department of Information Engineering, University of Padova, 35131 Padova, Italy

Mach. Learn. Knowl. Extr. 2025, 7(3), 104; https://doi.org/10.3390/make7030104

Submission received: 2 May 2025 / Revised: 11 September 2025 / Accepted: 16 September 2025 / Published: 20 September 2025

(This article belongs to the Topic The Use of Big Data in Public Health Research and Practice)

Download

Browse Figures

Review Reports Versions Notes

Abstract

In the technology-assisted review (TAR) area, most research has focused on ranking effectiveness and active learning strategies within individual topics, often assuming unconstrained review effort. However, real-world applications such as legal discovery or medical systematic reviews are frequently subject to global screening budgets. In this paper, we revisit the CLEF eHealth TAR shared tasks (2017–2019) through the lens of budget-aware evaluation. We first reproduce and verify the official participant results, organizing them into a unified dataset for comparative analysis. Then, we introduce and assess four intuitive budget allocation strategies—even, proportional, inverse proportional, and threshold-capped greedy—to explore how review effort can be efficiently distributed across topics. To evaluate systems under resource constraints, we propose two cost-aware metrics: relevant found per cost unit (RFCU) and utility gain at budget (UG@B). These complement traditional recall by explicitly modeling efficiency and trade-offs between true and false positives. Our results show that different allocation strategies optimize different metrics: even and inverse proportional allocation favor recall, while proportional and capped strategies better maximize RFCU. UG@B remains relatively stable across strategies, reflecting its balanced formulation. A correlation analysis reveals that RFCU and UG@B offer distinct perspectives from recall, with varying alignment across years. Together, these findings underscore the importance of aligning evaluation metrics and allocation strategies with screening goals. We release all data and code to support reproducibility and future research on cost-sensitive TAR.

Keywords:

technology-assisted review; systematic reviews; CLEF eHealth; medical text mining; cost analysis; active learning; information retrieval

Graphical Abstract

1. Introduction

Systematic reviews are a cornerstone of evidence-based decision making across a range of disciplines. Unlike traditional literature reviews, systematic reviews follow a rigorous methodology to identify, evaluate, and synthesize all available evidence relevant to a specific research question. This process typically involves a comprehensive search of the literature, predefined inclusion and exclusion criteria, and systematic data extraction and analysis. While essential in building cumulative knowledge, systematic reviews are also time-consuming and labor-intensive, often requiring months of work by domain experts [1].

In medicine, systematic reviews play a critical role in synthesizing clinical evidence to inform guidelines, policy decisions, and individual patient care. These reviews are particularly demanding due to the volume and heterogeneity of biomedical literature, the need for high precision and recall, and the importance of reproducibility and transparency [2]. Given these challenges, technology-assisted review (TAR) systems have emerged as crucial tools in the healthcare domain, particularly in supporting the growing demand for timely and comprehensive systematic reviews. As the volume of biomedical literature continues to rise, traditional manual review processes are becoming increasingly insufficient to ensure that healthcare decisions are informed by the best available evidence. TAR systems leverage intelligent algorithms—ranging from classical machine learning to neural ranking models and active learning frameworks—to enhance the efficiency, accuracy, and scalability of the systematic review pipeline [3]. These systems not only reduce the cognitive and temporal burden on human reviewers but also contribute to reproducibility, transparency, and adaptability in evidence-based medicine [4].

Evaluating TAR systems presents distinct challenges that go beyond those of traditional information retrieval settings. Standard metrics used in text classification and information retrieval, such as recall, precision, and the area under the curve, primarily measure the system’s ability to retrieve relevant documents [5]. However, these metrics often assume an idealized environment where resources—time, money, and human effort—are unlimited. In practice, systematic reviews, particularly in medicine, operate under strict resource constraints: reviewers must balance the goal of achieving high recall with the reality of limited budgets and tight timelines [6,7]. Furthermore, existing metrics rarely account for the actual cost of reviewing documents, which includes the cognitive burden and time demands placed on human experts. Another complication arises when the true recall is unknown or difficult to estimate, especially in live systematic review projects where the entire set of relevant documents is not available [8]. As a result, there is growing recognition that the evaluation of TAR systems should incorporate cost awareness, risk management, and practical stopping criteria, rather than relying solely on traditional retrieval performance indicators. In real-world settings—like in medical domains—reviewers often operate under limited time, budgets, and cognitive resources. As such, cost-aware perspectives are critical in assessing the true utility of TAR systems and guiding their design [9].

The CLEF eHealth (https://clefehealth.imag.fr/ (accessed on 1 May 2025)) initiative has been a key venue in evaluating the performance of TAR systems in the context of medical systematic reviews. Between 2017 and 2019 [10,11,12], the CLEF eHealth TAR tasks provided a benchmark for comparing automated and semi-automated approaches using realistic, large-scale datasets derived from Cochrane reviews (https://www.cochranelibrary.com/ (accessed on 1 May 2025)). These tasks aimed to simulate real-world conditions by offering a continuous active learning setup, assessing systems based on their ability to retrieve relevant documents efficiently and effectively. This paper provides a critical overview of the experiments conducted in the CLEF eHealth TAR tasks over these three years, highlighting methodological trends, performance outcomes, and implications for future development in medical TAR systems.

This paper presents a comprehensive re-examination of the CLEF eHealth TAR tasks conducted from 2017 to 2019. In particular, we propose the following key contributions from this study.

Reproduction of Official Results: We collect, organize, and verify the official participant runs from the CLEF eHealth TAR tasks (2017–2019), ensuring reproducibility and providing a unified view of system performance across editions.
Exploration of Budget Allocation Strategies: We investigate alternative strategies for allocating limited review budgets across multiple topics. By analyzing budget-aware screening performance, we provide practical insights for the optimization of systematic reviews under resource constraints.
Introduction of Cost-Aware Evaluation Metrics: We propose two novel measures—relevant found per cost unit (RFCU@k) and utility gain at budget (UG@B)—designed to capture the trade-offs between retrieval effectiveness and resource consumption in TAR systems.

This study makes a distinct contribution to the TAR literature by shifting the focus from single-topic ranking performance to the broader problem of budget allocation across multiple topics under realistic constraints. Whereas most prior evaluations have assumed unconstrained or uniform review budgets, we formalize allocation as a policy optimization problem and introduce two novel budget-aware metrics, RFCU and UG@B, that capture the complementary dimensions of efficiency and utility. To the best of our knowledge, no prior study has systematically investigated budget allocation strategies in evaluating TAR systems. By reanalyzing three years of CLEF eHealth TAR tasks under standardized budget assumptions, we show that allocation policies, whether simple heuristics or adaptive approaches, can materially influence review outcomes. These results provide both methodological advances and practical insights, offering the community new tools and benchmarks to better align TAR evaluation with the realities of systematic review practice.

The remainder of this paper is organized as follows. In Section 2, we introduce budget-aware perspectives on technology-assisted reviews and discuss their implications for screening strategies. Section 3 presents the CLEF eHealth TAR lab, describing its setting, tasks, and relevance for empirical medicine. Section 4 reports on the experimental analysis, evaluating different budget allocation strategies. Finally, Section 5 concludes the paper and outlines directions for future work.

2. Beyond Recall: Budget-Aware Perspectives in Technology- Assisted Reviews

Although metrics such as recall and workload reduction are standard in evaluating TAR systems, they only partially capture the practical constraints under which systematic reviews are conducted. In real-world settings—especially in medical domains—reviewers often operate under limited time, budgets, and cognitive resources. As such, cost-aware perspectives are critical in assessing the true utility of TAR systems and guiding their design. Several dimensions of cost remain underexplored in current TAR research.

First, most evaluation frameworks assume unlimited resources for document screening, with few efforts to simulate budget-constrained or time-sensitive review scenarios. Integrating explicit constraints—such as fixed review time or monetary limits—could yield insights into the robustness of different systems under realistic operational pressures. Second, human-in-the-loop efficiency remains largely nonquantified. The time per decision, fatigue effects, and the interpretability of ranking outputs are rarely measured but have substantial impacts on reviewer performance and user acceptance. Third, the computational cost of TAR systems, especially neural architectures, is seldom considered relative to their performance gains. Lightweight, non-neural models may offer similar effectiveness at a significantly lower cost in terms of infrastructure, energy consumption, and deployment complexity—factors that are particularly relevant for smaller institutions or global health contexts. Fourth, few studies account for the opportunity costs of screening inefficiencies or misclassifications. The delayed identification of relevant studies can affect the timeliness of clinical guideline implementation or policy decisions, which carries downstream consequences that are not reflected in retrieval metrics alone.

To address these limitations, research on budget allocation should adopt broader experimental protocols that model cost explicitly. This includes collecting timing data, incorporating economic analyses, and conducting mixed-method user studies. Comparative benchmarks should evaluate TAR systems not only by their retrieval effectiveness but also by their efficiency, transparency, and long-term sustainability.

2.1. Examples of Budget Allocation in Information Retrieval

While considerable research in the technology-assisted review (TAR) field has focused on stopping strategies and recall-oriented optimization, the problem of how to distribute review effort across multiple topics or tasks has received far less attention. In particular, budget allocation—how limited annotation or screening resources should be apportioned across topics—remains an underexplored area. To the best of our knowledge, budget allocation in TAR systems has not yet been systematically investigated, despite its growing relevance in scenarios involving large-scale, multitopic systematic reviews.

In this section, we examine related work on budget allocation in adjacent areas, highlighting strategies and frameworks that could inform future developments in TAR.

The TREC 2017 Common Core track explored how bandit techniques could be used not only to simulate but also to construct a test collection in real time. This requires addressing practical challenges beyond simulations, such as allocating annotation effort across topics and enabling assessors to familiarize themselves with each topic, while also developing infrastructure for real-time document selection and judgment acquisition [13].

Constructing reliable and low-cost IR test collections often depends on how annotation budgets are distributed across topics. Traditional strategies, as used in TREC, typically assign a uniform budget per topic, often tied to a fixed top-k pool size. However, more recent work has highlighted the limitations of such static allocation and has proposed dynamic approaches based on topic-specific needs [14].

Flexible budget models have also been studied in the context of sequential decision making. A generalization of restless multiarmed bandits (RMABs), known as flexible RMABs (F-RMABs), has been proposed to allow resource redistribution across decision rounds. This framework supports more realistic planning scenarios and provides a theoretically grounded algorithm for optimal budget allocation over time [15].

In multichannel advertising, reinforcement learning techniques have been applied to optimize long-term outcomes under budget constraints. A hybrid Q-learning and differential evolution (DE) approach has been introduced to dynamically allocate resources across channels, taking into account delayed effects and interdependencies between actions and outcomes [16].

Budget-sensitive approaches have also emerged in data annotation settings such as multilabel classification. A reverse auction framework has been proposed to select crowd workers based on cost and confidence, combined with systematic budget selection strategies (e.g., greedy and multicover selection) to address domain coverage limitations within fixed budgets [17].

Finally, contextual bandits, widely used in online recommendation systems, have been extended to handle biased user feedback caused by herding effects. This line of research, while not directly focused on budget, provides useful insights into sequential decision making under partial observability and feedback bias, which are relevant for adaptive resource allocation [18].

2.2. A Proposal of Budget Allocation Strategies in TAR Systems

In cost-constrained learning scenarios such as active learning, annotation campaigns, or document screening workflows, a central challenge is how to allocate a fixed labeling or review budget in order to maximize utility [16,19,20]. The budget may represent time, a monetary cost, or human effort and must be judiciously spent across a sequence of uncertain decisions. While most traditional active learning approaches focus on selecting the next most informative item, they often assume a static or uniform cost per query and overlook the global allocation question: how much effort should be spent on which data, tasks, or domains?

This budget allocation problem has been relatively understudied in TAR systems, despite its crucial role in real-world deployments. For example, in multiclass classification, multidomain retrieval, or document triage tasks, allocating too much effort to low-yield areas can waste resources, while underinvesting in promising areas reduces overall effectiveness. In [21], the authors mention that “it has been estimated that systematic reviews take, on average, 1139 h (range 216–2518 h) to complete and usually require a budget of at least $100,000”. Therefore, a robust allocation strategy must consider both the expected gain from labeling additional examples and the associated cost, potentially adapting over time as feedback is observed. This requires balancing the exploitation of known high-utility areas with the exploration of uncertain regions and motivates strategies that incorporate cost awareness, utility estimation, or uncertainty modeling into the allocation process.

Despite being a critical aspect, most TAR systems, when evaluated, focus on optimizing the within-topic ranking performance, assuming unlimited or per-topic isolated review budgets. However, in real-world scenarios such as legal reviews, systematic medical literature reviews, or multitopic classification, practitioners operate under a global screening budget that must be distributed across heterogeneous topics of varying size, relevance prevalence, and ranking quality [9,22,23,24,25,26,27,28].

In this study, we present a set of straightforward yet informative strategies for distributing review effort across topics. Rather than proposing complex adaptive policies, our objective is to analyze how different allocation schemes affect system performance and evaluation metrics under constrained review conditions. To this end, we wish to re-evaluate the CLEF eHealth TAR experiments from the 2017 to 2019 editions by applying a budget allocation strategy across all runs. This approach ensures that each TAR system is assessed under the same labeling cost constraints, allowing for a fair and consistent comparison of their effectiveness. By standardizing the annotation budget, we eliminate variability caused by differing cost assumptions in the original submissions, thereby focusing the evaluation on the intrinsic capabilities of the TAR methods themselves, rather than on disparities in resource expenditure.

In Section 2.2.1, we present four baseline strategies for budget allocation:

even allocation;
proportional allocation;
inversely proportional allocation; and
threshold-capped greedy allocation.

We selected these four strategies to reflect a range of simple, interpretable policies with different assumptions about topic relevance distribution. These strategies provide a useful contrast between uniform effort (even), topic prior-based effort (proportional/inverse), and a greedy baseline driven by estimated gains.

We assume that we have a budget, B, available, and we need to allocate it across the set of topics

T = {t_{1}, t_{2}, \dots, t_{i}, \dots}

, with a total number of topics equal to

| T |

.

B_{i}

will be the budget allocated for the i-th topic

t_{i}

. The total number of documents available for the i-th topic is

D_{i}

.

2.2.1. Even Allocation

The even allocation strategy divides the total review budget equally across all topics:

B_{i} = \frac{B}{| T |}

(1)

This approach ensures the uniform treatment of all topics, regardless of their size or difficulty. It is simple, fair, and effective in scenarios where topics are equally important or we lack prior estimations of relevance prevalence or difficulty. It may underperform when topic sizes are highly imbalanced or when some topics are significantly more relevant than others.

2.2.2. Proportional Allocation

Proportional allocation distributes the budget in proportion to the number of documents in each topic:

B_{i} = \frac{D_{i}}{\sum_{j = 1}^{| T |} D_{j}} \cdot B

(2)

This strategy assumes that larger topics deserve more review effort. This approach reflects the topic scale and avoids undersampling large collections, but it can overcommit to large topics with low relevance prevalence, reducing the overall retrieval efficiency.

2.2.3. Inversely Proportional Allocation

Inversely proportional allocation gives more effort to smaller topics, based on the notion that they can be reviewed more thoroughly and may yield higher accuracy averaged across the topics:

B_{i} = \frac{1 / D_{i}}{\sum_{j = 1}^{| T |} 1 / D_{j}} \cdot B

(3)

It prioritizes small topics that may otherwise be overlooked, and it should improve the recall by ensuring better topic coverage under tight budgets. The limitation of this approach is that it may underallocate effort to large but relevance-rich topics.

2.2.4. Threshold-Capped Greedy Allocation

This strategy allocates budget in a greedy, depth-first manner, prioritizing smaller topics. Topics are sorted in ascending order regarding the available number of documents

D_{i}

, and the budget is allocated sequentially until exhaustion. For each topic, allocation is capped at either a fraction

τ \in (0, 1]

of its total size or the remaining budget,

B_{remaining}

:

B_{i} = min (τ \cdot D_{i}, B_{remaining})

(4)

The idea of this approach is to quickly screen small topics, enabling broader coverage when only partial review is feasible. The drawback is that it may lead to the neglect of later (often larger) topics unless combined with a secondary reallocation strategy.

These strategies provide foundational tools for managing review effort in multitopic TAR settings. While simple, they offer contrasting trade-offs between fairness, coverage, and efficiency. In Section 4, we analyze their empirical performance and explore adaptive alternatives based on relevance prevalence and utility gain.

One final remark concerns the fact that the budget allocation strategies may operate in parallel with a stopping strategy at the topic level. While a global screening budget constrains the total effort across all topics, individual topics may consume less than their allocated share if an early stopping condition is met—for example, when an estimated recall threshold is achieved or the marginal utility drops below a predefined threshold. In practice, this allows the reallocation of unused budget to other topics that remain uncertain or underexplored. However, the problem of recall estimation and stopping decision making introduces additional complexity, requiring reliable recall estimators or predictive models. In this work, we focus exclusively on the budget allocation problem under the assumption of fixed topic-level effort, and we leave the investigation of stopping strategies to future work.

2.3. Budget-Aware Evaluation Metrics

Traditional evaluations of TAR systems often prioritize effectiveness metrics, such as recall, while focusing on reducing the workload [10,11,12]. However, these metrics only offer a partial perspective on system utility, particularly when reviews must be performed under real-world resource constraints. To address this gap, we propose a budget-aware framework for evaluating TAR systems where we treat the systematic review process as a decision-making problem under budget constraints. Each document-screening action is associated with a cost—whether in terms of time, cognitive load, or financial expenditure—while the benefit is measured by how effectively the system retrieves relevant studies. To quantify this balance, we introduce budget-sensitive metrics like the relevant found per cost unit (RFCU) and utility gain at budget (UG@B), which measure system effectiveness while factoring in resource limitations.

These budget-aware metrics are intended to complement, rather than replace, traditional evaluation measures such as recall, precision, or F1. While standard metrics provide essential insights into the overall effectiveness of TAR systems, they do not capture how efficiently this performance is achieved in terms of the annotation cost. Budget-aware metrics fill this gap by explicitly linking performance to resource usage, offering a more refined view of system behavior under realistic constraints. By considering both sets of metrics, we gain a more comprehensive understanding of each system’s strengths—balancing absolute effectiveness with cost-efficiency—which is crucial for informed deployment in high-cost, time-sensitive review settings.

The proposed budget-aware metrics used in our evaluation operate under the realistic assumption that the total number of relevant documents is unknown—a typical scenario in technology-assisted review tasks. Rather than relying on oracle knowledge of how many relevant items remain undiscovered, these measures focus exclusively on the performance achieved within the constraints of the allocated budget. This means that we evaluate systems based on the actual annotation cost incurred and the utility of the results retrieved and not based on hypothetical outcomes that assume full awareness of the relevance distribution. This approach ensures that our evaluation remains grounded in the practical limitations of real-world applications, where exhaustive relevance judgments are rarely available.

We will begin our presentation of the definitions of the new metrics by reviewing Recall@k, which measures the proportion of relevant documents retrieved within the first k screened documents:

Recall @ k = \frac{{TP}_{k}}{P_{total}}

(5)

where

{TP}_{k}

is the number of relevant documents (true positives—TP) found up to rank k, and

P_{total}

is the total number of relevant documents (positives—P) in the collection. While Recall@k effectively captures early retrieval performance, it assumes that screening cost is uniform and unlimited. In real-world TAR scenarios, reviewers operate under strict resource constraints. Consequently, Recall@k may fail to reflect the efficiency with which relevant documents are identified relative to the effort expended.

To address these limitations, we propose two complementary baseline budget-aware evaluation metrics: relevant found per cost unit (RFCU) and utility gain at budget (UG@B). Let B denote the reviewing budget, expressed as the number of documents that experts can examine for a systematic review. Given a ranked list of documents and their binary relevance labels, we wish to measure how effectively a system identifies relevant documents while minimizing wasted effort on nonrelevant ones.

2.3.1. Relevant Found per Cost Unit@k

RFCU@k measures how many relevant documents are retrieved per unit of cost within the first k documents. It is defined as

RFCU @ k = \frac{{TP}_{k}}{k \cdot c}

(6)

where

${TP}_{k}$ is the number of relevant documents found within the first k screened documents;
c is the cost of reviewing a single document (e.g., in units of time, money, or effort).

RFCU@k captures the cost-efficiency of the screening process. Unlike Recall@k, it directly accounts for the screening cost, making it particularly appropriate when the objective is to maximize the yield of relevant documents per dollar or hour spent.

To better illustrate the difference between Recall@k and RFCU@k, consider the following hypothetical scenario: suppose that a document collection contains

P_{total} = 100

relevant documents. A reviewer screens the top

k = 50

documents, finding

{TP}_{k} = 20

relevant documents. The cost per document assessed is

c = 2

cost units (e.g., due to document length or review complexity).

Recall@k measures the proportion of relevant documents retrieved within the top k documents:

Recall @ k = \frac{{TP}_{k}}{P_{total}} = \frac{20}{100} = 0.2

Thus, 20% of all relevant documents are found after screening 50 documents.

RFCU@k measures the number of relevant documents retrieved in the top-k documents per unit of cost spent:

RFCU @ k = \frac{{TP}_{k}}{k \times b} = \frac{20}{50 \times 2} = \frac{20}{100} = 0.2

Thus, the system achieves efficiency of

0.2

relevant documents per budget unit.

Although, in this example, the numerical values of Recall@k and RFCU@k coincide, they measure fundamentally different aspects:

Recall@k evaluates how much of the total relevant information has been retrieved;
RFCU@k evaluates how efficiently the review effort has been translated into relevant findings, accounting for the cost per document.

If the cost per document c were higher (e.g.,

b = 4

), RFCU@k would drop:

RFCU @ k = \frac{20}{50 \times 4} = \frac{20}{200} = 0.1

whereas Recall@k would remain unchanged.

Another important aspect to highlight is the fact that RFCU@k is similar to Precision@k, but the interpretation and the objectives are quite different also in this case. Precision@k measures the proportion of relevant documents among the top-k retrieved documents:

Precision @ k = \frac{{TP}_{k}}{k}

(7)

Precision@k reflects the purity of the top-k results and is bounded between 0 and 1. It assumes that the cost of assessing each document is constant and does not influence the evaluation. RFCU@k accounts explicitly for the screening cost. When

c = 1

, RFCU@k reduces to Precision@k. However, when the document assessment costs vary (e.g., due to the document length or reviewer expertise requirements), RFCU@k would provide an evaluation of system efficiency when screening costs need to be taken into account.

2.3.2. Utility Gain at Budget (UG@B)

UG@B formalizes the balance between the gain in finding relevant documents and the cost of reviewing nonrelevant ones. It is defined as

UG @ B = \sum_{j = 1}^{B_{i}} u_{j} where u_{j} = \{\begin{matrix} + g, & if the j - th document is relevant \\ - c, & if the j - th document is nonrelevant \end{matrix}

(8)

where

$B_{i}$ is the total number of documents reviewed for the i-th topic;
g is the gain associated with reviewing a relevant document;
c is the penalty associated with reviewing a nonrelevant document.

While Recall@k, and all other traditional measures, remains a valuable metric in assessing early retrieval effectiveness, it does not account for the practical costs associated with reviewing documents. RFCU@k and UG@B extend the traditional evaluation by incorporating budget awareness and cost-efficiency, providing a more actionable framework for the assessment of TAR systems under real-world constraints.

3. The CLEF eHealth Technology-Assisted Review Lab

CLEF eHealth was a large and extensive evaluation challenge in the medical and biomedical domain within the CLEF initiative. The goal was to provide researchers with datasets, evaluation frameworks, and events. The CLEF eHealth TAR lab, conducted from 2017 to 2019 [10,11,12], aimed to advance the development of intelligent systems to support the systematic review process in evidence-based medicine. Systematic reviews are vital in synthesizing research findings and guiding clinical and policy decisions, yet they are often labor-intensive and time-consuming due to the exponential growth of the medical literature. The TAR lab addressed this challenge by providing a benchmark platform for evaluating automated and semi-automated methods capable of assisting reviewers in identifying relevant scientific articles.

The main objectives of the TAR lab were threefold:

To develop and evaluate technology-assisted methods that support different stages of systematic reviews, including document retrieval, title and abstract screening, and ranking;
To assess how well such systems could replicate or improve upon expert-generated results, particularly in the context of diagnostic test accuracy (DTA) reviews;
To create reusable test collections and protocols that could foster reproducible and comparative experimentation across the research community.

Each year, the lab introduced realistic tasks simulating distinct phases of the systematic review process. In 2017, the lab focused on prioritizing citations retrieved by expert Boolean queries. In 2018 and 2019, the lab expanded to two subtasks: (1) retrieving relevant documents using only structured review protocols (no Boolean search) and (2) ranking documents retrieved by expert searches (title and abstract screening). In particular, some of the TAR tasks challenges were as follows:

Data sparsity and imbalance: Relevant studies were a small minority within large retrieval sets, mimicking the real-world screening burden.
Domain specificity: The DTA reviewed the required understanding of complex medical terminology and concepts.
Limited supervision: Especially in the no Boolean search subtask, systems had to operate with minimal training data and no expert-crafted queries.

Across the three editions, the TAR lab demonstrated that automatic and semi-automatic methods could substantially reduce the human effort required to complete systematic reviews without compromising their comprehensiveness. Participants employed a wide range of techniques, including classic machine learning, neural ranking models, and query expansion strategies. Some of the general findings were as follows:

High recall achievable with limited review effort: Several systems reached near-complete recall for relevant studies after reviewing only a fraction of the dataset.
Protocol-based retrieval remains challenging: The no Boolean search subtask proved to be the most difficult, with mixed results across teams and significant room for improvement.
Iterative strategies outperform static ones: Approaches that leveraged feedback loops, such as active learning or reranking, were generally more effective than static ranking methods.

3.1. CLEF eHealth 2017 Technology-Assisted Review Task

The CLEF eHealth 2017 evaluation lab introduced a new pilot task focusing on technology-assisted reviews (TAR) in empirical medicine [12,29]. The primary objective of this task was to support health scientists and policymakers by enhancing information access during the systematic review process. The TAR task specifically addressed the abstract and title screening phase of conducting diagnostic test accuracy (DTA) systematic reviews. Participants were challenged to develop and evaluate search algorithms aimed at efficiently and effectively ranking studies during this screening phase. The task involved constructing a benchmark collection of fifty systematic reviews, along with corresponding relevant and irrelevant articles identified through original Boolean queries. Fourteen teams participated, submitting a total of 68 automatic and semi-automatic runs, utilizing various information retrieval and machine learning algorithms over diverse text representations, in both batch and iterative modes.

3.2. CLEF eHealth 2018 Technology-Assisted Review Task

The CLEF eHealth 2018 evaluation lab continued its exploration into technology-assisted reviews (TAR) within empirical medicine, aiming to enhance the efficiency and effectiveness of systematic reviews [11,30]. Building upon the foundation laid in 2017, the 2018 lab introduced refinements to its tasks and methodologies. The 2018 TAR task was structured into two subtasks, each designed to simulate different stages of the systematic review process.

Subtask 1: No Boolean Search—Participants were provided with elements of a systematic review protocol, such as the objective, study type, participant criteria, index tests, target conditions, and reference standards. Using this information, they were to retrieve relevant studies from the PubMed database without constructing explicit Boolean queries. The goal was to emulate the initial search phase of a systematic review, relying solely on protocol information to identify pertinent literature.
Subtask 2: Title and Abstract Screening—In this subtask, participants received the results of a Boolean search executed by Cochrane experts, including the set of PubMed Document Identifiers (PMIDs) retrieved. The task involved ranking these abstracts to prioritize the most relevant studies, effectively simulating the screening phase, when reviewers assess titles and abstracts to determine inclusion in the systematic review.

3.3. CLEF eHealth 2019 Technology-Assisted Review Task

The CLEF eHealth 2019 evaluation lab, building upon the foundations established in previous years, introduced refinements to its tasks and methodologies [10,31]. The 2019 TAR task comprised the same two subtasks, each designed to simulate different stages of the systematic review process: subtask 1—no Boolean search; subtask 2—title and abstract screening.

4. Experimental Analysis

In this section, we present the experimental framework adopted to analyze the results of CLEF eHealth TAR tasks from 2017 to 2019. Our primary focus is on reproducing and systematically studying the official runs submitted by participants, with particular attention to the title and abstract screening task. We restrict our analysis to this task because it is the one with sufficient evaluation data across all three editions, whereas the other subtask (full-text screening) was less consistently populated by participants and thus less suitable for comprehensive comparison. The goal of this experimental analysis is to reproduce the retrieval effectiveness of TAR systems under a recall-oriented perspective, reflecting the critical importance of high recall in systematic reviews. Accordingly, we adopt a set of well-established evaluation metrics that were used in the evaluation of the CLEF TAR labs, each capturing complementary aspects of system performance.

Recall@k: This measures the proportion of relevant documents retrieved within the top k screened documents. This metric provides a direct view of how quickly systems identify relevant studies. We will evaluate the runs when k is equal to the total number of relevant documents.
Area Under the Curve (AUC): Specifically, we compute the AUC of the recall versus documents screened curve, offering a global assessment of the retrieval efficiency across the screening process.
Mean Average Precision (MAP): This summarizes the precision achieved for each relevant document retrieved, providing an aggregate measure of the ranking quality. Although it is traditionally precision-biased, we have seen that this measure is strongly correlated with RFCU@k.
Work Saved over Sampling at Recall (WSS@r): This quantifies the reduction in screening effort relative to random sampling, at a given level of recall, thus highlighting the practical workload benefits offered by TAR systems.

These metrics collectively allow us to capture both the effectiveness (in terms of the retrieval and ranking of relevant documents) and the efficiency (in terms of effort saved) of the systems evaluated during the CLEF eHealth TAR challenges. This analysis serves as the foundation for comparing the results on new cost- and budget-aware evaluation measures proposed Section 2.3.

We implemented all measures in R and did not rely on the official scripts in order to be consistent with the R programming framework that we established for this analysis.

4.1. Adaptive Baseline: Epsilon-Greedy Multiarmed Bandit

While our main analysis focuses on fixed, hand-crafted allocation strategies, we also include in this work a simple adaptive baseline to illustrate how learning-based methods can dynamically adjust budget allocation. We adopt an epsilon-greedy multiarmed bandit approach [32,33], where each review topic is treated as an “arm” that yields a reward when relevant documents are discovered. At each allocation step, the policy selects the topic with the highest estimated reward (exploitation) with probability

1 - ϵ

or a random topic (exploration) with probability

ϵ

. In this work, the reward is defined as the number of relevant documents identified in the screened set, consistent with prior work in active learning [33,34].

This adaptive baseline highlights two key properties absent in static heuristics: (1) the ability to incorporate observed feedback into future allocation decisions and (2) the trade-off between exploiting high-yield topics and exploring uncertain ones. While epsilon-greedy is intentionally simple and does not model topic heterogeneity or review costs, it provides a principled adaptive comparison point and a foundation for more sophisticated strategies.

4.2. Datasets and Data Availability

All data used in this study were obtained from the official CLEF TAR GitHub repository (https://github.com/CLEF-TAR/tar/ (accessed on 1 May 2025)). This repository contains the complete set of topics, document collections, and submitted runs for the CLEF eHealth TAR tasks from 2017 to 2019, as well as associated relevance judgments and evaluation scripts. By relying on this openly accessible and curated resource, we ensure the transparency, reproducibility, and consistency of our experimental setup and results. We will also make available all the code that we implemented to reproduce the study online (https://github.com/gmdn (accessed on 1 May 2025)).

In Table 1, we provide a summary of the number of topics, documents, and relevant documents for each year of the CLEF eHealth TAR lab.

4.3. Reproducing the Original Results

In this section, we revisit and reproduce the results of the CLEF eHealth TAR lab from the 2017 to 2019 editions. By systematically reproducing the original runs, we aim to create a consistent evaluation framework that allows for direct comparison across years, systems, and strategies. This reproduction effort not only reinforces the transparency and reliability of previous findings but also lays the groundwork for introducing new, budget-aware evaluation methods that reflect the practical constraints of real-world TAR deployments in a better way.

To ensure completeness, we have included not only the runs that were officially evaluated during the original campaigns but also those that were submitted but not formally assessed at the time. This comprehensive approach allows for a more thorough comparison across systems and years and supports the introduction of new budget-aware evaluation metrics that reflect the practical constraints and priorities of real-world screening workflows.

In Table 2, Table 3 and Table 4, we show the reproduced results of the three editions, CLEF 2017, CLEF 2018, and CLEF 2019, respectively. We observed that our replicated evaluation closely matched the original scores. In fewer than 10% of the evaluated system–topic combinations, we observed minor deviations, with absolute differences typically ranging from 2% to 3%, and all of them occurred in the work saved over sampling (WSS) metric. This is due to the behavior of trec_eval (https://github.com/usnistgov/trec_eval (accessed on 1 May 2025)), the software used by CLEF lab organizers to compute results, which applies a truncation mechanism when a run fails to retrieve all relevant documents, leading to slight variations in WSS calculation. These isolated cases were limited in scope and did not exhibit any systematic bias in favor of a particular system. For each system–topic–year tuple, we computed both absolute and relative differences, confirming that the observed discrepancies were negligible and did not compromise the integrity or fairness of the comparative evaluation—a key objective of the CLEF benchmarking initiative. This high level of agreement confirms the reliability of the reproduction process and provides a solid foundation for further analysis. With this strongly reproduced dataset, we can confidently proceed to study the impact of budget allocation strategies, ensuring that any observed differences in performance are due to the evaluation framework rather than inconsistencies in the data or original runs.

4.4. Analysis of Budget Allocation Strategies

To explore the impact of budget-aware evaluation, we designed an experimental setting in which each TAR system is allocated a fixed annotation budget equivalent to 10% of the total number of documents in the collection and a value of 1 for both cost and gain. This budget simulates a realistic constraint on the number of documents that can be manually reviewed or labeled, reflecting common limitations in resources. All systems are evaluated under this uniform budget threshold to ensure comparability, and selection strategies are assumed to proceed until the budget is exhausted. This controlled setting enables us to assess not only how effectively systems retrieve relevant documents but also how efficiently they operate within predefined resource limits.

For better readability, we show the tables of all results in Appendix A. Our comparative evaluation across multiple years reveals that the effectiveness of budget allocation strategies varies substantially depending on the evaluation metric used. In particular, when optimizing for recall—the proportion of relevant documents retrieved—the best-performing allocation strategies were even allocation and inverse proportional allocation. These strategies prioritize topic coverage and allocate resources in a way that avoids overcommitting to large, potentially low-yield topics. In particular, inverse proportional allocation favors smaller topics, which may contain concentrated pockets of relevant documents and benefit from deeper review. This behavior aligns with recall’s sensitivity to missing relevant items, especially in underrepresented or niche topics.

In contrast, when optimizing for RFCU, the highest scores were achieved by the proportional and threshold-capped greedy strategies. These methods allocate budget relative to the topic size or limit effort given to large topics and thus avoid overspending review effort where relevant documents are sparse. RFCU emphasizes efficiency—finding the most relevant documents per unit cost—rather than coverage and is therefore better served by strategies that exploit expected high-yield regions while avoiding low-utility areas.

For UG@B, we observed no significant performance differences among the four allocation strategies. This suggests that UG@B is less sensitive to the specific distribution of the budget, possibly due to its balancing of gains (true positives) and penalties (false positives). UG@B captures the net benefit rather than pure coverage or efficiency and may tolerate a wider range of allocation behaviors without large variance in utility.

These results are better visualized in Figure 1, Figure 2 and Figure 3. For each year, we show a boxplot of the metrics of all experiments according to the allocation strategy. Strategies like even allocation and inverse proportional allocation promote fairness and topic-wide coverage, which benefits recall. Conversely, strategies such as proportional allocation and capped greedy allocation optimize for cost-effectiveness, which aligns with RFCU. UG@B’s neutrality reflects its more balanced formulation, integrating both gain and cost into a single score and, in this specific case, the fact that gains and costs are equal.

The epsilon-greedy bandit adaptive baseline achieved performance that was broadly comparable to that of the static policies, although it did not consistently outperform them. In particular, its ability to dynamically reallocate effort allowed it to avoid severe underinvestment in small but relevant-rich topics, yielding recall scores close to those of even and inverse proportional allocation. At the same time, the exploration component prevented the systematic neglect of large topics, producing RFCU values that were again similar to those of even allocation and inverse proportional allocation. However, under our uniform cost/gain assumptions, epsilon-greedy did show an advantage in terms of UG@B for all years. This behavior is similar to what we observed when using UG@B as the reward, rather than the number of relevant documents, for the adaptive baseline [35]. This reinforces the view that, in budget-constrained TAR settings, adaptive strategies offer flexibility, but their advantages depend strongly on the chosen rewards, task characteristics, and evaluation criteria. Overall, these results confirm that even the simplest adaptive strategy can provide a reasonable middle ground between coverage- and efficiency-oriented heuristics, and they highlight the potential of more advanced learning-based allocation methods.

4.5. Analysis of Correlations Between Evaluation Metrics

To better understand the relationships between evaluation metrics, we conducted a year-by-year correlation analysis comparing recall with RFCU and UG@B. The result, plotted in Figure 4 and Figure 5, shows that RFCU is, in some cases, only weakly correlated with recall—demonstrating that systems that achieve high recall may do so inefficiently, while others can yield high RFCU with relatively low recall. This divergence reinforces the view that RFCU captures a distinct, efficiency-oriented dimension of performance. In contrast, UG@B shows a more consistent and moderate correlation with recall across years, suggesting that it partially aligns with traditional effectiveness goals while also incorporating cost sensitivity. Importantly, the strength of these correlations varies year by year, reflecting differences in system behavior, task design, and relevance prevalence across CLEF eHealth editions. These findings support the use of multiple, complementary metrics in TAR evaluation to provide a broader understanding of system trade-offs under budget constraints.

4.6. On the Effect of Gain/Loss Ratios in UG@B

In our main experiments, the budget B was defined as the number of documents that could be reviewed. Under this interpretation, every reviewed document is either a true positive (

T P

) or a false positive (

F P

) and thus

T P + F P = B

. Let us rewrite Equation (8) in the following manner:

U G @ B = g \cdot T P - c \cdot F P

(9)

where

g > 0

is the gain per relevant document found within the budget B, and

c > 0

is the cost per nonrelevant document reviewed.

Substituting

F P = B - T P

, we obtain

U G @ B = (g + c) \cdot T P - c B .

Since B is constant for all systems, the ranking induced by UG@B is strictly increasing in

T P

. This implies that, when the budget is defined in terms of the number of documents reviewed, the relative weighting of gain and loss (i.e., the ratio

g / c

) does not affect the ranking between systems. A system that retrieves more relevant documents within the fixed number of screened documents will always achieve a higher UG@B, regardless of how much losses are penalized. In this setting, no crossover between systems (a crossover occurs when a system performs better or worse than another system according to the parameters of the metric—in this case, g and c) can occur unless a stopping strategy is implemented for one of the systems (more details are given in the Conclusions).

As an alternative, the budget can be defined in terms of time rather than the number of screened documents. In this case, B represents the total annotation time available, where each reviewed document consumes a variable amount of time depending on whether it is relevant or not, and possibly also on the system’s workflow. For example, a system might route likely negatives through a faster triage process, while relevant items require more careful reading. Under this interpretation, the number of documents reviewed within the same time budget may differ across systems, and the mix of

T P

and

F P

achieved depends on both the effectiveness and review cost. Let us consider this version of UG@B, which highlights the gain/loss ratio up to a multiplicative constant c:

U G @ B \propto \frac{g}{c} \cdot T P - F P .

This scenario renders UG@B more sensitive to the gain/loss ratio, and crossovers between systems become possible. To illustrate this, consider the following example. Suppose that the time budget is fixed at

B = 100

min. We compare two systems with different review dynamics:

System A: True positives take $2.0$ min to review, and false positives take $1.0$ min. Within 100 min, this system can review $(T P_{A}, F P_{A}) = (33, 34)$ documents, since $2 \cdot 33 + 1 \cdot 34 = 100$ .
System B: Both true and false positives take $2.0$ and $1.33$ min to review, respectively. Within 100 min, this system can review $(T P_{B}, F P_{B}) = (30, 30)$ documents, since $2 \cdot 30 + 1.33 \cdot 30 \approx 100$ .

In this situation, the evaluation of the two systems would be

U G_{A} = \frac{2}{1} 33 - 34 = 32 U G_{B} = \frac{2}{1.33} 30 - 30 = 15.1

and system A would be preferred in terms of UG@B. The question now is the following: is there a gain/cost ratio value for which the preference of one system over the other would switch? Let us rewrite the utilities as a function of the gain/loss ratio

r = g / c

(with

c = 1

without loss of generality):

U G_{A} (r) = 33 r - 34, U G_{B} (r) = 30 r - 30 .

The crossover point occurs when

U G_{A} (r) = U G_{B} (r)

:

33 r - 34 = 30 r - 30 \Rightarrow 3 r = 4 \Rightarrow r \approx 1.33 .

This means that

For $r < 1.33$ (loss-dominant setting), system B is preferred because it minimizes wasted effort on nonrelevant items;
For $r > 1.33$ (gain-dominant setting), system A is preferred because its higher number of true positives outweighs the additional cost of false positives.

This example, displayed in Figure 6, shows that when the budget is defined in terms of time, the choice of gain/loss ratio directly influences the ranking of systems. UG@B thus provides a tunable framework to reflect different operational scenarios, such as prioritizing recall (high

g / c

) versus minimizing annotation effort (low

g / c

).

5. Conclusions and Future Work

This study presents a comprehensive, budget-aware analysis of document screening strategies in technology-assisted review using real-world evaluation data from the CLEF eHealth shared tasks (2017–2019). Our investigation emphasizes the importance of modeling both retrieval performance and resource constraints when designing and evaluating TAR systems.

In particular, our main contributions are the following.

Reproduction of Official Results: We collect, organize, and verify the official participant runs from the CLEF eHealth TAR tasks, providing a unified and reproducible dataset spanning three years of evaluations.
Exploration of Budget Allocation Strategies: We investigate how limited screening budgets can be distributed across topics using several allocation strategies, including even, proportional, inverse proportional, and capped greedy approaches. Our analysis reveals that different strategies optimize different objectives: even and inverse proportional strategies favor recall, while proportional and capped strategies enhance efficiency.
Introduction of Cost-Aware Evaluation Metrics: We propose two novel measures—relevant found per cost unit (RFCU) and utility gain at budget (UG@B)—that directly account for resource expenditure. These metrics complement traditional recall by quantifying screening efficiency and net utility, and they help to expose trade-offs not visible through recall alone.

Our findings show that no single allocation strategy dominates across all evaluation criteria, underscoring the need to align strategy selection with screening goals—whether completeness, efficiency, or balanced utility. Furthermore, our correlation analysis demonstrates that cost-aware metrics like RFCU and UG@B provide complementary insights alongside recall, rather than duplicating it. The choice of evaluation metric in TARs should reflect the goals and constraints of the review task. In resource-unconstrained environments—such as legal e-discovery with completeness mandates—recall remains the gold standard. However, in practice, many reviews operate under strict cost or time budgets. In such settings, traditional recall may fail to capture a system’s efficiency or utility under constraints. To address this, we propose the following guidelines for selecting evaluation metrics that align with practical review objectives. For a comprehensive performance analysis, we recommend using a composite evaluation approach: recall should be used to ensure topic coverage, while RFCU and UG@B provide complementary perspectives on efficiency and cost sensitivity. This multimetric framework offers a more robust and actionable view of TAR performance, especially in real-world screening scenarios where budgets are finite and reviewers must prioritize effort.

This work opens up several promising directions. We envision the development of cost–risk frontiers, which plot the relationship between resource expenditure and the risk of missing relevant documents. These frontiers would enable a nuanced comparison of TAR strategies, not only by their performance but also by how well they handle the degradation of results under limited resources. Evaluating systems along cost–risk frontiers will provide a deeper understanding of their robustness when constrained by resource availability.

Incorporating human decision making into TAR system evaluation is a crucial avenue for future work, as the effectiveness of TAR workflows is ultimately shaped by human-in-the-loop dynamics. Reviewer behavior can vary widely—in terms of speed, risk tolerance, consistency, fatigue, and stopping strategies—introducing variance in the effective cost and duration of screening. These behavioral factors mean that the performance and utility of a policy can diverge from the outcomes predicted under the assumptions of this paper. To capture this complexity, research in this area will require access to real-world data on reviewer behavior. While simulations can approximate variations in speed, risk tolerance, or stopping strategies, they inevitably rely on simplifying assumptions that may not capture the full complexity of human decision making. In this sense, synthetic data can therefore be misleading, particularly when subtle patterns of fatigue, inconsistency, or adaptation to system guidance play a decisive role. To obtain reliable insights, future work should complement simulations with empirical studies involving actual users, ensuring that evaluations reflect the realities of human interaction with TAR systems.

Automatic early stopping is a crucial aspect of TAR systems, as it directly affects the trade-off between recall and cost. Stopping too early risks missing a substantial portion of relevant documents, while stopping too late expends resources without meaningful gains. Reviewer behavior further complicates this picture: a conservative reviewer may continue screening well beyond the optimal point, consuming unnecessary resources, while a more aggressive one may stop prematurely, jeopardizing recall. These behavioral and policy-driven factors highlight the importance of jointly modeling human decision making and system-level stopping strategies, ensuring that evaluations reflect both theoretical performance and the realities of practice. We propose the exploration of adaptive stopping mechanisms that rely on marginal utility or predictive confidence thresholds, allowing the review process to adjust dynamically to budget constraints. In future work, we aim to estimate the opportunity cost of early stopping by quantifying the risk of missing relevant documents when resources are limited, thereby identifying the most effective strategies for cost-sensitive review settings.

Looking ahead, we aim to move beyond static budget allocation by enabling each TAR system to dynamically determine its own optimal budget usage based on performance signals and uncertainty estimates. Rather than imposing a fixed budget ceiling, future evaluations will explore adaptive budget strategies where systems decide how much annotation effort is necessary to achieve a given level of recall or confidence.

Funding

The APC was funded by the “National Biodiversity Future Center—NBFC” project funded under the National Recovery and Resilience Plan (NRRP), Mission 4 Component 2 Investment 1.4—Call for tender No. 3138 of 16 December 2021, rectified by Decree n. 3175 of 18 December 2021 of Italian Ministry of University and Research funded by the European Union—NextGenerationEU. Project code CN_00000033.

Acknowledgments

This work is partially supported by the HEREDITARY Project, as part of the European Union’s Horizon Europe research and innovation programme under grant agreement No GA 101137074. During the preparation of this manuscript, the author used ChatGPT, OpenAI, GPT-5 exclusively for language refinement (e.g., grammar, style, and clarity). The author have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A

In this appendix, we provide the detailed results of the different allocation budget approaches for each year of the CLEF TAR lab.

Table A1. CLEF TAR 2017 lab. Even allocation of budget. The results are ordered decreasingly by recall. Runs that were not included in the official overview are named accordingly.

	Run	Recall	Recall@k	AUC	MAP	WSS@95	RFCU@k	UG@B
1	auth.simple.run6_NO_OVERVIEW	0.98	0.98	0.98	0.98	0.82	0.15	−274.67
2	auth.simple.run8_NO_OVERVIEW	0.98	0.98	0.98	0.98	0.82	0.15	−274.67
3	auth.simple.run7_NO_OVERVIEW	0.98	0.98	0.98	0.98	0.82	0.14	−274.93
4	auth.simple.run5_NO_OVERVIEW	0.95	0.95	0.95	0.95	0.72	0.13	−284.53
5	auth.simple.run3	0.89	0.28	0.84	0.27	0.33	0.12	−301.73
6	auth.simple.run1	0.89	0.30	0.84	0.28	0.34	0.11	−302.40
7	auth.simple.run2	0.89	0.30	0.83	0.28	0.34	0.11	−302.67
8	auth.simple.run4	0.89	0.30	0.83	0.28	0.34	0.11	−302.60
9	waterloo.B-rank-cost_NO_OVERVIEW	0.89	0.32	0.84	0.31	0.35	0.12	−299.87
10	waterloo.B-rank-normal	0.89	0.32	0.84	0.31	0.35	0.12	−299.87
11	waterloo.B-thresh	0.89	0.32	0.84	0.31	0.35	0.12	−299.93
12	waterloo.B-thresh-cost_NO_OVERVIEW	0.89	0.32	0.84	0.31	0.35	0.12	−299.93
13	waterloo.A-rank-cost_NO_OVERVIEW	0.88	0.29	0.83	0.27	0.31	0.12	−300.20
14	waterloo.A-rank-normal	0.88	0.29	0.83	0.27	0.31	0.12	−300.20
15	waterloo.A-thresh	0.88	0.29	0.83	0.27	0.31	0.12	−300.20
16	waterloo.A-thresh-cost_NO_OVERVIEW	0.88	0.29	0.83	0.27	0.31	0.12	−300.20
17	padua.iafap_m10p5f0m10	0.84	0.29	0.78	0.26	0.26	0.11	−309.33
18	padua.iafapc_m10p20f0t300p2m10.NO_OVERVIEW	0.84	0.30	0.80	0.26	0.29	0.12	−290.03
19	padua.iafas_m10k50f0m10	0.84	0.28	0.78	0.25	0.26	0.11	−310.80
20	padua.iafap_m10p2f0m10	0.83	0.28	0.76	0.24	0.21	0.11	−310.93
21	padua.iafa_m10k150f0m10	0.83	0.30	0.78	0.27	0.28	0.10	−311.93
22	uos.sis.AL30Q	0.83	0.21	0.74	0.20	0.17	0.11	−310.50
23	padua.iafapc_m10p20f0t150p2m10.NO_OVERVIEW	0.82	0.30	0.77	0.26	0.21	0.12	−292.93
24	padua.iafapc_m10p10f0t150p2m10.NO_OVERVIEW	0.79	0.28	0.73	0.24	0.14	0.12	−268.60
25	ncsu.abs	0.73	0.08	0.60	0.11	0.12	0.09	−322.67
26	padua.iafapc_m10p5f0t0p2m10.NO_OVERVIEW	0.71	0.26	0.69	0.22	0.08	0.16	−213.27
27	uos.bm25_1	0.71	0.18	0.62	0.16	0.11	0.09	−328.63
28	uos.bm25_2	0.71	0.18	0.62	0.16	0.11	0.09	−328.63
29	uos.bm25_2.5	0.71	0.18	0.62	0.16	0.11	0.09	−328.63
30	uos.sis.BM25_NO_OVERVIEW	0.71	0.18	0.62	0.16	0.11	0.09	−328.63
31	uos.sis.bm25_1.5_NO_OVERVIEW	0.71	0.18	0.62	0.16	0.11	0.09	−328.63
32	uos.sis.TMAL30Q_BM25	0.70	0.15	0.61	0.15	0.16	0.09	−323.43
33	cnrs.noaffull.all	0.70	0.18	0.61	0.16	0.11	0.09	−325.87
34	sheffield.run3	0.70	0.21	0.62	0.18	0.15	0.08	−329.53
35	eth.m1	0.70	0.22	0.63	0.20	0.15	0.09	−324.67
36	eth.m2	0.70	0.22	0.63	0.20	0.15	0.09	−324.67
37	eth.m4	0.70	0.22	0.63	0.20	0.15	0.09	−324.67
38	sheffield.run1	0.69	0.18	0.61	0.15	0.19	0.08	−334.47
39	sheffield.run2	0.69	0.22	0.61	0.20	0.16	0.09	−328.87
40	sheffield.run4	0.69	0.22	0.61	0.20	0.16	0.09	−328.80
41	cnrs.noaf.all	0.68	0.16	0.57	0.13	0.11	0.08	−335.67
42	ncsu.simple	0.66	0.07	0.53	0.10	0.06	0.10	−316.80
43	amc.run	0.61	0.13	0.50	0.11	0.09	0.07	−338.47
44	cnrs.abrupt.all	0.61	0.16	0.51	0.13	0.04	0.07	−340.67
45	ecnu.run3	0.59	0.20	0.54	0.17	−0.01	0.10	−280.47
46	qut.ca_bool_ltr	0.57	0.13	0.46	0.10	0.03	0.08	−337.00
47	ecnu.run2	0.57	0.20	0.52	0.16	−0.02	0.11	−265.93
48	qut.ca_pico_ltr	0.57	0.16	0.48	0.14	0.06	0.07	−339.53
49	uos.sis.TMBEST_BM25	0.57	0.12	0.46	0.11	0.14	0.07	−343.57
50	cnrs.gradual.all	0.56	0.15	0.46	0.13	0.05	0.08	−337.53
51	qut.rf_bool_ltr	0.55	0.11	0.42	0.09	0.00	0.08	−338.20
52	iiit.run1	0.55	0.16	0.51	0.14	0.05	0.12	−196.37
53	iiit.run2	0.55	0.16	0.51	0.14	0.05	0.12	−196.37
54	iiit.run3	0.55	0.16	0.51	0.14	0.05	0.12	−196.37
55	iiit.run4	0.55	0.16	0.51	0.14	0.05	0.12	−196.37
56	iiit.run5_NO_OVERVIEW	0.55	0.16	0.51	0.14	0.05	0.12	−196.37
57	iiit.run6_NO_OVERVIEW	0.55	0.16	0.51	0.14	0.05	0.12	−196.37
58	iiit.run7_NO_OVERVIEW	0.55	0.16	0.51	0.14	0.05	0.12	−196.37
59	iiit.run8_NO_OVERVIEW	0.55	0.16	0.51	0.14	0.05	0.12	−196.37
60	qut.rf_pico_ltr	0.54	0.13	0.43	0.10	0.05	0.07	−344.40
61	qut.pico_es	0.50	0.16	0.45	0.11	0.05	0.08	−290.47
62	ntu.run1	0.49	0.07	0.35	0.06	−0.02	0.06	−348.80
63	qut.bool_es	0.49	0.15	0.44	0.12	0.03	0.07	−305.33
64	ecnu.run1	0.47	0.09	0.35	0.08	−0.00	0.06	−356.13
65	ntu.run2	0.42	0.05	0.29	0.05	−0.01	0.05	−358.20
66	ntu.run3	0.42	0.04	0.28	0.04	0.02	0.04	−369.47
67	ucl.run_abstract	0.40	0.05	0.26	0.05	−0.00	0.05	−362.47
68	ucl.run_fulltext	0.38	0.05	0.25	0.04	−0.00	0.04	−371.27
69	uos.sis.pubmed.random_NO_OVERVIEW	0.36	0.04	0.21	0.03	−0.04	0.04	−369.33

Table A2. CLEF TAR 2017 lab. Proportional allocation of budget. The results are ordered decreasingly by recall. Runs that were not included in the official overview are named accordingly.

	Run	Recall	Recall@k	AUC	MAP	WSS@r95	RFCU@k	UG@B
1	auth.simple.run6_NO_OVERVIEW	0.94	0.94	0.94	0.94	0.72	0.33	−276.73
2	auth.simple.run8_NO_OVERVIEW	0.94	0.94	0.94	0.94	0.72	0.33	−276.73
3	auth.simple.run7_NO_OVERVIEW	0.94	0.94	0.94	0.94	0.72	0.33	−277.00
4	auth.simple.run5_NO_OVERVIEW	0.91	0.91	0.91	0.91	0.62	0.32	−285.80
5	waterloo.B-rank-cost_NO_OVERVIEW	0.74	0.29	0.72	0.25	0.11	0.24	−300.20
6	waterloo.B-rank-normal	0.74	0.29	0.72	0.25	0.11	0.24	−300.20
7	waterloo.B-thresh	0.73	0.29	0.71	0.25	0.11	0.24	−298.23
8	waterloo.B-thresh-cost_NO_OVERVIEW	0.73	0.29	0.71	0.25	0.11	0.24	−298.23
9	auth.simple.run1	0.73	0.27	0.71	0.22	0.08	0.22	−302.53
10	auth.simple.run3	0.71	0.25	0.70	0.21	0.05	0.21	−302.67
11	waterloo.A-rank-cost_NO_OVERVIEW	0.71	0.26	0.69	0.20	0.11	0.22	−302.20
12	waterloo.A-rank-normal	0.71	0.26	0.69	0.20	0.11	0.22	−302.20
13	waterloo.A-thresh	0.71	0.26	0.69	0.20	0.11	0.22	−302.20
14	waterloo.A-thresh-cost_NO_OVERVIEW	0.71	0.26	0.69	0.20	0.11	0.22	−302.20
15	auth.simple.run2	0.70	0.27	0.68	0.22	0.11	0.20	−303.20
16	auth.simple.run4	0.69	0.27	0.68	0.22	0.08	0.20	−303.27
17	padua.iafap_m10p5f0m10	0.68	0.27	0.66	0.21	0.08	0.20	−309.13
18	padua.iafa_m10k150f0m10	0.67	0.27	0.66	0.21	0.08	0.20	−310.80
19	padua.iafas_m10k50f0m10	0.67	0.26	0.65	0.20	0.08	0.19	−310.60
20	padua.iafapc_m10p20f0t300p2m10.NO_OVERVIEW	0.66	0.27	0.65	0.20	0.08	0.19	−283.07
21	padua.iafapc_m10p5f0t0p2m10.NO_OVERVIEW	0.64	0.24	0.63	0.19	0.08	0.16	−312.07
22	padua.iafap_m10p2f0m10	0.64	0.26	0.62	0.19	0.05	0.18	−312.13
23	padua.iafapc_m10p10f0t150p2m10.NO_OVERVIEW	0.63	0.27	0.62	0.20	0.01	0.19	−215.57
24	padua.iafapc_m10p20f0t150p2m10.NO_OVERVIEW	0.63	0.27	0.62	0.20	0.01	0.19	−215.57
25	uos.sis.AL30Q	0.58	0.19	0.56	0.15	0.01	0.15	−312.87
26	uos.sis.TMAL30Q_BM25	0.51	0.14	0.49	0.10	−0.02	0.14	−323.40
27	cnrs.noaffull.all	0.51	0.17	0.49	0.12	−0.02	0.15	−326.67
28	eth.m1	0.51	0.20	0.49	0.15	0.01	0.17	−327.07
29	eth.m2	0.51	0.20	0.49	0.15	0.01	0.17	−327.07
30	eth.m4	0.51	0.20	0.49	0.15	0.01	0.17	−327.07
31	sheffield.run4	0.50	0.20	0.49	0.15	−0.02	0.16	−330.20
32	sheffield.run2	0.50	0.20	0.48	0.15	−0.02	0.16	−330.00
33	sheffield.run3	0.49	0.20	0.48	0.13	0.01	0.15	−330.87
34	sheffield.run1	0.47	0.16	0.45	0.10	0.01	0.15	−338.13
35	uos.bm25_1	0.45	0.16	0.44	0.11	−0.02	0.12	−334.00
36	uos.bm25_2	0.45	0.16	0.44	0.11	−0.02	0.12	−334.00
37	uos.bm25_2.5	0.45	0.16	0.44	0.11	−0.02	0.12	−334.00
38	uos.sis.BM25_NO_OVERVIEW	0.45	0.16	0.44	0.11	−0.02	0.12	−334.00
39	uos.sis.bm25_1.5_NO_OVERVIEW	0.45	0.16	0.44	0.11	−0.02	0.12	−334.00
40	cnrs.noaf.all	0.44	0.15	0.43	0.09	−0.02	0.13	−341.40
41	ncsu.abs	0.43	0.06	0.42	0.06	−0.05	0.10	−334.47
42	ecnu.run3	0.43	0.17	0.42	0.12	−0.02	0.15	−302.13
43	iiit.run1	0.43	0.15	0.41	0.10	−0.02	0.13	−225.37
44	iiit.run2	0.43	0.15	0.41	0.10	−0.02	0.13	−225.37
45	iiit.run3	0.43	0.15	0.41	0.10	−0.02	0.13	−225.37
46	iiit.run4	0.43	0.15	0.41	0.10	−0.02	0.13	−225.37
47	iiit.run5_NO_OVERVIEW	0.43	0.15	0.41	0.10	−0.02	0.13	−225.37
48	iiit.run6_NO_OVERVIEW	0.43	0.15	0.41	0.10	−0.02	0.13	−225.37
49	iiit.run7_NO_OVERVIEW	0.43	0.15	0.41	0.10	−0.02	0.13	−225.37
50	iiit.run8_NO_OVERVIEW	0.43	0.15	0.41	0.10	−0.02	0.13	−225.37
51	ecnu.run2	0.42	0.18	0.41	0.12	−0.02	0.16	−291.37
52	ncsu.simple	0.42	0.07	0.40	0.06	−0.02	0.11	−317.60
53	qut.bool_es	0.37	0.14	0.36	0.09	−0.05	0.13	−338.53
54	qut.pico_es	0.37	0.15	0.36	0.08	−0.02	0.12	−340.87
55	cnrs.abrupt.all	0.37	0.14	0.36	0.08	−0.05	0.12	−347.40
56	amc.run	0.34	0.11	0.33	0.07	−0.05	0.12	−341.67
57	qut.ca_pico_ltr	0.34	0.15	0.33	0.10	−0.05	0.13	−343.20
58	cnrs.gradual.all	0.33	0.13	0.32	0.08	−0.05	0.14	−342.53
59	qut.ca_bool_ltr	0.31	0.12	0.30	0.06	−0.05	0.09	−345.47
60	uos.sis.TMBEST_BM25	0.31	0.11	0.29	0.07	−0.05	0.12	−351.07
61	qut.rf_bool_ltr	0.28	0.10	0.27	0.06	−0.05	0.09	−347.27
62	qut.rf_pico_ltr	0.26	0.11	0.25	0.06	−0.05	0.10	−351.07
63	ecnu.run1	0.20	0.08	0.19	0.04	−0.05	0.08	−364.33
64	ntu.run1	0.20	0.06	0.19	0.03	−0.05	0.07	−358.47
65	ntu.run2	0.15	0.05	0.15	0.01	−0.05	0.05	−367.13
66	ucl.run_abstract	0.11	0.04	0.10	0.02	−0.05	0.05	−374.80
67	ucl.run_fulltext	0.10	0.03	0.10	0.01	−0.05	0.04	−383.07
68	ntu.run3	0.09	0.03	0.09	0.01	−0.05	0.03	−381.53
69	uos.sis.pubmed.random_NO_OVERVIEW	0.09	0.03	0.09	0.01	−0.05	0.04	−379.67

Table A3. CLEF TAR 2017 lab. Inverse proportional allocation of budget. The results are ordered decreasingly by recall. Runs that were not included in the official overview are named accordingly.

	Run	Recall	Recall@k	AUC	MAP	WSS@r95	RFCU@k	UG@B
1	auth.simple.run6_NO_OVERVIEW	0.98	0.98	0.98	0.98	0.82	0.15	−275.37
2	auth.simple.run8_NO_OVERVIEW	0.98	0.98	0.98	0.98	0.82	0.15	−275.37
3	auth.simple.run7_NO_OVERVIEW	0.98	0.98	0.98	0.98	0.82	0.15	−275.63
4	auth.simple.run5_NO_OVERVIEW	0.95	0.95	0.95	0.95	0.72	0.14	−284.50
5	auth.simple.run3	0.88	0.28	0.83	0.27	0.33	0.12	−305.23
6	auth.simple.run1	0.88	0.30	0.83	0.28	0.36	0.12	−305.63
7	auth.simple.run2	0.88	0.30	0.82	0.28	0.37	0.12	−305.03
8	auth.simple.run4	0.88	0.30	0.82	0.28	0.34	0.12	−305.43
9	waterloo.A-rank-cost_NO_OVERVIEW	0.88	0.28	0.82	0.26	0.28	0.12	−303.00
10	waterloo.A-rank-normal	0.88	0.28	0.82	0.26	0.28	0.12	−303.00
11	waterloo.A-thresh	0.88	0.28	0.82	0.26	0.28	0.12	−303.00
12	waterloo.A-thresh-cost_NO_OVERVIEW	0.88	0.28	0.82	0.26	0.28	0.12	−303.00
13	waterloo.B-rank-cost_NO_OVERVIEW	0.87	0.32	0.83	0.30	0.29	0.12	−303.07
14	waterloo.B-rank-normal	0.87	0.32	0.83	0.30	0.29	0.12	−303.07
15	waterloo.B-thresh	0.87	0.32	0.83	0.30	0.29	0.12	−303.07
16	waterloo.B-thresh-cost_NO_OVERVIEW	0.87	0.32	0.83	0.30	0.29	0.12	−303.07
17	padua.iafas_m10k50f0m10	0.83	0.28	0.78	0.25	0.25	0.11	−313.30
18	padua.iafa_m10k150f0m10	0.83	0.30	0.78	0.26	0.25	0.11	−314.37
19	padua.iafapc_m10p20f0t300p2m10.NO_OVERVIEW	0.82	0.30	0.77	0.26	0.27	0.12	−286.80
20	padua.iafap_m10p5f0m10	0.82	0.29	0.76	0.25	0.23	0.11	−312.10
21	uos.sis.AL30Q	0.82	0.21	0.73	0.20	0.20	0.11	−313.37
22	padua.iafap_m10p2f0m10	0.82	0.27	0.74	0.24	0.18	0.11	−313.50
23	padua.iafapc_m10p20f0t150p2m10.NO_OVERVIEW	0.81	0.30	0.76	0.25	0.22	0.12	−281.70
24	padua.iafapc_m10p10f0t150p2m10.NO_OVERVIEW	0.78	0.28	0.73	0.23	0.14	0.13	−250.77
25	ncsu.abs	0.75	0.08	0.61	0.11	0.18	0.09	−324.10
26	uos.bm25_1	0.71	0.17	0.61	0.16	0.11	0.09	−329.90
27	uos.bm25_2	0.71	0.17	0.61	0.16	0.11	0.09	−329.90
28	uos.bm25_2.5	0.71	0.17	0.61	0.16	0.11	0.09	−329.90
29	uos.sis.BM25_NO_OVERVIEW	0.71	0.17	0.61	0.16	0.11	0.09	−329.90
30	uos.sis.bm25_1.5_NO_OVERVIEW	0.71	0.17	0.61	0.16	0.11	0.09	−329.90
31	cnrs.noaffull.all	0.70	0.18	0.60	0.16	0.18	0.09	−329.03
32	uos.sis.TMAL30Q_BM25	0.70	0.15	0.60	0.14	0.20	0.09	−326.03
33	eth.m1	0.69	0.22	0.62	0.20	0.17	0.09	−326.43
34	eth.m2	0.69	0.22	0.62	0.20	0.17	0.09	−326.43
35	eth.m4	0.69	0.22	0.62	0.20	0.17	0.09	−326.43
36	sheffield.run3	0.69	0.21	0.61	0.18	0.15	0.08	−331.13
37	padua.iafapc_m10p5f0t0p2m10.NO_OVERVIEW	0.69	0.26	0.67	0.22	0.04	0.16	−194.37
38	sheffield.run2	0.69	0.22	0.61	0.20	0.16	0.08	−330.67
39	sheffield.run1	0.68	0.18	0.59	0.15	0.19	0.08	−336.87
40	sheffield.run4	0.68	0.22	0.60	0.20	0.16	0.08	−331.13
41	ncsu.simple	0.67	0.07	0.52	0.10	0.06	0.10	−320.03
42	cnrs.noaf.all	0.66	0.16	0.55	0.13	0.13	0.08	−337.57
43	amc.run	0.60	0.12	0.49	0.11	0.10	0.07	−340.77
44	cnrs.gradual.all	0.60	0.15	0.47	0.13	0.08	0.08	−338.17
45	cnrs.abrupt.all	0.59	0.16	0.49	0.13	0.05	0.07	−343.57
46	ecnu.run3	0.58	0.19	0.52	0.16	0.00	0.10	−272.90
47	qut.ca_pico_ltr	0.57	0.16	0.48	0.14	0.06	0.07	−340.70
48	qut.ca_bool_ltr	0.57	0.12	0.45	0.10	0.03	0.08	−338.37
49	ecnu.run2	0.56	0.20	0.52	0.16	0.00	0.11	−258.07
50	uos.sis.TMBEST_BM25	0.56	0.12	0.45	0.11	0.14	0.07	−345.10
51	qut.rf_bool_ltr	0.55	0.11	0.42	0.09	0.00	0.07	−339.30
52	iiit.run1	0.55	0.16	0.50	0.14	0.05	0.12	−190.10
53	iiit.run2	0.55	0.16	0.50	0.14	0.05	0.12	−190.10
54	iiit.run3	0.55	0.16	0.50	0.14	0.05	0.12	−190.10
55	iiit.run4	0.55	0.16	0.50	0.14	0.05	0.12	−190.10
56	iiit.run5_NO_OVERVIEW	0.55	0.16	0.50	0.14	0.05	0.12	−190.10
57	iiit.run6_NO_OVERVIEW	0.55	0.16	0.50	0.14	0.05	0.12	−190.10
58	iiit.run7_NO_OVERVIEW	0.55	0.16	0.50	0.14	0.05	0.12	−190.10
59	iiit.run8_NO_OVERVIEW	0.55	0.16	0.50	0.14	0.05	0.12	−190.10
60	qut.rf_pico_ltr	0.54	0.13	0.43	0.10	0.05	0.07	−345.23
61	ntu.run1	0.52	0.07	0.36	0.06	−0.02	0.06	−349.50
62	qut.bool_es	0.50	0.15	0.44	0.12	0.03	0.07	−297.00
63	ecnu.run1	0.49	0.09	0.36	0.08	−0.00	0.05	−356.37
64	qut.pico_es	0.49	0.16	0.44	0.11	0.05	0.08	−277.13
65	ntu.run2	0.44	0.05	0.29	0.05	−0.00	0.05	−358.43
66	ntu.run3	0.43	0.04	0.29	0.04	0.02	0.04	−367.80
67	ucl.run_abstract	0.42	0.05	0.26	0.05	−0.00	0.05	−361.53
68	uos.sis.pubmed.random_NO_OVERVIEW	0.40	0.04	0.22	0.03	−0.04	0.04	−367.93
69	ucl.run_fulltext	0.40	0.05	0.26	0.04	−0.00	0.04	−369.60

Table A4. CLEF TAR 2017 lab. Capped allocation of budget. The results are ordered decreasingly by recall. Runs that were not included in the official overview are named accordingly.

	Run	Recall	Recall@k	AUC	MAP	WSS@r95	RFCU@k	UG@B
1	auth.simple.run6_NO_OVERVIEW	0.98	0.98	0.98	0.98	0.82	0.23	−274.67
2	auth.simple.run8_NO_OVERVIEW	0.98	0.98	0.98	0.98	0.82	0.23	−274.67
3	auth.simple.run7_NO_OVERVIEW	0.98	0.98	0.98	0.98	0.82	0.23	−274.93
4	auth.simple.run5_NO_OVERVIEW	0.95	0.95	0.95	0.95	0.72	0.22	−284.20
5	auth.simple.run1	0.82	0.30	0.79	0.27	0.21	0.17	−300.13
6	auth.simple.run3	0.81	0.28	0.78	0.26	0.14	0.17	−300.00
7	waterloo.B-rank-cost_NO_OVERVIEW	0.81	0.32	0.78	0.29	0.21	0.17	−298.47
8	waterloo.B-rank-normal	0.81	0.32	0.78	0.29	0.21	0.17	−298.47
9	auth.simple.run2	0.81	0.30	0.77	0.26	0.22	0.17	−300.87
10	auth.simple.run4	0.80	0.30	0.77	0.26	0.19	0.17	−300.87
11	waterloo.B-thresh	0.80	0.32	0.78	0.29	0.21	0.17	−296.50
12	waterloo.B-thresh-cost_NO_OVERVIEW	0.80	0.32	0.78	0.29	0.21	0.17	−296.50
13	waterloo.A-rank-cost_NO_OVERVIEW	0.79	0.29	0.76	0.25	0.20	0.16	−300.20
14	waterloo.A-rank-normal	0.79	0.29	0.76	0.25	0.20	0.16	−300.20
15	waterloo.A-thresh	0.79	0.29	0.76	0.25	0.20	0.16	−300.20
16	waterloo.A-thresh-cost_NO_OVERVIEW	0.79	0.29	0.76	0.25	0.20	0.16	−300.20
17	padua.iafa_m10k150f0m10	0.78	0.30	0.75	0.25	0.21	0.16	−308.13
18	padua.iafapc_m10p20f0t300p2m10.NO_OVERVIEW	0.77	0.30	0.74	0.24	0.20	0.16	−289.23
19	padua.iafas_m10k50f0m10	0.77	0.28	0.74	0.24	0.21	0.16	−308.40
20	padua.iafap_m10p5f0m10	0.76	0.29	0.73	0.24	0.17	0.16	−307.00
21	padua.iafapc_m10p20f0t150p2m10.NO_OVERVIEW	0.74	0.30	0.71	0.24	0.13	0.16	−221.73
22	padua.iafap_m10p2f0m10	0.73	0.28	0.69	0.22	0.09	0.15	−309.80
23	padua.iafapc_m10p10f0t150p2m10.NO_OVERVIEW	0.71	0.28	0.69	0.22	0.07	0.16	−221.50
24	uos.sis.AL30Q	0.71	0.21	0.66	0.19	0.07	0.15	−309.67
25	padua.iafapc_m10p5f0t0p2m10.NO_OVERVIEW	0.69	0.26	0.67	0.21	0.11	0.16	−304.23
26	uos.sis.TMAL30Q_BM25	0.64	0.15	0.58	0.13	0.06	0.15	−319.97
27	eth.m1	0.61	0.22	0.57	0.19	0.04	0.13	−323.87
28	eth.m2	0.61	0.22	0.57	0.19	0.04	0.13	−323.87
29	eth.m4	0.61	0.22	0.57	0.19	0.04	0.13	−323.87
30	sheffield.run4	0.60	0.22	0.56	0.18	0.03	0.13	−327.40
31	sheffield.run2	0.60	0.22	0.56	0.19	0.03	0.13	−327.13
32	sheffield.run3	0.60	0.21	0.56	0.16	0.03	0.13	−328.40
33	cnrs.noaffull.all	0.60	0.18	0.56	0.15	0.01	0.13	−324.00
34	uos.bm25_1	0.57	0.18	0.53	0.14	0.01	0.12	−331.07
35	uos.bm25_2	0.57	0.18	0.53	0.14	0.01	0.12	−331.07
36	uos.bm25_2.5	0.57	0.18	0.53	0.14	0.01	0.12	−331.07
37	uos.sis.BM25_NO_OVERVIEW	0.57	0.18	0.53	0.14	0.01	0.12	−331.07
38	uos.sis.bm25_1.5_NO_OVERVIEW	0.57	0.18	0.53	0.14	0.01	0.12	−331.07
39	sheffield.run1	0.57	0.18	0.52	0.14	0.05	0.12	−335.40
40	ncsu.abs	0.55	0.08	0.49	0.08	−0.02	0.12	−330.47
41	cnrs.noaf.all	0.53	0.16	0.49	0.11	0.03	0.10	−339.47
42	ncsu.simple	0.52	0.07	0.46	0.08	0.03	0.13	−314.53
43	iiit.run1	0.52	0.16	0.49	0.13	0.02	0.12	−218.67
44	iiit.run2	0.52	0.16	0.49	0.13	0.02	0.12	−218.67
45	iiit.run3	0.52	0.16	0.49	0.13	0.02	0.12	−218.67
46	iiit.run4	0.52	0.16	0.49	0.13	0.02	0.12	−218.67
47	iiit.run5_NO_OVERVIEW	0.52	0.16	0.49	0.13	0.02	0.12	−218.67
48	iiit.run6_NO_OVERVIEW	0.52	0.16	0.49	0.13	0.02	0.12	−218.67
49	iiit.run7_NO_OVERVIEW	0.52	0.16	0.49	0.13	0.02	0.12	−218.67
50	iiit.run8_NO_OVERVIEW	0.52	0.16	0.49	0.13	0.02	0.12	−218.67
51	ecnu.run3	0.51	0.20	0.48	0.15	−0.02	0.14	−304.47
52	ecnu.run2	0.49	0.20	0.47	0.15	−0.02	0.13	−293.90
53	cnrs.abrupt.all	0.45	0.16	0.42	0.11	−0.01	0.09	−345.27
54	amc.run	0.44	0.13	0.40	0.09	−0.02	0.10	−339.53
55	qut.ca_pico_ltr	0.43	0.16	0.39	0.12	−0.00	0.10	−341.00
56	qut.bool_es	0.42	0.15	0.40	0.11	−0.05	0.10	−333.73
57	cnrs.gradual.all	0.42	0.15	0.39	0.11	0.01	0.11	−340.33
58	uos.sis.TMBEST_BM25	0.42	0.12	0.36	0.09	−0.03	0.11	−348.30
59	qut.pico_es	0.41	0.16	0.39	0.10	−0.00	0.10	−335.67
60	qut.ca_bool_ltr	0.40	0.13	0.36	0.08	−0.04	0.08	−343.67
61	qut.rf_bool_ltr	0.36	0.11	0.33	0.07	−0.04	0.08	−344.87
62	qut.rf_pico_ltr	0.35	0.13	0.32	0.09	−0.00	0.09	−348.80
63	ecnu.run1	0.30	0.09	0.27	0.06	−0.05	0.07	−362.13
64	ntu.run1	0.29	0.07	0.25	0.05	−0.04	0.07	−356.20
65	ntu.run2	0.25	0.05	0.21	0.03	−0.04	0.06	−364.93
66	ntu.run3	0.21	0.04	0.16	0.03	−0.05	0.05	−378.33
67	ucl.run_abstract	0.19	0.05	0.15	0.03	−0.04	0.05	−372.47
68	ucl.run_fulltext	0.18	0.05	0.15	0.02	−0.05	0.03	−381.00
69	uos.sis.pubmed.random_NO_OVERVIEW	0.16	0.04	0.12	0.02	−0.05	0.04	−378.07

Table A5. CLEF TAR 2017 lab. Epsilon-greedy multiarmed bandit. The results are ordered decreasingly by recall. Runs that were not included in the official overview are named accordingly.

	Run	Recall	Recall@k	AUC	MAP	WSS@r95	RFCU@k	UG@B
1	waterloo.B-thresh-cost_NO_OVERVIEW	0.83	0.31	0.78	0.30	0.40	0.12	−282.67
2	waterloo.B-thresh	0.82	0.31	0.77	0.30	0.40	0.12	−279.07
3	waterloo.A-rank-normal	0.78	0.28	0.74	0.25	0.35	0.14	−286.73
4	waterloo.A-thresh-cost_NO_OVERVIEW	0.78	0.28	0.73	0.25	0.35	0.14	−287.37
5	waterloo.A-thresh	0.78	0.28	0.73	0.25	0.35	0.13	−286.00
6	waterloo.A-rank-cost_NO_OVERVIEW	0.77	0.28	0.73	0.25	0.32	0.14	−285.50
7	padua.iafa_m10k150f0m10	0.77	0.30	0.72	0.26	0.29	0.13	−292.50
8	padua.iafapc_m10p20f0t150p2m10.NO_OVERVIEW	0.76	0.29	0.71	0.25	0.18	0.14	−221.47
9	padua.iafapc_m10p20f0t300p2m10.NO_OVERVIEW	0.75	0.29	0.71	0.25	0.24	0.14	−278.77
10	padua.iafas_m10k50f0m10	0.75	0.28	0.71	0.24	0.24	0.13	−292.63
11	waterloo.B-rank-cost_NO_OVERVIEW	0.75	0.31	0.72	0.29	0.36	0.14	−286.70
12	waterloo.B-rank-normal	0.74	0.31	0.71	0.29	0.36	0.13	−287.17
13	padua.iafap_m10p5f0m10	0.74	0.28	0.69	0.25	0.22	0.13	−290.97
14	padua.iafap_m10p2f0m10	0.70	0.27	0.66	0.22	0.14	0.14	−289.90
15	padua.iafapc_m10p10f0t150p2m10.NO_OVERVIEW	0.68	0.27	0.65	0.22	0.10	0.15	−199.30
16	sheffield.run4	0.65	0.22	0.59	0.20	0.12	0.10	−306.60
17	padua.iafapc_m10p5f0t0p2m10.NO_OVERVIEW	0.63	0.26	0.61	0.21	0.08	0.18	−171.23
18	sheffield.run3	0.63	0.21	0.57	0.18	0.09	0.10	−308.17
19	sheffield.run2	0.62	0.21	0.57	0.20	0.08	0.10	−307.67
20	sheffield.run1	0.61	0.18	0.54	0.15	0.10	0.08	−310.57
21	uos.sis.AL30Q	0.60	0.20	0.54	0.18	0.11	0.13	−289.87
22	uos.sis.TMAL30Q_BM25	0.56	0.15	0.49	0.13	0.13	0.11	−301.03
23	ncsu.abs	0.56	0.07	0.49	0.09	0.06	0.09	−283.47
24	uos.bm25_1	0.55	0.17	0.48	0.14	0.03	0.11	−306.23
25	iiit.run5_NO_OVERVIEW	0.55	0.16	0.50	0.14	0.02	0.12	−242.03
26	uos.bm25_2.5	0.54	0.17	0.48	0.14	0.02	0.11	−304.93
27	iiit.run4	0.54	0.16	0.49	0.14	0.02	0.12	−241.63
28	iiit.run7_NO_OVERVIEW	0.54	0.16	0.49	0.14	0.02	0.12	−243.47
29	uos.bm25_2	0.54	0.17	0.48	0.14	0.02	0.11	−307.77
30	iiit.run2	0.54	0.16	0.49	0.14	0.02	0.12	−239.20
31	iiit.run6_NO_OVERVIEW	0.54	0.16	0.49	0.14	0.02	0.12	−239.20
32	uos.sis.bm25_1.5_NO_OVERVIEW	0.54	0.17	0.47	0.14	0.02	0.11	−303.43
33	iiit.run8_NO_OVERVIEW	0.54	0.16	0.48	0.14	0.02	0.12	−243.43
34	iiit.run3	0.53	0.16	0.48	0.14	0.02	0.12	−241.17
35	qut.ca_pico_ltr	0.53	0.16	0.46	0.13	−0.01	0.09	−314.43
36	uos.sis.BM25_NO_OVERVIEW	0.53	0.17	0.47	0.14	−0.00	0.11	−307.50
37	amc.run	0.47	0.12	0.41	0.10	0.02	0.09	−308.30
38	qut.bool_es	0.45	0.15	0.41	0.12	−0.05	0.09	−276.53
39	uos.sis.TMBEST_BM25	0.44	0.12	0.37	0.10	0.02	0.09	−316.33
40	qut.pico_es	0.43	0.15	0.39	0.11	−0.03	0.10	−257.17
41	qut.rf_pico_ltr	0.43	0.12	0.37	0.09	−0.01	0.08	−319.87
42	qut.ca_bool_ltr	0.43	0.12	0.36	0.08	−0.03	0.07	−314.70
43	qut.rf_bool_ltr	0.42	0.10	0.35	0.08	−0.04	0.07	−317.93
44	ntu.run3	0.38	0.04	0.26	0.04	−0.02	0.04	−339.77
45	ntu.run1	0.37	0.07	0.31	0.05	−0.03	0.06	−314.77
46	ncsu.simple	0.34	0.05	0.28	0.05	0.09	0.06	−265.93
47	ucl.run_fulltext	0.32	0.05	0.21	0.04	−0.04	0.04	−323.73
48	ntu.run2	0.32	0.05	0.24	0.03	−0.03	0.05	−329.80
49	uos.sis.pubmed.random_NO_OVERVIEW	0.29	0.04	0.18	0.03	−0.04	0.04	−336.97
50	ucl.run_abstract	0.24	0.05	0.19	0.03	−0.04	0.04	−339.50

Table A6. CLEF TAR 2018 lab. Even allocation of budget. The results are ordered decreasingly by recall. Runs that were not included in the official overview are named accordingly.

	Run	Recall	Recall@k	AUC	MAP	WSS@r95	RFCU@k	UG@B
1	auth_run3	0.93	0.39	0.89	0.38	0.51	0.14	−512.50
2	auth_run1	0.92	0.40	0.89	0.39	0.51	0.14	−512.77
3	auth_run2	0.92	0.40	0.89	0.39	0.51	0.14	−512.77
4	UWB	0.92	0.39	0.88	0.37	0.45	0.14	−510.90
5	UWA	0.92	0.37	0.88	0.35	0.44	0.14	−512.37
6	cnrs_RF_bi	0.84	0.33	0.80	0.29	0.27	0.12	−550.03
7	cnrs_comb	0.84	0.34	0.80	0.32	0.30	0.12	−553.50
8	shef-feed	0.84	0.56	0.80	0.59	0.35	0.12	−543.83
9	cnrs_RF_uni	0.83	0.35	0.79	0.30	0.23	0.12	−551.70
10	unipd_t500	0.83	0.36	0.77	0.30	0.26	0.12	−551.43
11	unipd_t1000	0.82	0.36	0.77	0.30	0.17	0.12	−553.23
12	unipd_t1500	0.82	0.36	0.77	0.30	0.17	0.12	−553.23
13	shef-general	0.75	0.28	0.69	0.24	0.20	0.10	−579.97
14	shef-query	0.72	0.24	0.66	0.21	0.19	0.10	−586.83
15	uic_model7	0.64	0.21	0.55	0.16	−0.02	0.08	−608.57
16	uic_model8	0.63	0.20	0.54	0.16	−0.02	0.08	−610.30
17	ECNU_RUN3	0.61	0.20	0.51	0.13	−0.04	0.08	−605.83
18	ECNU_RUN1	0.60	0.19	0.49	0.12	−0.04	0.08	−612.70
19	ECNU_RUN2	0.46	0.12	0.40	0.07	−0.05	0.07	−606.10

Table A7. CLEF TAR 2018 lab. Proportional allocation of budget. The results are ordered decreasingly by recall. Runs that were not included in the official overview are named accordingly.

	Run	Recall	Recall@k	AUC	MAP	WSS@r95	RFCU@k	UG@B
1	auth_run1	0.79	0.38	0.77	0.33	0.17	0.28	−535.03
2	auth_run2	0.79	0.38	0.77	0.33	0.17	0.28	−535.03
3	auth_run3	0.78	0.37	0.76	0.32	0.17	0.28	−535.57
4	UWB	0.77	0.36	0.76	0.30	0.14	0.27	−535.30
5	UWA	0.77	0.35	0.75	0.29	0.11	0.27	−536.57
6	cnrs_comb	0.72	0.31	0.70	0.26	0.11	0.26	−567.30
7	shef-feed	0.72	0.54	0.71	0.55	0.11	0.29	−563.50
8	cnrs_RF_bi	0.70	0.32	0.68	0.24	0.08	0.24	−567.23
9	unipd_t500	0.69	0.33	0.67	0.25	0.11	0.24	−568.97
10	cnrs_RF_uni	0.69	0.33	0.67	0.25	0.08	0.24	−568.97
11	unipd_t1500	0.68	0.33	0.67	0.25	0.08	0.24	−572.37
12	unipd_t1000	0.68	0.33	0.66	0.25	0.08	0.24	−572.83
13	shef-general	0.55	0.26	0.53	0.18	−0.05	0.22	−614.90
14	shef-query	0.50	0.22	0.48	0.14	−0.05	0.20	−625.43
15	uic_model7	0.42	0.19	0.41	0.11	−0.05	0.17	−640.43
16	uic_model8	0.41	0.18	0.40	0.11	−0.05	0.16	−642.17
17	ECNU_RUN3	0.41	0.18	0.40	0.09	−0.05	0.16	−634.43
18	ECNU_RUN1	0.38	0.17	0.37	0.09	−0.05	0.14	−646.03
19	ECNU_RUN2	0.26	0.10	0.25	0.04	−0.05	0.11	−657.90

Table A8. CLEF TAR 2018 lab. Inverse proportional allocation of budget. The results are ordered decreasingly by recall. Runs that were not included in the official overview are named accordingly.

	Run	Recall	Recall@k	AUC	MAP	WSS@r95	RFCU@k	UG@B
1	UWB	0.91	0.39	0.87	0.36	0.47	0.14	−520.83
2	auth_run1	0.91	0.40	0.87	0.38	0.48	0.14	−523.87
3	auth_run2	0.91	0.40	0.87	0.38	0.48	0.14	−523.87
4	auth_run3	0.91	0.39	0.87	0.38	0.48	0.14	−523.07
5	UWA	0.90	0.37	0.86	0.34	0.47	0.14	−523.10
6	cnrs_RF_bi	0.84	0.33	0.79	0.29	0.32	0.12	−554.50
7	cnrs_comb	0.83	0.34	0.79	0.32	0.37	0.11	−557.77
8	shef-feed	0.83	0.56	0.80	0.59	0.40	0.12	−547.77
9	cnrs_RF_uni	0.82	0.35	0.77	0.29	0.25	0.12	−558.17
10	unipd_t500	0.82	0.36	0.76	0.30	0.27	0.12	−556.77
11	unipd_t1000	0.82	0.36	0.76	0.30	0.19	0.12	−558.23
12	unipd_t1500	0.82	0.36	0.76	0.30	0.19	0.12	−558.23
13	shef-general	0.73	0.28	0.68	0.24	0.20	0.09	−585.43
14	shef-query	0.71	0.24	0.65	0.21	0.21	0.09	−590.63
15	uic_model7	0.65	0.21	0.55	0.17	−0.00	0.08	−607.63
16	uic_model8	0.64	0.20	0.54	0.16	−0.00	0.08	−609.97
17	ECNU_RUN3	0.62	0.20	0.50	0.13	−0.04	0.08	−604.70
18	ECNU_RUN1	0.60	0.19	0.48	0.13	−0.04	0.08	−610.57
19	ECNU_RUN2	0.45	0.12	0.39	0.07	−0.05	0.07	−589.57

Table A9. CLEF TAR 2018 lab. Capped allocation of budget. The results are ordered decreasingly by recall. Runs that were not included in the official overview are named accordingly.

	Run	Recall	Recall@k	AUC	MAP	WSS@r95	RFCU@k	UG@B
1	auth_run3	0.85	0.38	0.83	0.35	0.26	0.24	−532.37
2	auth_run1	0.85	0.38	0.83	0.35	0.26	0.24	−532.23
3	auth_run2	0.85	0.38	0.83	0.35	0.26	0.24	−532.23
4	UWB	0.85	0.37	0.82	0.33	0.23	0.24	−531.83
5	UWA	0.84	0.36	0.81	0.31	0.20	0.24	−533.23
6	cnrs_comb	0.78	0.32	0.76	0.29	0.18	0.22	−564.83
7	cnrs_RF_bi	0.77	0.32	0.74	0.27	0.15	0.22	−563.30
8	shef-feed	0.77	0.55	0.75	0.56	0.19	0.24	−561.10
9	cnrs_RF_uni	0.75	0.34	0.73	0.27	0.15	0.22	−565.37
10	unipd_t500	0.75	0.34	0.73	0.27	0.13	0.20	−566.30
11	unipd_t1500	0.74	0.34	0.72	0.27	0.10	0.20	−569.77
12	unipd_t1000	0.74	0.34	0.72	0.27	0.10	0.20	−570.30
13	shef-general	0.61	0.26	0.59	0.20	−0.00	0.19	−611.70
14	shef-query	0.57	0.22	0.55	0.17	0.00	0.17	−622.23
15	uic_model7	0.47	0.19	0.45	0.13	−0.04	0.15	−638.10
16	uic_model8	0.46	0.19	0.44	0.13	−0.03	0.14	−639.97
17	ECNU_RUN3	0.44	0.19	0.42	0.10	−0.05	0.13	−633.23
18	ECNU_RUN1	0.41	0.17	0.39	0.10	−0.05	0.11	−644.90
19	ECNU_RUN2	0.32	0.11	0.29	0.05	−0.05	0.11	−655.80

Table A10. CLEF TAR 2018 lab. Epsilon-greedy multiarmed bandit. The results are ordered decreasingly by recall. Runs that were not included in the official overview are named accordingly.

	Run	Recall	Recall_at_k	AUC	MAP	WSS_at_r95	rfbu	ug_at_budget
1	auth_run3	0.89	0.39	0.85	0.37	0.51	0.15	−483.23
2	auth_run2	0.88	0.40	0.85	0.37	0.51	0.15	−490.10
3	auth_run1	0.88	0.40	0.84	0.37	0.51	0.15	−490.03
4	UWB	0.87	0.38	0.83	0.35	0.49	0.15	−489.37
5	UWA	0.84	0.35	0.81	0.33	0.49	0.15	−499.77
6	shef-feed	0.83	0.56	0.80	0.59	0.33	0.13	−487.33
7	unipd_t500	0.81	0.35	0.75	0.29	0.23	0.13	−512.27
8	cnrs_RF_bi	0.80	0.33	0.75	0.29	0.32	0.13	−514.90
9	unipd_t1500	0.79	0.35	0.74	0.29	0.17	0.13	−511.53
10	cnrs_comb	0.79	0.33	0.75	0.31	0.35	0.13	−525.73
11	unipd_t1000	0.79	0.35	0.73	0.29	0.17	0.13	−512.63
12	cnrs_RF_uni	0.79	0.34	0.74	0.29	0.24	0.13	−519.07
13	shef-general	0.72	0.27	0.67	0.24	0.27	0.10	−544.50
14	shef-query	0.67	0.23	0.62	0.20	0.26	0.10	−546.57
15	ECNU_RUN3	0.57	0.19	0.48	0.12	−0.05	0.09	−565.67
16	uic_model7	0.55	0.20	0.49	0.16	0.02	0.08	−550.17
17	ECNU_RUN1	0.54	0.18	0.46	0.12	−0.04	0.08	−571.93
18	uic_model8	0.54	0.20	0.47	0.15	0.03	0.08	−551.33
19	ECNU_RUN2	0.43	0.11	0.38	0.07	−0.05	0.07	−559.73

Table A11. CLEF TAR 2019 lab. Even allocation of budget. The results are ordered decreasingly by recall. Runs that were not included in the official overview are named accordingly.

	Run	Recall	Recall@k	AUC	MAP	WSS@r95	RFCU@k	UG@B
1	2018_stem_original_p10_t1000	0.80	0.26	0.72	0.24	0.24	0.14	−193.39
2	2018_stem_original_p10_t1500	0.80	0.26	0.72	0.24	0.24	0.14	−193.39
3	2018_stem_original_p10_t300	0.80	0.26	0.72	0.24	0.24	0.14	−193.39
4	2018_stem_original_p10_t400	0.80	0.26	0.72	0.24	0.24	0.14	−193.39
5	2018_stem_original_p10_t500	0.80	0.26	0.72	0.24	0.24	0.14	−193.39
6	distributed_effort_p10_t1000	0.80	0.26	0.72	0.23	0.23	0.14	−193.39
7	distributed_effort_p10_t1500	0.80	0.26	0.73	0.24	0.24	0.14	−193.39
8	2018_stem_original_p50_t1000	0.80	0.26	0.73	0.24	0.19	0.14	−193.42
9	2018_stem_original_p50_t1500	0.80	0.26	0.73	0.24	0.19	0.14	−193.42
10	2018_stem_original_p50_t300	0.80	0.26	0.73	0.24	0.19	0.14	−193.42
11	2018_stem_original_p50_t400	0.80	0.26	0.73	0.24	0.19	0.14	−193.42
12	2018_stem_original_p50_t500	0.80	0.26	0.73	0.24	0.19	0.14	−193.42
13	2018_stem_original_p10_t200	0.80	0.26	0.72	0.23	0.24	0.14	−193.58
14	distributed_effort_p10_t500	0.79	0.26	0.72	0.23	0.19	0.14	−194.16
15	2018_stem_original_p50_t200	0.79	0.26	0.73	0.24	0.21	0.14	−193.42
16	distributed_effort_p10_t400	0.79	0.26	0.71	0.23	0.19	0.14	−194.81
17	distributed_effort_p10_t300	0.79	0.26	0.71	0.23	0.19	0.14	−195.32
18	distributed_effort_p10_t200	0.78	0.25	0.70	0.23	0.18	0.14	−195.84
19	abs-hh-ratio-ilps	0.73	0.46	0.69	0.51	0.18	0.15	−186.71
20	distributed_effort_p10_t100	0.73	0.24	0.62	0.21	0.12	0.13	−198.35
21	2018_stem_original_p50_t100	0.72	0.26	0.66	0.23	0.21	0.13	−197.10
22	2018_stem_original_p10_t100	0.71	0.26	0.64	0.23	0.18	0.13	−197.65
23	abs-th-ratio-ilps	0.71	0.43	0.67	0.48	0.21	0.14	−193.90
24	sheffield-baseline	0.63	0.21	0.56	0.19	0.12	0.10	−218.65
25	baseline_bm25_t100	0.57	0.18	0.50	0.16	0.11	0.09	−223.10
26	baseline_bm25_t200	0.56	0.18	0.50	0.16	0.14	0.09	−223.94
27	baseline_bm25_t1000	0.56	0.18	0.49	0.16	0.14	0.09	−224.00
28	baseline_bm25_t1500	0.56	0.18	0.49	0.16	0.14	0.09	−224.00
29	baseline_bm25_t300	0.56	0.18	0.49	0.16	0.14	0.09	−224.00
30	baseline_bm25_t400	0.56	0.18	0.49	0.16	0.14	0.09	−224.00
31	baseline_bm25_t500	0.56	0.18	0.49	0.16	0.14	0.09	−224.00

Table A12. CLEF TAR 2019 lab. Proportional allocation of budget. The results are ordered decreasingly by recall. Runs that were not included in the official overview are named accordingly.

	Run	Recall	Recall@k	AUC	MAP	WSS@r95	RFCU@k	UG@B
1	abs-hh-ratio-ilps	0.63	0.44	0.62	0.46	0.14	0.29	−194.81
2	2018_stem_original_p10_t500	0.62	0.23	0.61	0.18	0.04	0.21	−206.06
3	2018_stem_original_p50_t500	0.62	0.23	0.60	0.18	0.04	0.21	−206.13
4	2018_stem_original_p10_t400	0.62	0.23	0.60	0.18	0.04	0.21	−206.26
5	2018_stem_original_p50_t400	0.62	0.23	0.60	0.18	0.04	0.21	−206.32
6	2018_stem_original_p10_t1000	0.62	0.23	0.60	0.18	0.04	0.21	−206.32
7	2018_stem_original_p10_t1500	0.62	0.23	0.60	0.18	0.04	0.21	−206.32
8	distributed_effort_p10_t1000	0.62	0.23	0.60	0.18	0.04	0.21	−206.32
9	distributed_effort_p10_t1500	0.62	0.23	0.60	0.18	0.04	0.21	−206.32
10	distributed_effort_p10_t300	0.62	0.23	0.60	0.18	0.04	0.21	−206.32
11	distributed_effort_p10_t400	0.62	0.23	0.60	0.18	0.04	0.21	−206.32
12	distributed_effort_p10_t500	0.62	0.23	0.60	0.18	0.04	0.21	−206.32
13	2018_stem_original_p50_t1000	0.62	0.23	0.60	0.18	0.04	0.21	−206.39
14	2018_stem_original_p50_t1500	0.62	0.23	0.60	0.18	0.04	0.21	−206.39
15	2018_stem_original_p10_t300	0.62	0.23	0.60	0.18	0.01	0.21	−206.45
16	2018_stem_original_p50_t300	0.62	0.23	0.60	0.18	0.01	0.21	−206.52
17	2018_stem_original_p10_t200	0.61	0.23	0.59	0.18	0.01	0.21	−207.10
18	2018_stem_original_p50_t200	0.61	0.23	0.59	0.18	0.01	0.21	−207.16
19	distributed_effort_p10_t200	0.60	0.23	0.58	0.17	0.10	0.20	−207.13
20	abs-th-ratio-ilps	0.59	0.40	0.58	0.42	0.14	0.27	−202.61
21	2018_stem_original_p10_t100	0.54	0.24	0.52	0.17	−0.02	0.21	−209.29
22	2018_stem_original_p50_t100	0.54	0.24	0.52	0.17	−0.02	0.21	−209.29
23	distributed_effort_p10_t100	0.52	0.22	0.50	0.16	0.01	0.17	−211.13
24	sheffield-baseline	0.45	0.19	0.44	0.14	0.01	0.15	−229.68
25	baseline_bm25_t100	0.41	0.16	0.40	0.11	0.01	0.14	−232.06
26	baseline_bm25_t400	0.41	0.16	0.39	0.11	0.01	0.14	−232.32
27	baseline_bm25_t300	0.41	0.16	0.39	0.11	0.01	0.14	−232.77
28	baseline_bm25_t200	0.41	0.16	0.39	0.11	0.01	0.14	−232.90
29	baseline_bm25_t500	0.40	0.16	0.39	0.11	0.01	0.14	−232.71
30	baseline_bm25_t1000	0.40	0.16	0.39	0.11	0.01	0.14	−233.10
31	baseline_bm25_t1500	0.40	0.16	0.39	0.11	0.01	0.14	−233.10

Table A13. CLEF TAR 2019 lab. Inverse proportional allocation of budget. The results are ordered decreasingly by recall. Runs that were not included in the official overview are named accordingly.

	Run	Recall	Recall@k	AUC	MAP	WSS@r95	RFCU@k	UG@B
1	2018_stem_original_p10_t1000	0.78	0.26	0.70	0.23	0.25	0.14	−196.16
2	2018_stem_original_p10_t1500	0.78	0.26	0.70	0.23	0.25	0.14	−196.16
3	2018_stem_original_p10_t400	0.78	0.26	0.70	0.23	0.25	0.14	−196.16
4	2018_stem_original_p10_t500	0.78	0.26	0.70	0.23	0.25	0.14	−196.16
5	distributed_effort_p10_t1000	0.78	0.26	0.70	0.23	0.24	0.14	−196.16
6	2018_stem_original_p10_t300	0.78	0.26	0.70	0.23	0.24	0.14	−196.23
7	distributed_effort_p10_t1500	0.78	0.26	0.70	0.23	0.24	0.14	−196.29
8	2018_stem_original_p10_t200	0.78	0.26	0.70	0.23	0.25	0.14	−195.71
9	2018_stem_original_p50_t200	0.78	0.26	0.71	0.24	0.25	0.14	−195.74
10	distributed_effort_p10_t500	0.77	0.26	0.69	0.23	0.22	0.14	−196.42
11	distributed_effort_p10_t400	0.77	0.26	0.69	0.23	0.21	0.14	−196.48
12	distributed_effort_p10_t300	0.77	0.26	0.69	0.23	0.21	0.14	−196.61
13	2018_stem_original_p50_t1000	0.77	0.25	0.71	0.24	0.22	0.14	−196.19
14	2018_stem_original_p50_t1500	0.77	0.25	0.71	0.24	0.22	0.14	−196.19
15	2018_stem_original_p50_t300	0.77	0.25	0.71	0.24	0.22	0.14	−196.19
16	2018_stem_original_p50_t400	0.77	0.25	0.71	0.24	0.22	0.14	−196.19
17	2018_stem_original_p50_t500	0.77	0.25	0.71	0.24	0.22	0.14	−196.19
18	distributed_effort_p10_t200	0.77	0.25	0.68	0.22	0.18	0.14	−197.13
19	abs-hh-ratio-ilps	0.73	0.46	0.69	0.51	0.19	0.16	−187.23
20	distributed_effort_p10_t100	0.71	0.24	0.60	0.20	0.13	0.13	−199.00
21	2018_stem_original_p50_t100	0.71	0.26	0.65	0.23	0.22	0.13	−198.19
22	2018_stem_original_p10_t100	0.71	0.26	0.64	0.23	0.20	0.13	−198.42
23	abs-th-ratio-ilps	0.71	0.43	0.66	0.48	0.21	0.14	−194.16
24	sheffield-baseline	0.64	0.21	0.56	0.19	0.14	0.10	−217.68
25	baseline_bm25_t1000	0.59	0.18	0.50	0.16	0.15	0.09	−220.90
26	baseline_bm25_t1500	0.59	0.18	0.50	0.16	0.15	0.09	−220.90
27	baseline_bm25_t400	0.59	0.18	0.50	0.16	0.15	0.09	−220.90
28	baseline_bm25_t500	0.59	0.18	0.50	0.16	0.15	0.09	−220.90
29	baseline_bm25_t300	0.59	0.18	0.50	0.16	0.15	0.09	−220.90
30	baseline_bm25_t200	0.59	0.18	0.50	0.16	0.16	0.09	−221.10
31	baseline_bm25_t100	0.59	0.18	0.50	0.16	0.13	0.09	−220.58

Table A14. CLEF TAR 2019 lab. Capped allocation of budget. The results are ordered decreasingly by recall. Runs that were not included in the official overview are named accordingly.

	Run	Recall	Recall@k	AUC	MAP	WSS@r95	RFCU@k	UG@B
1	2018_stem_original_p50_t500	0.70	0.25	0.67	0.21	0.11	0.18	−202.19
2	2018_stem_original_p10_t500	0.70	0.26	0.66	0.20	0.11	0.18	−202.42
3	2018_stem_original_p50_t400	0.70	0.25	0.67	0.21	0.11	0.18	−202.32
4	2018_stem_original_p10_t400	0.70	0.26	0.66	0.20	0.11	0.18	−202.55
5	2018_stem_original_p50_t1000	0.70	0.25	0.67	0.21	0.11	0.18	−202.45
6	2018_stem_original_p50_t1500	0.70	0.25	0.67	0.21	0.11	0.18	−202.45
7	2018_stem_original_p10_t1000	0.70	0.26	0.66	0.20	0.11	0.18	−202.68
8	2018_stem_original_p10_t1500	0.70	0.26	0.66	0.20	0.11	0.18	−202.68
9	distributed_effort_p10_t1000	0.70	0.26	0.66	0.20	0.10	0.18	−202.68
10	distributed_effort_p10_t1500	0.70	0.26	0.66	0.20	0.11	0.18	−202.68
11	distributed_effort_p10_t300	0.70	0.26	0.66	0.20	0.10	0.18	−202.42
12	distributed_effort_p10_t400	0.70	0.26	0.66	0.20	0.10	0.18	−202.48
13	distributed_effort_p10_t500	0.70	0.25	0.66	0.20	0.10	0.18	−202.68
14	2018_stem_original_p50_t300	0.70	0.25	0.66	0.21	0.08	0.18	−202.52
15	2018_stem_original_p10_t300	0.70	0.26	0.65	0.20	0.08	0.18	−202.74
16	2018_stem_original_p50_t200	0.69	0.25	0.66	0.21	0.08	0.18	−203.23
17	2018_stem_original_p10_t200	0.69	0.26	0.65	0.20	0.08	0.18	−203.45
18	abs-hh-ratio-ilps	0.68	0.45	0.66	0.49	0.17	0.21	−191.87
19	distributed_effort_p10_t200	0.68	0.25	0.63	0.20	0.07	0.17	−203.26
20	abs-th-ratio-ilps	0.65	0.42	0.64	0.45	0.15	0.20	−199.19
21	2018_stem_original_p50_t100	0.62	0.26	0.59	0.20	0.05	0.18	−205.55
22	2018_stem_original_p10_t100	0.62	0.26	0.58	0.20	0.05	0.17	−205.77
23	distributed_effort_p10_t100	0.60	0.23	0.56	0.18	0.04	0.16	−207.39
24	sheffield-baseline	0.53	0.21	0.50	0.16	0.06	0.12	−226.26
25	baseline_bm25_t100	0.48	0.18	0.46	0.14	0.03	0.11	−229.23
26	baseline_bm25_t300	0.48	0.18	0.45	0.14	0.03	0.11	−229.87
27	baseline_bm25_t400	0.48	0.18	0.45	0.14	0.03	0.11	−229.55
28	baseline_bm25_t200	0.48	0.18	0.45	0.14	0.03	0.11	−230.06
29	baseline_bm25_t500	0.48	0.18	0.45	0.14	0.03	0.11	−230.00
30	baseline_bm25_t1000	0.47	0.18	0.45	0.14	0.03	0.11	−230.06
31	baseline_bm25_t1500	0.47	0.18	0.45	0.14	0.03	0.11	−230.06

Table A15. CLEF TAR 2019 lab. Epsilon-greedy multiarmed bandit. The results are ordered decreasingly by recall. Runs that were not included in the official overview are named accordingly.

	Run	Recall	Recall_at_k	AUC	MAP	WSS_at_r95	rfbu	ug_at_budget
1	2018_stem_original_p10_t500	0.71	0.25	0.64	0.22	0.19	0.13	−187.00
2	2018_stem_original_p10_t1000	0.70	0.25	0.64	0.22	0.19	0.13	−185.29
3	distributed_effort_p10_t500	0.70	0.25	0.63	0.22	0.18	0.13	−185.94
4	distributed_effort_p10_t300	0.70	0.25	0.63	0.22	0.17	0.13	−187.65
5	abs-hh-ratio-ilps	0.70	0.46	0.67	0.50	0.21	0.16	−173.35
6	2018_stem_original_p10_t300	0.69	0.25	0.63	0.22	0.19	0.13	−184.77
7	2018_stem_original_p50_t400	0.69	0.25	0.64	0.23	0.18	0.13	−184.77
8	distributed_effort_p10_t1000	0.69	0.25	0.63	0.22	0.18	0.13	−185.52
9	2018_stem_original_p10_t1500	0.69	0.25	0.63	0.22	0.19	0.13	−186.23
10	2018_stem_original_p10_t400	0.69	0.25	0.63	0.22	0.19	0.13	−185.00
11	distributed_effort_p10_t1500	0.68	0.25	0.63	0.22	0.19	0.13	−184.10
12	abs-th-ratio-ilps	0.68	0.43	0.65	0.47	0.20	0.14	−179.97
13	2018_stem_original_p10_t200	0.68	0.25	0.62	0.22	0.19	0.13	−188.13
14	2018_stem_original_p50_t200	0.68	0.25	0.63	0.23	0.18	0.13	−187.42
15	2018_stem_original_p50_t1000	0.68	0.24	0.63	0.22	0.16	0.13	−186.52
16	2018_stem_original_p50_t300	0.66	0.24	0.62	0.22	0.18	0.12	−184.68
17	2018_stem_original_p50_t500	0.66	0.23	0.62	0.21	0.16	0.12	−185.87
18	distributed_effort_p10_t200	0.66	0.24	0.59	0.20	0.14	0.12	−187.87
19	distributed_effort_p10_t400	0.65	0.24	0.60	0.21	0.16	0.12	−188.42
20	2018_stem_original_p50_t1500	0.65	0.22	0.61	0.21	0.14	0.13	−186.52
21	2018_stem_original_p10_t100	0.62	0.24	0.57	0.20	0.18	0.12	−190.77
22	sheffield-baseline	0.61	0.20	0.54	0.18	0.14	0.10	−205.58
23	2018_stem_original_p50_t100	0.61	0.23	0.57	0.20	0.17	0.13	−192.81
24	distributed_effort_p10_t100	0.59	0.22	0.53	0.18	0.11	0.11	−191.06
25	baseline_bm25_t100	0.52	0.18	0.46	0.16	0.12	0.08	−207.29
26	baseline_bm25_t1000	0.52	0.18	0.46	0.15	0.11	0.08	−210.45
27	baseline_bm25_t500	0.52	0.18	0.45	0.15	0.12	0.08	−210.58
28	baseline_bm25_t200	0.51	0.18	0.45	0.15	0.12	0.08	−206.77
29	baseline_bm25_t1500	0.51	0.18	0.45	0.15	0.12	0.08	−207.97
30	baseline_bm25_t300	0.51	0.18	0.45	0.15	0.12	0.08	−208.65
31	baseline_bm25_t400	0.51	0.18	0.45	0.15	0.12	0.08	−207.00

References

Needleman, I.G. A guide to systematic reviews. J. Clin. Periodontol. 2002, 29, 6–9. [Google Scholar] [CrossRef] [PubMed]
Timsina, P.; Liu, J.; El-Gayar, O. Advanced analytics for the automation of medical systematic reviews. Inf. Syst. Front. 2016, 18, 237–252. [Google Scholar] [CrossRef]
Di Nunzio, G.M. Technology Assisted Review Systems: Current and Future Directions. In Proceedings of the 3rd Workshop on Augmented Intelligence for Technology-Assisted Reviews Systems (ALTARS 2024), Glasgow, UK, 28 March 2024; Di Nunzio, G.M., Kanoulas, E., Majumder, P., Eds.; CEUR Workshop Proceedings. ACM: New York, NY, USA, 2024; Volume 3832, ISSN 1613-0073. [Google Scholar]
Cormack, G.V.; Grossman, M.R. Engineering Quality and Reliability in Technology-Assisted Review. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, New York, NY, USA, 17–21 July 2016; SIGIR ’16. pp. 75–84. [Google Scholar] [CrossRef]
Manning, C.; Schutze, H. Foundations of Statistical Natural Language Processing; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
Jørgensen, L.; Paludan-Müller, A.S.; Laursen, D.R.T.; Savović, J.; Boutron, I.; Sterne, J.A.C.; Higgins, J.P.T.; Hróbjartsson, A. Evaluation of the Cochrane tool for assessing risk of bias in randomized clinical trials: Overview of published comments and analysis of user practice in Cochrane and non-Cochrane reviews. Syst. Rev. 2016, 5, 80. [Google Scholar] [CrossRef] [PubMed]
Koffel, J.B. Use of Recommended Search Strategies in Systematic Reviews and the Impact of Librarian Involvement: A Cross-Sectional Survey of Recent Authors. PLoS ONE 2015, 10, e0125931. [Google Scholar] [CrossRef] [PubMed]
Elliott, J.H.; Synnot, A.; Turner, T.; Simmonds, M.; Akl, E.A.; McDonald, S.; Salanti, G.; Meerpohl, J.; MacLehose, H.; Hilton, J.; et al. Living systematic review: 1. Introduction—The why, what, when, and how. J. Clin. Epidemiol. 2017, 91, 23–30. [Google Scholar] [CrossRef] [PubMed]
Yang, E.; Lewis, D.D.; Frieder, O. On minimizing cost in legal document review workflows. In Proceedings of the 21st ACM Symposium on Document Engineering, New York, NY, USA, 24–27 August 2021; DocEng ’21. pp. 1–10. [Google Scholar] [CrossRef]
Kelly, L.; Suominen, H.; Goeuriot, L.; Neves, M.; Kanoulas, E.; Li, D.; Azzopardi, L.; Spijker, R.; Zuccon, G.; Scells, H.; et al. Overview of the CLEF eHealth Evaluation Lab 2019. In Proceedings of the Experimental IR Meets Multilinguality, Multimodality, and Interaction, Lugano, Switzerland, 9–12 September 2019; Crestani, F., Braschler, M., Savoy, J., Rauber, A., Müller, H., Losada, D.E., Heinatz Bürki, G., Cappellato, L., Ferro, N., Eds.; Springer: Cham, Switzerland, 2019; pp. 322–339. [Google Scholar] [CrossRef]
Suominen, H.; Kelly, L.; Goeuriot, L.; Névéol, A.; Ramadier, L.; Robert, A.; Kanoulas, E.; Spijker, R.; Azzopardi, L.; Li, D.; et al. Overview of the CLEF eHealth Evaluation Lab 2018. In Proceedings of the Experimental IR Meets Multilinguality, Multimodality, and Interaction, Avignon, France, 10–14 September 2018; Bellot, P., Trabelsi, C., Mothe, J., Murtagh, F., Nie, J.Y., Soulier, L., SanJuan, E., Cappellato, L., Ferro, N., Eds.; Springer: Cham, Switzerland, 2018; pp. 286–301. [Google Scholar] [CrossRef]
Goeuriot, L.; Kelly, L.; Suominen, H.; Névéol, A.; Robert, A.; Kanoulas, E.; Spijker, R.; Palotti, J.; Zuccon, G. CLEF 2017 eHealth Evaluation Lab Overview. In Proceedings of the Experimental IR Meets Multilinguality, Multimodality, and Interaction, Dublin, Ireland, 11–14 September 2017; Jones, G.J., Lawless, S., Gonzalo, J., Kelly, L., Goeuriot, L., Mandl, T., Cappellato, L., Ferro, N., Eds.; Springer: Cham, Switzerland, 2017; pp. 291–303. [Google Scholar] [CrossRef]
Voorhees, E.M. On Building Fair and Reusable Test Collections using Bandit Techniques. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, New York, NY, USA, 22–26 October 2018; CIKM ’18. pp. 407–416. [Google Scholar] [CrossRef]
Rahman, M.M.; Kutlu, M.; Lease, M. Constructing Test Collections using Multi-armed Bandits and Active Learning. In Proceedings of the The World Wide Web Conference, New York, NY, USA, 13–17 May 2019; WWW ’19. pp. 3158–3164. [Google Scholar] [CrossRef]
Rodriguez-Diaz, P.; Killian, J.A.; Xu, L.; Suggala, A.S.; Taneja, A.; Tambe, M. Flexible budgets in restless bandits: A primal-dual algorithm for efficient budget allocation. In Proceedings of the Thirty–Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; AAAI’23/IAAI’23/EAAI’23. AAAI Press: Washington, DC, USA, 2023; Volume 37, pp. 12103–12111. [Google Scholar] [CrossRef]
Li, M.; Zhang, J.; Alizadehsani, R.; Pławiak, P. A Multi-Channel Advertising Budget Allocation Using Reinforcement Learning and an Improved Differential Evolution Algorithm. IEEE Access 2024, 12, 100559–100580. [Google Scholar] [CrossRef]
Suyal, H.; Singh, A. Multilabel classification using crowdsourcing under budget constraints. Knowl. Inf. Syst. 2024, 66, 841–877. [Google Scholar] [CrossRef]
Xu, L.; Wang, L.; Xie, H.; Zhou, M. Contextual Bandit with Herding Effects: Algorithms and Recommendation Applications. In Proceedings of the PRICAI 2024: Trends in Artificial Intelligence, Kyoto, Japan, 19 November 2024; Hadfi, R., Anthony, P., Sharma, A., Ito, T., Bai, Q., Eds.; Springer: Singapore, 2025; pp. 132–144. [Google Scholar] [CrossRef]
Stradiotti, L.; Perini, L.; Davis, J. Combining Active Learning and Learning to Reject for Anomaly Detection. In ECAI 2024; IOS Press: Amsterdam, The Netherlands, 2024; pp. 2266–2273. [Google Scholar] [CrossRef]
Yang, Y.; Zhang, J.; Qin, R.; Li, J.; Wang, F.Y.; Qi, W. A Budget Optimization Framework for Search Advertisements Across Markets. IEEE Trans. Syst. Man-Cybern.-Part Syst. Humans 2012, 42, 1141–1151. [Google Scholar] [CrossRef]
Tricco, A.C.; Antony, J.; Zarin, W.; Strifler, L.; Ghassemi, M.; Ivory, J.; Perrier, L.; Hutton, B.; Moher, D.; Straus, S.E. A scoping review of rapid review methods. BMC Med. 2015, 13, 224. [Google Scholar] [CrossRef] [PubMed]
O’Halloran, T.; McManus, B.; Harbison, A.; Grossman, M.R.; Cormack, G.V. Comparison of Tools and Methods for Technology-Assisted Review. In Proceedings of the Information Management, Suva, Fiji, 23–25 January 2024; Li, S., Ed.; Springer Nature: Cham, Switzerland, 2024; pp. 106–126. [Google Scholar] [CrossRef]
Cormack, G.V.; Grossman, M.R. Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, New York, NY, USA, 6–11 July 2014; SIGIR ’14. pp. 153–162. [Google Scholar] [CrossRef]
Wallace, B.C.; Dahabreh, I.J.; Schmid, C.H.; Lau, J.; Trikalinos, T.A. Modernizing the systematic review process to inform comparative effectiveness: Tools and methods. J. Comp. Eff. Res. 2013, 2, 273–282. [Google Scholar] [CrossRef] [PubMed]
Kusa, W.; Lipani, A.; Knoth, P.; Hanbury, A. An analysis of work saved over sampling in the evaluation of automated citation screening in systematic literature reviews. Intell. Syst. Appl. 2023, 18, 200193. [Google Scholar] [CrossRef]
Molinari, A.; Esuli, A.; Sebastiani, F. Improved risk minimization algorithms for technology-assisted review. Intell. Syst. Appl. 2023, 18, 200209. [Google Scholar] [CrossRef]
Di Nunzio, G.M. A Study on a Stopping Strategy for Systematic Reviews Based on a Distributed Effort Approach. In Proceedings of the Experimental IR Meets Multilinguality, Multimodality, and Interaction, Thessaloniki, Greece, 22–25 September 2020; Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S., Joho, H., Lioma, C., Eickhoff, C., Névéol, A., Cappellato, L., Ferro, N., Eds.; Springer: Cham, Switzerland, 2020; pp. 112–123. [Google Scholar] [CrossRef]
Lewis, D.D.; Yang, E.; Frieder, O. Certifying One-Phase Technology-Assisted Reviews. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, New York, NY, USA, 1–5 November 2021; CIKM ’21. pp. 893–902. [Google Scholar] [CrossRef]
Névéol, A.; Robert, A.; Anderson, R.; Cohen, K.B.; Grouin, C.; Lavergne, T.; Rey, G.; Rondet, C.; Zweigenbaum, P. CLEF eHealth 2017 Multilingual Information Extraction task Overview: ICD10 Coding of Death Certificates in English and French. In Proceedings of the Working Notes of CLEF 2017—Conference and Labs of the Evaluation Forum, Dublin, Ireland, 11–14 September 2017; Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T., Eds.; CEUR Workshop Proceedings. RWTH: Aachen, Germany, 2017; Volume 1866, ISSN 1613-0073. [Google Scholar]
Kanoulas, E.; Li, D.; Azzopardi, L.; Spijker, R. CLEF 2018 Technologically Assisted Reviews in Empirical Medicine Overview. In Proceedings of the Working Notes of CLEF 2018—Conference and Labs of the Evaluation Forum, Avignon, France, 10–14 September 2018; Cappellato, L., Ferro, N., Nie, J.Y., Soulier, L., Eds.; CEUR Workshop Proceedings. RWTH: Aachen, Germany, 2018; Volume 2125, ISSN 1613-0073. [Google Scholar]
Kanoulas, E.; Li, D.; Azzopardi, L.; Spijker, R. CLEF 2019 Technology Assisted Reviews in Empirical Medicine Overview. In Proceedings of the Working Notes of CLEF 2019—Conference and Labs of the Evaluation Forum, Lugano, Switzerland, 9–12 September 2019; Cappellato, L., Ferro, N., Losada, D.E., Müller, H., Eds.; CEUR Workshop Proceedings. RWTH: Aachen, Germany, 2019; Volume 2380, ISSN 1613-0073. [Google Scholar]
Tran-Thanh, L.; Chapman, A.; de Cote, E.M.; Rogers, A.; Jennings, N.R. Epsilon–First Policies for Budget–Limited Multi-Armed Bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010; Volume 24, pp. 1211–1216. [Google Scholar] [CrossRef]
Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-Time Analysis of the Multiarmed Bandit Problem. Mach. Learn. 2002, 47, 235–256. [Google Scholar] [CrossRef]
Li, L.; Chu, W.; Langford, J.; Schapire, R.E. A Contextual-Bandit Approach to Personalized News Article Recommendation. In Proceedings of the 19th International Conference on World Wide Web, New York, NY, USA, 26–30 April 2010; WWW ’10. pp. 661–670. [Google Scholar] [CrossRef]
Di Nunzio, G. POLAR: Policy Optimization for Literature Analysis under Review Constraints. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management, CIKM 2025, Seoul, Republic of Korea, 10–14 November 2025; ACM: New York, NY, USA, 2025. [Google Scholar] [CrossRef]

Figure 1. Boxplots of recall by budget allocation strategy across years.

Figure 2. Boxplots of RFCU by budget allocation strategy across years.

Figure 3. Boxplots of UG@B by budget allocation strategy across years.

Figure 4. Correlation between RFCU and recall across years.

Figure 5. Correlation between UF@B and recall across years.

Figure 6. UG@B for systems A and B under a fixed time budget of

B = 100

min. System A reviews

(33, 34)

true and false positives, while system B reviews

(30, 30)

. The curves show the utility as a function of the gain/loss ratio r. The vertical dashed line marks the crossover point (

r \approx 1.33

), where both systems achieve an equal UG@B.

Figure 6. UG@B for systems A and B under a fixed time budget of

B = 100

min. System A reviews

(33, 34)

true and false positives, while system B reviews

(30, 30)

. The curves show the utility as a function of the gain/loss ratio r. The vertical dashed line marks the crossover point (

r \approx 1.33

), where both systems achieve an equal UG@B.

Table 1. Summary of CLEF eHealth TAR collections from 2017 to 2019.

Year	Topics (Test)	Documents	Relevant Documents
2017	30	117,562	1857
2018	30	218,496	3964
2019	31	82,421	1682

Table 2. CLEF TAR 2017 lab. The results shown are the ones recalculated in this experimental work, while the last three columns display the differences between our results and the official results.

	Run	AUC	MAP	WSS@95	AUC_diff	MAP_diff	WSS@95_diff
1	amc.run	0.75	0.13	0.32	0.01	0.00	0.01
2	auth.simple.run1	0.93	0.30	0.70	0.00	0.00	−0.00
3	auth.simple.run2	0.92	0.29	0.70	0.00	−0.00	−0.00
4	auth.simple.run3	0.92	0.29	0.67	0.00	−0.00	0.00
5	auth.simple.run4	0.92	0.29	0.69	0.00	0.00	−0.00
6	cnrs.abrupt.all	0.73	0.14	0.24	0.01	0.00	0.01
7	cnrs.gradual.all	0.70	0.15	0.26	0.01	−0.00	0.03
8	cnrs.noaf.all	0.77	0.14	0.35	0.01	0.00	0.01
9	cnrs.noaffull.all	0.83	0.18	0.49	0.00	−0.00	0.00
10	ecnu.run1	0.62	0.09	0.11	0.01	0.00	0.01
11	ntu.run1	0.60	0.08	0.10	0.01	−0.00	0.01
12	ntu.run2	0.59	0.06	0.12	0.01	−0.00	0.01
13	ntu.run3	0.53	0.05	0.07	0.01	−0.00	0.00
14	padua.iafa_m10k150f0m10	0.89	0.28	0.50	0.00	0.00	0.01
15	padua.iafap_m10p2f0m10	0.87	0.25	0.47	0.00	−0.00	0.01
16	padua.iafap_m10p5f0m10	0.88	0.27	0.48	0.00	−0.00	0.01
17	padua.iafas_m10k50f0m10	0.89	0.27	0.51	0.00	0.00	0.01
18	qut.ca_bool_ltr	0.73	0.11	0.27	0.01	−0.00	0.02
19	qut.ca_pico_ltr	0.74	0.15	0.29	0.01	−0.00	0.01
20	qut.rf_bool_ltr	0.70	0.11	0.24	0.01	0.00	0.03
21	qut.rf_pico_ltr	0.72	0.12	0.29	0.01	0.00	0.01
22	sheffield.run1	0.81	0.17	0.42	0.00	−0.00	−0.00
23	sheffield.run2	0.84	0.22	0.50	0.00	−0.00	−0.01
24	sheffield.run3	0.84	0.20	0.48	0.00	0.00	−0.00
25	sheffield.run4	0.84	0.22	0.49	0.00	0.00	−0.01
26	ucl.run_abstract	0.50	0.06	0.06	0.01	0.00	0.01
27	ucl.run_fulltext	0.51	0.05	0.06	0.01	0.00	0.02
28	uos.sis.TMAL30Q_BM25	0.83	0.16	0.52	0.00	−0.00	0.01
29	uos.sis.TMBEST_BM25	0.72	0.12	0.32	0.01	−0.00	0.00
30	waterloo.A-rank-normal	0.93	0.28	0.71	0.00	−0.00	−0.01
31	waterloo.B-rank-normal	0.93	0.32	0.72	0.00	−0.00	−0.02

Table 3. CLEF TAR 2018 lab. The results shown are the ones recalculated in this experimental work, together with the differences between our results and the official results. We also sought to show the different values of the official R@k (recall at the number of documents reviewed) and our recall@k (recall at the number of relevant documents).

	Run	AUC	MAP	WSS@95	MAP_diff	WSS@95_diff	R@k	Recall@k
1	auth_run1	0.95	0.40	0.77	0.00	−0.02	1.00	0.40
2	auth_run2	0.95	0.40	0.77	0.00	−0.02	0.94	0.40
3	auth_run3	0.95	0.39	0.76	−0.00	−0.03	0.94	0.39
4	cnrs_RF_uni	0.89	0.31	0.52	−0.00	−0.00	1.00	0.35
5	cnrs_RF_bi	0.92	0.31	0.63	−0.00	−0.01	1.00	0.33
6	cnrs_comb	0.93	0.34	0.68	0.00	−0.02	1.00	0.34
7	ECNU_RUN1	0.66	0.14	0.02	0.00	0.01	0.52	0.19
8	ECNU_RUN2	0.59	0.08	−0.03	0.00	0.05	0.37	0.12
9	ECNU_RUN3	0.68	0.15	0.02	0.00	0.01	0.53	0.20
10	unipd_t500	0.91	0.32	0.62	0.00	−0.01	0.86	0.36
11	unipd_t1000	0.90	0.32	0.58	−0.00	−0.00	0.92	0.36
12	unipd_t1500	0.89	0.32	0.55	−0.00	−0.00	0.94	0.36
13	shef-feed	0.92	0.61	0.63	−0.00	0.01	1.00	0.56
14	shef-general	0.87	0.26	0.55	−0.00	−0.00	1.00	0.28
15	shef-query	0.85	0.22	0.51	−0.00	−0.01	1.00	0.24
16	uic_model8	0.73	0.17	0.22	−0.00	0.04	0.51	0.20
17	uic_model7	0.74	0.18	0.23	0.00	0.03	0.58	0.21
18	UWA	0.94	0.36	0.76	−0.00	−0.01	0.99	0.37
19	UWB	0.95	0.38	0.77	0.00	−0.01	0.93	0.39

Table 4. CLEF TAR 2019 lab. The results shown are the ones recalculated in this experimental work. We do not show any differences compared with the official results since, in the original evaluation campaign, there was a split among the different typologies of reviews.

	Run	AUC	MAP	WSS@95	Recall@k
1	2018_stem_original_p10_t100	0.82	0.25	0.51	0.26
2	2018_stem_original_p10_t1000	0.86	0.25	0.60	0.26
3	2018_stem_original_p10_t1500	0.86	0.25	0.60	0.26
4	2018_stem_original_p10_t200	0.86	0.25	0.59	0.26
5	2018_stem_original_p10_t300	0.86	0.25	0.59	0.26
6	2018_stem_original_p10_t400	0.86	0.25	0.60	0.26
7	2018_stem_original_p10_t500	0.86	0.25	0.60	0.26
8	2018_stem_original_p50_t100	0.83	0.25	0.53	0.26
9	2018_stem_original_p50_t1000	0.88	0.26	0.61	0.26
10	2018_stem_original_p50_t1500	0.88	0.26	0.61	0.26
11	2018_stem_original_p50_t200	0.87	0.26	0.58	0.26
12	2018_stem_original_p50_t300	0.87	0.26	0.59	0.26
13	2018_stem_original_p50_t400	0.87	0.26	0.61	0.26
14	2018_stem_original_p50_t500	0.87	0.26	0.61	0.26
15	abs-hh-ratio-ilps	0.85	0.53	0.48	0.46
16	abs-th-ratio-ilps	0.82	0.49	0.46	0.43
17	baseline_bm25_t100	0.77	0.18	0.40	0.18
18	baseline_bm25_t1000	0.76	0.18	0.38	0.18
19	baseline_bm25_t1500	0.76	0.18	0.39	0.18
20	baseline_bm25_t200	0.77	0.18	0.40	0.18
21	baseline_bm25_t300	0.76	0.18	0.39	0.18
22	baseline_bm25_t400	0.76	0.18	0.39	0.18
23	baseline_bm25_t500	0.76	0.18	0.39	0.18
24	distributed_effort_p10_t100	0.79	0.23	0.46	0.24
25	distributed_effort_p10_t1000	0.86	0.25	0.59	0.26
26	distributed_effort_p10_t1500	0.86	0.25	0.60	0.26
27	distributed_effort_p10_t200	0.85	0.25	0.55	0.26
28	distributed_effort_p10_t300	0.86	0.25	0.57	0.26
29	distributed_effort_p10_t400	0.86	0.25	0.57	0.26
30	distributed_effort_p10_t500	0.86	0.25	0.58	0.26
31	sheffield-baseline	0.80	0.21	0.40	0.21

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Di Nunzio, G.M. Screening Smarter, Not Harder: Budget Allocation Strategies for Technology-Assisted Reviews (TARs) in Empirical Medicine. Mach. Learn. Knowl. Extr. 2025, 7, 104. https://doi.org/10.3390/make7030104

AMA Style

Di Nunzio GM. Screening Smarter, Not Harder: Budget Allocation Strategies for Technology-Assisted Reviews (TARs) in Empirical Medicine. Machine Learning and Knowledge Extraction. 2025; 7(3):104. https://doi.org/10.3390/make7030104

Chicago/Turabian Style

Di Nunzio, Giorgio Maria. 2025. "Screening Smarter, Not Harder: Budget Allocation Strategies for Technology-Assisted Reviews (TARs) in Empirical Medicine" Machine Learning and Knowledge Extraction 7, no. 3: 104. https://doi.org/10.3390/make7030104

APA Style

Di Nunzio, G. M. (2025). Screening Smarter, Not Harder: Budget Allocation Strategies for Technology-Assisted Reviews (TARs) in Empirical Medicine. Machine Learning and Knowledge Extraction, 7(3), 104. https://doi.org/10.3390/make7030104

Article Menu

Screening Smarter, Not Harder: Budget Allocation Strategies for Technology-Assisted Reviews (TARs) in Empirical Medicine

Abstract

1. Introduction

2. Beyond Recall: Budget-Aware Perspectives in Technology- Assisted Reviews

2.1. Examples of Budget Allocation in Information Retrieval

2.2. A Proposal of Budget Allocation Strategies in TAR Systems

2.2.1. Even Allocation

2.2.2. Proportional Allocation

2.2.3. Inversely Proportional Allocation

2.2.4. Threshold-Capped Greedy Allocation

2.3. Budget-Aware Evaluation Metrics

2.3.1. Relevant Found per Cost Unit@k

2.3.2. Utility Gain at Budget (UG@B)

3. The CLEF eHealth Technology-Assisted Review Lab

3.1. CLEF eHealth 2017 Technology-Assisted Review Task

3.2. CLEF eHealth 2018 Technology-Assisted Review Task

3.3. CLEF eHealth 2019 Technology-Assisted Review Task

4. Experimental Analysis

4.1. Adaptive Baseline: Epsilon-Greedy Multiarmed Bandit

4.2. Datasets and Data Availability

4.3. Reproducing the Original Results

4.4. Analysis of Budget Allocation Strategies

4.5. Analysis of Correlations Between Evaluation Metrics

4.6. On the Effect of Gain/Loss Ratios in UG@B

5. Conclusions and Future Work

Funding

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI