Building High-Quality Datasets for Information Retrieval Evaluation at a Reduced Cost †

: Information Retrieval is not any more exclusively about document ranking. Continuously new tasks are proposed on this and sibling ﬁelds. With this proliferation of tasks, it becomes crucial to have a cheap way of constructing test collections to evaluate the new developments. Building test collections is time and resource consuming: it requires time to obtain the documents, to deﬁne the user needs and it requires the assessors to judge a lot of documents. To reduce the latest, pooling strategies aim to decrease the assessment effort by presenting to the assessors a sample of documents in the corpus with the maximum number of relevant documents in it. In this paper, we propose the preliminary design of different techniques to easily and cheapily build high-quality test collections without the need of having participants systems.


Introduction
In Information Retrieval, test collections are the most widespread technique to evaluate the effectiveness of new developments [1]. These collections are formed by the document set, the information needs (topics) and the human judgments [2]. They are complex to construct because the need of human work to obtain the judgments [3,4]. Datasets of general purpose like TREC (https: //trec.nist.gov), NTCIR (http://research.nii.ac.jp/ntcir) and CLEF (http://www.clef-initiative.eu) are useful but sometimes research teams need to build their own collections within a specific task [5].
Pooling methods allow building larger datasets with less effort [6]. When using a pooling approach, only a subset-the pool-of the whole document set is assessed for relevance. The pool is built by taking the union of the top k documents retrieved by each participant system, the runs. In TREC competitions these pools are built using the runs sent by the competition participants, who execute their algorithms on the original dataset and send back their results [2]. Historically, TREC applied the most basic pooling approach (DocID) [2], but recent publications [7,8] have shown that is possible to reduce the assessor's work without harming the quality of the obtained dataset. In particular, in TREC Common Core Track [9] NIST applied these techniques for the first time. The drawback of these techniques is that they are tied to having participant systems, condition that is not always met.
In some cases it may be necessary to obtain collections prior to the competition. Therefore, in these cases, it is not possible to use approaches where the participants are needed, such as CLEF eRisk competition (http://erisk.irlab.org) [10][11][12], where training data is released.
We propose a method to build the pool before having participant systems. Here the role of the runs will be played by different query variants and out of the box retrieval strategies. The top k documents from the runs produced by multiple combinations of query variants and retrieval strategies are used to build the pool.

Experiments
We made a series of experiments to preliminary compare the effectiveness between different pooling approaches. In particular, we want to test if the use of query variants is adequate.

Systems and Query Variants
We use four different retrieval models: BM25, TF-IDF, LM Jelinek-Mercer and LM Dirichlet. We want to test the effectiveness of this approach having only a few different systems.
To combine with the described models, we build a series of query variants from the original query. With the model×query variant combinations, we can obtain larger pools, ideally having more relevant documents in them. To build a query variant we combine the original query with one of the five terms from the topic description with the highest IDF, i.e., the more specific terms.
Combining these variants along with the systems, we end obtaining a number of different runs equal to no. systems(4) × no. variants(5) = 20.

Pooling Algorithms
To perform these experiments we use two pooling algorithms. The first one is the traditional pooling strategy used in TREC competitions [2], i.e., DocID. The second one, DocPoolFreq, is a simple adaptation of the former, where we order the documents by the number of times they appear in the pool and if they tie, by DocID. This is based on the intuition that if a document appears on more systems is it going to be more relevant than other that appears less in all the systems, which is part of many complex pooling algorithms [13].

Results
We performed these experiments using the TREC5 dataset (disks 2 & 4, topics 251-300). The results can be observed in Figure 1. When it comes to finding more relevant documents we can observe that the approaches that use the query variants outperform the other two. This is because when using the variants we have more systems, which results in having more relevant documents.  We also can observe that DocPoolFreq outperforms DocID as it finds relevant documents earlier in the process. This confirms that with only four models and query variants it is possible to obtain the 40% of the relevant documents found in TREC 5 where 61 systems where used to build the pool.

Discussion
Results show that our research direction is promising. We also open a line of investigation which is to compare the quality of the datasets built with a participants-based approach with techniques that do not need the participant system, like the presented in this paper.