A Study of a Gain Based Approach for Query Aspects in Recall Oriented Tasks

Evidence-based healthcare integrates the best research evidence with clinical expertise in order to make decisions based on the best practices available. In this context, the task of collecting all the relevant information, a recall oriented task, in order to take the right decision within a reasonable time frame has become an important issue. In this paper, we investigate the problem of building an effective Consumer Health Search (CHS) systems that use query variations to achieve high recall and fulfill the information needs of health consumers. In particular, we study an intent-aware gain metric used to estimate the amount of missing information and make a prediction about the achievable recall for each query reformulation during a search session. We evaluate and propose alternative formulations of this metric using standard test collections of the CLEF 2018 eHealth Evaluation Lab CHS.


Introduction
The study of the query representation in Information Retrieval has driven a lot of interest in recent years [1][2][3][4][5]. Several works in the past [6][7][8] showed the positive effect on the retrieval results of fusing runs retrieved with human-made multiple formulations of the same information need. Recent studies have shown how query reformulations automatically extracted from query logs can be as effective as those manually created by users [9]. Furthermore, the performance of a system can greatly improve when the "right" formulation of an information need is selected [4,10]. One of the main challenges in this research area is being able to suggest the best performing query (or queries) among the possible variations [4,[10][11][12]. For example, Thomas et al. [4] observed that, the most prominent effect in predicting the performance of a query formulation is due to the information need and not to the "query wording". In this sense, query performance predictors actually predict the complexity of the information need, rather than the one the query itself. Zendel et al. [10] pursue a slightly different task. Following the literature on reference lists [13,14] they try to predict the performance for a query using information about queries representing the same information need. Benham et al. [15] define a fusion approach for multiple query formulations based on the concept of "topic centroid", which describes the information need as combination of its formulations. Dang et al. [11] address also the problem of improving the ranking results through a query formulation selection phase. Note that, they show that they show how they are often capable of putting the best query in the first two positions, a further evidence of the complexity of the task.
A use case of query performance prediction is the systematic compilation of literature review. In fact, systematic reviews are scientific investigations that use strategies to include a comprehensive search of all potentially relevant articles. As time and resources are limited for compiling a systematic review, limits to the search are needed: for example, one may want to estimate how far the horizon of the search should be (i.e. all possible cases/documents that could exist in the literature) in order to stop before the resources are finished [16]. Scells et al. [12] apply several state-of-the-art Query Performance Predictors to select the best query in the Systematic Reviews domain. They show how current Query Performance Prediction approaches perform poorly on this specific task. International evaluation campaigns have organized labs in order to study this problem in terms of the evaluation, through controlled simulation, of methods designed to achieve very high recall [17,18]. The CLEF initiative 1 has promoted the eHealth track since 2013 and, the CLEF 2018 eHealth Evaluation Lab Consumer Health Search (CHS) task [19] investigated the problem of building search engines that are robust to query variations to support information needs of health consumers.
In this paper, we study an alternative formulation of the intent-aware metric proposed by Umemoto et al. [20], in which the authors analyze a metric to estimate the amount of missing information for each query reformulation during a search session. Note that in [20] the authors do not propose an approach capable of predicting the recall of different formulations. Nevertheless, our perception is that, their approach can be easily adapted with good results also to the predictive task. In our case, our research goal is to understand whether a gain based approach can be used to predict the relative importance of each reformulation in terms of recall performance, in the context of Consumer Health Search where users need support for medical information needs.
In this sense, with respect to [20], our contribution is two-fold: • we show that it is possible to apply the GAIN measure proposed in [20] to obtain a recall predictor over a set of formulations for the same topic; • we furthermore show how to improve the results of such predictor by exploiting also the information obtained through the various formulations.
The paper is organized as follows: in Sec.2, we present the original gain metric, while in Sec. 3 we define our alternative version to predict the performance in a recall-oriented fashion. In Sec. 4, we discuss the experimental analysis and results; while in Sec. 5 we give our final remarks.

A GAIN-based Approach
In [20], Umemoto et al. define the intent-aware gain metric and the requirements that it should satisfy. They identify the following properties: importance, documents relevant to a central aspect of the search topic produce higher gain than those relevant to a peripheral one; relevance, highly relevant documents produce higher gain than partially relevant ones; novelty, documents relevant to an unexplored aspect produce higher gain than those relevant to a fully explored aspect.
The set of aspects A t of a topic t is estimated through the process described in [21]: first, a set of subtopics S t is mined given a topic t; then, the subtopics are grouped into a set of clusters C t . These clusters are regarded as the "facets" 2 of t. The most representative subtopic s is chosen from each cluster as formulation of the topic aspect a using the formula a = argmax s∈C t Imp t (s), where the importance of a subtopic s is defined as: D N s and D N t denote the sets of the top N retrieved documents for a subtopic s and the topic t, respectively, and Rank t (d) is the rank of the document d in the ranked list for t.
It is crucial to stress that the definition of importance, and the following definition of gain, derives from the assumption that there is a known "reference" topic t that describes completely the information need for which the retrieved documents can be different from the one observed for a query which represents just one aspect a of the topic. In Fig. 1, we show an example of a number of subtopics found for a topic t and grouped into three clusters, each one with a representative aspect. The Intent-Aware Gain is defined for a set of documents D as: which is a sort of expected value of the gain across the different aspects. P(a|t) is the probability that an aspect a is important to the topic t, and Gain t,a (D) is the gain that can be obtained by the aspect a from the documents D. The importance probability for an aspect of a topic is computed as: while the gain which measures how the documents D retrieved for a query contribute to increment the information relative to a specific aspect of the topic is: This last part that is required to compute the Intent-Aware Gain contains the term Rel t,a (d) which is the relevance degree of a document d with respect to an aspect a, estimated as follows: Rel t,a (d) = ∑ s∈C a Imp t (s) · Rel s (d) ∑ s∈C a Imp t (s) (5) where C a ∈ C t is a the cluster of subtopics belonging to the aspect a, and Rel s (d) is the relevance degree of a document d to a subtopic s estimated as Rel s (d) = 1/ Rank s (d).

A Gain for Query Reformulations
Our initial hypothesis in this work is that: a) we have one information need expressed with different query reformulations, and b) the topic t is unknown. In particular, given an information need i and its set of reformulations V i , we assume that each reformulation q ∈ V i is able to 'reveal' different facets of i. Consequently, we need to redefine the expression of the gain of Eq.4 as: where i is the information need and q a specific (re)formulation. The main difference with the original approach, apart from changing variable names, is the fact that i) we do not have a 'reference' topic t that describes completely the information need i, and ii) we have one single cluster of query reformulations, or variants, V i . For these reasons, we also need an alternative definition of relevance that adapts to our case study: where the relevance of d, retrieved by the query variant q of the information need i, is computed as the weighted average of the relevance of d with respect to all the alternative reformulations in V i . The two terms Imp q (s) e Rel s (d) remain unaltered compared to the previous definitions:

A Similarity Matrix for Recall Prediction
In the proposed context, we can think of an 'optimal' query as the one capable of combining all the diverse facets of the information need it represents. In order to estimate which query reformulation q is the closest to the unknown optimal one, we propose the following procedure: 1.
we define D q as the set of documents retrieved by q; 2.
D i = q∈V i D q as the set of all documents retrieved by at least one reformulation q; 3.
R ∈ R |V i |×|D i | as the matrix of rankings for the information need i where each row corresponds to a specific reformulation and each column to a document. The value of an element r k,d of R is defined as |D q | − ρ q,d where ρ q,d is the rank of document d retrieved by q. R is at the end normalized with norm l2.
At this point, we want to build a similarity matrix to predict the impact in terms of recall that each reformulation will have on the retrieval. We compute the cosine similarity between each pair of rows in R, obtaining a symmetric matrix S where each row (or column) represents how a reformulation is similar to the others. We use the sum the k-th row (or column) of S to predict the importance of the k-th query; then, we order the query reformulations in decreasing order where greater values indicate a higher probability of retrieving more relevant documents. This measure describes how close each query is to the ideal "centroid" query that perfectly describes the topic.

Experiments and Analysis
In this section, we describe the analysis of our experiments. In particular, we want to compare the performance in terms of predicted recall among: i) the gain defined in Eq 6, ii) an alternative definition that mitigates some arithmetical issues, iii) and the similarity matrix.

Test Collection and Retrieval Model
The CLEF 2018 eHealth Evaluation Lab Consumer Health Search (CHS) task [19] investigated the problem of retrieving Web pages to support information needs of health consumers that are confronted with a health problem or a medical condition. One subtask (i.e., subtask 3) of this lab is aimed to foster research into building search systems that are robust to query variations. 3 Queries There are 50 information need for which we have 7 query reformulation for a total of 350 queries: the original 50 queries issued by the general public augmented with 6 query variations issued individually by 6 research students with no medical knowledge. 4 Collection The collection contains 5,535,120 Web pages and it was created by compiling Web pages of selected domains acquired from the CommonCrawl [19]. Relevance Assessments For each information need, the organizers of the task provided about 500 documents assessed for a total of 25,000 topic-document pairs. Retrieval Model The index provided by the organizers of the task, an ElasticSearch index version 5.1.1, comes with a standard BM25 model with parameters b = 0.75 and k1 = 1.2. 5 Caveat The queries 160006 and 164007 will not retrieve any document in common with the other variants of the same information need (at least for N ≤ 1000). The content of query 160006 contains only "nan", while query 164007 has a typo "pros and cons spirculina", instead of spirulina, a type of algae. For those queries, it will not be possible to compute the value of the gain by definition.

Using traditional Query Performance Predictors applied to recall prediction for systematic reviews
To have a better grasp on the peculiarities of the problem, we first try to apply traditional techniques of Query Performance Prediction (QPP) to our specific setting. More in detail, we select a set of very well-know QPP models, in order to determine whether they can be satisfactory applied to the prediction of the recall and can be used with the documents and queries that we have at hand. Traditionally, Query Preformance Predictors are divided into two macro-categories, according to the information they exploit to formulate the prediction: Pre-retrieval predictors and Post-retireval Predictors. Preretrieval predictors analyze query and corpus statistics prior to retrieval [22][23][24][25][26][27] and postretrieval predictors that also analyze the retrieval results [13,[28][29][30][31][32][33][34][35]. Even though Preretrieval predictors have the advantage of being faster, since they do not need to retrieve the documents for a certain run, post-retrieval predictors typically perform better. Table 1 reports the predictors that we include in our analyses and a brief description of how they work. It is important to notice that, as for many QPP models, the models that we selected do not actually predict the performance measure. They associate a score to each of the queries, which is expected to correlate with the performance measure, but is on a different scale and cannot be used directly as estimate of the performance.
The traditional strategy to evaluate how good a query performance predictor is, consists in computing a traditional retrieval performance measure, such as Average Precision (AP), for each of the query, and determine how much such measure correlates with the prediction scores computed by the QPP model [22][23][24][25]27,[31][32][33][34]37,40,41]. Notice that, there are two main aspects that might impair traditional QPP models in our specific setting:

•
We do not need to estimate the AP, which is a precision based measure, but our aim is to predict which query will have the best recall; • We do not compare queries meant for different information needs, which is the typical evaluation scenario for QPP models.
On the other hand, we aim at understanding which one, among a set of queries representing the same information need, achieve the best result.
To determine whether we are impaired by the first problem, we first apply the traditional QPP considering only the default formulation of each topic, and we compare whether the predictors are capable of correctly determining the inter-topic performance. More in detail, with this first experiment, we are interested in understanding whether the baseline predictors are capable of predicting which topic will have the best recall, using a single formulation for each of them. Table 2 reports the result of such analysis We can observe that, by looking at Table 2, the results are in line with previous similar experiments in the literature, such as [5,42]. Almost all the predictors are able to achieve a significant correlation with the recall (with level α = 0.01). Two noticeable exceptions are represented by nqc and smv: traditionally, they are considered among the best predictors, but in this specific scenario they fail, with correlations not statistically different from 0. Our 5 https://sites.google.com/view/clef-ehealth-2018/task-3-consumer-health-search Table 1: Pre-and Post-retrieval predictive baseline models considered. type predictor description

pre-retrieval max-idf [26]
It considers the maximum value of the id f over the query terms mean-idf [36] It computes the mean value of the id f over the query terms std-idf [36] It uses the standard deviation of the id f over the query terms sum-scq [27] Measures similarity based on c f .id f to the corpus, summed over the query terms. mean-scq [27] It relies on the same value of sumscq, but it normalizes it with the length of the query max-scq [27] It relies on the same value of sumscq, but considers only the maximum value

post-retrieval wig [37]
Standard deviation of the top documents scores in the retrieval list. nqc [38] Difference between the mean retrieval score of the top documents, scaled by the score of the entire corpus smv [39] It computes the prediction considering the standard deviation of the retrieval scores hypothesis is that, while pre-retrieval predictors tend to be estimators of the recall base of a query, and therefore tend to correlate with the recall itself, post-retrieval predictors tend to compute their predictors based on the scores that the retrieval model assigns to the top-ranked documents. In this sense, post-retrieval predictors are "top-heavy": they focus on the upper part of the ranked list of documents. This behaviour favours predicting the performance for top-heavy measures, such as Average Precision or nDCG. Instead, our task consists in predicting the recall, given a long list of documents. It is not unlikely that the upper part of the list of retrieved documents is saturated with relevant ones; nevertheless, we are more interested in being sure that every relevant document has been considered, rather than saying whether the top part of the ranked list contains relevant documents. We now switch the focus from predicting the performance across topics, to predict the performance within topics. Instead of comparing the performance that the standard formulation is expected to achieve for each topic, we try to sort different formulations for the same topic, according to the predicted performance. Table 3 reports the results of our analysis. Compared to the results observed in Table 2, the performance achieved by traditional predictors for the "within"-topics prediction, is extremely lower, with very few cases of significantly positive correlation between the predicted and observed recall. Note that, even though we agree with [12] on the fact that predicting the best query among a series of formulations of a topic is a hard task, we end up with diametrically opposite conclusions. Scells et al. [12] observed severe flaws in traditional QPP techniques when predicting the performance across topics. On the other hand, they found the task of predicting the performance within topics (which they refer to as Query Variation Performance Prediction (QVPP)) to be easier, achieving higher (although still very low) results. What we observe here, is diametrically opposite: we found the worst results when predicting results within topics, and performance in line with previous literature for the predictions across topics.

Analysis of the Results
Given what we observed in Subsection 4.2, we are interested in understanding whether the GAIN-based proposed by [20] (cfr. Equation 4) can overcome the problems in this specific setting shown by traditional QPP models. The results are shown in Figures 2a,  2d and 2h. Each figure is divided into two parts: top, we show the distribution of values of the GAIN (or similarity), ordered increasingly, for each query reformulation (350 in total); bottom, we plot for each topic (50 topics) the value of the correlation Kendall τ between the query reformulations ordered by decreasing GAIN (or similarity) and the reformulations ordered by decreasing true recall. The blue dots indicate a statistically significant correlation greater (or lower) than zero, while black dots the topics for which it is not possible to compute the correlation.

Saturated GAIN distribution
In Figures 2a, 2d, and 2h, we show that the value of the gain saturates to 1 for most query reformulation. This is more evident when we increase the number of documents N of Eq. 6 from N = 100 up to N = 10000. This behavior, due to the importance in Eq. 1 that multiplies N numbers less than one, makes the GAIN not useful to discriminate the different query variants of an information need, since every variant will have gain equal to 1. In addition, when all the reformulations have the same gain, it is impossible to compute the Kendall τ correlation to predict the performance (black dots with correlation value 0 in the figure). Being not saturated is not by itself a desirable feature for the gain measure. Nevertheless, the faster the gain saturates, the harder it is to discriminate between different formulations. In this sense, a GAIN measure capable of spreading better the options in the entire domain is preferable.

Alternative GAIN Definition
In order to mitigate the aforementioned problems, we propose an alternative definition of the gain of Eq. 6 substituting the product with an average: The results of this new formulation are shown in Figures 2b, 2e, and 2g. The distribution of the gain is more spread across all the reformulations and does not saturate to one. There is also a more stable prediction of the performances for each topic: the number of statistically Table 3: Performance achieved by traditional predictors, applied to our specific case. Each predictor has been used to predict the performance of the different formulations. We report the mean score and standard deviation of the correlation computed over the different topics. We also report the first quartile, third quartile and number of topics for which the correlation between the predicted and observed recalls for their (re)formulations is significantly greater than 0. This indicates (as we may expect) that with more information (more documents, greater N) we can predict better the order of importance, in terms of recall, of each reformulation.

Using Similarity Matrix for Recall Prediction
In Figures 2c, 2f, and 2i, we show the ability to predict the performance of a query reformulation using the correlation between the similarity-based approach presented in Sec. 3.1. The values of the Similarity are spread and do not saturate to the maximum value of the sum of a row of S (in our experiments equal to 7). By increasing the number N of documents, we improve the capability to predict the performance of the query reformulation; in particular, there are no statistically significant negative correlation and the total number of negative correlations decreases from N = 100 to N = 10000.
Besides the qualitative aspects, Table 4 reports also the numerical performance comparison between the GAIN as proposed by [20], its version which employs the mean, and the similarity-based gain.  In this last section of the analysis of the results, we want to briefly summarize our findings. As a remainder, we want to point out that, the GAIN measure proposed by [20], was originally used to estimate the missing information that the user could have gained, by using different subtopic formulations, showed in a user-interface. Although such task shares similar aspects with the one of predicting the recall, they are not fully overlapping. Our main contributions in this paper are: • First, adapting an already established technique to a different task. In this sense, to the best of our knowledge, this is the first effort in adapting the GAIN measure proposed by Umemoto et al. [20] to the query formulation recall prediction task. • Secondly, its "mean" version, which we refer to as "Mean Gain", is observed here for the first time, as a better adaptation of [20] to the predictive task.

•
Finally, the Similarity-based Gain is a completely new contribution of this manuscript, which exploits similar elements to the gain measure proposed by Umemoto et al. [20]. Table 4 shows that the similarity based gain has the overall best performance both compared to other gain based measures and traditional predictors (cfr. Table 3). Interestingly, while the original gain worsen with the increase of the cutoff (as observed both in Tables 2 and 3), both the mean based and the similarity one tend to improve their performance when the cutoff increase. The original gain suffers of the "saturated gain", as reported in 4.3.1, while our proposal (both mean and similarity) improve as new relevant information is added.

Conclusions and Future Work
In this paper, we have presented a study that evaluates different definitions of the GAIN of a reformulation for an information need. We adapted the definition of gain proposed by Umemoto et al. [20] to the context of Consumer Health Search, and we used a standard test collection to evaluate our hypotheses: can we use the gain metric to predict the performance of each reformulation? Is there a better formulation that can produce an order of the importance of each reformulation in terms of recall?
We found that for recall based tasks where the number of documents to retrieve may be large, N > 100 , the original definition of GAIN saturates quickly to 1. We proposed an alternative definition that mitigates this problem, and we also presented a similarity based approach that tries to capture the 'optimal' query reformulation among all the available formulations of an information need. The analysis of the results confirms that our approach significantly improves the prediction of the order of importance of each reformulation in terms of recall.
We are currently investigating the possibility to smooth the contribution of each reformulation in the similarity matrix S with a locality parameter w. This parameter can be used as an exponent for each element of S and decide whether to get reformulations closer, w < 1, or push them far away, w > 1, to create sub-clusters of reformulations and obtain a better prediction. proposed in [20], the mean aggregation and the similarity-based aggregation strategy. Each subfigure shows (top) the distribution of the GAIN, ordered increasingly, of the 350 queries and (bottom) correlation between the reformulations ordered by predicted GAIN (or similarity) and the reformulations ordered by the true recall.