# Analytics Methods to Understand Information Retrieval Effectiveness—A Survey

## Abstract

**:**

## 1. Introduction

- Can we understand better the IR system effectiveness, that is to say successes and failures of systems, using data analytics methods?

- Did the literature allow conclusions to be drawn from the analysis of international evaluation campaigns and the analysis of the participants’ results?
- Did data driven analysis, based on thorough examination of IR components and hyper-parameters, lead to different or better conclusions?
- Did we learn from query performance prediction?

- Can system effectiveness understanding be used in a comprehensive way in IR to solve system failures and to design more effective systems? Can we design a transparent model in terms of its performance on a query?

## 2. Related Work

#### 2.1. Surveys on a Specific IR Component

**query expansion component**. Carpineto and Romano’ survey [23] includes the different applications of query expansion, as well as the different techniques. They suggested a classification of QE approaches that Azad and Deepak [24] completed with a four-level taxonomy. To analyse the different methods, Carpineto and Romano did not use any data analytics, rather they used both a classification with various criteria and a comparison of method effectiveness. More precisely, the criteria they used are as follows: the data source used in the expansion (e.g., Wordnet, top ranked documents, ...), candidate feature extraction method, feature selection method, and the expanded query representation. With regard to effectiveness, they report mean average precision on TREC collections (sparse results). Mean average precision is the average of average precision on a query set. Average precision is one of the main evaluation measure in IR. It is the area under the precision–recall curve which, in practice, is replaced with an approximate based on precision at every position in the ranked sequence of documents, more at https://jonathan-hui.medium.com/map-mean-average-precision-for-object-detection-45c121a31173 accessed on 15 May 2022. The authors concluded that for query expansion, linguistics techniques are considered as less effective than statistic-based methods. In particular, local analysis seems to perform better than corpus based. The authors also mentioned that the methods seem to be complementary and that this should be exploited more. Their final conclusion is that the best choice depends on many factors among which the type of collection being queried, the availability and features of the external data, and the type of queries. The authors did not detail the link between these features and the choices of a query expansion mechanism.

**stemming algorithms**applied in the indexing and query pre-processing and their effect. They considered mainly rule-based stemmers and classified the stemmers according to their features, such as their strength, the aggressiveness with which the stemmer clears the terminations of the terms, the number of rules and suffixes considered, their use of recoding phase, partial-matching, and constraint rules. They also compared the algorithms according to their conflation rate or index compression factor. The authors did not compare the algorithms in terms of effectiveness but rather refer to other papers for this aspect.

**BM25 scoring function**[26]. The authors considered 3 TREC collections and used average precision at 30 documents. Precision at 30 documents is the precision, the proportion of relevant document within the retrieved document list where this list is considered up to the 30th retrieved document. They show that there is no significant effectiveness difference between the different implantation of BM25.

#### 2.2. Effectiveness and Relevance

**relevance**.

#### 2.3. Typical Evaluation Report in IR Literature

**hyper-parameters**, we should mention that it is a common practice nowadays in IR experimental evaluation (https://www.sigir.org/sigir2012/paper-guidelines.php accessed on 15 May 2022 is an example of paper guideline to write IR papers.) to analyse the hyper-parameters of the method one developed. Analysing the results is generally performed by comparing the results in terms of effectiveness in tables or graphs that show the effectiveness for different values of the hyper-parameters (see Figure 2 that represent typical reports on comparison of methods and hyper-parameters in IR papers). In these figures and tables, the parameter values change and either different effectiveness measures or different evaluation collections, or both are reported.

## 3. Materials and Methods

#### 3.1. Data Analysis Methods

**Boxplot**is a graphical representation of a series of numerical values that shows their locality, spread, and skewness based on their quartiles. Whiskers extend the Q1–Q3 box, indicating variability outside the upper and lower quartiles. Beyond the whiskers, outliers that differ significantly from the rest of the dataset are plotted as individual points. Effectiveness under different conditions (different queries, different values of a component parameter) is a typical series that can be represented under the form of a boxplot.

**Correlation**is a family of analysis that measures the relationship between two variables, its strength and direction. Correlation calculation results in a value that ranges between $-1$ (strong negative correlation) and 1 (strong positive correlation); 0 indicating that the two variables are not correlated. The p-value indicates the confidence or risk of error in rejecting the hypothesis that the two variables are independent. The most familiar measure of correlation is the Pearson product-moment correlation coefficient which is a normalised form of the covariance. Covariance between two random variables measures their join distance to their expected values which can be the distance to the mean for numerical data. Pearson $\rho $ assumes linear relationship between the two variables. Spearman’s correlation (r) considers the ranks rather than the values and measures how far from each other variable ranks are. r is similar to Pearson on ranks. Spearman’s assumes monotonic relationship between the two variables. Kendall correlation measures the correlation on ranks, that is the similarity of the orderings of data when ranked by each of the variable values. It is affected by whether the ranks between observations are the same or not without considering how far they are as opposed to r. It is thus considered as more appropriate for discrete variables. With regard to system effectiveness, correlation is used in query performance prediction to evaluate the accuracy of the prediction: the two analysed variables are the predictor (either a single predictor or a complex one) and the observed effectiveness.

**Analysis of variance (ANOVA)**encompasses different statistical models and estimation procedures used to highlight differences or dependencies between several statistical groups. It is used to analyse the difference between the means of more than two groups. In ANOVA, the observed variance in a particular variable is partitioned into components that are attributable to different sources of variation. A one-way ANOVA uses one independent variable, while a two-way ANOVA uses two independent variables. The General Linear Mixed Model [30] extends the General Linear Model [31] so that the linear predictor contains random effects in addition to the usual fixed effects.

**Factorial analysis**is used to describe variability among observed, correlated variables; it uses factors, here combinations of initial variables, to represent individuals or data in a space of a lower dimension. It uses singular value decomposition and is appropriate to visualise the link between elements (individuals) that are initially represented in a high dimensional space (variables). Two variants of factorial analysis are used in the context of IR system performance analysis. Factorial analysis is also the core model used in the Latent Semantic Indexing model [32] where documents are considered in the high dimensional space of words. It is also linked to the matrix factorisation principle used in recommender systems for example [33]. Principal Component Analysis (PCA) and Correspondence Analysis [34] which differ on the pre-treatment applied to the initial analysed matrix and on the distance used to find the links between variables and individuals. Although PCA reduces the dimensionality of the data by considering the most important dimensions as determined by the eigen values of the variance/covariance matrix using Euclidian distance, CA uses the ${\chi}^{2}$ distance on contingency matrices. Factorial analysis results on visual representations which can be manually interpreted. Among others, one interesting property of CA compared to PCA is that individuals and features can be observed all together in the same projected space. Factorial analyses are used in the context of IR effectiveness analysis.

**Clustering methods**is a family of methods that aims to group together similar objects or individuals. Under this group falls agglomerative clustering and k-means. In agglomerative clustering, each individual corresponds to a cluster; at each processing step, the two closest clusters are merged; the process ends when there is a single cluster. The minimum value of the error sum of squares is used as the ward criterion to choose the pair of clusters to merge [35]. The resulting dendogramme (tree-like structure) can be cut at any level to produce a partition of objects. Depending on its level, the cut will result in either numerous but homogeneously-composed clusters or few but heterogeneously-composed clusters. Another popular clustering method is k-means where a number of seeds, corresponding to the desired number of clusters, are chosen. Objects are associated to the closest seed. Objects can then be re-allocated to a different cluster if it is closer to the centroid of an other cluster. For system effectiveness analysis, clustering can be used to group queries, systems or even measures.

**Regression methods**aim to approach the value of a dependent variable (the variable to be predicted) considering one or several independent variables (the variables or features that are used to predict). The regression is based on a function model with one or more parameters (e.g., linear function in the linear regression; polynomial, ...). Logistic regression is for the case the variable to explain is binary (e.g., the individual belongs to a class or not). It is used, for example, in query performance prediction.

**Decision trees**show a family of non-parametric supervised learning methods that are used for classification and regression. The resulting model is able to predict the value of a target variable by learning simple decision rules inferred from the data features. CART [36] and random forests [37] are the most popular among these methods. They have been shown as very competitive methods. The extra advantage is that the model can combine both quantitative and qualitative variables. In addition, the obtained models are explainable. For system effectiveness analysis, the target variable is effectiveness measurement or class of query difficulty (easy, hard, medium, for example). The system hyperparameters or query features are used to infer the rules.

**deep learning**methods as means to analyse and understand information retrieval effectiveness. Deep learning is more and more popular in IR but still these models lack interpretability. The artificial intelligence community is re-investigating the explainability and interpretability challenge of neural network based models [38]. For example, a recent review focused on explainable recommendation systems [39]. Still, model explanability is mainly based on model interpretability and prominent interpretable models are more conventional machine learning ones, such as regression models and decision tree models [39].

#### 3.2. Data and Data Structures for System Effectiveness Analysis

## 4. System Performance Analysis Based on Their Participation to Evaluation Challenges

## 5. Analyses Based on Systems That Were Generated for the Study—The System Factor

## 6. The Query Factor

#### 6.1. Considering the Queries and Their Pre- and Post-Retrieval Features

- Combination of query features might;
- It may explain that systems will fail in general.

#### 6.2. Relationship between the Query Factor and the System Factor

## 7. Discussion and Conclusions

**evaluation forums and shared tasks**, although participants provide some detailed description of the systems they designed, the information is not enough structured or detailed to draw conclusions, except in broad strokes. The main conclusions from the analyses of shared tasks results are:

- C1: it is possible to distinguish between effective and non-effective systems on average over a query set;
- C2: effectiveness of systems has increased over years thanks to the effort put in the domain;

**automatically generated query processing chains**, we have a deep knowledge on the systems, we know exactly what components are used with which hyper-parameters. From the analyses that used these data, we can conclude:

- C4: some components and hyper-parameters are more influential than others and informed choices can be made;
- C5: the choice of the most appropriate components depends on the query level of difficulty.

**query analyses**, we can conclude:

- C6: a single query feature or a combination of features have not been proven to explain system effectiveness;
- C7: query features can explain somehow system effectiveness.

## Funding

## Conflicts of Interest

## Abbreviations

AP | Average Precision |

CA | Correspondence Analysis |

CIKM | Conference on Information and Knowledge Management |

CLEF | Conference and Labs of the Evaluation Forum |

IR | Information Retrieval |

MAP | Mean Average Precision |

PCA | Principal Component Analysis |

QE | Query Expansion |

QPP | Query Performance Prediction |

SIGIR | Conference of the Association for Computing Machinery Special Interest Group in Information Retrieval |

SQE | Selective Query Expansion |

TREC | Text Retrieval Conference |

## References

- Salton, G.; Wong, A.; Yang, C.S. A vector space model for automatic indexing. Commun. ACM
**1975**, 18, 613–620. [Google Scholar] [CrossRef] - Robertson, S.E.; Jones, K.S. Relevance weighting of search terms. J. Am. Soc. Inf. Sci.
**1976**, 27, 129–146. [Google Scholar] [CrossRef] - Robertson, S.; Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond; Now Publishers Inc.: Delft, The Netherlands, 2009; pp. 333–389. [Google Scholar]
- Ponte, J.M.; Croft, W.B. A Language Modeling Approach to Information Retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’98, Melbourne, Australia, 24–28 August 1998; ACM: New York, NY, USA, 1998; pp. 275–281. [Google Scholar] [CrossRef]
- Ounis, I.; Amati, G.; Plachouras, V.; He, B.; Macdonald, C.; Johnson, D. Terrier information retrieval platform. In European Conference on Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2005; pp. 517–519. [Google Scholar]
- Taylor, M.; Zaragoza, H.; Craswell, N.; Robertson, S.; Burges, C. Optimisation methods for ranking functions with multiple parameters. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, Arlington, VA, USA, 6–11 November 2006; pp. 585–593. [Google Scholar]
- Ayter, J.; Chifu, A.; Déjean, S.; Desclaux, C.; Mothe, J. Statistical analysis to establish the importance of information retrieval parameters. J. Univers. Comput. Sci.
**2015**, 21, 1767–1789. [Google Scholar] - Tague-Sutcliffe, J.; Blustein, J. A Statistical Analysis of the TREC-3 Data; NIST Special Publication SP: Washington, DC, USA, 1995; p. 385. [Google Scholar]
- Banks, D.; Over, P.; Zhang, N.F. Blind men and elephants: Six approaches to TREC data. Inf. Retr.
**1999**, 1, 7–34. [Google Scholar] [CrossRef] - Dinçer, B.T. Statistical principal components analysis for retrieval experiments. J. Am. Soc. Inf. Sci. Technol.
**2007**, 58, 560–574. [Google Scholar] [CrossRef] - Mothe, J.; Tanguy, L. Linguistic analysis of users’ queries: Towards an adaptive information retrieval system. In Proceedings of the 2007 Third International IEEE Conference on Signal-Image Technologies and Internet-Based System, Shanghai, China, 16–18 December 2007; pp. 77–84. [Google Scholar]
- Harman, D.; Buckley, C. The NRRC reliable information access (RIA) workshop. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, 25–29 July 2004; pp. 528–529. [Google Scholar]
- Mizzaro, S.; Robertson, S. Hits hits trec: Exploring ir evaluation results with network analysis. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, 23–27 July 2007; pp. 479–486. [Google Scholar]
- Harman, D.; Buckley, C. Overview of the reliable information access workshop. Inf. Retr.
**2009**, 12, 615. [Google Scholar] [CrossRef] - Bigot, A.; Chrisment, C.; Dkaki, T.; Hubert, G.; Mothe, J. Fusing different information retrieval systems according to query-topics: A study based on correlation in information retrieval systems and TREC topics. Inf. Retr.
**2011**, 14, 617. [Google Scholar] [CrossRef] - Ferro, N.; Silvello, G. A general linear mixed models approach to study system component effects. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, Pisa, Italy, 17–21 July 2016; pp. 25–34. [Google Scholar]
- Ferro, N.; Silvello, G. Toward an anatomy of IR system component performances. J. Assoc. Inf. Sci. Technol.
**2018**, 69, 187–200. [Google Scholar] [CrossRef] - Louedec, J.; Mothe, J. A massive generation of ir runs: Demonstration paper. In Proceedings of the IEEE 7th International Conference on Research Challenges in Information Science (RCIS), Paris, France, 29–31 May 2013; pp. 1–2. [Google Scholar]
- Wilhelm, T.; Kürsten, J.; Eibl, M. A tool for comparative ir evaluation on component level. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, China, 24–28 July 2011; pp. 1291–1292. [Google Scholar]
- Carmel, D.; Yom-Tov, E.; Darlow, A.; Pelleg, D. What makes a query difficult? In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, USA, 6–11 August 2006; pp. 390–397. [Google Scholar]
- Mothe, J.; Tanguy, L. Linguistic features to predict query difficulty. In ACM Conference on Research and Development in Information Retrieval, SIGIR, Predicting Query Difficulty-Methods and Applications Workshop; ACM: New York, NY, USA, 2005; pp. 7–10. [Google Scholar]
- Zamani, H.; Croft, W.B.; Culpepper, J.S. Neural query performance prediction using weak supervision from multiple signals. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 105–114. [Google Scholar]
- Carpineto, C.; Romano, G. A survey of automatic query expansion in information retrieval. ACM Comput. Surv. (CSUR)
**2012**, 44, 1–50. [Google Scholar] [CrossRef] - Azad, H.K.; Deepak, A. Query expansion techniques for information retrieval: A survey. Inf. Process. Manag.
**2019**, 56, 1698–1735. [Google Scholar] [CrossRef] - Moral, C.; de Antonio, A.; Imbert, R.; Ramírez, J. A survey of stemming algorithms in information retrieval. Inf. Res. Int. Electron. J.
**2014**, 19, n1. [Google Scholar] - Kamphuis, C.; de Vries, A.P.; Boytsov, L.; Lin, J. Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants. In Advances in Information Retrieval; Jose, J.M., Yilmaz, E., Magalhães, J., Castells, P., Ferro, N., Silva, M.J., Martins, F., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 28–34. [Google Scholar]
- Mizzaro, S. How many relevances in information retrieval? Interact. Comput.
**1998**, 10, 303–320. [Google Scholar] [CrossRef] - Ruthven, I. Relevance behaviour in TREC. J. Doc.
**2014**, 70, 1098–1117. [Google Scholar] [CrossRef] - Hofstätter, S.; Lin, S.C.; Yang, J.H.; Lin, J.; Hanbury, A. Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event. 11–15 July 2021; pp. 113–122. [Google Scholar]
- Breslow, N.E.; Clayton, D.G. Approximate inference in generalized linear mixed models. J. Am. Stat. Assoc.
**1993**, 88, 9–25. [Google Scholar] - McCullagh, P.; Nelder, J.A. Generalized Linear Models, 2nd ed.; Chapman and Hall: London, UK, 1989. [Google Scholar]
- Dumais, S.T. LSA and information retrieval: Getting back to basics. Handb. Latent Semant. Anal.
**2007**, 293, 322. [Google Scholar] - Sarwar, B.; Karypis, G.; Konstan, J.; Riedl, J. Application of Dimensionality Reduction in Recommender System—A Case Study; Technical Report; Department of Computer Science and Engineering, University of Minnesota: Minneapolis, MN, USA, 2000. [Google Scholar]
- Benzécri, J.P. Statistical analysis as a tool to make patterns emerge from data. In Methodologies of Pattern Recognition; Elsevier: Amsterdam, The Netherlands, 1969; pp. 35–74. [Google Scholar]
- Ward, J.H., Jr. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc.
**1963**, 58, 236–244. [Google Scholar] [CrossRef] - Li, B.; Friedman, J.; Olshen, R.; Stone, C. Classification and regression trees (CART). Biometrics
**1984**, 40, 358–361. [Google Scholar] - Ho, T.K. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; Volume 1, pp. 278–282. [Google Scholar]
- Gunning, D. Explainable Artificial Intelligence; Defense Advanced Research Projects Agency (DARPA): Arlington, VA, USA, 2017; p. 2. [Google Scholar]
- Zhang, Y.; Chen, X. Explainable recommendation: A survey and new perspectives. Found. Trends® Inf. Retr.
**2020**, 14, 1–101. [Google Scholar] [CrossRef] - Harman, D. Overview of the First Text Retrieval Conference (trec-1); NIST Special Publication SP: Washington, DC, USA, 1992; pp. 1–532. [Google Scholar]
- Harman, D. Overview of the first TREC conference. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, PA, USA, 27 June–1 July 1993; pp. 36–47. [Google Scholar]
- Buckley, C.; Mitra, M.; Walz, J.A.; Cardie, C. SMART high precision: TREC 7; NIST Special Publication SP: Washington, DC, USA, 1999; pp. 285–298. [Google Scholar]
- Clarke, C.L.; Craswell, N.; Soboroff, I. Overview of the Trec 2009 Web Track; Technical Report; University of Waterloo: Waterloo, ON, Canada, 2009. [Google Scholar]
- Collins-Thompson, K.; Macdonald, C.; Bennett, P.; Diaz, F.; Voorhees, E.M. TREC 2014 Web Track Overview; Technical Report; University of Michigan: Ann Arbor, MI, USA, 2015. [Google Scholar]
- Kompaore, D.; Mothe, J.; Baccini, A.; Dejean, S. Query clustering and IR system detection. Experiments on TREC data. In Proceedings of the ACM International Workshop for Ph. D. Students in Information and Knowledge Management (ACM PIKM 2007), Lisboa, Portugal, 5–10 November 2007. [Google Scholar]
- Hanbury, A.; Müller, H. Automated component–level evaluation: Present and future. In International Conference of the Cross-Language Evaluation Forum for European Languages; Springer: Berlin/Heidelberg, Germany, 2010; pp. 124–135. [Google Scholar]
- Arslan, A.; Dinçer, B.T. A selective approach to index term weighting for robust information retrieval based on the frequency distributions of query terms. Inf. Retr. J.
**2019**, 22, 543–569. [Google Scholar] [CrossRef] - Di Buccio, E.; Dussin, M.; Ferro, N.; Masiero, I.; Santucci, G.; Tino, G. Interactive Analysis and Exploration of Experimental Evaluation Results. In European Workshop on Human-Computer Interaction and Information Retrieval EuroHCIR; Citeseer: Nijmegen, The Netherlands, 2011; pp. 11–14. [Google Scholar]
- Compaoré, J.; Déjean, S.; Gueye, A.M.; Mothe, J.; Randriamparany, J. Mining information retrieval results: Significant IR parameters. In Proceedings of the First International Conference on Advances in Information Mining and Management, Barcelona, Spain, 23–29 October 2011; Volume 74. [Google Scholar]
- Hopfgartner, F.; Hanbury, A.; Müller, H.; Eggel, I.; Balog, K.; Brodt, T.; Cormack, G.V.; Lin, J.; Kalpathy-Cramer, J.; Kando, N.; et al. Evaluation-as-a-service for the computational sciences: Overview and outlook. J. Data Inf. Qual. (JDIQ)
**2018**, 10, 1–32. [Google Scholar] [CrossRef] - Kürsten, J.; Eibl, M. A large-scale system evaluation on component-level. In European Conference on Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2011; pp. 679–682. [Google Scholar]
- Angelini, M.; Fazzini, V.; Ferro, N.; Santucci, G.; Silvello, G. CLAIRE: A combinatorial visual analytics system for information retrieval evaluation. Inf. Process. Manag.
**2018**, 54, 1077–1100. [Google Scholar] [CrossRef] - Dejean, S.; Mothe, J.; Ullah, M.Z. Studying the variability of system setting effectiveness by data analytics and visualization. In International Conference of the Cross-Language Evaluation Forum for European Languages; Springer: Cham, Switzerland, 2019; pp. 62–74. [Google Scholar]
- De Loupy, C.; Bellot, P. Evaluation of document retrieval systems and query difficulty. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000) Workshop, Athens, Greece, 31 May–2 June 2000; pp. 32–39. [Google Scholar]
- Banerjee, S.; Pedersen, T. Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the IJCAI 2003, Acapulco, Mexico, 9–15 August 2003; pp. 805–810. [Google Scholar]
- Patwardhan, S.; Pedersen, T. Using WordNet-based context vectors to estimate the semantic relatedness of concepts. In Proceedings of the Workshop on Making Sense of Sense: Bringing Psycholinguistics and Computational Linguistics Together, Trento, Italy, 4 April 2006. [Google Scholar]
- Cronen-Townsend, S.; Zhou, Y.; Croft, W.B. Predicting query performance. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, 11–15 August 2002; pp. 299–306. [Google Scholar]
- Scholer, F.; Williams, H.E.; Turpin, A. Query association surrogates for web search. J. Am. Soc. Inf. Sci. Technol.
**2004**, 55, 637–650. [Google Scholar] [CrossRef] - He, B.; Ounis, I. Inferring query performance using pre-retrieval predictors. In International Symposium on String Processing and Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2004; pp. 43–54. [Google Scholar]
- Hauff, C.; Hiemstra, D.; de Jong, F. A survey of pre-retrieval query performance predictors. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, Napa Valley, CA, USA, 26–30 October 2008; pp. 1419–1420. [Google Scholar]
- Zhao, Y.; Scholer, F.; Tsegay, Y. Effective pre-retrieval query performance prediction using similarity and variability evidence. In European Conference on Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2008; pp. 52–64. [Google Scholar]
- Sehgal, A.K.; Srinivasan, P. Predicting performance for gene queries. In Proceedings of the ACM SIGIR 2005 Workshop on Predicting Query Difficulty-Methods and Applications; Available online: http://www.haifa.il.ibm.com/sigir05-qp (accessed on 15 May 2022).
- Zhou, Y.; Croft, W.B. Ranking robustness: A novel framework to predict query performance. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, Arlington, VA, USA, 6–11 November 2006; pp. 567–574. [Google Scholar]
- Vinay, V.; Cox, I.J.; Milic-Frayling, N.; Wood, K. On ranking the effectiveness of searches. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, USA, 6–11 August 2006; pp. 398–404. [Google Scholar]
- Aslam, J.A.; Pavlu, V. Query hardness estimation using Jensen-Shannon divergence among multiple scoring functions. In European Conference on Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2007; pp. 198–209. [Google Scholar]
- Zhou, Y.; Croft, W.B. Query performance prediction in web search environments. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, 23–27 July 2007; pp. 543–550. [Google Scholar]
- Shtok, A.; Kurland, O.; Carmel, D. Predicting query performance by query-drift estimation. In Conference on the Theory of Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2009; pp. 305–312. [Google Scholar]
- Carmel, D.; Yom-Tov, E. Estimating the query difficulty for information retrieval. Synth. Lect. Inf. Concepts Retr. Serv.
**2010**, 2, 1–89. [Google Scholar] - Cummins, R.; Jose, J.; O’Riordan, C. Improved query performance prediction using standard deviation. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, China, 24–28 July 2011; pp. 1089–1090. [Google Scholar]
- Roitman, H.; Erera, S.; Weiner, B. Robust standard deviation estimation for query performance prediction. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval, Amsterdam, The Netherlands, 1–4 October 2017; pp. 245–248. [Google Scholar]
- Chifu, A.G.; Laporte, L.; Mothe, J.; Ullah, M.Z. Query performance prediction focused on summarized letor features. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 1177–1180. [Google Scholar]
- Zhang, Z.; Chen, J.; Wu, S. Query performance prediction and classification for information search systems. In Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data; Springer: Cham, Switzerland, 2018; pp. 277–285. [Google Scholar]
- Khodabakhsh, M.; Bagheri, E. Semantics-enabled query performance prediction for ad hoc table retrieval. Inf. Process. Manag.
**2021**, 58, 102399. [Google Scholar] [CrossRef] - Molina, S.; Mothe, J.; Roques, D.; Tanguy, L.; Ullah, M.Z. IRIT-QFR: IRIT query feature resource. In International Conference of the Cross-Language Evaluation Forum for European Languages; Springer: Cham, Switzerland, 2017; pp. 69–81. [Google Scholar]
- Macdonald, C.; He, B.; Ounis, I. Predicting query performance in intranet search. In Proceedings of the SIGIR 2005 Query Prediction Workshop, Salvador, Brazil, 15–19 August 2005. [Google Scholar]
- Faggioli, G.; Zendel, O.; Culpepper, J.S.; Ferro, N.; Scholer, F. sMARE: A new paradigm to evaluate and understand query performance prediction methods. Inf. Retr. J.
**2022**, 25, 94–122. [Google Scholar] [CrossRef] - Hashemi, H.; Zamani, H.; Croft, W.B. Performance Prediction for Non-Factoid Question Answering. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, Paris, France, 21–25 July 2019; pp. 55–58. [Google Scholar]
- Roy, D.; Ganguly, D.; Mitra, M.; Jones, G.J. Estimating Gaussian mixture models in the local neighbourhood of embedded word vectors for query performance prediction. Inf. Process. Manag.
**2019**, 56, 1026–1045. [Google Scholar] [CrossRef] - Anscombe, F. American Statistical Association, Taylor & Francis, Ltd. are collaborating with JSTOR to. Am. Stat.
**1973**, 27, 17–21. [Google Scholar] - Grivolla, J.; Jourlin, P.; de Mori, R. Automatic Classification of Queries by Expected Retrieval Performance; SIGIR: Salvador, Brazil, 2005. [Google Scholar]
- Raiber, F.; Kurland, O. Query-performance prediction: Setting the expectations straight. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, Gold Coast, Australia, 6–11 July 2014; pp. 13–22. [Google Scholar]
- Mizzaro, S.; Mothe, J.; Roitero, K.; Ullah, M.Z. Query performance prediction and effectiveness evaluation without relevance judgments: Two sides of the same coin. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 1233–1236. [Google Scholar]
- Aslam, J.A.; Savell, R. On the Effectiveness of Evaluating Retrieval Systems in the Absence of Relevance Judgments. In Proceedings of the 26th ACM SIGIR, Toronto, ON, Canada, 28 July–1 August 2003; pp. 361–362. [Google Scholar]
- Baccini, A.; Déjean, S.; Lafage, L.; Mothe, J. How many performance measures to evaluate information retrieval systems? Knowl. Inf. Syst.
**2012**, 30, 693–713. [Google Scholar] [CrossRef] - Amati, G.; Carpineto, C.; Romano, G. Query difficulty, robustness, and selective application of query expansion. In European Conference on Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2004; pp. 127–137. [Google Scholar]
- Cronen-Townsend, S.; Zhou, Y.; Croft, W.B. A framework for selective query expansion. In Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, Washington, DC, USA, 8–13 November 2004; pp. 236–237. [Google Scholar]
- Zhao, L.; Callan, J. Automatic term mismatch diagnosis for selective query expansion. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, Portland, OR, USA, 12–16 August 2012; pp. 515–524. [Google Scholar]
- Deveaud, R.; Mothe, J.; Ullah, M.Z.; Nie, J.Y. Learning to Adaptively Rank Document Retrieval System Configurations. ACM Trans. Inf. Syst. (TOIS)
**2018**, 37, 3. [Google Scholar] [CrossRef] - Bigot, A.; Déjean, S.; Mothe, J. Learning to Choose the Best System Configuration in Information Retrieval: The Case of Repeated Queries. J. Univers. Comput. Sci.
**2015**, 21, 1726–1745. [Google Scholar] - Deveaud, R.; Mothe, J.; Nia, J.Y. Learning to Rank System Configurations. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM ’16, Indianapolis, IN, USA, 24–28 October 2016; ACM: New York, NY, USA, 2016; pp. 2001–2004. [Google Scholar]
- Mothe, J.; Ullah, M.Z. Defining an Optimal Configuration Set for Selective Search Strategy-A Risk-Sensitive Approach. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Online. 1–5 November 2021; pp. 1335–1345. [Google Scholar]

**Figure 1.**An online search process consists of four main phases to retrieve an ordered list of documents that answer the user’s query. The component used at each phase has various hyper-parameters to tune.

**Figure 2.**A common practice in IR literature is to analyse the effect of hyper-parameters on the overall system effectiveness and to present the results under the form of tables or graphs. The top part of this figure is a typical table that represents hyper-parameters or comparison of methods. Here, a deep learning-based model was used and comparisons are reported on the different training types, encoders, and batch sizes; using different effectiveness measures (nDCG@10, MRR@10, and R@1K), on different collections (here TREC DL’19, TREC-DL’20, and MSMARCO DEV). The best results are highlighted in bold font. the bottom part is a typical graph to compare different variants or hyper-parameters effect on effectiveness. Here, the lines represent different combination of hyper-parameters, effectiveness is measured in terms of recall (Y-axis) for different cut-off of the retrieved document list. Table and Figure adapted with permission from [29], Copyright 2021, Sebastian Hofstätter et al.

**Figure 3.**The 3D matrices obtained from participants’ results to shared ad hoc information retrieval tasks that report effectiveness measurements for systems, topics, effectiveness measures can be transformed into 2D matrices that fit many data analysis methods.

**Figure 4.**More complex data structures can be used that integrate features on topics, on systems or on both.

**Figure 6.**A 2D matrix representing the effectiveness of different systems (X axis) on different topics (Y axis). This matrix is an extract of the one representing the AP (effectiveness measure) for TREC 7 ad hoc participants on the topic set of that track.

**Figure 7.**Effective systems are effective whatever the measure used. Web track 09 participants’ results considering mean subtopic recall (X-axis) and mean precision (Y-axis); each dot is a participant systems. Figure reprinted with permission from [43], Copyright 2009, Charles Clarke et al.

**Figure 8.**In the easiest topics according to the median effectiveness on the participants’ results, there are both topics with very diverse system effectiveness results (e.g., 298) and very similar ones (e.g., topic 285)—Web track 2014—topics are ordered according to decreasing the err@20 of the best system. Figure reprinted with permission from [44], Copyright 2014, Kevyn Collins-Thompson et al.

**Figure 9.**System failure and effectiveness depend on queries—not all systems succeed or fail on the same queries. The visualisation shows the two first principal components of a Principal Component Analysis, where the data of the system effectiveness is obtained for each topic by each participants’ run. MAP measure of TREC 12 Robust Track participants’ runs. Figure reprinted with permission from [10], Copyright 2007, John Wiley and Sons.

**Figure 10.**The first ranked system differs according to the query clusters. The rank of the system is on the Y-axis and the system is on X-axis. Blue diamonds correspond to the ranks of the systems when considering all the queries, pink squares when considering the query cluster 1, brown triangles are for query cluster 2, and green crosses for cluster 3. Systems on the X-axis are ordered according to decreasing effectiveness on average on the query set. Figure reprinted with permission from [45], Copyright 2007, Kompaore et al.

**Figure 11.**The choice of the weighting model has more impact than the stemmer used. Individual boxplots represent average precision on the TREC 7 and 8 topics when a given component is used in a query processing chain—80,000 query processing chains or component combinations were used. Figure reprinted with permission from [7], Copyright 2015, J.UCS.

**Figure 12.**

**Interaction between component choices.**The curves used in this representation are somehow misleading since the variables are not continuous but nevertheless can be understood, we thus kept the original Figures from [16]) where we added letters in each sub-figure for clarity. On the first row, the stop list effect is shown for different stemmers (

**A**) and different weighting models (

**B**). On the second row, the effect of stemmers is reported for different stop lists (

**C**) and different weighting models (

**D**). On the latest row, the weighting model effect is reported, for different stop lists (

**E**) and different stemmers (

**F**). Figure adapted with permission from [16], Copyright 2016, Nicola Ferro et al.

**Figure 13.**No correlation using pre- or post-predictors with the actual effectiveness—IDF pre-retrieval predictor and BM25 post-retrieval predictor (X-axis) and ndcg (Y-axis) values on WT10G TREC collection. Although the correlation values are up to $0.35$, there is no correlation.

**Figure 14.**Pearson correlation value higher than 0.8 does not mean the two variables are correlated. The Anscombe’s quartet presents four datasets that share the same mean, same number of values, same Pearson correlation value ($\rho $ = 0.816) but for which this latter value does not always means the two variables X and Y are correlated. X and Y are not correlated on #4 despite high $\rho $ value. #2 X and Y are perfectly correlated but not in a linear way (Pearson cannot measure other than linear correlations) #1 and #3 illustrates two cases of linear correlation. Figures generated from the data in [79].

**Figure 15.**Predicted AP is correlated to actual AP for easy queries (the ones on the right part of the plot), although there are sparse. Figure reprinted with permission from Roy et al. [78], Copyright 2019, Elsevier.

**Figure 16.**AS feature [83] is correlated to the average effectiveness of a set of systems. TREC7 Adhoc collection. Pearson correlation between AAP and (

**a**) QF [66], (

**b**) AS [83]. Dots correspond to actual and predicted AAP for individual topics; the cones represent the confidence intervals. Figure reprinted with permission from [82], Copyright 2018, Mizzaro et al.

**Figure 17.**The parameters that affect retrieval effectiveness the most depend on the query difficulty. On (

**a**), for easy queries, the most important parameter for search effectiveness optimisation is the choice of the query expansion component; on (

**b**), for hard queries, the most important parameter is the topic parts used for building the query, then the weighting component and in third the query expansion model. Figure reprinted with permission from [7], Copyright 2015, J.UCS.

**Figure 18.**Some queries are easy for all the systems, some are hard for all, other depends on the system. On the TREC topic 297, all the analysed systems obtained at least 0.5 as NDCG@20, half of them obtained more than 0.65 and some obtained 0.8, which is high. For topic 255, all the systems failed but 3, only one obtained more than 0.3. The right part boxplot, as opposed to the left side ones, shows that for topic 298, the system effectiveness have a large span from 0 to almost 1.

**Figure 19.**When considering a given system and a given query, the effectiveness measure used to compare the systems does not matter much: all are strongly correlated. Pearson correlation values between two effectiveness measurements on two measures for a given (system, query) pair. Correlations are represented using a divergent palette (a central colour, yellow, and 2 shades depending on whether the values go for negative—red—or positive values—blue).

**Table 1.**Correlation between query features and ndcg. WT10G TREC collection. * marks the usual <0.05 p-Value significance.

Feature | ||||
---|---|---|---|---|

Measure | BM25_MAX | BM25_STD | IDF_MAX | IDF_AVG |

Pearson $\rho $ | 0.294 * | 0.232 * | 0.095 | 0.127 |

p-Value | 0.0034 | 0.0224 | 0.3531 | 0.2125 |

Spearman r | 0.260 * | 0.348 * | 0.236 * | 0.196 |

p-Value | 0.0100 | <0.001 | 0.0202 | 0.0544 |

Kendall $\tau $ | 0.172 * | 0.230 * | 0.159 * | 0.136 * |

p-Value | 0.0128 | <0.001 | 0.0215 | 0.0485 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Mothe, J.
Analytics Methods to Understand Information Retrieval Effectiveness—A Survey. *Mathematics* **2022**, *10*, 2135.
https://doi.org/10.3390/math10122135

**AMA Style**

Mothe J.
Analytics Methods to Understand Information Retrieval Effectiveness—A Survey. *Mathematics*. 2022; 10(12):2135.
https://doi.org/10.3390/math10122135

**Chicago/Turabian Style**

Mothe, Josiane.
2022. "Analytics Methods to Understand Information Retrieval Effectiveness—A Survey" *Mathematics* 10, no. 12: 2135.
https://doi.org/10.3390/math10122135