Topic Models Ensembles for AD-HOC Information Retrieval

: Ad hoc information retrieval (ad hoc IR) is a challenging task consisting of ranking text documents for bag-of-words (BOW) queries. Classic approaches based on query and document text vectors use term-weighting functions to rank the documents. Some of these methods’ limitations consist of their inability to work with polysemic concepts. In addition, these methods introduce fake orthogonalities between semantically related words. To address these limitations, model-based IR approaches based on topics have been explored. Speciﬁcally, topic models based on Latent Dirichlet Allocation (LDA) allow building representations of text documents in the latent space of topics, the better modeling of polysemy and avoiding the generation of orthogonal representations between related terms. We extend LDA-based IR strategies using different ensemble strategies. Model selection obeys the ensemble learning paradigm, for which we test two successful approaches widely used in supervised learning. We study Boosting and Bagging techniques for topic models, using each model as a weak IR expert. Then, we merge the ranking lists obtained from each model using a simple but effective top- k list fusion approach. We show that our proposal strengthens the results in precision and recall, outperforming classic IR models and strong baselines based on topic models.


Introduction
Information retrieval (IR) studies techniques and methods to retrieve information from unstructured or semi-structured data sources [1]. Unstructured data sources often correspond to collections of documents that cover a variety of subjects. The primary descriptor of the content of a document is its text. For this reason, IR methods construct representations based on the content of the documents, using words as content descriptors.
IR is an essential research area that pushes the development of information technologies and applications in many domains across the industry. IR-based systems are at the core of many search engines, supporting tasks such as query routing [2], spam filtering [3], multimedia retrieval [4], and user interest mining [5]. IR is also a fundamental building block of many content-based recommender systems [6]. Other content modeling approaches have also helped drive the development of these technologies, highlighting, for example, the emergence of semantic web technologies [7], representation learning [8], and formal concept analysis [9].
An IR system provides a query engine capable of retrieving an ordered list of documents according to the relevance to a given query [10]. Many classic IR methods use term-weighting functions to achieve this goal, which measures the match between query words and documents. If the cross-match between a query and a document is higher, the ranking of the document will be higher [11]. Classic IR approaches based on the termmatching principle, such as TF-IDF [12], achieve good results in precision and recall, being strong baselines for other more sophisticated IR methods [13].
One of the main limitations of the classic IR methods is their inability to work with polysemic terms [14]. A polysemic term is a word that, depending on the context, has different meanings. As IR term-matching systems rely on lexical matching, they can rank in advanced positions documents whose semantic-matching with the query differs. Another weakness of the classic IR methods is the production of fake orthogonalities between semantically related terms. This pitfall is because two lexically different terms can denote the same meaning. However, a term-weighting IR scheme will process them as unrelated terms. Model-based IR methods have been introduced to address these limitations [15]. These models perform the query term-matching process on a latent feature space. Usually, the latent space is inferred using techniques based on topic models, such as latent Dirichlet allocation [16]. Topic models can identify semantic relationships between related terms, generating vector representations of terms whose proximity is defined by the match in the topic space of the documentary collection. Inferred representations in latent spaces capture semantic relationships between related terms and can better handle polysemy [17].
Topic models have shown great utility in different domains, allowing improvements in the descriptive capacity of documents based on the lists of related terms detected on each topic. For example, Li et al. [18] show that topic models can improve the predictive capacity of graded qualifications inference systems, which are widely used in reviews systems. Another successful application of topic models shows their usefulness in user interest mining, a relevant problem in social networks where the connections between users are defined from shared topics. Dhelim et al. [5] show that topic modeling improves the precision and recall of user interest recommender systems, increasing interactions between users and favoring activity growth in the network of shared interests.
In a seminal paper on model-based IR, Wei and Croft introduced LDA-based IR [19], a term-weighting scheme computed in the latent space of document topics. The model's core is based on the query likelihood model for IR [20], in which each document is scored by the likelihood of its topic model generating the formulated query. While the classic query likelihood strategy is based on maximum likelihood estimators calculated directly on the documentary collection, the LDA-based model calculates the likelihood from each document's topic distribution. In this way, two documents that show a lexical match with the query could rank differently, conditioned on the distribution of topics of each document.
One of the limitations of LDA is its sensitivity to hyperparameter tuning [16]. LDA requires the user to choose the number of topics. In addition, some hyperparameters define the characteristics of Dirichlet's priors. Wei and Croft [19] show that these parameters must be chosen carefully to avoid creating an uninformative topic model, with dire consequences for ad hoc IR tasks. Unfortunately, hyperparameter tuning requires an exhaustive search for possible configurations, which must be evaluated in curated data. Tuning a model based on a curated dataset requires several conditions to avoid overfitting, such as data variety and volume. Both conditions are challenging in the context of text IR.
One way to address the parametric sensitivity of LDA is to use ensemble learning [21]. Ensemble-based learning uses the outputs of various models to infer a model outcome. In this way, the probability of errors generated by model artifacts is minimized. Topic model ensembles have received attention due to their abilities to deal with the parametric sensitivity of LDA [22]. Topic model ensembles have the potential for applications such as distributed topic modeling for large corpora and incremental topic modeling for rapidly growing corpora, being applied in various fields such as healthcare [23], biomedicine [24], hospital readmission cost optimization [25], and social media content summarization [26].
We extend topic modeling ensembles to deal with ad hoc IR, studying the performance of Bagging [27] and Boosting [28]. Then, we use a simple but effective list ranking fusion strategy that combines the partial rankings delivered by each ensemble model into a single ranking list. Using benchmark data to examine the performance of different IR models, we found that our proposal outperforms classic IR methods and the method proposed by Wei and Croft [19] in terms of precision and recall.
The main contributions of this work are the following: -We extend topic modeling ensembles to the ad hoc IR domain, showing that this approach performs well in precision and recall in benchmark data; -We combine the partial lists of each model into a consolidated ranking list. Our results show that the strategy is effective.
The main purpose of this work is to determine if LDA-based ensembles strategies are helpful in IR. Furthermore, the design and study of different IR-based models and their validation in benchmark data will be helpful to elucidate whether the different ensemble strategies that have proven to be successful in text classification also prove to be competitive in IR. Accordingly, we can enumerate the main research questions that this work addresses: -RQ1: What is the level of improvement that the strategies of ensembles of LDA-based models introduce in IR? -RQ2: Which ensemble strategies, based on LDA, are most useful in IR?
This work is organized in the following sections. In Section 2, we review related work. Topic modeling ensembles for IR are introduced in Section 3. In Section 4, we present experimental results. We discuss implications of results and limitations of this study in Section 5. Finally, we conclude in Section 6, providing concluding remarks and outlining future work.

Related Work
A pioneering work on the use of ensemble learning for text processing is BoosTexter [29]. The proposed method was based on boosting algorithms for multilabel multiclass text categorization, outperforming text classifiers based on TF-IDF [12] and naive Bayes. The use of LDA-based features in boosting algorithms was introduced by La et al. [30]. The method, named LDABoost, uses latent topics extracted from one LDA model as text features. As base classifiers, LDABoost uses naive Bayes. The authors use mutual information as a metric for combination of basis classifiers, generating a strong classifier. The experimental results show that LDABoost outperforms BoosTexter and other classical text classification methods. LDABoost has been explored in Chinese language corpora [31], showing promising results in high volume data, outperforming, in terms of precision, other text classification methods based on the BOW approach. The use of LDA features for ensemble-based classifiers has been applied in different tasks, such as visual concept detection in video [32], phishing website detection [33], and classification of grants [34]. Wang and Guo [35] also use LDA in text classification based on boosting. The proposed method uses several LDAbased methods, each of which is used to build a classifier. The authors estimate the classification error to calculate the weight of each classifier. Finally, a new classifier is made based on the linear combination of the weak classifiers. The experimental results demonstrate that this algorithm performs better than classical methods in multilabeled corpora. Al-Salemi et al. [36] used supervised LDA [37] as a base model for text feature selection. This method makes use of labeled corpora to obtain the supervised topic model. The authors use a word selection method based on the LDA-topic weights to construct vector representations of the documents. These representations are used with AdaBoost for multilabel text categorization, showing promising results and outperforming classical methods for text classification.
Shen et al. [22] proposes separating the corpus into subpartitions, fitting an LDA model in each data partition. Then, a representation of the terms is obtained in the latent topic space, concatenating the vectors of terms of each base topic model. The idea of partitioning the corpus to build base LDA models was later applied to different domains, since it allowed the information coming from the original corpus to be obfuscated. These privacy guarantees were explored in applications to healthcare systems [23], biomedicine [24], hospital readmission cost optimization [25], and social media content summarization [26]. Belford et al. [38] propose a method for topic modeling ensembles based on Non-Negative Matrix Factorization (NNMF) [39]. The proposal disaggregates the matrix representation of a corpus of tweets into two factors obtained using NNMF. To address the instability limitations produced by matrix factorization, the method integrates several NNMF-based models, consolidating the term-topic base matrices in a single term matrix representation.
The procedure is evaluated in text clustering, improving the results obtained with a single NNMF-based model.
The use of data fusion methods for text clustering has also been explored in topic models. Pourvali et al. [40] propose calculating several topic models based on LDA, with different configurations according to the number of topics. For each of them, the proposed method leads a topic selection process at the level of each document. These topics are used to create a vector representation of each document. Finally, the technique conducts a document clustering process. Experimental results on different datasets show improvements in clustering results. Topic selection was also explored by Mendoza et al. [41], building document vector representations based on selected LDA-based topics according to topic coherence. Experimental results show that the proposal outperforms TF-IDF in text clustering tasks.
While LDA has been explored in ad hoc IR [19], topic modeling ensembles in IR remain almost unexplored. A closely related work but with very different evaluation assumptions is AdaRank [42]. AdaRank is an IR method based on boosting in the context of learning to rank. Learning to rank models make use of relevance-labeled corpora to train a supervised model for IR. In this context, the model is trained on pairs of documents and queries labeled with relevance scores. This valuable information allows a supervised learning algorithm to optimize the IR measure (e.g., mean average precision). AdaRank makes use of AdaBoost to fulfill this purpose. It should be noted that the context of ad hoc IR is different from that of learning to rank since ad hoc IR systems do not have relevance scores to build their models, assuming an unsupervised learning scenario.

Background
We introduce the necessary knowledge background to present our proposal. The environment needed for this work consists of the ad hoc IR method proposed by Wei and Croft [19], which extends the query likelihood model using topic models.
Formally, let C be a text corpora. Each document d i ∈ C is represented by a topic dis- where K represents the number of topics. The topic model provides a probability distribution φ j over the words for each topic j. Accordingly, the topic model of C corresponds to the collection of topics Φ = {φ 1 , φ 2 , . . . , φ K }.
The method proposed by Wei and Croft [19] for ad hoc IR is based on the query likelihood model, which uses a probabilistic language model to infer the likelihood model of generating a query Q from a document d: where q is a query term and P(Q|d) is the model likelihood generated for Q conditioned on d. P(q|d) is specified using Dirichlet smoothing [20]: where P ML (q|d) is the maximum likelihood estimator of the query term q conditioned on the document d given by where n d,q is the number of occurrences of q in d, and N d is the number of tokens in d. P ML (q|C) is the maximum likelihood estimator of q conditioned on C, i.e., the term bias of q on the corpus C, also known as as the prior of q. The µ parameter corresponds to the Dirichlet prior, which controls the relative weight of each factor in the estimate. Note that if q does not appear in d, the first factor of the estimate goes to zero, but the estimate P ML (q|d) is not zero due to the use of the term bias factor P(q|C). The smoothing effect improves the chances to recover more relevant documents in the ranking list. Empirical results on benchmark data show that µ can be fixed at 1000, offering good results in ad hoc IR tasks.
The maximum likelihood estimator P(q|d) is critical for the retrieval task. Wei and Croft [19] propose combining the original document modeling and the model obtained using LDA. In this way, the authors propose a linear combination between both approaches: where λ controls the relative weight between Dirichlet smoothing and LDA, with λ ∈ [0, 1]. When λ = 1, P(q|d) corresponds to the estimate proposed by Zhai and Lafferty [20]. Wei and Croft [19] have shown that λ = 0.7 offers a good balance between Dirichlet smoothing and LDA. As LDA models word correlations, P lda (q|d) may reach a high value if d includes words that correlate with q even if q does not appear in d. P lda (q|d) is obtained from a generative expression of q using Dirichlet priors: where θ d indicates topic proportions in d. Then, z n , the latent variable that produces q, is conditioned on β and represents the sampling probability of q on d. The β parameter controls the level of smoothness of the density function of the vocabulary simplex. Typically, β is fixed at 0.01. The α parameter is known as the Dirichlet hyperparameter of LDA and controls the level of smoothness/sharpness of the density function around the centroid of the simplex.

Topic Modeling Ensembles
The general scheme of the strategies studied in this work is shown in Figure 1. For all the ensemble learning strategies studied in this work, the corpus is divided into m document partitions, and an LDA model is fitted in each of them. To tackle the ad hoc information retrieval task, we use the LDA-based IR strategy proposed by Wei and Croft [19]. Then, given a BOW query, we produce a ranking list from each model. Finally, we build a consolidated document ranking list using a list merge method known as CombMNZ, successfully validated in ad hoc IR [43].
We study three ensemble learning strategies for ad hoc information retrieval. First, we split the corpus at random into m disjoint partitions. Accordingly, the models fitted to these partitions are trained regardless of the relationship between them. A second approach is based on Bagging [27], in which we sample the corpus at random with replacement. Accordingly, the models are related to each other because the partitions overlap and therefore have documents simultaneously included in several LDA models. Finally, we examine the performance of Boosting [28], sampling the corpus with an adaptive resampling strategy, from which documents with a lower quality of fit to an LDA model have a higher probability of being sampled. This approach defines a chained resampling strategy, from which the sampling probability at the document level is dependent on the goodness of fit of the previous models. Now we explain in detail each of these strategies. This setting is similar to the one proposed by Shen et al. [22], in which the topic model ensembles are used for multiclass text classification. In our problem setting, the m disjoint partitions are obtained by splitting the corpus at random. All the documents in the corpus are used to build the partitions.
Bagging-based corpus sampling (BAGG Ens): BAGG Ens works using document sampling without replacement. This strategy implies that the probability of sampling one document in a partition is independent of the probability of sampling other documents. Since the sample is without replacement, a document can be included in more than one partition and more than one time in the same partition. BAGG Ens considers each sample corpus to be the same size as the original corpus. In this way, the strategy obtains m versions of the corpus, introducing diversity between them.
Boosting-based corpus sampling (ADA Ens): ADA Ens works using adaptive boosting (AdaBoost) [28]. This strategy fits the LDA models in sequence. Given a new model in the ensemble, the partition on which the new model fits is obtained by sampling the documents according to the document's error of fitness to the immediately previous model. The sampling probabilities are proportional to the error of fitness so that the new models specialize in representing documents that have not been adequately modeled. To quantify the error of fitness, we build a probabilistic language model M d from each document d in the corpus. We create a unigram language model, so the word order is irrelevant. Accordingly, the language model of a document d corresponds to a multinomial distribution over words: where TF t i ,d is the frequency of the t i term in d, L d is the length of d measured as the number of terms that compose it, and M is the number of words that compounds the vocabulary of the corpus. The first term on the right-hand side is the multinomial coefficient that allows summing up all possible orderings of words. We can estimate the probabilities of words from the document language model M d using maximum likelihood Given an LDA topic model M lda , we can estimate the probability of a term t conditioned on the document d from a generative expression based on Dirichlet priors: where θ d indicates topic proportions in d and z n is the latent variable that produces t, conditioned on β. Therefore, to compute the error of fitness, we measure the divergence between the probabilities of words of the language model and the LDA model, defined from the Kullback-Leibler divergence: Finally, we use the standard Adaboost framework, defining an error coefficient from the error of fitness of each document in the corpus. Then, we define the document sampling probability for the t + 1-th iteration of the ensemble: d are the probabilities of sampling d in iterations t + 1 and t, α (t) d is the error coefficient of d in the t-th iteration, and Z t is a normalization factor. To initialize the sampling probabilities, in the first iteration D where N is the number of documents in the corpus C. ADA Ens works using document sampling without replacement. Therefore, a document can be included in more than one partition and more than one time in the same partition. ADA Ens considers each sample corpus to be the same size as the original corpus.

Ranking Fusion Strategy
We combine top-k lists of relevant documents from each LDA model using a ranking fusion strategy based on a linear combination of scores. The strategy takes advantage of the fact that different retrieval models may retrieve various documents for a single query. Thus, the potential global relevance of a document correlates with the number of models that suggest it. Specifically, we use CombMNZ [44], which multiplies the number of top-k lists where the document occurs by the sum of the scores obtained across all lists: where P l (q|d) is the score of d in the top-k rank list l. CombMNZ is a simple but effective technique for ranking fusion that has shown good performance in TREC datasets, which is the reason why we adopt it as a ranking fusion method for our proposal.

Experimental Results
We evaluate the proposal on four standard benchmark document collections. These datasets are MED, CRAN, CISI, and CACM, which can be freely accessed (http://ir.dcs.gla. ac.uk/resources/test_collections/ accessed on 1 March 2021). Table 1 shows basic statistics of these datasets. We compare the performance of our three methods, LDA Ens, BAGG Ens, and ADA Ens, with two strong baselines: LDA, the method introduced by Wei and Croft [19], and TF-IDF [12], a classic and successful term-based weighted scheme used in ad-hoc IR. In addition, we included in the evaluation two model-based IR methods. The first one is DBNIRM (Dependency Bayesian Network-based Information Retrieval Model) [45], a Bayesian network-based IR model that achieves good retrieval performance by detecting the most salient dependencies between terms in a term-based Bayesian network. Identifying pairs of related terms is helpful in IR, determining semantic relations between documents and query terms. We also included a second model-based IR method named CCLR (Concept Coupling Learning Retrieval) [9], which uses concept lattices to model dependency relationships between document terms. Like DBNIRM, CCLR allows identifying the pairs of concepts that are most strongly related, combining criteria of conceptual coupling intra-and inter-documents.
We use Mean Average Precision (MAP), Precision, Recall, and F 1 at top-k lists with 5, 10, and 20 results as evaluation metrics. As the four datasets have vocabularies of comparable sizes, we use the same number of topics for all the datasets. In [41], we show that using a high number of topics in these datasets allows finding topics with high coherence. Accordingly, we set the number of topics at 100 to help the topics identify lists of highly correlated descriptive words.
Since our methods depend on the sampling process, each ensemble-based model was evaluated five times. Accordingly, the reported results consider the average between the five trials. In TF-IDF, LDA, DBNIRM and CCLR, the results do not vary between different trials because they do not operate on corpus samples but on the entire collection. For LDA, we tested 20 runs over different hyperparameter settings for α and β. We did not find significant differences in terms of MAP for the different configurations used. Accordingly, we decided to use the values proposed in [46], this is α = 50 k and β = 0.01. We evaluate the effect of the number of models in each ensemble. We measure the impact of the number of models in terms of the four performance measures, finding that they show consistent results. We report the results in terms of MAP in Table 2 in top-10 lists. Table 2 shows the lack of a clear pattern of dependency between the number of models required to obtain the best configuration and the ensemble model. For LDA Ens, the best results in MED, CRAN, and CISI are obtained using five models. However, in CACM, LDA Ens requires ten models. BAGG Ens achieves its best results in MED and CRAN using 15 models. In CISI, the best results are achieved using 20 models, but in CACM, only one is needed. Finally, ADA Ens obtains its best result in MED and CACM using only one model, while in CRAN, it requires five and in CISI ten.
In most cases, the performance improves when using more models. In the case of LDA Ens, the best results are always obtained with five or more models. When using BAGG Ens, both MED, CRAN, and CISI require at least 15 models. Regarding the datasets, the most difficult is CACM, in which all strategies consistently obtain the lowest results. In this dataset, BAGG Ens and ADA Ens show that ensemble learning achieves no performance improvements. To compare the results of these strategies with the baselines, we use the best configurations in terms of the number of models indicated in Table 2. The results in terms of MAP, Precision, Recall, and F1 are shown for lists @5, @10, and @20 in Tables 3-5, respectively. Differences between models and baselines are statistically significant with 95% confidence according to the Wilcoxon test. The results in Tables 3-5 show that LDA is very competitive, outperforming TF-IDF in MED and CACM in all comparisons. However, the LDA results in CRAN and CISI show a deterioration compared to those obtained by TF-IDF. DBNIRM is also a competitive method, outperforming CCLR and achieving competitive results with TF-IDF on all datasets. This result indicates that identifying dependencies between pairs of terms is relevant to improving the description of documents and better matches the query terms. This idea is also exploited by topic models, which identity, for each topic, lists of related terms that improve the descriptive capacity of the documents. Specifically, DBNIRM obtains very competitive results in CRAN and CISI, especially in lists @10 and @20, where it manages to surpass LDA and TF-IDF in MAP and precision but obtains lower results in recall. On the other hand, CCLR consistently shows lower results than DBNIRM, showing its best results in MED and CACM for @20 lists. By extending LDA with ensemble learning, some results show significant improvements in many cases. For example, BAGG Ens outperforms in MED, CRAN, and CACM all its competitors by a substantial margin in results @5. The difference between BAGG Ens and LDA narrows in @10 and @20 results. BAGG Ens outperforms its competitors in MED and CRAN in results @10. In results of @20, LDA is the most robust method, being only surpassed by BAGG Ens in CACM. LDA Ens is also a competitive method, obtaining good performance results @10, achieving the best results in MAP for CRAN and CISI. LDA Ens maintains its good performance in CISI for results @20, obtaining the best performance in MAP. Regarding ADA Ens, this strategy outperforms its competitors only in results @5 in CISI. In the rest of the comparisons, ADA Ens fails to beat its competitors.
The fact that ADA Ens fails to outperform its competitors indicates that adaptive sampling is ineffective when working in tandem with topic models. On the other hand, domain partitioning based on disjoint partitions (LDA Ens) or bootstrap resampling (BAGG Ens) shows greater effectiveness. This finding is related to the potentialities and limitations of the topic models used to generate the ensembles, which fail to identify more valuable topics for complex documents. Instead, LDA takes more advantage of non-adaptive resampling strategies. Resampling allows discarding documents in specific partitions, introducing a greater variety in the samples.

Discussion
An interesting result shown in Tables 3-5 is related to the effectiveness of the ensemble learning techniques in terms of the lengths of the results lists. While ensemble learning results are better on shorter lists (@5), they deteriorate as the lists become longer. In fact, in @20 lists, LDA outperforms ensemble learning in MED, CRAN, and CISI, while BAGG Ens only maintains its performance in CACM. This finding indicates that ensemble learning techniques allow identifying more relevant results only in the first positions of the lists, suggesting that the descriptive word lists of the topics found may differ. This fact would explain the differences between the ensemble strategies.
To illustrate the differences between the four methods based on topic models, we compare the top-5 words of the highly coherent topics detected for LDA in each dataset. These topics were searched in the other methods (LDA Ens, BAGG Ens and ADA Ens), identifying the differences between these words lists. For each topic model strategy, we selected the model closest to the average performance showed in Tables 3-5, making the comparison consistent and fair. The results of this comparative analysis are shown in Table 6.
In Table 6, we highlight some words that complements the list of words detected by LDA. First, for each topic, we computed the IDF score of the top-5 LDA words. Then, new words identified by LDA Ens, BAGG Ens, or ADA Ens that are above the maximum IDF or below the minimum IDF are considered as words with more specific or general meanings, respectively. The most generic words are indicated in red, while the most specific ones are displayed in blue. Table 6 shows that the three ensemble strategies manage to identify new words concerning the topics detected by LDA. While most of the detected words are generic, some specific words complement the description of the original topic. All the words added by these strategies have a semantic relationship to the original topic, except for drum (indicated in green), which has no apparent semantic connection with topic 2 in CACM. Both LDA Ens, BAGG Ens and ADA Ens seem to detect specific words depending on the topic. This finding is interesting since it shows that the topics detected may have more or less specificity depending on the ensemble strategy. We note some differences between the strategies. LDA Ens works on independent partitions of the corpus. This partitioning strategy allow detecting more generic words. In the case of BAGG Ens and ADA Ens, because these strategies specialize in more complex documents to model, they tend to detect more specific words. We show in Figure 2 the IDF factor distributions for each of the strategies in each dataset studied in this work to corroborate this intuition. Table 6. Top-5 words per topic for the ensemble strategies proposed. The most generic words are indicated in red, while the most specific ones are displayed in blue. Off-topic words are displayed in green color. To create the boxplots in Figure 2, we selected the top-20 highly coherent topics of each strategy in each dataset. Then, we picked its top-10 most descriptive terms for each of these topics, calculating their IDF scores in the dataset. The boxplots of Figure 2 show some interesting results. The IDF distributions in MED are the most disparate, being BAGG Ens and ADA Ens, the strategies that manage to identify more specific words. This result coincides with the performances obtained by these strategies, which are the best found in this study. On the other hand, in both CRAN and CACM, ADA Ens cannot identify specific words, having the lowest median IDF of the four strategies. In these datasets, LDA and BAGG Ens slightly outperform the other strategies in median IDF. Finally, in CISI, none of the strategies can identify more specific words than the rest. This result coincides with the fact that the performances of the four strategies indicated in Tables 3-5 are quite even. In summary, Figure 2 shows that the ability of each strategy to identify specific words in each topic varies according to the datasets. While BAGG Ens and LDA identify specific words, the other strategies do not seem to have a significant ability to detect specific words in each topic. Now, we study the nature of the queries in which the proposed methods perform better than their competitors. First, we determine the set of queries where any of the LDA-based methods beats its contenders by at least 10% in MAP@5, so that the advantage obtained by the method is significant. The average performance model indicated in Table 3 is used to conduct this analysis, favoring a fair comparison between the different strategies considered in this work. Queries, where none of the methods managed to gain a significant margin, were excluded from the analysis. We show in Table 7 the list of queries for each dataset where a clear winning method was observed in MAP@5. We show the id of the query, its query words, and the name of the winning method. The results in Table 7 show that LDA and BAGG Ens are the methods that, by surpassing their competitors, achieve more advantages in terms of MAP@5. While in MED and CISI, BAGG Ens manages to outperform its competitors in more queries than the rest of the methods; in CRAN and CACM, both BAGG Ens and LDA are very competitive. In none of the queries does LDA Ens manage to significantly outperform its competitors in MAP@5, showing that this method, although it obtains an interesting average result, does not manage to outperform the rest consistently. On the other hand, ADA Ens only manages to outperform its competitors in some queries of CRAN. Undoubtedly, both LDA and BAGG Ens are the ones that manage to outperform the rest of the methods, offering competitive results in all datasets. The column that indicates the length of the queries shows that there is no relationship between this variable and the winning method. Both BAGG Ens and LDA exhibit the best performances in long or short queries, not clearly observing a pattern that shows dependence between the type of ensemble strategy and the query length.
The results show another important finding. While CRAN has twice as many queries as CISI, the number of queries in which our ensemble methods outperform their competitors show a ratio of 4 to 1. This ratio can be attributed to the fact that CRAN's vocabulary is smaller than CISI, which would make it easier to model. The results of Tables 3-5 show that the datasets in which the methods obtain better results are MED and CRAN, which are the datasets that have smaller vocabularies.

Limitations of This Study
Due to the high computational cost involved in the experiments, which implied carrying out several trials for each topic model, it was not easy to experiment on datasets of greater volume, such as the Tipster datasets (TREC), which are not in the public domain.
Instead, and due to the limitations of access to computational resources, the experiments were carried out in datasets of smaller size, which allowed to control the use of resources available for this study. Although this limitation of the study is important, it does not limit the validity of its conclusions since the four datasets used in the experimental validation are frequently used in studies of this type. It would be desirable to overcome these limitations with a work that involves studying different aspects of the efficiency of these methods, which allow them to scale to larger documentary collections. However, the study of these aspects exceeds the objectives of this article, despite which we understand that they are fundamental for the applicability of these methods.

Conclusions
This study has extended the ensemble strategies based on topic models to the ad-hoc IR domain. These classic machine learning strategies have been widely studied in text classification, but their use in IR still seems incipient. Accordingly, we have studied three different ensemble strategies for IR, showing that these strategies manage to identify more relevant documents than two competitive baselines at the top of the results lists. However, when the results lists are longer, the differences between these methods decrease. Our experiments show that performance is related to the specificity of the words detected in the topics, for which BAGG Ens emerges as the most effective strategy. No dependence was detected between the performance of the methods and the length of the queries.
Concerning RQ1, this work shows that model ensemble strategies based on LDA topic models are competitive in IR, offering improvements over solid baselines such as TF-IDF and outperforming IR strategies based on Bayesian networks of terms or conceptual lattices. The advantages they offer over other strategies are especially relevant in the first positions of the results lists, but they lose effectiveness as the lists becomes longer. Regarding RQ2, this work shows that the most effective strategy is BAGG Ens. This strategy is especially effective on @5 lists, in which it achieves statistically significant advantages over other competitive methods. However, although ADA Ens manages to identify more specific words in some queries, which produces improvements in the descriptive capacity of queries and documents, this does not necessarily imply an improvement in precision or recall. This result is similar to that identified in models based on networks of terms such as DBNIRM or models based on concepts such as CCLR, which effectively identify pairs of related terms, but this does not necessarily imply an improvement in precision and recall.
In future work, the efficiency aspects of these methods should be studied with care. In addition, the enormous volume of data on the web indicates that the scalability of these methods is an issue that needs to be addressed carefully in future studies.