Generalized Analysis of a Distribution Separation Method

Separating two probability distributions from a mixture model that is made up of the combinations of the two is essential to a wide range of applications. For example, in information retrieval (IR), there often exists a mixture distribution consisting of a relevance distribution that we need to estimate and an irrelevance distribution that we hope to get rid of. Recently, a distribution separation method (DSM) was proposed to approximate the relevance distribution, by separating a seed irrelevance distribution from the mixture distribution. It was successfully applied to an IR task, namely pseudo-relevance feedback (PRF), where the query expansion model is often a mixture term distribution. Although initially developed in the context of IR, DSM is indeed a general mathematical formulation for probability distribution separation. Thus, it is important to further generalize its basic analysis and to explore its connections to other related methods. In this article, we first extend DSM’s theoretical analysis, which was originally based on the Pearson correlation coefficient, to entropy-related measures, including the KL-divergence (Kullback–Leibler divergence), the symmetrized KL-divergence and the JS-divergence (Jensen–Shannon divergence). Second, we investigate the distribution separation idea in a well-known method, namely the mixture model feedback (MMF) approach. We prove that MMF also complies with the linear combination assumption, and then, DSM’s linear separation algorithm can largely simplify the EM algorithm in MMF. These theoretical analyses, as well as further empirical evaluation results demonstrate the advantages of our DSM approach.


Introduction
In information retrieval, a typical post-query process is relevance feedback, which builds a refined query model (often a term distribution) based on a set of feedback documents, in order to have a better representation of the user's information need [1].There are three types of relevance feedback methods, i.e., explicit, implicit and pseudo-relevance feedback.Among them, pseudo-relevance feedback (PRF) is a fully automatic approach to the query expansion, by assuming that the top ranked documents returned by an information retrieval (IR) system are relevant.A widely-used PRF method is the relevance model (RM) [2], which utilizes top ranked documents D to construct a relevance term distribution R. One limitation of RM-based methods is that the feedback document set D is often a mixture of relevant and irrelevant documents, so that R is very likely to be a mixture distribution seed irrelevance distribution) is available, while the other part of irrelevance distribution is unknown (denoted as I S ).Probability of the i-th term in any distribution F l(F, G) Linear combination of distributions F and G The task of DSM is defined as: given a mixture distribution M and a seed irrelevance distribution I S , derive an output distribution that can approximate the R as closely as possible.Specifically, as shown in Figure 1, the task of DSM can be divided into two problems: (1) how to separate I S from M and derive a less noisy distribution l(R, I S ), which is mixed by R and I S ; (2) how to further refine l(R, I S ) to approximate R as closely as possible.In this article, we will be focused on the first problem and the linear separation algorithm to derive l(R, I S ).Note that l(R, I S ) is also an estimate of R, depending on how much irrelevance data are available.The theoretical analysis proposed in this article will be mainly related to the linear separation algorithm and its lower bound analysis.

Linear Combination Analysis
DSM adopts a linear combination assumption, which states that the mixture term distribution is a linear combination of the relevance and irrelevance distributions.Under such a condition, the mixture distribution M can be a linear combination of R and I.As shown in Figure 1, M can also be a linear combination of two distributions I S and l(R, I S ), where l(R, I S ) is a linear combination of R and I S .We have: where λ (0 < λ ≤ 1) is the linear coefficient.The problem of estimating l(R, I S ) does not have a unique solution generally, since the value of the coefficient λ is unknown.Therefore, the key is to estimate λ.Let λ(0 < λ ≤ 1) denote an estimate of λ, and correspondingly, let l(R, I S ) be the estimation of the desired distribution l(R, I S ).According to Equation (1), we have: Once the right λ is obtained, Equation ( 2) is the main equation to construct the distribution separation in linear time.However, there can be infinite possible choices of λ and its corresponding l(R, I S ).To get the solution of λ, we need to find its lower bound, by introducing a constraint that values in the distribution should be nonnegative [6].Based on this constraint and Equation (2), we have: λ × 1 (1 − M./I S ) Effectively, Equation (3) sets a lower bound λ L of λ: where 1 stands for a vector in which all of the entries are one, ./denotes the entry-wise division of M by I S and max(•) denotes the max value in the resultant vector 1 − M./I S .The lower bound λ L itself also determines an estimation of l(R, I S ), denoted as l L (R, I S ).
The calculation of the lower bound λ L is critical to the estimation of λ.Now, we present an important property of λ L in Lemma 1. Lemma 1 guarantees that if the distribution l(R, I S ) contains a zero value, then λ = λ L , leading to the distribution l L (R, I S ) w.r.t.λ L being exactly the desired distribution l(R, I S ) w.r.t.λ.

Lemma 1.
If there exists a zero value in l(R, I S ), then λ = λ L , leading to l(R, I S ) = l L (R, I S ).
The proof can be found in [6].In a density estimation problem or a specific IR model estimation task, with a smoothing method used, there would be many small values instead of zero values, in l(R, I S ).In this case, l L (R, I S ) is still approximately equal to l(R, I S ), which guarantees that λ L can still be equal to λ.The detailed description of this remark can be found in [6].

Minimum Correlation Analysis
In this section, we go in-depth to study another property of the combination coefficient and its lower bound.Specifically, we analyse the correlation between l(R, I S ) and I S , along with the decreasing coefficient λ.Pearson product-moment correlation coefficient ρ [8] is used as the correlation measurement.Proposition 1.If λ ( λ > 0) decreases, the correlation coefficient between l(R, I S ) and I S , i.e., ρ( l(R, I S ), I S ), will decrease.
The proof of Proposition 1 can be found in [6].According to Proposition 1, among all λ ∈ [λ L , 1], λ L corresponds to the minimum correlation coefficient between l(R, I S ) and I S , i.e., min(ρ).We can also change the minimum correlation coefficient (i.e., min (ρ)) to the minimum squared correlation coefficient (i.e., min (ρ 2 )).To solve this optimization problem, please refer to [6] for more details.

Extended Analysis of DSM on Entropy-Related Measurements
As we can see from the previous section, although DSM was proposed in the pseudo-relevance feedback scenario, its algorithm and analysis are not restricted to query term distributions derived by PRF techniques.DSM is actually a mathematical formulation for probability distribution separation, and it is important to further investigate its theoretical properties.
In this section, we describe the generalization of DSM's analysis in terms of some entropy-related measures.Specifically, we will extend the aforementioned minimum correlation analysis to the analysis of the maximum KL-divergence, the maximum symmetrized KL-divergence and the maximum JS-divergence.

Effect of DSM on KL-Divergence
Recall that in Section 2.2, Proposition 1 shows that after the distribution separation process, the Pearson correlation coefficient between DSM's output distribution l(R, I S ) and the seed irrelevance distribution I S can be minimized.Here, we further analyse the effect of DSM on the KL-divergence between l(R, I S ) and I S .
Specifically, we propose the following Proposition 2, which proves that if λ decreases, the KL-divergence between l(R, I S ) and I S will be increased monotonously.Proposition 2. If λ ( λ > 0) decreases, the KL-divergence between l(R, I S ) and I S will increase.
Proof.Using the simplified notations in Table 2, let the KL-divergence of between l(R, I S ) and I S be formulated as: Now, let ξ = 1/ λ as we did in the proof of Proposition 1 (see [6]).According to Equation (2), we have l(R, Based on Equations ( 5) and ( 6), we get: Let D(ξ) = D( l(R, I S ), I S ).The derivative of D(ξ) can be calculated as: Since ∑ m i=1 M(i) = 1 and ∑ m i=1 I S (i) = 1, ∑ m i=1 [M(i) − I S (i)] becomes zero.We then have: Let the i-th term in the summation of Equation (9) be: It turns out that when In conclusion, we have D (ξ) > 0. This means that D(ξ) (i.e., D( l(R, I S ), I S )) increases after ξ increases.Since λ = 1/ξ, after λ decreases, D( l(R, I S ), I S ) will increase.

Original
Simplified Linear Coefficient According to Proposition 2, if λ is reduced to its lower bound λ L , then the corresponding KL-divergence D(l L (R, I S ), I S ) will be the maximum value for all of the legal λ (λ L ≤ λ < 1).In this case, the output distribution of DSM will have the maximum KL-divergence with the seed irrelevance distribution.

Effect of DSM on Symmetrized KL-Divergence
Having shown the effect of reducing the coefficient λ on the KL-divergence between l(R, I S ) and I S , we now investigate the effect on the symmetrized KL-divergence between two involved distributions by proving the following proposition.
The proof of Proposition 3 can be found in Appendix A.1.According to the above proposition, if λ is reduced to its lower bound λ L , the corresponding symmetrized KL-divergence D(I S , l(R, I S )) will be the maximum value for all of the legal λ (λ L ≤ λ < 1).This means that the output distribution of DSM given this lower bound estimation has the maximum symmetrized KL-divergence with the seed irrelevance distribution.

Effect of DSM on JS-Divergence
Now, let us further study the reduction of the coefficient λ in terms of its role in maximizing the JS-divergence between DSM's output distribution l(R, I S ) and the seed irrelevance distribution I S , by presenting the following proposition.Proposition 4. If λ ( λ > 0) decreases, the JS-divergence between l(R, I S ) and I S will increase.
The proof of Proposition 4 can be found in Appendix A.2. Based on the above proposition, if λ is reduced to its lower bound λ L , then the corresponding JS-divergence JS( l(R, I S ), I S ) will be the maximum value for all of the legal λ (λ L ≤ λ < 1).
In summary, we have extended the analysis of DSM's lower bound combination coefficient, from the minimum correlation analysis, to the maximum KL-divergence analysis, the maximum symmetrized KL-divergence analysis and the maximum JS-divergence analysis.These extended analyses enrich DSM's own theoretical properties.
These above theoretical properties of DSM are based on one basis condition, i.e., the linear combination assumption.In the next section, we will investigate how to apply the distribution separation idea/algorithm in other methods.The main idea is to verify if the well-known mixture model feedback (MMF) approach complies with this linear combination assumption.If yes, the idea of DSM's linear separation algorithm can be applied in MMF, and the associated theoretical properties of DSM can be valid for MMF's solution, as well.

Generalized Analysis of DSM's Linear Combination Condition in MMF
Now, we will investigate the relation between DSM and a related PRF model, namely the mixture model feedback (MMF) approach [7].MMF assumes that feedback documents are generated from a mixture model with two multinomial components, i.e., the query topic model and the collection model [7].
The estimation of the output "relevant" query model of MMF is trying to purify the feedback document by eliminating the effect of the collection model, since the collection model contains background noise, which can be regarded as the "irrelevant" content in the feedback document [7].In this sense, similar to DSM, the task of MMF can also be regarded as a process that removes the irrelevant part in the mixture model.However, to our knowledge, researchers have not investigated whether the linear combination assumption is valid or not in MMF.We will prove that the mixture model in MMF is indeed a linear combination of "relevant" and "irrelevant" parts.This theoretical result can lead to a simplified version of MMF based on the linear separation equation (see Equation ( 2)) of DSM.

Review of the Mixture Model Feedback Approach
Now, we first review the mixture model feedback approach, where the likelihood of feedback documents (F ) can be written as: where c(w; d) is the count of a term w in a document d, p(w|θ F ) is the query topic model, which can be regarded as the relevance distribution to be estimated, and p(w|C) is the collection model (i.e., the distribution of term frequency in the whole document collection), which is considered as the background distribution/noise.The empirically-assigned parameter λ is the amount of the true relevance distribution, and 1 − λ indicates the amount of background noise, i.e., the influence of C in the feedback documents.An EM method [7] is developed to estimate the relevance distribution via maximizing the likelihood in Equation (10).It contains iterations of two steps [9]: λp(w|θ p(w|R where p(z w = 1|F , θ F ) is the probability that the word w is from the background distribution, given the current estimation of the relevance distribution (θ F ).This estimation can be regarded as a procedure to obtain relevant information from feedback documents while filtering the influence of collection distribution, leading to a more discriminative relevance model.It should be noted that in Equation (10), due to the log operator within the summations (i.e., ∑ d∈F ∑ w∈d c(w; d)), it does not directly show that the mixture model is a linear combination of the collection model and the query topic model.Therefore, an EM algorithm is adopted to estimate the query topic model θ F .

The Simplification of the EM Algorithm in MMF via DSM's Linear Separation Algorithm
Now, we explore the connections between DSM and MMF.In both methods, once λ is given (either by the estimation in DSM or by an assigned value in MMF), the next step is to estimate the true relevance distribution R. We will first demonstrate that if the EM algorithm (in MMF) converges, the mixture model of the feedback documents is a linear combination of the collection model and the output model of the EM iterative algorithm.The proof of Proposition 5 can be found in Appendix A.3.Based on such a proof, it is shown that: where t f (w, F ) is the mixture model, which represents the term frequency in the feedback documents, p(w|C) is the collection model and p(w|θ is the estimated relevance model output by the n-th step of the EM iterative algorithm in MMF.It shows that the mixture model t f (w, F ) is a linear combination of the collection model p(w|C) and the output relevance model p(w|θ The above equation can be changed to: Now, if we regard p(w|θ F ) as an estimated relevance distribution, t f (w, F ) as a kind of mixture distribution and p(w|C) as a kind of irrelevance distribution, then Equation (13) fits Equation (1), and Equation ( 14) is the same distribution separation process as Equation (2), where l(R, I S ) is the estimated relevance distribution.It demonstrates that the EM iterative steps in MMF can actually be simplified by the linear separation solution in Equation ( 14), which has the same distribution separation idea in Equation (2).

Comparisons between DSM and Related Models
In this section, we will compare DSM with other related works, including mixture model feedback (MMF) [7], fast mixture model feedback (FMMF) [10], regularized mixture model feedback (RMMF) [11], as well as a mixture multinomial distribution framework and a query-specific mixture modelling feedback (QMMF) approach [12].Since the above models are implemented on two basic relevance feedback models, i.e., relevance model (RM) and mixture model feedback (MMF), we will also compare RM (we use RM to denote RM1 in [2]) and MMF.These comparative discussions and analyses are described in the following, in order to clarify the position of DSM in the IR literature and our contributions for the IR community.

DSM and MMF
As discussed in the previous section, DSM and MMF share a similar strategy that the irrelevant part should be eliminated from the mixture model, and then, the output relevant query model can be purified.In MMF, the collection model is considered as the irrelevance model that contains background noise, and an EM iterative method [7] is developed to estimate the relevance distribution via maximizing the likelihood in Equation (10).
To our knowledge, researchers have not investigated whether the linear combination assumption is valid or not in MMF.We, for the first time, prove Proposition 5, which shows that if the EM algorithm (in MMF) converges, the mixture model of the feedback documents is a linear combination of the collection model and the output model of the EM iterative algorithm.This proposition directly results in a simplified solution for MMF, by replacing the EM iterative steps in MMF with DSM's linear distribution separation solution (see Equation ( 14)).
Besides providing a simplified solution with linear complexity to the EM method in MMF, DSM shows an essential difference regarding the coefficient λ.In MMF, the proportion of relevance model in the assumed mixture model t f (w, F ) is controlled by λ, which is a free parameter and is empirically assigned to a fixed value before running the EM algorithm.On the other hand, in DSM, as previously mentioned in Section 2, λ for each query is estimated adaptively via an analytical procedure based on its linear combination analysis (see Section 2.1), a minimum correlation analysis (see Section 2.2) and a maximal KL-divergence analysis (described in Section 3.1).

DSM and FMMF
Another simplified solution to MMF was proposed in [10].This solution is derived by the Lagrange multiplier method, and the complexity of its divide and conquer algorithm is O(n) (on average) to O(n 2 ) (the worst case).On the other hand, our simplified solution in Equation ( 14) was analytically derived from the convergence condition of the EM method in the MMF approach, and the complexity of the linear combination algorithm in Equation ( 14) is further reduced to a fixed linear complexity, i.e., O(n).

DSM and RMMF
To deal with the problems of the manually-tuned interpolation coefficient λ in MMF (see also the discussions in Section 5.1), Tao and Zhai [11] proposed a regularized MMF (RMMF), which yields an adaptive solution for estimating λ and achieves good performance.Specifically, RMMF added a conjugate Dirichlet prior function to the original objective function in MMF, and the original query model is used as the prior.In RMMF, a regularized EM method is developed to adapt the linear coefficients and the prior confident value.The main strategy in this EM method is to gradually lower the prior confident value µ starting with a very high value, and the learned interpolation coefficient λ (in [11], λ is denoted by α D ) varies with different queries.
Although both RMMF and DSM can estimate an adaptive interpolation coefficient λ for the mixture model MMF, their algorithms are quite different.In RMMF, an EM iterative algorithm is still used, like in the original MMF.Therefore, the computational cost is relatively time consuming.On the other hand, as described in Sections 5.1, the adaptive solution of the interpolation coefficient of MMF can be obtained in linear time via an analytical procedure, with a minimum correlation analysis and a maximal KL-divergence analysis guaranteed.Moreover, for the estimation of the output relevance distribution, different from the iteratively-learned solution in RMMF, the solution of MMF can be obtained by a closed-form solution in Equation ( 14).

DSM and Mixture Multinomial Distribution Framework
Chen et al. [12] proposed a unified framework by considering several query expansion models, e.g., RM and MMF, as mixture multinomial distributions.In addition, they built a query-specific mixture model feedback (QMMF) approach, which modifies RMMF by replacing the original query model with the relevance model (actually RM1) in the prior function of RMMF.QMMF was then successfully applied in speech recognition and summarization tasks.
Although Chen et al. have summarized RM and MMF in the mixture multinomial distribution [12], they have not shown that both RM and MMF comply with the linear combination assumption.With the proof in Appendix A.3, we demonstrate Proposition 5, which shows that MMF complies with the linear combination assumption.This theoretical result leads to a simplified solution for MMF (see Equation ( 14)).In Appendix A.4, we also show the validity of the linear combination assumption in RM.Therefore, to some extent, DSM unifies RM and MMF from another point of view, i.e., the linear combination assumption and DSM's analysis and algorithm can be applied to both of them.
With regard to QMMF, since it is actually based on RMMF and MMF, the difference between QMMF and DSM is also related to the EM algorithm's solution in MMF-based methods versus the linear separation solution in DSM, as we discussed in Sections 5.1 and 5.3.Indeed, in RMMF and QMMF, it brings obvious benefits to adopt the original query model or the relevance model as a prior to constrain the estimation of the interpolation coefficient and the relevance feedback distribution.In our future work, we are going to investigate if it is possible to adopt similar relevance information to regularize the separation algorithm in DSM.

RM and MMF
Exploiting relevance feedback for query expansion [13] is a popular strategy in the information retrieval area to improve the retrieval performance [14].Many models with relevance feedback have been proposed [2,7,11,[15][16][17], among which the relevance model (RM) [2] and the mixture model feedback (MMF) [7] are two basic models on which many other models are built.
RM extends the original query with an expanded term distribution generated from the feedback documents.The resultant distribution of RM is calculated by combining the distributions of each feedback document with the normalized query likelihood as its document weight.Therefore, the effectiveness of RM is dependent on the quality of feedback documents.Since feedback documents may contain collection noise, which affects the quality of the relevance model, the mixture model approach [7] is proposed to handle this problem.It assumes that the relevance documents are generated from a mixture model of the relevance information and collection noise, and an EM iterative method is used to learn a relevance feedback distribution.
Although empirical results have shown that MMF can perform better than RM on some collections [14], we cannot say which one is definitely better or worse than the other, since the retrieval performance of a feedback-based query model is dependent on the quality of the feedback documents.Low quality feedback documents may not reflect the user's information need well, which affects the effectiveness of the feedback document-based models.
With respect to the time complexity, due to an EM learning procedure in MMF [7,9], MMF is more time consuming than RM.In this paper, we provide a simplified solution for MMF in Section 4.2.Equipped with this linear separation algorithm, MMF can be also implemented efficiently.
As discussed in Section 5.4, DSM's generalized analysis can unify RM and MMF, since the linear combination assumption holds in both models, and DSM's analysis and algorithm can be applied to both of them.Moreover, DSM can guide the improvements of both of them.Specifically, for RM, DSM can separate an irrelevant distribution from the mixture model to approach the pure relevance distribution, and for MMF, the linear separation algorithm of DSM can be utilized to simplify the solution of MMF, significantly reducing its algorithm complexity.

Contributions of DSM in Information Retrieval
Based on the above comparisons between DSM and other related models, we summarize our contributions as follows:

•
We, for the first time, prove that mixture model feedback (MMF) complies with the linear combination assumption.

•
Based on the above proof, MMF's EM algorithm can be simplified by a linear separation algorithm in DSM.

•
DSM can unify RM and MMF, in the sense that DSM's analysis and algorithm can be applied to both of them.

•
The solution of DSM is associated with solid mathematical proofs in the linear combination analysis, the minimum correlation analysis, as well the analyses with the maximal KL-divergence, the maximal symmetric KL-divergence and the maximal JS-divergence.
We believe that compared to the empirical contributions on the retrieval performance improvements, the theoretical contributions of DSM are also important in the IR community.The generalized analyses of DSM are validated by the mathematical proofs, and its validity is to some extent independent of different parameters or different test collections.Although many feedback-based query expansion models have been proposed, relatively less attention has been paid to the rigorous analysis (through the proof of lemmas or propositions) of a retrieval model.There are a few works on the theoretical analysis of relevance feedback.For example, recently, Clinchant and Gaussier [18] studied the statistical characteristics of the terms selected by several pseudo-relevance feedback methods, and proposed properties that may be helpful for effective relevance feedback models.However, to our knowledge, in the literature, there is a lack of a generalized analysis for DSM and an investigation on the linear combination condition in MMF.

Experiments
We have theoretically described the relation between the mixture model feedback (MMF) approach and our DSM method.The main experiments in this section provide empirical comparisons of these two methods in an ad hoc retrieval task.In addition, since we compare RM and DSM in Sections 5.4 and 5.5, we will conduct an implicit feedback task with RM as the baseline (for the empirical comparison between RM and DSM in the ad hoc retrieval task, please refer to [6]).It is expected that this additional experiment can show the flexibility of DSM on different tasks.

Experimental Setup
The evaluation involves four standard TREC (Text REtrieval Conference) collections, including WSJ (87-92, 173,252 documents), AP (88-89, 164,597 documents) in TREC Disk 1 and 2, ROBUST 2004 (528,155 documents) in TREC Disk 4 and 5 and WT10G (1,692,096 documents).These datasets involve a variety of texts, ranging from newswire articles to web/blog data.Both WSJ and AP datasets are tested on Queries 151-200, while the ROBUST 2004 and WT10G collections are tested on Queries 601-700 and 501-550, respectively.The title field of the queries is used to reflect the typical keyword-based search scenarios.In query expansion models, the top 100 terms in the corresponding distributions are selected as expanded terms.The top 50 documents in the initial ranked list obtained by the query likelihood approach are selected as the feedback documents.The top 1000 documents retrieved by the negative KL-divergence between the expanded query model and the document language model [19] are used for retrieval performance evaluation.The Lemur 4.7 toolkit [20] is used for indexing and retrieval.All collections are stemmed using the Porter stemmer, and stop words are removed in the indexing process.
As for the evaluation metric for the retrieval performance, we use the mean average precision (MAP), which is the mean value of average precision over all queries.In addition, we use the Wilcoxon significance test to examine the statistical significance of the improvements over the baseline (baseline model and significant test results are shown in the result tables).

Evaluation on Retrieval Performance
As previously mentioned, the EM iteration algorithm of MMF can be simplified as a distribution separation procedure (see Equation ( 14)) whose inputs are two distributions t f (w, F ) (TF for short) and p(w|C), where TF is the mixture distribution for which the probability of a term is its frequency in feedback documents, and C is the distribution of the term frequency in the whole document collection.It has been shown in Section 4 that Equation ( 14) is actually a special case of DSM, when TF and C are DSM's input distributions and λ is assigned empirically without principled estimation.We denote this special case as DSM (λ fixed).Now, we compare MMF (to the EM algorithm) and DSM (λ fixed) to test Proposition 5 empirically.At first, we directly measure the KL-divergence between the resultant distributions of MMF and DSM (λ fixed).We report the results of Queries 151-160 on WSJ 87-92 with λ = 0.8 in Figure 2, and the results of other queries/datasets show the same trends.It can be observed in Figure 2 that the KL-divergence between the resultant distributions of MMF and DSM (λ fixed) tends to zero, when the EM algorithm (of MMF) converges (as the iteration steps are going to 20).The above observation supports the proof of their equivalence illustrated in Proposition 5.
Next, we compare the retrieval performance of MMF and DSM (λ fixed).For MMF, we set λ to the value with the best retrieval performance, and this optimal value is also used in DSM (λ fixed).Experimental results are shown in Table 3.We can find that the performances of these two methods are very close, which is consistent with the analysis in Section 4. The results again confirm that the EM algorithm in MMF can be simplified by Equation (14), which is a linear separation algorithm used in DSM.As previously mentioned, DSM (λ fixed) is just a special case of DSM when the λ is empirically assigned.This λ is the same for all of the concerned queries.For DSM, we can use the lower bound of λ and this estimation (i.e., the lower bound λ L ) is computed adaptively for each query.In addition, DSM involves a refinement step for the input distributions (see the algorithm in [6]).In Table 3, we denote DSM with the lower bound of λ as DSM (λ L ) and denote DSM with the refinement step as DSM (+refine).
We now test DSM (λ L ) and DSM (+refine) when TF and C are the mixture distribution and seed irrelevance distribution, respectively, as used in DSM (λ fixed).It is demonstrated in Table 3 that the performances of both DSM (λ L ) and DSM (+refine) are significantly better than MMF.This is because although MMF and DSM (λ fixed) empirically tune λ for each collection, the value of λ is the same for each query.On the contrary, DSM (λ L ) and DSM (+refine) adopt the principled estimation of λ for each concerned query adaptively based on the linear combination analysis, the minimum correlation analysis and maximum KL-divergence analysis.This set of experiments demonstrates that the estimation method for λ in the DSM method is crucial and effective for the irrelevance distribution elimination.

Evaluation on Running Time
Now, we report the running time of the DSM in comparison with the MMF's EM iterative methods.The running times are recorded on a Dell PowerEdge R730 with one six-core CPU.Each recorded time is computed over a number of topical queries, and this number is 50, 50, 100 and 50 for WS J8-792, AP 88-89, ROBUS 2004T and WT10G, respectively.We run each method 100 times and report the average running time of the MATLAB code.The number of iterations used in the EM algorithm of MMF is set to 20, since in our experiments, the EM algorithm cannot converge well with less than 20 iterations.
The running time comparisons between DSM and MMF are shown in Figure 3.For each figure, the left column is for the DSM method (with distribution refinement), which has more computation steps and is thus slower than DSM (with λ fixed) and DSM (with lower bound λ L ) described in the previous experiment; while the right column is for the EM algorithm used in MMF.These results demonstrate the acceleration effect of DSM for MMF.It is clear that DSM with a linear time complexity is much more efficient than the EM's iterative algorithm used for MMF.

Application of DSM on Implicit Irrelevant Feedback Using Commercial Query Logs
To further show DSM's flexibility, in this section, we will give the empirical evaluation of DSM in implicit irrelevant feedback using query log data from a real commercial search engine.Since users' behaviour can be exploited in implicit feedback to infer their preference [21], it can be used here for identifying seed irrelevant documents.Without loss of generality, we assume that each query is part of a searching session, which consists of a sequence of queries with a time interval (30 min in our case).A list of returned documents is associated with each query, and whether or not the documents were clicked is recorded.Besides, the clicked documents can also be divided into satisfied clicked and unsatisfied clicked according to the hovering time over the document.We call the clicked documents with a short hovering time (e.g., less than 30 s) as "unsatisfied", because we believe that a user clicking a document, but closing it really quickly, gives a hint that the user is not interested in this document [22][23][24].Now, we can obtain the seed irrelevant document set, which consists of the unsatisfied clicked documents in the history appearing in the current returned documents list, recorded as D unsatis f ied .The corresponding seed irrelevance distribution is: where Z I S = ∑ d ∈D unsatis f ied p(q|d ).
obtain the seed irrelevance distribution from unsatisfied clicked (USC) documents as the implicit feedback documents.As for the mixture term distribution, we use the term distribution derived from RM as an input for DSM.The detailed calculation for this mixture distribution can be found in Appendix A.4.We compare DSM to the initial result of the search engine.In addition, we compare DSM to two kinds of relevance feedback methods: pseudo-relevance feedback using all of the returned documents for query expansion and implicit relevance feedback using the clicked document in the log (the number of pseudo-relevant documents is not a constant, since the number for each query's returned documents is different; some sessions may have no unsatisfied clicked documents, and to handle this problem, we simply ignore these sessions).For implicit irrelevance feedback, the top 50 terms are selected as expanded terms in query expansion models, and performance evaluations are conducted on all of the returned documents.
We sort all of the sessions based on the current query's query click entropy [25,26], which is a metric for the variability in click results from different users.Low click entropy means that clicks from the most users are within a small number of returned documents, which leads to less potential to benefit from user's implicit feedback and little improvement space.To clearly compare different implicit feedback-based re-ranking methods, we take the top n (n = 100, 200) sessions with the largest click entropy.In Table 4, each column records the result for the top n sessions with the largest click entropy.
In Table 4, "Initial" denotes the initial performance of the search engine."Pseudo-Relevance Feedback" and "Implicit Relevance Feedback" denote the cases when we use all of the returned documents and clicked documents, respectively, as the feedback documents, based on which we carry out query expansion using RM.
From Table 4, we can observe that DSM with unsatisfied clicked (USC) documents (as the seed irrelevance documents) can largely improve the initial ranking performanceand works better than both pseudo-relevance feedback and implicit relevance feedback.The above results demonstrate that DSM is effective for implicit irrelevance feedback.

Conclusions and Future Work
In this paper, we have systematically investigated the theoretical properties of the distribution separation method (DSM).Specifically, we have proven that the minimum correlation analysis in DSM is generalizable to maximum (original and symmetrized) KL-divergence analysis, as well as JS-divergence.We also proved that the solution to the well-known mixture model feedback (MMF) can be simplified using the linear combination technique in DSM, and this is also empirically verified using standard TREC datasets.We summarize the theoretical contributions of DSM for the IR research in Section 5.6.
The experimental results on the ad hoc retrieval task show that the DSM with an analytically-derived combination coefficient λ can not only achieve better retrieval performance, but can largely reduce the running time, compared to the EM algorithm used in MMF.An additional experiment on the query log shows that DSM can also work well in the implicit feedback task, which indicates the flexibility of DSM on different tasks.
In our future work, we are going to investigate if it is possible to adopt the original query model (or other relevance information) to regularize the separation algorithm in DSM.The empirical evaluation will then be based on RMMF and QMMF as the baselines, in order to compare different regularization strategies, given the same prior information for the regularization.Moreover, since the EM algorithm is widely used in many fields, e.g., machine learning, data mining, etc., it is interesting to investigate the distribution separation idea in certain applications of EM algorithms (e.g., the Gaussian mixture model).In our future work, we will endeavour to make DSM be more applicable to various methods/tasks.M(i) > I S (i), since M(i) − I S (i) > 0 and 0 < I S (i) l(i) < 1, we have: If M(i) < I S (i), since M(i) − I S (i) < 0 and I S (i) l(i) > 1, we have: We then have: We now have D (ξ) > 0. This means that D(ξ) (i.e., D(I S , l(R, I S ))) will increase after ξ increases.Since λ = 1/ξ, after λ decreases, D(I S , l(R, I S ) will increase.Combined with the result proven in Proposition 2, we can conclude that when λ decreases, the symmetrized KL-divergence D( l(R, I S ), I S ) + D(I S , l(R, I S )) will increase monotonically.

A.2. Proof of Proposition 4
Proposition 4. If λ ( λ > 0) decreases, the JS-divergence between l(R, I S ) and I S will increase.

Figure 1 .
Figure 1.An illustration of the linear combination l(•, •) between two term distributions.

Proposition 5 .
If the EM algorithm (in MMF) converges, the mixture model of the feedback documents is a linear combination of the collection model and the output relevance model of the EM iterative algorithm.

Figure 2 .
Figure 2. KL-divergence between the resultant distributions of distribution separation method (DSM) (with λ fixed) and mixture model feedback (MMF) at each iteration of the EM method.

Figure 3 .
Figure 3. Running time of DSM and MMF.

A. 3 .Proposition 5 .F
Proof of Proposition 5 If the EM algorithm (in MMF) converges, the mixture model of the feedback documents is a linear combination of the collection model and the output relevance model of the EM iterative algorithm.Proof.When the EM method converges in MMF, without loss of generality, let p).In addition, we can replace the p(z w = 1|F , θ (n) F ) in Equation(12) using Equation(11) and then get:

Table 4 .
Evaluation on DSM with implicit approaches to the seed irrelevance distribution.(Statistically significant improvement over the initial at the 0.05 (*) and 0.01 (**) levels.)