Generalizing the Balance Heuristic Estimator in Multiple Importance Sampling

In this paper, we propose a novel and generic family of multiple importance sampling estimators. We first revisit the celebrated balance heuristic estimator, a widely used Monte Carlo technique for the approximation of intractable integrals. Then, we establish a generalized framework for the combination of samples simulated from multiple proposals. Our approach is based on considering as free parameters both the sampling rates and the combination coefficients, which are the same in the balance heuristics estimator. Thus our novel framework contains the balance heuristic as a particular case. We study the optimal choice of the free parameters in such a way that the variance of the resulting estimator is minimized. A theoretical variance study shows the optimal solution is always better than the balance heuristic estimator (except in degenerate cases where both are the same). We also give sufficient conditions on the parameter values for the new generalized estimator to be better than the balance heuristic estimator, and one necessary and sufficient condition related to χ2 divergence. Using five numerical examples, we first show the gap in the efficiency of both new and classical balance heuristic estimators, for equal sampling and for several state of the art sampling rates. Then, for these five examples, we find the variances for some notable selection of parameters showing that, for the important case of equal count of samples, our new estimator with an optimal selection of parameters outperforms the classical balance heuristic. Finally, new heuristics are introduced that exploit the theoretical findings.


Introduction
Multiple importance sampling (MIS) is a Monte Carlo technique widely used in the literature of signal processing, computational statistics, and computer graphics for approximating complicated integrals. In its basic configuration, it works by drawing random samples from several proposal distributions (also called techniques) and weighting them appropriately in such a way that an estimator built with the pairs of weighted samples is consistent. Since the publication of [1], the celebrated balance heuristic estimator has been extensively used in the Monte Carlo literature, with an unprecedented success in the computer graphics industry (Eric Veach has been awarded with several prizes because of his contributions in the MIS literature, where the balance heuristic is arguably the most relevant one). In the balance heuristic method, different samples are simulated from each proposal and the traditional IS weight is assigned to each of them. Unlike the standard IS estimator, all the weighted samples are combined with an extra weighting, in such a way the resulting estimator typically shows a reduced variance. Its superiority in terms of variance with regard to other traditional combination schemes has been recently shown in [2], where a framework is established for sampling and weighting in MIS under equal number of samples per technique. The balance heuristic, also called the deterministic mixture [3], has been widely used in the literature of MIS. Further efficient variance reduction techniques are proposed in [4][5][6] also in the context of MIS, still with equal counts from each technique. Provably better estimators [7] and heuristically better ones [8,9] have been presented that use a different count of samples than equal count for all techniques. In [10], it has been shown the relationship of a better count of samples with generalized weighted means. The balance heuristic is also present in most of successful adaptive IS (AIS) methods, see [11][12][13][14][15], in particular in the case where all techniques are used to simulate the same number of samples. Recently, MIS has returned to the main focus in computer graphics in [16], where it is shown that allowing weights to be negative, the reduction over balance heuristics of the resulting MIS estimator can be higher than the one predicted by Veach bounds [17]; in [18], where one of the sampling techniques is optimized to decrease the overall variance of the resulting MIS estimator; in [19], where the weights are made proportional to the quotients of second moments divided by the variances of the independent techniques; in [20], where MIS is generalized to uncountably infinite sets of techniques.
Interestingly, the balance heuristic has two properties in the assigned weights. First, all techniques appear at the denominator of the weight of a specific technique. Second, they appear in a form of a mixture, with coefficients proportional to the number of samples simulated from each technique. In this paper, we relax this constraint providing a generalized weighting/combining family of estimators that has the balance heuristic as a particular case. First, we show that it is possible to use a specific set of coefficients to decide the amount of samples per technique, and a different set of coefficients to be applied as the importance weight. Second, we study four different cases fixing some of these coefficients (sampling and/or weighting), and we give the optimal solution for the rest of coefficients in such a way the variance of the MIS estimator is minimized. Note that, the novel estimator always outperforms the balance heuristic under the optimal choice of those coefficients. In five numerical examples we show that, under an adequate choice of parameters, the novel estimator outperforms the celebrated balance heuristic.
The rest of the paper is structured as follows. Section 2 revisits the balance heuristic estimator. In Section 3, we propose the new family of estimators that generalizes the balance heuristic. We address five cases of special interest, depending on which parameters are free and the number of samples simulated from each technique. We then, in Section 4, discuss the different singular points in the variances of the estimators considered. In Section 5, we give a necessary and sufficient condition, in terms of χ 2 divergence (and its approximation as a Kullback-Leibler divergence), for the variance of the new estimator to be better than the one of balance heuristics. Finally, we conclude the paper with five numerical examples in Section 6, propose some heuristics, and give some conclusions in Section 7.

Balance Heuristic Estimator
The goal in IS is usually the estimation of the value of integral µ = f (x)dx. In MIS, n i samples, {X i,j } n i j=1 , are simulated from a set of available probability density functions (pdfs), {p i } n i=1 . The MIS estimator introduced by Veach and Guibas [1] is given by where w i (x) is a weight function associated to the i-th proposal that fulfills both following conditions. First, the weights must sum up to one in all points of the domain where the value of the function is different from zero, i.e., ∑ n i=1 w i (x) = 1, ∀x where f (x) = 0, Second, for all x where p i (x) = 0, then w i (x) = 0. In this paper, we consider only weighting functions that yield Z unbiased.
The balance heuristic estimator is a particular case of Equation (1) where the weight function is given by which can be written too as where n i = α i N. In this case, the estimator in Equation (1) becomes the balance heuristic or deterministic mixture estimator given by

Interpretation of F and General Notation of the Paper
Note that F is also called deterministic mixture scheme [3] or multi-sample MIS estimator [7,17], as opposed to the case where all independent and identically distributed (i.i.d.) samples are from the mixture ψ α = ∑ n k=1 α k p k (x). The latter alternative is called the randomized balance heuristic or one-sample MIS estimator, and the estimator is denoted by F . This alternative is a subtle variation of F that simply modifies the sampling by simulating the N total samples i.i.d. from the mixture ψ α (instead of deterministically choosing the number of samples per proposal). Then, the estimator is the exact same expression of Equation (4). This randomized version F allows also for a re-interpretation of the deterministic balance heuristic, F, that can be seen as a mixture sampling (note that the mixture is present in the denominator of all importance weights) with variance reduction using stratified sampling (see Appendix 1 of [2] for a detailed discussion). In the randomized version, n i is then a random variable with expected value α i N, for all i. In summary, in this paper we refer as deterministic mixture estimator the cases where the number of samples per technique is deterministic, e.g., F, and random mixture estimator when there is i.i.d. sampling from the mixture (and hence n i is a random variable), e.g., F . We represent the randomized estimators with calligraphic letters. All estimators, unless the opposite is clearly stated, are deterministic mixture estimators.
Finally, we denote estimators with the superindex 1 when they are versions of a specific estimator but with a number of samples normalized to 1, e.g., F 1 is the normalized version of F. In other words, even if the estimators require that all the numbers of samples per technique are n i ∈ R, we use these normalized estimators to denote the variance normalized to 1 sample, which simplifies the comparison across estimators (for N total samples, the variance of the estimator would be just the variance of F 1 divided by N).
In Table 1 we show the naming convention used in this paper.

Rationale
In Theorems 9.2 and 9.4 of [17], the variances of Z 1 and its randomized version Z 1 are given, i.e., the version where instead of deterministically selecting n i , all the samples are directly simulated from ∑ n k=1 α k p k (x). By subtracting their values we have where (we use here a notation to be consistent with notation in [7]) and thus This result generalizes several particular cases derived in [7] and [2], in the context of a variance analysis of MIS estimators. From Equation (6) (see Appendix A) we have that and equality only happens (apart from the case when both variances when for all i all µ i are equal. One example is given by taking in Equation (7) for all i, w i (x) = w i constant and α i = w i . We also show in Appendix A that, for the particular case when α i = 1/n and f (x) ≥ 0, the upper bound for improvement of deterministic versus non-deterministic estimator is given by where the bound would be approached when there is an index k such that for all i = k, µ k >> µ i . In Theorem 9.4 of [17], it is shown that the optimal weighting functions w i (x) for Z (i.e., those that minimize their variance V[Z ]), are the balance heuristic weights, Equation (3); therefore, the optimal case is Z ≡ F , where F is the random mixture estimator. This is, for any estimator Z, when using the same distribution of samples, also taking into account Equation (9) (see also [7]), it always holds that Further, in Theorem 9.2 of [17], it is proved that the estimator that optimizes the second moment of Z 1 estimator, this is, , is the balance heuristic estimator. Thus, it seems clear that for improvement we have to look for a deterministic estimator, which should be a generalization of balance heuristic mixture estimator F. This is presented in the next section.

Global Illumination
The rendering equation [21] tells us that the radiance L out (x, ω out ) exiting point x in direction ω out is given by where L e (x, ω out ) is the emitted radiance, L in (x, ω in ) is the incoming radiance at x from direction ω in , with the integral extended to the hemisphere Ω above x, θ is the angle of the normal at x with ω in , and ρ is the bidirectional reflectance distribution function, which gives the fraction of luminance incoming from ω in and scattered in the direction ω out . To approximate the integral in (12), multiple importance sampling is applied with pdfs that sample, respectively, the incoming radiance and the bidirectional reflectance.

Bayesian Inference
Multiple importance sampling (MIS) is often applied in Bayesian inference [22]. In this context, the posterior distribution is usually intractable, and computing its moments analytically is impossible; therefore, approximate methods are required. MIS is particularly well suited for cases when the posterior distribution is multimodal or needs to be approximated by a mixture of densities [2].
Typically, a set of observations, y ∈ R d y , are available. The inference task consists in estimating probabilistically some hidden parameters and/or latent variables x ∈ R d y , which are connected to the observations through a statistical model. The relation between observations and unknown parameters is encoded in the likelihood, (y|x), and the prior knowledge on x is described through the prior distribution p 0 (x). Through the Bayes' rule, we can obtain the posterior distribution of the unknowns as where Z(y) = (y|x)p 0 (x)dx is the marginal likelihood. In this scenario, we can be interested in computing a moment h(x) of the posterior, which is an expectation (hence an integral) of the following form: where h : R d x → R. In connection with the notation of this paper, f (x) = h(x) π(x|y). In the context of Bayesian inference, there are several ways to select the techniques, {p i (x)} n i=1 . For instance, they can be deterministically set a priori, e.g., as Laplace approximations at different points [23]. Alternatively, they can be adapted over iterations via iterative stochastic mechanisms [24]. The latter technique is called adaptive importance sampling (AIS) and its literature is vast (see [15] for a recent review).

Generalized Multiple Importance Sampling Balance Heuristic Estimator
In classic balance heuristic, both the number of samples {n i = α i N} n i=1 and the weights of the mixture of pdfs, {α i } n i=1 , are given by the same set of parameters. Let us consider now the estimator of Equation (4), where we relax the dependence between the number of samples n i , and the associated coefficient α i , i.e., now n i = β i N, β i > 0, ∑ n i=1 β i = 1, where in general α i = β i (otherwise, we recover F). We now define the estimator Note that G is a particular case of Z, with weights w i = (1). Note also that the balance heuristic F is a particular case of G, i.e., in general we do not impose the restriction of α i = n i N . Although the allocation of samples will be different for F and G, we remark that both estimators require the same number of simulations and evaluations. Finally, we note that a close examination of (16) allows us to re-interpret the importance weight associated to each sample. The mixture parametrized with {n i = α i N} n i=1 appears in its denominator, but since this mixture does not reflect the sampling procedure (the true mixture proposal in sampling is the one parametrized by {n i = β i N} n i=1 ), the importance weight of a sample simulated from the technique β i , is corrected by the factor α i /β i in order to account for this mismatch.
Estimator G can be rewritten as Note also that G depends of two sets of parameters, In the particular case where β i = α i , ∀i, the estimator G becomes F. Let us consider the case with n i = 1. Then, with expectation and variance Observe

Theorem 1. For any set of weights
Proof. The estimator G is unbiased, since The variance of G is given by For the sake of the theoretical analysis, we define G 1 , a normalized version of G with N = 1 (see Section 2.1), with variance Next we study four special cases of estimator G 1 .

Remark 1.
We could also consider the one-sample estimator G, randomized version of G. However, G is a particular case of the general estimator Z, and we have seen in Section 2.2 that the optimal case for Z is when Z ≡ F , thus it only makes sense to consider the extension G of the multi-sample estimator F.

Remark 2.
Grittmann et al. estimator [19], can be interpreted as a G estimator where 3.1. Case 1: In this particular case, the estimator G reverts to F. The variance is by simple substitution in Equation (24). We aim at finding the optimal {α * i } n i=1 such the variance of Equation (26) is minimized.

Theorem 2.
The optimal estimator F * in terms of variance is achieved when ∀j ∈ {1, . . . , n}, the following values are equal See Appendix B for a proof. Compare Equation (27) with the condition for optimal F * , which is [9] that the following expression is equal ∀j ∈ {1, . . . , n}, The optimal {α * i } n i=1 solutions are in general different for F, Equation (27), and for F * , Equation (28), although observe that, when for {α * i } n i=1 all i the µ i are equal, then the optimality condition Equation (27) implies that ∀i ∈ {1, . . . , n} the σ 2 i are equal too, and thus the optimal solutions for F and F are the same, in concordance with being V[F] = V[F ] for this special case, see Section 2.2 and Appendix A.
Another interesting result regarding the µ i values is the following theorem,

is obviously a local (and global) minimum for V[F(F )] (a convex function)
and thus from Equation (28) for all i, σ 2 i + µ i 2 are equal and hence the result. Alternatively, Consider now that {α i } n i=1 are fixed, and hence also {σ 2 As the left term is fixed, equality in Equation (29) will give the minimum of right term, being this term V[G 1 ] as ∑ n i=1 β i = 1; however, equality can only happen when the right term sequences are proportional, this is, for all i, , and thus the optimal {β i } n i=1 are given by and the optimal (minimum) variance is Theorem 4. Given an estimator F with {α i } n i=1 values, we can always find a better estimator G by sampling as β * i ∝ α i σ i , which is strictly better whenever not all σ i are equal.
Proof. Observe that, by applying Cauchy-Schwartz inequality to the sequences but the left-hand side of inequality is (30), the estimator G * always outperforms the estimator F (When comparing estimators F and G, we consider, unless explicitly stated, the same set of {α i } n i=1 values). Equality in Equation (32) only happens when for all i, , when all σ i are equal. In that case we have for all i β * i = α i , and we revert to the F estimator.
, the maximum possible acceleration by using the optimal β * i values when for all i, α i = 1/n, is equal to n (as observed in [7]). This acceleration would be approached when there is an index k such that for all i = k, σ k >> σ i . In general, the more different the σ i are, the higher the acceleration.

A particular case of Theorem 4 is when
This case was introduced in Section 4 of [7]. It was shown that this estimator is provably better than F with α i = 1/n, ∀i when which is the optimal case of Equation (33). Examples showing the improvement obtained were also given in [7].

Optimal Efficiency
Let us now take into account that the cost of each sampling technique can be different, as it is usually considered in the literature [25]. Let us denote the cost of sampling technique i as c i . The inverse of efficiency for the estimator G is given by Note that this quantity represents the total cost multiplied by the variance of the estimator. Using Cauchy-Schwartz inequality with the sequences The optimal sampling rates (for maximizing the efficiency) are those that yield Equation (36) as an equality, which happens when . Observe that, using again the Cauchy- where the left-hand side is E −1 G with the optimal sampling rates, and the right-hand side is E −1 F . Note that equality only happens when for all i, and we revert to the F estimator. This is summarized in the following theorem.

Theorem 5.
Given an estimator F with {α i } n i=1 values, and sampling costs {c i } n i=1 , we can always find a strictly more efficient estimator G * when for all i, can be found using Lagrange multipliers with target function Observe that the σ 2 i values depend on the {α i } n i=1 values. The optimal values are those that obey, for all j, the following expression Proof. The derivation can be found in the Appendix C.
Note that in the general case, the optimal α * i = β i . Moreover, aside from the optimal values {α * j } n j=1 , we can find cases where ..n, then the estimator G 1 is more efficient than estimator F 1 . Observe that from Equation (38), a necessary and sufficient condition for the optimal solution α * i to be such that α * i = β i for all i is that Equation (39) holds (see Appendix C), which is satisfied when for all i, µ i = µ. In this particular case, µ i = µ, G ≡ F, and In the case when for all i, β i = 1/n, the variance becomes Note that this is a usual case in the MIS literature strategies [2,[4][5][6] and the adaptive IS (AIS) literature [11][12][13][14][15]24], since all the techniques have the same number of counts. By setting in Equation (38) The corresponding variance V[G * ] will be less or equal than the variance of V Proof. Applying Theorem 9, Corollary 5 from [26], obtaining the inequality in Equation (42). See also [10,27].
Observe that equality in Equation (43) happens when for all i, β i ∝ α 2 i σ 2 i . In the same way we can proof the inverse case as follows, The proof implies for countermonotonicity a change of sign in the inequality in Theorem 9, Corollary 5 from [26].
Let us compare now Proof. The proof is as above, making use of Theorem 9, Corollary 5 from [26], obtaining the inequality in Equation (45).
Observe that equality in Equation (46) happens when for all i, α i ∝ 1/σ 2 i . In the same way we can prove the reverse case of Theorem 9.
As a direct consequence of Theorems 9 and 10 we have the following corollary: By setting = −1 in Corollary 2, the products α i σ i are equal for all i, which is equivalent to saying that α i σ i ∝ 1/n = β i . Observe that this corresponds to the optimal solution {β * j } given by Equation (30). Observe now that the inequality for this optimal solution, V[G 1 ] ≤ V[F 1 ], can be obtained by successive application of Theorems 7 and 9.
Proof. The result is obtained by successive application of Theorems 7 and 9.
As a direct consequence of Theorems 11 and 12 we have the following corollary: Finally, observe that we do not have a sufficient condition for , only there are heuristics [7,9,10,28].

Case 5: General Case
Let us consider the global optimum for G estimator, this is, without imposing any constraint to the parameters {α i }, {β i }. For this we should optimize the lagrangian function Differentiating with respect to α i , we obtain Equation (38). Differentiating with respect to β i , we obtain Equation (30). Thus, the optimal parameters {α i } n i=1 and {β i } n i=1 will solve for Equations (38) and (30)

Singular Solutions
The variance of F can be written as [7] The optimal parameters {α i } n i=1 happen when for all i, σ 2 i + µ 2 i equal, which gives a local minimum for V[F ] (in this case a global minimum too, as V[F ] is convex). Observe that this makes, in the third inequality, all terms factored by {α i } equal, and also corresponds to equality in the Cauchy-Schwartz inequality such that all σ i are equal, then we can not improve the F estimator with a suitable selection of {β i } n i=1 parameters for the G estimator, because the optimal Observe that by Cauchy-Schwartz In the second equality in Equation (53) with equality only when, for all i, α i β i σ 2 i are equal.
In the third equality in Equation (53), the solution for all i, gives the optimal solution for V[G] when {α i } are fixed, Equation (30). By Cauchy-Schwartz, with equality only when, for all i, i are equal. We summarize in Table 2 the different possibilities.
for all i,

Relationship with χ 2 Divergence: A Necessary and Sufficient Condition for
The variance of an importance sampling estimator can be written in terms of a χ 2 divergence if f (x) ≥ 0 [29,30]. As the F estimator can be seen as an importance sampling estimator with pdf p(x) = ∑ n i p i (x), its variance can be written as Observe that from the χ 2 expression in Equation (56) it is clear that  (56), we can still relate them to χ 2 divergence. In general, ∑ n i α k σ k > 0. Note that, if ∑ n k α k σ k = 0, then, for all k, are fixed, as seen in Section 3.2.
As we have taken ∑ n k=1 α k σ k > 0, we can divide first and last terms of Equations (58) and (59), Hence the following theorem, Observe that Theorem 13 does not impose any a priori condition on the {α i }, {β i } parameters and generalizes Theorem 12.
In the same way we can generalize Theorems 7-11 with the following corollaries,
The same approximation can be used for Corollaries 3 and 4.

Efficiency Comparison between F and G Estimators
We compare the efficiencies for the F estimator and the optimal G estimator defined in Theorem 5 with five examples. Table 3 shows the inverse of the efficiencies, , i.e., the product of variance and cost for the F estimator and for the optimal G estimator, for these possible sets of {α k } n k=1 : (i) equal count of samples, {α k = 1/n}; (ii) count of samples inversely proportional to the variances of the independent techniques times the cost of sampling the technique, {α k ∝ 1 c k v k } [8,9]; and the three balance heuristic estimators defined in [28], which are (iii) count of samples inversely proportional to the second moments of the independent techniques times the cost of sampling the technique, {α k ∝ 1 , and inversely proportional to the square root of the cost, {α k ∝ σ k,eq √ c k }; and Below, we describe the five examples. From the results in Table 3, we can see:

•
As expected, Theorems 4 and 5 hold for all the cases. • Examples 1-4 show a general gain in efficiency, around 20%. • Example 5 with equal costs shows a gain in efficiency from 2% to around 20%. • Example 5 with different costs shows big gains in efficiency, in particular for equal count of sampling the efficiency is doubled.
We see in all our examples an increase in efficiency when the estimator G is used instead of the estimator F, showing in some cases, as in Example 5, a 50% improvement. Theorems 4 and 5 ensure that estimator G is always better than estimator F if the optimal {β i } values are used; however, for equal costs (i.e., comparing only variances), the amount of improvement strongly depends on the example considered, from very small o negligible (see first three rows of Table 4 for the first four examples) to an important one (compare Table 3 for Example 5 with equal costs with the first three rows of Table 4 for Example 5). The values are computed for the results in Tables 3 and 4 by using numerical approximations obtained with Mathematica software. It would also be possible to select them in an adaptive and iterative way. We defer the investigation of these adaptive schemes for a future work.

Example 1
Suppose we want to solve the integral by MIS sampling on functions x, (x 2 − x π ), and sin(x), respectively. We first find the normalization constants: sin(x)dx = 1.88816. The costs for sampling the techniques are (1; 6.24; 3.28).

Example 2
Let us solve the integral µ = π 3 2π using the same functions x, (x 2 − x π ), and sin(x) as before.

Example 3
As the third example, let us solve the integral using the same functions as before. The optimal α values for this example, with zero variance, are ≈ 100.
In this case we know again the optimal (zero variance) α values: (0.3, 0.3, 0.4). The difference with the previous example is that this case should be most favorable to equal count of samples.
6.1.5. Example 5 As the last example, consider solving the following integral by MIS sampling on functions 2 − x, and sin 2 (x).

Variances of Examples 1-5 for Some Notable Cases
In Table 4 we present some notable cases for the variance of Examples 1-5. Observe that here we do not consider the cost of sampling. The values are computed for the results in Table 4 The results in row 9 correspond to the estimator defined in [19], are the second moments of the independent techniques, and β i = 1/n. The results are slightly better than F(1/n) for Examples 1 and 2, much better for Example 3, much worse for Example 4 and slightly worse for Example 5. Observe that the m i , v i values can be easily approximated using a first batch of samples in a Monte Carlo integration, as it was already performed in [7,9,10]. • In row 10 we introduce the estimator corresponding to α i ∝ 1/v i , and β i = 1/n, which gives better results than F(1/n) for Examples 1 and 2, much better for Example 3, worse for Example 5, and much worse for Example 4. This estimator improves on the one in [19] except for Example 4. Observe that these α i weights correspond to the optimal weights in the linear combination of estimators when the sampling rates (β i ) are fixed, Equation (13) Table 4 row 6, we can conclude that there can be room for improvement on F(1/n) estimator using a G estimator, with a suitable selection of α i parameters. Heuristics as in rows 9 and 10 can give much improvement in some cases at the cost of no improvement at all and even a huge variance in other cases, thus they are not robust heuristics. More sophisticated heuristics as in rows 11 and 12, based on the theoretically justified heuristic in row 8, show a robust behavior.
A very promising estimator is the one in row 7, for all i all µ i equal (to µ), in all five examples the variance is very near the optimal V[F]. Observe that this estimator has a strong theoretical justification, derived from the study of G estimator. The {α i } parameters that approximate for all i, µ i = µ could be obtained in an adaptive way; however, a practical way to obtain this estimator should be devised.
, i.e., the inverse of efficiency or product of variance and cost for the F estimator and for the optimal G estimator, using for both the same {α k } values. The optimal efficiencies of G estimator, for each set of {α k } values, are computed using the left-hand side of Equation (37), while the efficiencies of F estimator use the right-hand side of Equation (37). We consider for the five numerical examples using an equal count of samples, count inversely proportional to the variances of independent estimators [8,9], and for the three estimators defined in [28]. The sampling costs are (1, 6.24, 3.28). In Example 5, we present the case with equal costs (1, 1), and different costs (1,5

Conclusions
In this paper, we have proposed a multiple importance sampling estimator that combines samples simulated from different techniques. The novel estimator generalizes the Monte Carlo balance heuristic estimator, widely used in the literature of signal processing, computational statistics, and computer graphics. In particular, this estimator relaxes the connection between the coefficients that select the number of samples per proposal, and the samples that appear in the mixture of techniques at the denominator of the importance weight. This flexibility shows a relevant improvement in terms of variance in the combined estimator with regard to the balance heuristic estimator (which is included as a particular case in the novel estimator). We have studied the optimal choice of the free coefficients in such a way that the variance of the resulting estimator is minimized. In addition, numerical results have shown that the significant gap in terms of variance between both estimators justifies the use of the novel estimator whenever possible. In the particular, but much used case, of equal sampling, our new estimator shows a big potential for improvement. Future work may include the application of this variance-reduction technique in the context of adaptive importance sampling [15] or within MCMC-based methods that include reweighting [32].

Appendix A. Difference between the Variances of Deterministic and Randomized Multiple Importance Sampling Estimators
Proof of Equation (9). The difference between the variances of the deterministic multiple importance sampling estimator, Z, and the randomized one, Z, is given by [17] (we normalize here to one sample) which is positive, as by Cauchy-Schwartz and equality only happens (apart from the case when both V[Z 1 ], V[Z 1 ] are zero) when for all i, α i ∝ α i µ 2 i , i.e., all µ i are equal.
Proof of Equation (10). Observe that if f (x) ≥ 0 then for all i µ i ≥ 0, and we have and thus where equality would be approached when there is an index k such that for all i = k, µ k >> µ i . Thus when α i = 1/n, In general, the more different the µ i are, the higher the difference between the variances. Then, This is, for all j, the following values have to be equal Observe that all derivatives in (A6) are negative for the optimal {α i } values and equal to − ∑ n j=1 α j σ 2 j .

Appendix C. Derivation of Case 3
Proof of Theorem 6. We have to optimize the target function Taking partial derivatives with respect to α i , as the β i values are constant,

Proof of Equation (39).
For the optimal solution to be, for all i, α i = β i then and thus the following equation has to hold for all j (A20)

Appendix D. An Alternative Perspective Based on the χ 2 Divergence
Let us define ψ α = ∑ n k=1 α k p k (x) and ψ β = ∑ n k=1 βp k (x). Further, let us define Z f β ]. We also recall that the χ 2 divergence between the pdf f (x) = f (x) µ and ψ, is given by Since in the randomized generalized balance heuristic estimator, G, all samples are simulated i.i.d. from α β , the variance can be expressed as where the integral can be seen as a modified χ 2 divergence with two differences: (a) the normalizing constant Z f β α is modified with regard to µ when ψ α = ψ β , and (b) there is the ratio ψ β ψ α that appears multiplying (evoking an importance weight between the denominator we have used, ψ α , and the proposal used to simulate, ψ β ). Note that, when ψ α = ψ β , Z f β α = µ, the ratio of mixtures is equal to one, and V ψ β [G] = 1 N µ 2 χ 2 ( f , ψ β ).