Generalizing the Balance Heuristic Estimator in Multiple Importance Sampling

Sbert, Mateu; Elvira, Víctor

doi:10.3390/e24020191

Open AccessArticle

Generalizing the Balance Heuristic Estimator in Multiple Importance Sampling

by

Mateu Sbert

^1,*

and

Víctor Elvira

^2,3

¹

Informatics and Applications Institute, Girona University, 17003 Girona, Spain

²

School of Mathematics, University of Edinburgh, Edinburgh EH8 9YL, UK

³

The Alan Turing Institute, London NW1 2DB, UK

^*

Author to whom correspondence should be addressed.

Entropy 2022, 24(2), 191; https://doi.org/10.3390/e24020191

Submission received: 13 December 2021 / Revised: 19 January 2022 / Accepted: 25 January 2022 / Published: 27 January 2022

(This article belongs to the Special Issue Adaptive Signal Processing and Machine Learning Using Entropy and Information Theory)

Download Versions Notes

Abstract

:

In this paper, we propose a novel and generic family of multiple importance sampling estimators. We first revisit the celebrated balance heuristic estimator, a widely used Monte Carlo technique for the approximation of intractable integrals. Then, we establish a generalized framework for the combination of samples simulated from multiple proposals. Our approach is based on considering as free parameters both the sampling rates and the combination coefficients, which are the same in the balance heuristics estimator. Thus our novel framework contains the balance heuristic as a particular case. We study the optimal choice of the free parameters in such a way that the variance of the resulting estimator is minimized. A theoretical variance study shows the optimal solution is always better than the balance heuristic estimator (except in degenerate cases where both are the same). We also give sufficient conditions on the parameter values for the new generalized estimator to be better than the balance heuristic estimator, and one necessary and sufficient condition related to

χ^{2}

divergence. Using five numerical examples, we first show the gap in the efficiency of both new and classical balance heuristic estimators, for equal sampling and for several state of the art sampling rates. Then, for these five examples, we find the variances for some notable selection of parameters showing that, for the important case of equal count of samples, our new estimator with an optimal selection of parameters outperforms the classical balance heuristic. Finally, new heuristics are introduced that exploit the theoretical findings.

Keywords:

Monte Carlo; importance sampling; balance heuristic; variance reduction; chi-square divergence; Kullback–Leibler divergence; cross entropy

1. Introduction

Multiple importance sampling (MIS) is a Monte Carlo technique widely used in the literature of signal processing, computational statistics, and computer graphics for approximating complicated integrals. In its basic configuration, it works by drawing random samples from several proposal distributions (also called techniques) and weighting them appropriately in such a way that an estimator built with the pairs of weighted samples is consistent. Since the publication of [1], the celebrated balance heuristic estimator has been extensively used in the Monte Carlo literature, with an unprecedented success in the computer graphics industry (Eric Veach has been awarded with several prizes because of his contributions in the MIS literature, where the balance heuristic is arguably the most relevant one). In the balance heuristic method, different samples are simulated from each proposal and the traditional IS weight is assigned to each of them. Unlike the standard IS estimator, all the weighted samples are combined with an extra weighting, in such a way the resulting estimator typically shows a reduced variance. Its superiority in terms of variance with regard to other traditional combination schemes has been recently shown in [2], where a framework is established for sampling and weighting in MIS under equal number of samples per technique. The balance heuristic, also called the deterministic mixture [3], has been widely used in the literature of MIS. Further efficient variance reduction techniques are proposed in [4,5,6] also in the context of MIS, still with equal counts from each technique. Provably better estimators [7] and heuristically better ones [8,9] have been presented that use a different count of samples than equal count for all techniques. In [10], it has been shown the relationship of a better count of samples with generalized weighted means. The balance heuristic is also present in most of successful adaptive IS (AIS) methods, see [11,12,13,14,15], in particular in the case where all techniques are used to simulate the same number of samples. Recently, MIS has returned to the main focus in computer graphics in [16], where it is shown that allowing weights to be negative, the reduction over balance heuristics of the resulting MIS estimator can be higher than the one predicted by Veach bounds [17]; in [18], where one of the sampling techniques is optimized to decrease the overall variance of the resulting MIS estimator; in [19], where the weights are made proportional to the quotients of second moments divided by the variances of the independent techniques; in [20], where MIS is generalized to uncountably infinite sets of techniques.

Interestingly, the balance heuristic has two properties in the assigned weights. First, all techniques appear at the denominator of the weight of a specific technique. Second, they appear in a form of a mixture, with coefficients proportional to the number of samples simulated from each technique. In this paper, we relax this constraint providing a generalized weighting/combining family of estimators that has the balance heuristic as a particular case. First, we show that it is possible to use a specific set of coefficients to decide the amount of samples per technique, and a different set of coefficients to be applied as the importance weight. Second, we study four different cases fixing some of these coefficients (sampling and/or weighting), and we give the optimal solution for the rest of coefficients in such a way the variance of the MIS estimator is minimized. Note that, the novel estimator always outperforms the balance heuristic under the optimal choice of those coefficients. In five numerical examples we show that, under an adequate choice of parameters, the novel estimator outperforms the celebrated balance heuristic.

The rest of the paper is structured as follows. Section 2 revisits the balance heuristic estimator. In Section 3, we propose the new family of estimators that generalizes the balance heuristic. We address five cases of special interest, depending on which parameters are free and the number of samples simulated from each technique. We then, in Section 4, discuss the different singular points in the variances of the estimators considered. In Section 5, we give a necessary and sufficient condition, in terms of

χ^{2}

divergence (and its approximation as a Kullback–Leibler divergence), for the variance of the new estimator to be better than the one of balance heuristics. Finally, we conclude the paper with five numerical examples in Section 6, propose some heuristics, and give some conclusions in Section 7.

2. Balance Heuristic Estimator

The goal in IS is usually the estimation of the value of integral

μ = \int f (x) d x

. In MIS,

n_{i}

samples,

{X_{i, j}}_{j = 1}^{n_{i}}

, are simulated from a set of available probability density functions (pdfs),

{p_{i}}_{i = 1}^{n}

. The MIS estimator introduced by Veach and Guibas [1] is given by

Z = \sum_{i = 1}^{n} \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} w_{i} (X_{i, j}) \frac{f (X_{i, j})}{p_{i} (X_{i, j})},

(1)

where

w_{i} (x)

is a weight function associated to the i-th proposal that fulfills both following conditions. First, the weights must sum up to one in all points of the domain where the value of the function is different from zero, i.e.,

\sum_{i = 1}^{n} w_{i} (x) = 1

,

\forall x

where

f (x) \neq 0

, Second, for all x where

p_{i} (x) = 0

, then

w_{i} (x) = 0

. In this paper, we consider only weighting functions that yield Z unbiased.

The balance heuristic estimator is a particular case of Equation (1) where the weight function is given by

w_{i} (x) = \frac{n_{i} p_{i} (x)}{\sum_{k = 1}^{n} n_{k} p_{k} (x)},

(2)

which can be written too as

w_{i} (x) = \frac{α_{i} p_{i} (x)}{\sum_{k = 1}^{n} α_{k} p_{k} (x)},

(3)

where

n_{i} = α_{i} N

. In this case, the estimator in Equation (1) becomes the balance heuristic or deterministic mixture estimator given by

\begin{matrix} F & = & \sum_{i = 1}^{n} \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} \frac{α_{i} f (X_{i, j})}{\sum_{k = 1}^{n} α_{k} p_{k} (X_{i, j})} \end{matrix}

(4)

\begin{matrix} = & \frac{1}{N} \sum_{i = 1}^{n} \sum_{j = 1}^{n_{i}} \frac{f (X_{i, j})}{\sum_{k = 1}^{n} α_{k} p_{k} (X_{i, j})} . \end{matrix}

(5)

2.1. Interpretation of F and General Notation of the Paper

Note that F is also called deterministic mixture scheme [3] or multi-sample MISestimator [7,17], as opposed to the case where all independent and identically distributed (i.i.d.) samples are from the mixture

ψ_{α} = \sum_{k = 1}^{n} α_{k} p_{k} (x)

. The latter alternative is called the randomized balance heuristic or one-sample MIS estimator, and the estimator is denoted by

F

. This alternative is a subtle variation of F that simply modifies the sampling by simulating the N total samples i.i.d. from the mixture

ψ_{α}

(instead of deterministically choosing the number of samples per proposal). Then, the estimator is the exact same expression of Equation (4). This randomized version

F

allows also for a re-interpretation of the deterministic balance heuristic, F, that can be seen as a mixture sampling (note that the mixture is present in the denominator of all importance weights) with variance reduction using stratified sampling (see Appendix 1 of [2] for a detailed discussion). In the randomized version,

n_{i}

is then a random variable with expected value

α_{i} N

, for all i. In summary, in this paper we refer as deterministic mixture estimator the cases where the number of samples per technique is deterministic, e.g., F, and random mixture estimator when there is i.i.d. sampling from the mixture (and hence

n_{i}

is a random variable), e.g.,

F

. We represent the randomized estimators with calligraphic letters. All estimators, unless the opposite is clearly stated, are deterministic mixture estimators.

Finally, we denote estimators with the superindex

^{1}

when they are versions of a specific estimator but with a number of samples normalized to 1, e.g.,

F^{1}

is the normalized version of F. In other words, even if the estimators require that all the numbers of samples per technique are

n_{i} \in R

, we use these normalized estimators to denote the variance normalized to 1 sample, which simplifies the comparison across estimators (for N total samples, the variance of the estimator would be just the variance of

F^{1}

divided by N).

In Table 1 we show the naming convention used in this paper.

2.2. Rationale

In Theorems 9.2 and 9.4 of [17], the variances of

Z^{1}

and its randomized version

Z^{1}

are given, i.e., the version where instead of deterministically selecting

n_{i}

, all the samples are directly simulated from

\sum_{k = 1}^{n} α_{k} p_{k} (x)

. By subtracting their values we have

\begin{matrix} V [Z^{1}] - V [Z^{1}] = \sum_{i} α_{i} {μ^{'}}_{i}^{2} - μ^{2}, \end{matrix}

(6)

where (we use here a notation to be consistent with notation in [7])

\begin{matrix} {μ^{'}}_{i} = \frac{1}{α_{i}} \int w_{i} (x) f (x) d x, \end{matrix}

(7)

and thus

\begin{matrix} \sum_{i} α_{i} {μ^{'}}_{i} = μ . \end{matrix}

(8)

This result generalizes several particular cases derived in [7] and [2], in the context of a variance analysis of MIS estimators. From Equation (6) (see Appendix A) we have that

\begin{matrix} V [Z^{1}] \leq V [Z^{1}], \end{matrix}

(9)

and equality only happens (apart from the case when both variances

V [Z^{1}], V [Z^{1}]

are zero) when for all i all

μ_{i}^{'}

are equal. One example is given by taking in Equation (7) for all i,

w_{i} (x) = w_{i}

constant and

α_{i} = w_{i}

. We also show in Appendix A that, for the particular case when

α_{i} = 1 / n

and

f (x) \geq 0

, the upper bound for improvement of deterministic versus non-deterministic estimator is given by

\begin{matrix} V [Z^{1}] - V [Z^{1}] \leq (n - 1) μ^{2}, \end{matrix}

(10)

where the bound would be approached when there is an index k such that for all

i \neq k

,

μ_{k}^{'} > > μ_{i}^{'}

. In Theorem 9.4 of [17], it is shown that the optimal weighting functions

w_{i} (x)

for

Z

(i.e., those that minimize their variance

V [Z]

), are the balance heuristic weights, Equation (3); therefore, the optimal case is

Z \equiv F

, where

F

is the random mixture estimator. This is, for any estimator

Z

, when using the same distribution of samples, also taking into account Equation (9) (see also [7]), it always holds that

\begin{matrix} V [F] \leq V [F] \leq V [Z] . \end{matrix}

(11)

Further, in Theorem 9.2 of [17], it is proved that the estimator that optimizes the second moment of

Z^{1}

estimator, this is,

V [Z^{1}] + \sum_{i} α_{i} {μ^{'}}_{i}^{2}

, is the balance heuristic estimator. Thus, it seems clear that for improvement we have to look for a deterministic estimator, which should be a generalization of balance heuristic mixture estimator F. This is presented in the next section.

2.3. Realistic Applications

2.3.1. Global Illumination

The rendering equation [21] tells us that the radiance

L_{o u t} (x, ω_{o u t})

exiting point x in direction

ω_{o u t}

is given by

L_{o u t} (x, ω_{o u t}) = L_{e} (x, ω_{o u t}) + \int_{Ω} L_{i n} (x, ω_{i n}) ρ (x, ω_{i n}, ω_{o u t}) cos θ d ω_{i n},

(12)

where

L_{e} (x, ω_{o u t})

is the emitted radiance,

L_{i n} (x, ω_{i n})

is the incoming radiance at x from direction

ω_{i n}

, with the integral extended to the hemisphere

Ω

above x,

θ

is the angle of the normal at x with

ω_{i n}

, and

ρ

is the bidirectional reflectance distribution function, which gives the fraction of luminance incoming from

ω_{i n}

and scattered in the direction

ω_{o u t}

. To approximate the integral in (12), multiple importance sampling is applied with pdfs that sample, respectively, the incoming radiance and the bidirectional reflectance.

2.3.2. Bayesian Inference

Multiple importance sampling (MIS) is often applied in Bayesian inference [22]. In this context, the posterior distribution is usually intractable, and computing its moments analytically is impossible; therefore, approximate methods are required. MIS is particularly well suited for cases when the posterior distribution is multimodal or needs to be approximated by a mixture of densities [2].

Typically, a set of observations,

y \in R^{d_{y}}

, are available. The inference task consists in estimating probabilistically some hidden parameters and/or latent variables

x \in R^{d_{y}}

, which are connected to the observations through a statistical model. The relation between observations and unknown parameters is encoded in the likelihood,

ℓ (y | x)

, and the prior knowledge on

x

is described through the prior distribution

p_{0} (x)

. Through the Bayes’ rule, we can obtain the posterior distribution of the unknowns as

\tilde{π} (x | y) = \frac{ℓ (y | x) p_{0} (x)}{Z (y)},

(13)

where

Z (y) = \int ℓ (y | x) p_{0} (x) d x

is the marginal likelihood.

In this scenario, we can be interested in computing a moment

h (x)

of the posterior, which is an expectation (hence an integral) of the following form:

μ = E_{\tilde{π}} [h (x)] = \int h (x) \tilde{π} (x | y) d x,

(14)

where

h : R^{d_{x}} \to R

. In connection with the notation of this paper,

f (x) = h (x) \tilde{π} (x | y)

. In the context of Bayesian inference, there are several ways to select the techniques,

{p_{i} (x)}_{i = 1}^{n}

. For instance, they can be deterministically set a priori, e.g., as Laplace approximations at different points [23]. Alternatively, they can be adapted over iterations via iterative stochastic mechanisms [24]. The latter technique is called adaptive importance sampling (AIS) and its literature is vast (see [15] for a recent review).

3. Generalized Multiple Importance Sampling Balance Heuristic Estimator

In classic balance heuristic, both the number of samples

{n_{i} = α_{i} N}_{i = 1}^{n}

and the weights of the mixture of pdfs,

{α_{i}}_{i = 1}^{n}

, are given by the same set of parameters. Let us consider now the estimator of Equation (4), where we relax the dependence between the number of samples

n_{i}

, and the associated coefficient

α_{i}

, i.e., now

n_{i} = β_{i} N

,

β_{i} > 0

,

\sum_{i = 1}^{n} β_{i} = 1

, where in general

α_{i} \neq β_{i}

(otherwise, we recover F). We now define the estimator

\begin{matrix} G & = & \sum_{i = 1}^{n} \frac{α_{i}}{n_{i}} \sum_{j = 1}^{n_{i}} \frac{f (X_{i, j})}{\sum_{k = 1}^{n} α_{k} p_{k} (X_{i, j})} \end{matrix}

(15)

\begin{matrix} = & \frac{1}{N} \sum_{i = 1}^{n} \frac{α_{i}}{β_{i}} \sum_{j = 1}^{n_{i}} \frac{f (X_{i, j})}{\sum_{k = 1}^{n} α_{k} p_{k} (X_{i, j})} . \end{matrix}

(16)

Note that G is a particular case of Z, with weights

w_{i} = \frac{α_{i} p_{i} (x)}{\sum_{k = 1}^{n} α_{k} p_{k} (X_{i, j})}

in Equation (1). Note also that the balance heuristic F is a particular case of G, i.e., in general we do not impose the restriction of

α_{i} = \frac{n_{i}}{N}

. Although the allocation of samples will be different for F and G, we remark that both estimators require the same number of simulations and evaluations. Finally, we note that a close examination of (16) allows us to re-interpret the importance weight associated to each sample. The mixture parametrized with

{n_{i} = α_{i} N}_{i = 1}^{n}

appears in its denominator, but since this mixture does not reflect the sampling procedure (the true mixture proposal in sampling is the one parametrized by

{n_{i} = β_{i} N}_{i = 1}^{n}

), the importance weight of a sample simulated from the technique

β_{i}

, is corrected by the factor

α_{i} / β_{i}

in order to account for this mismatch.

Estimator G can be rewritten as

G = \sum_{i = 1}^{n} α_{i} G_{i}

, where

\begin{matrix} G_{i} & = & \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} \frac{f (X_{i, j}))}{\sum_{k = 1}^{n} α_{k} p_{k} (X_{i, j}))} . \end{matrix}

(17)

Note also that G depends of two sets of parameters,

{α_{i}}_{i = 1}^{n} a n d {β_{i}}_{i = 1}^{n}

. In the particular case where

β_{i} = α_{i}, \forall i

, the estimator G becomes F. Let us consider the case with

n_{i} = 1

. Then,

\begin{matrix} G_{i}^{'} & = & \frac{f (x)}{\sum_{k = 1}^{n} α_{k} p_{k} (x)}, \end{matrix}

(18)

with expectation

\begin{matrix} E [G_{i}^{'}] = \int \frac{f (x) p_{i} (x)}{\sum_{k = 1}^{n} α_{k} p_{k} (x)} d x \equiv μ_{i}^{'}, \end{matrix}

(19)

and variance

\begin{matrix} {σ^{'}}_{i}^{2} & = & \int \frac{f^{2} (x) p_{i} (x)}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{2}} d x - {(μ_{i}^{'})}^{2} . \end{matrix}

(20)

Observe that

E [G_{i}] = E [G_{i}^{'}] = μ_{i}^{'}

, and

V [G_{i}] = \frac{1}{n_{i}} V [G_{i}^{'}] = \frac{1}{n_{i}} {σ^{'}}_{i}^{2}

.

Theorem 1.

For any set of weights

{α_{i}}_{i = 1}^{n}

, such as

\sum_{i = 1}^{n} α_{i} = 1

and any set of weights

{β_{i}}_{i = 1}^{n}

, such as

\sum_{i = 1}^{n} β_{i} = 1

, G is an unbiased estimator of μ.

Proof.

The estimator G is unbiased, since

\begin{array}{l} (21) & E [G] & = \sum_{i = 1}^{n} α_{i} E [G_{i}] = \sum_{i = 1}^{n} α_{i} μ_{i}^{'} = \sum_{i} α_{i} \int \frac{f (x) p_{i} (x)}{\sum_{k = 1}^{n} α_{k} p_{k} (x)} d x \\ (22) & = \int \frac{f (x) \sum_{i = 1}^{n} α_{i} p_{i} (x)}{\sum_{k = 1}^{n} α_{k} p_{k} (x)} d x \\ = \int f (x) d x \equiv μ . \end{array}

□

The variance of G is given by

\begin{matrix} V [G] = V [\sum_{i = 1}^{n} α_{i} G_{i}] = \sum_{i = 1}^{n} α_{i}^{2} V [G_{i}] = \sum_{i = 1}^{n} \frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{n_{i}} . \end{matrix}

(23)

For the sake of the theoretical analysis, we define

G^{1}

, a normalized version of G with

N = 1

(see Section 2.1), with variance

\begin{matrix} V [G^{1}] = \sum_{i = 1}^{n} \frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{β_{i}} . \end{matrix}

(24)

Next we study four special cases of estimator

G^{1}

.

Remark 1.

We could also consider the one-sample estimator

G

, randomized version of G. However,

G

is a particular case of the general estimator

Z

, and we have seen in Section 2.2 that the optimal case for

Z

is when

Z \equiv F

, thus it only makes sense to consider the extension G of the multi-sample estimator F.

Remark 2.

Grittmann et al. estimator [19], can be interpreted as a G estimator where

α_{i} \propto β_{i} \frac{m_{i}^{2}}{v_{i}} = β_{i} \frac{v_{i} + μ^{2}}{v_{i}}

, where

v_{i}

is the variance of i independent technique and

m_{i}^{2} = v_{i} + μ^{2}

the second moment,

v_{i} = \int \frac{f {(x)}^{2}}{p_{i} (x)} d x .

(25)

3.1. Case 1: $α_{i} = β_{i}$ , $\forall i$

In this particular case, the estimator G reverts to F. The variance is

\begin{matrix} V [F^{1}] = \sum_{i = 1}^{n} α_{i} {σ^{'}}_{i}^{2}, \end{matrix}

(26)

by simple substitution in Equation (24). We aim at finding the optimal

{α_{i}^{*}}_{i = 1}^{n}

such the variance of Equation (26) is minimized.

Theorem 2.

The optimal estimator

F^{*}

in terms of variance is achieved when

\forall j \in {1, \dots, n}

, the following values are equal

\begin{matrix} {σ^{'}}_{j}^{2} + 2 {μ_{j}^{'}}^{2} - 2 \sum_{i = 1}^{n} α_{i}^{*} μ_{i}^{'} \int \frac{f (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k}^{*} p_{k} (x))}^{2}} d x . \end{matrix}

(27)

See Appendix B for a proof.

Compare Equation (27) with the condition for optimal

F^{*}

, which is [9] that the following expression is equal

\forall j \in {1, \dots, n}

,

\begin{matrix} {σ^{'}}_{j}^{2} + {μ_{j}^{'}}^{2} . \end{matrix}

(28)

The optimal

{α_{i}^{*}}_{i = 1}^{n}

solutions are in general different for F, Equation (27), and for

F^{*}

, Equation (28), although observe that, when for

{α_{i}^{*}}_{i = 1}^{n}

all i the

μ_{i}^{'}

are equal, then the optimality condition Equation (27) implies that

\forall i \in {1, \dots, n}

the

{σ^{'}}_{i}^{2}

are equal too, and thus the optimal solutions for F and

F

are the same, in concordance with being

V [F] = V [F]

for this special case, see Section 2.2 and Appendix A.

Another interesting result regarding the

μ_{i}^{'}

values is the following theorem,

Theorem 3.

If

V [F (F)] = 0

then for all i,

μ_{i}^{'} = μ

.

Proof.

As

V [F] \leq V [F (F)]

then

V [F] = \sum_{i}^{n} α_{i} {σ_{i}^{'}}^{2} = 0

, and thus for all i,

σ_{i}^{'} = 0

; however,

V [F (F)] = 0

is obviously a local (and global) minimum for

V [F (F)]

(a convex function) and thus from Equation (28) for all i,

{σ^{'}}_{i}^{2} + {μ_{i}^{'}}^{2}

are equal and hence the result. Alternatively, we know that

V [F] = V [F (F)]

⇔ for all i, all

μ_{i}^{'} = μ

. □

3.2. Case 2: Fixed ${α_{i}}_{i = 1}^{n}$

Consider now that

{α_{i}}_{i = 1}^{n}

are fixed, and hence also

{{σ^{'}}_{i}^{2}}_{i = 1}^{n}

, are fixed. Applying Cauchy–Schwartz inequality to sequences

{\sqrt{β_{i}}}_{i = 1}^{n}

and

{\frac{α_{i} {σ^{'}}_{i}}{\sqrt{β_{i}}}}_{i = 1}^{n}

we obtain

\begin{matrix} {(\sum_{i = 1}^{n} α_{i} σ_{i}^{'})}^{2} \leq (\sum_{i = 1}^{n} β_{i}) (\sum_{i = 1}^{n} \frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{β_{i}}) . \end{matrix}

(29)

As the left term is fixed, equality in Equation (29) will give the minimum of right term, being this term

V [G^{1}]

as

\sum_{i = 1}^{n} β_{i} = 1

; however, equality can only happen when the right term sequences are proportional, this is, for all i,

\sqrt{β_{i}} \propto \frac{α_{i} {σ^{'}}_{i}}{\sqrt{β_{i}}}

, and thus the optimal

{β_{i}}_{i = 1}^{n}

are given by

\begin{matrix} β_{i}^{*} \propto α_{i} σ_{i}^{'}, i = 1, . . . n, \end{matrix}

(30)

and the optimal (minimum) variance is

\begin{matrix} V [G^{1 *}] = {(\sum_{i = 1}^{n} α_{i} σ_{i}^{'})}^{2} . \end{matrix}

(31)

Theorem 4.

Given an estimator F with

{α_{i}}_{i = 1}^{n}

values, we can always find a better estimator G by sampling as

β_{i}^{*} \propto α_{i} σ_{i}^{'}

, which is strictly better whenever not all

{σ^{'}}_{i}

are equal.

Proof.

Observe that, by applying Cauchy–Schwartz inequality to the sequences

{\sqrt{α_{i}}}_{i = 1}^{n}

and

{\sqrt{α_{i}} {σ^{'}}_{i}}_{i = 1}^{n}

,

\begin{matrix} {(\sum_{i = 1}^{n} α_{i} σ_{i}^{'})}^{2} \leq (\sum_{i = 1}^{n} α_{i}) (\sum_{i = 1}^{n} α_{i} {σ^{'}}_{i}^{2}), \end{matrix}

(32)

but the left-hand side of inequality is

V [G^{1 *}]

, and as

\sum_{i = 1}^{n} α_{i} = 1

, the right-hand side is

V [F^{1}]

. Hence, for the optimal values

{β_{i}^{*}}_{i = 1}^{n}

as in Equation (30), the estimator

G^{*}

always outperforms the estimator F (When comparing estimators F and G, we consider, unless explicitly stated, the same set of

{α_{i}}_{i = 1}^{n}

values).

Equality in Equation (32) only happens when for all i,

\sqrt{α_{i}} \propto \sqrt{α_{i}} {σ^{'}}_{i}

, i.e., when all

{σ^{'}}_{i}

are equal. In that case we have for all i

β_{i}^{*} = α_{i}

, and we revert to the F estimator. □

Remark 3.

From the inequality

(\sum_{i = 1}^{n} \frac{1}{n} {σ^{'}}_{i}^{2}) \leq n {(\sum_{i = 1}^{n} \frac{1}{n} σ_{i}^{'})}^{2}

, the maximum possible acceleration by using the optimal

β_{i}^{*}

values when for all i,

α_{i} = 1 / n

, is equal to n (as observed in [7]). This acceleration would be approached when there is an index k such that for all

i \neq k

,

σ_{k}^{'} > > σ_{i}^{'}

. In general, the more different the

σ_{i}^{'}

are, the higher the acceleration.

A particular case of Theorem 4 is when

α_{i} = \frac{1}{n}

,

\forall i

, where

V [G^{1}]

becomes

\begin{matrix} V [G^{1}] = \frac{1}{n^{2}} \sum_{i = 1}^{n} \frac{{σ^{'}}_{i}^{2}}{β_{i}} . \end{matrix}

(33)

This case was introduced in Section 4 of [7]. It was shown that this estimator is provably better than F with

α_{i} = 1 / n, \forall i

when

\begin{matrix} β_{i}^{*} \propto σ_{i}^{'}, i = 1, \dots, n, \end{matrix}

(34)

which is the optimal case of Equation (33). Examples showing the improvement obtained were also given in [7].

Optimal Efficiency

Let us now take into account that the cost of each sampling technique can be different, as it is usually considered in the literature [25]. Let us denote the cost of sampling technique i as

c_{i}

. The inverse of efficiency for the estimator G is given by

\begin{matrix} E_{G}^{- 1} = (\sum_{i = 1}^{n} β_{i} c_{i}) (\sum_{i = 1}^{n} \frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{β_{i}}) . \end{matrix}

(35)

Note that this quantity represents the total cost multiplied by the variance of the estimator. Using Cauchy–Schwartz inequality with the sequences

{\sqrt{β_{i} c_{i}}}_{i = 1}^{n}

and

{\frac{α_{i} {σ^{'}}_{i}}{\sqrt{β_{i}}}}_{i = 1}^{n}

, we obtain

\begin{matrix} {(\sum_{i = 1}^{n} α_{i} σ_{i}^{'} \sqrt{c_{i}})}^{2} \leq (\sum_{i = 1}^{n} β_{i} c_{i}) (\sum_{i = 1}^{n} \frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{β_{i}}) . \end{matrix}

(36)

The optimal sampling rates (for maximizing the efficiency) are those that yield Equation (36) as an equality, which happens when

β_{i}^{*} \propto \frac{α_{i} {σ^{'}}_{i}}{\sqrt{c_{i}}}

. Observe that, using again the Cauchy–Schwartz theorem, with sequences

{\sqrt{α_{i} c_{i}}}_{i = 1}^{n}

and

{\sqrt{α_{i}} {σ^{'}}_{i}}_{i = 1}^{n}

, we obtain

\begin{matrix} {(\sum_{i = 1}^{n} α_{i} σ_{i}^{'} \sqrt{c_{i}})}^{2} \leq (\sum_{i = 1}^{n} α_{i} c_{i}) (\sum_{i = 1}^{n} α_{i} {σ^{'}}_{i}^{2}), \end{matrix}

(37)

where the left-hand side is

E_{G}^{- 1}

with the optimal sampling rates, and the right-hand side is

E_{F}^{- 1}

. Note that equality only happens when for all i,

\sqrt{α_{i} c_{i}} = \sqrt{α_{i}} {σ^{'}}_{i}

, i.e.,

c_{i} \propto {σ^{'}}_{i}^{2}

, where we have

β_{i}^{*} = α_{i}

and we revert to the F estimator. This is summarized in the following theorem.

Theorem 5.

Given an estimator F with

{α_{i}}_{i = 1}^{n}

values, and sampling costs

{c_{i}}_{i = 1}^{n}

, we can always find a strictly more efficient estimator

G^{*}

when for all i,

β_{i}^{*} \propto α_{i} \frac{σ_{i}^{'}}{\sqrt{c_{i}}}

, and not all

c_{i} \propto {σ^{'}}_{i}^{2}

.

3.3. Case 3: Fixed ${β_{i}}_{i = 1}^{n}$

Theorem 6.

Consider now a fixed set

{β_{i}}_{i = 1}^{n}

. The optimal set

{α_{i}^{*}}_{i = 1}^{n}

can be found using Lagrange multipliers with target function

Λ ({α_{i}}_{i = 1}^{n}, λ) = \sum_{i = 1}^{n} \frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{β_{i}} + λ (\sum_{i = 1}^{n} α_{i} - 1) .

Observe that the

{σ^{'}}_{i}^{2}

values depend on the

{α_{i}}_{i = 1}^{n}

values. The optimal values are those that obey, for all j, the following expression

\begin{matrix} \frac{α_{j}^{*} {σ^{'}}_{j}^{2}}{β_{j}} = \sum_{i = 1}^{n} \frac{α_{i}^{* 2}}{β_{i}} \times \\ (\int \frac{f^{2} (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k}^{*} p_{k} (x))}^{3}} d x - μ_{i}^{'} \int \frac{f (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k}^{*} p_{k} (x))}^{2}} d x) . \end{matrix}

(38)

Proof.

The derivation can be found in the Appendix C. □

Note that in the general case, the optimal

α_{i}^{*} \neq β_{i}

.

Moreover, aside from the optimal values

{α_{j}^{*}}_{j = 1}^{n}

, we can find cases where

V [G^{1}] \leq V [F^{1}]

. Given

{β_{i}}_{i = 1}^{n}

, from Theorem 4 we have that if there exist

{α_{i}}_{i = 1}^{n}

such that

β_{i} \propto α_{i} σ_{i}^{'}, i = 1, . . . n

, then

V [G^{1}] \leq V [F^{1}]

. From Theorem 5, we have that if there exist

{α_{i}}_{i = 1}^{n}

such that

β_{i} \propto α_{i} \frac{σ_{i}^{'}}{\sqrt{c_{i}}}, \forall i = 1, . . . n

, then the estimator

G^{1}

is more efficient than estimator

F^{1}

.

Observe that from Equation (38), a necessary and sufficient condition for the optimal solution

α_{i}^{*}

to be such that

α_{i}^{*} = β_{i}

for all i is that Equation (39) holds (see Appendix C),

\begin{matrix} {μ_{j}^{'}}^{2} = \sum_{i = 1}^{n} α_{i} μ_{i}^{'} \int \frac{f (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{2}} d x, \end{matrix}

(39)

which is satisfied when for all i,

μ_{i}^{'} = μ

. In this particular case,

μ_{i}^{'} = μ

,

G \equiv F

, and

V [G] = V [F] = V [F]

. In other words, when choosing a sampling rate of

{β_{i}} = {α_{i}^{⋆}}

, where

{α_{i}^{⋆}}

are such that for all i,

μ_{i}^{'} = μ

, the optimal

{α_{i}}

parameters for G estimator are precisely

{α_{i}^{⋆}}

. That is, we can not improve the

F ({α_{i}^{⋆}})

estimator with the G estimator with a suitable selection of

{β_{i}}

parameters. Observe that it is similar to the case studied in Section 3.2 for the

{α_{i}}

that make all

{σ^{'}}_{i}

equal.

Equation (39) might have additional solutions to the one (if exists) that makes all

μ_{i}^{'}

equal, in this case

V [G] = V [F]

too but

V [F] \neq V [F]

.

3.4. Case 4: $β_{i} = 1 / n, \forall i$

In the case when for all i,

β_{i} = 1 / n

, the variance becomes

\begin{matrix} V [G^{1}] = \sum_{i = 1}^{n} \frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{1 / n} = n \sum_{i = 1}^{n} α_{i}^{2} {σ^{'}}_{i}^{2} . \end{matrix}

(40)

Note that this is a usual case in the MIS literature strategies [2,4,5,6] and the adaptive IS (AIS) literature [11,12,13,14,15,24], since all the techniques have the same number of counts. By setting in Equation (38) for all i,

β_{i} = 1 / n

, and if we can optimize

{α_{j}}_{j = 1}^{n}

, we can find the minimum variance values

{α_{j}^{*}}_{j = 1}^{n}

. Thus the minimum variance

V [G^{*}]

corresponds, if the solution exists, to the values

{α_{j}^{*}}_{j = 1}^{n}

that satisfy

\begin{matrix} α_{j}^{*} {σ^{'}}_{j}^{2} = \sum_{i = 1}^{n} α_{i}^{* 2} \\ \times (\int \frac{f^{2} (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k}^{*} p_{k} (x))}^{3}} d x - μ_{i}^{'} \int \frac{f (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k}^{*} p_{k} (x))}^{2}} d x) . \end{matrix}

(41)

The corresponding variance

V [G^{*}]

will be less or equal than the variance of

V [G]

for all

{α_{j}}_{j = 1}^{n}

, and in particular for

α_{i} = 1 / n

, where G converts into F as

β_{i} = α_{i} = 1 / n

, the classic balance heuristic estimator, thus

V [G^{*} ({α_{i}^{*}}, {β_{i} = 1 / n})] \leq V [F ({α_{i} = 1 / n})]

. Observe that in general

{α_{i}^{*} \neq 1 / n}

, because substituting

{α_{i}^{*} = 1 / n}

in Equation (41) the resulting equation Equation (39) is not satisfied in general.

Comparing $V [G^{1} ({α_{i}}, {β_{i}})]$ , $V [G^{1} ({α_{i}}, {β_{i} = 1 / n})]$ and $V [F^{1} ({α_{i}})]$

Let us see first when

V [G^{1} ({α_{i}}, {β_{i}})] \leq V [G^{1} ({α_{i}}, {β_{i} = 1 / n}))]

and viceversa.

Theorem 7.

If the sequences

{β_{i}}, {\frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{β_{i}}}

are comonotonic, i.e.,

β_{i} \leq β_{j} \Rightarrow \frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{β_{i}} \leq \frac{α_{j}^{2} {σ^{'}}_{j}^{2}}{β_{j}}

, then

\begin{matrix} V [G^{1} ({α_{i}}, {β_{i}})] \leq V [G^{1} ({α_{i}}, {β_{i} = 1 / n})] . \end{matrix}

(42)

Proof.

Applying Theorem 9, Corollary 5 from [26],

\begin{matrix} \sum_{i = 1}^{n} \frac{1}{n} (\frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{β_{i}}) & \leq & \sum_{i = 1}^{n} β_{i} (\frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{β_{i}}), \end{matrix}

(43)

obtaining the inequality in Equation (42). See also [10,27]. □

Observe that equality in Equation (43) happens when for all i,

β_{i} \propto α_{i}^{2} {σ^{'}}_{i}^{2}

.

In the same way we can proof the inverse case as follows,

Theorem 8.

If the sequences

{β_{i}}

and

{\frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{β_{i}}}

are countermonotonic, i.e.,

β_{i} \leq β_{j} \Rightarrow \frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{β_{i}} \geq \frac{α_{j}^{2} {σ^{'}}_{j}^{2}}{β_{j}}

, then

\begin{matrix} V [G^{1} ({α_{i}}, {β_{i}})] \geq V [G^{1} ({α_{i}}, {β_{i} = 1 / n}))] . \end{matrix}

(44)

The proof implies for countermonotonicity a change of sign in the inequality in Theorem 9, Corollary 5 from [26].

Let us compare now

V [G^{1} ({α_{i}}, {β_{i} = 1 / n})]

and

V [F^{1} ({α_{i}})]

.

Theorem 9.

If the sequences

{α_{i}}, {α_{i} {σ^{'}}_{i}^{2}}

are countermonotonic, i.e.,

α_{i} \leq α_{j} \Rightarrow α_{i} {σ^{'}}_{i}^{2} \geq α_{j} {σ^{'}}_{j}^{2}

, then

\begin{matrix} V [G^{1} ({α_{i}}, {β_{i} = 1 / n})] \leq V [F^{1} ({α_{i}})] . \end{matrix}

(45)

Proof.

The proof is as above, making use of Theorem 9, Corollary 5 from [26],

\begin{matrix} \sum_{i = 1}^{n} \frac{1}{n} (α_{i} {σ^{'}}_{i}^{2}) & \geq & \sum_{i = 1}^{n} α_{i} (α_{i} {σ^{'}}_{i}^{2}), \end{matrix}

(46)

obtaining the inequality in Equation (45). □

Observe that equality in Equation (46) happens when for all i,

α_{i} \propto 1 / {σ^{'}}_{i}^{2}

. In the same way we can prove the reverse case of Theorem 9.

Theorem 10.

If the sequences

{α_{i}}, {α_{i} {σ^{'}}_{i}^{2}}

are comonotonic, i.e.,

α_{i} \leq α_{j} \Rightarrow α_{i} {σ^{'}}_{i}^{2} \leq α_{j} {σ^{'}}_{j}^{2}

, then

\begin{matrix} V [G^{1} ({α_{i}}, {β_{i} = 1 / n})] \geq V [F^{1} ({α_{i}})] . \end{matrix}

(47)

As a direct consequence of Theorems 9 and 10 we have the following corollary:

Corollary 1.

When

α_{i} \propto 1 / {σ^{'}}_{i}^{2 + ϵ}

, if

- 2 \leq ϵ \leq 0

then

V [G^{1} ({α_{i}}, {β_{i} = 1 / n})] \leq V [F^{1} ({α_{i}})]

holds, and if

ϵ \geq 0

then

V [G^{1} ({α_{i}}, {β_{i} = 1 / n})] \geq V [F^{1} ({α_{i}})]

holds.

By setting

ϵ = - 1

in Corollary 1, the products

α_{i} {σ^{'}}_{i}

are equal for all i, which is equivalent to saying that

α_{i} {σ^{'}}_{i} \propto 1 / n = β_{i}

. Observe that this corresponds to the optimal solution

{β_{j}^{*}}

given by Equation (30). Observe now that the inequality for this optimal solution,

V [G^{1}] \leq V [F^{1}]

, can be obtained by successive application of Theorems 7 and 9.

Theorem 11.

If the sequences

{α_{i}}, {α_{i} {σ^{'}}_{i}^{2}}

are countermonotonic, and the sequences

{β_{i}}, {\frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{β_{i}}}

are comonotonic, then

\begin{matrix} V [G^{1} ({α_{i}}, {β_{i}})] \leq V [F^{1} ({α_{i}})] . \end{matrix}

(48)

Proof.

The result is obtained by successive application of Theorems 7 and 9. □

Theorem 12.

If the sequences

{α_{i}}, {α_{i} {σ^{'}}_{i}^{2}}

are comonotonic, and the sequences

{β_{i}}, {\frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{β_{i}}}

are countermonotonic, then

\begin{matrix} V [G^{1} ({α_{i}}, {β_{i}})] \geq V [F^{1} ({α_{i}})] . \end{matrix}

(49)

As a direct consequence of Theorems 11 and 12 we have the following corollary:

Corollary 2.

When

α_{i} \propto 1 / {σ^{'}}_{i}^{2 + ϵ}

and

β_{i} \propto {(α_{i} {σ^{'}}_{i})}^{γ}

, if

- 2 \leq ϵ \leq 0

and

0 < γ \leq 2

then

V [G^{1} ({α_{i}}, {β_{i}})] \leq V [F^{1} ({α_{i}})]

holds, and if

ϵ \geq 0

and

γ > 2

then

V [G^{1} ({α_{i}}, {β_{i}})] \geq V [F^{1} ({α_{i}})]

holds.

For example, take

α_{i} \propto 1 / {σ^{'}}_{i}^{2}

, and

β_{i} \propto α_{i} {σ^{'}}_{i} \propto 1 / {σ^{'}}_{i}

, then

V [G^{1} ({α_{i}}, {β_{i}})] \leq V [F^{1} ({α_{i}})] .

Finally, observe that we do not have a sufficient condition for

V [F^{1} ({α_{i}})] \leq V [F^{1} ({α_{i} = 1 / n})]

, only there are heuristics [7,9,10,28].

3.5. Case 5: General Case

Let us consider the global optimum for G estimator, this is, without imposing any constraint to the parameters

{α_{i}}, {β_{i}}

. For this we should optimize the lagrangian function

Λ ({α_{i}}_{i = 1}^{n}, {β_{i}}_{i = 1}^{n}, λ_{1}, λ_{2}) = \sum_{i = 1}^{n} \frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{β_{i}} + λ_{1} (\sum_{i = 1}^{n} α_{i} - 1) + λ_{2} (\sum_{i = 1}^{n} β_{i} - 1) .

Differentiating with respect to

α_{i}

,

\frac{\partial Λ ({α_{i}}_{i = 1}^{n}, {β_{i}}_{i = 1}^{n}, λ_{1}, λ_{2})}{\partial_{α_{j}}} = 0,

we obtain Equation (38).

Differentiating with respect to

β_{i}

,

\frac{\partial Λ ({α_{i}}_{i = 1}^{n}, {β_{i}}_{i = 1}^{n}, λ_{1}, λ_{2})}{\partial_{β_{j}}} = 0,

we obtain Equation (30). Thus, the optimal parameters

{α_{i}^{⋆}}_{i = 1}^{n}

and

{β_{i}^{⋆}}_{i = 1}^{n}

will solve for Equations (38) and (30) simultaneously.

Let us consider now the

{α_{i}^{†}}_{i = 1}^{n}

, if they exist, such that for all i,

σ_{i}^{'}

are equal. Then, for Equation (30) to hold for

{α_{i}^{†}}_{i = 1}^{n}

, we have that

β_{i}^{†} = α_{i}^{†}

for all

i = 1, \dots, n

. From Equation (38), taking

{β_{i}^{†}}_{i = 1}^{n} = {α_{i}^{†}}_{i = 1}^{n}

, Equation (39) has to hold for all

α_{i}^{†}

.

Furthermore, the holding of Equation (39), together with the hypothesis that for all j,

{σ^{'}}_{j}^{2}

are equal, allows Equation (27) to hold, and then

{α_{i}^{†}}_{i = 1}^{n}

is an optimum for F too,

V [F] = V [G]

. If in addition, if

{α_{i}^{†}}

are such that they make all

{μ^{'}}_{j = 1}^{n}

equal, then they would be optimal for

F

too,

V [F] = V [F] = V [G]

.

4. Singular Solutions

The variance of

F

can be written as [7]

\begin{matrix} V [F] & = & \sum_{i = 1}^{n} α_{i} ({σ^{'}}_{i}^{2} + {μ^{'}}_{i}^{2} - μ^{2}) = \sum_{i = 1}^{n} α_{i} {σ^{'}}_{i}^{2} + \sum_{i = 1}^{n} α_{i} {μ^{'}}_{i}^{2} - μ^{2} . \\ = & \sum_{i = 1}^{n} α_{i} ({σ^{'}}_{i}^{2} + {μ^{'}}_{i}^{2}) - μ^{2} . \end{matrix}

(50)

The optimal parameters

{α_{i}}_{i = 1}^{n}

happen when for all i,

{σ^{'}}_{i}^{2} + {μ^{'}}_{i}^{2}

equal, which gives a local minimum for

V [F]

(in this case a global minimum too, as

V [F]

is convex). Observe that this makes, in the third inequality, all terms factored by

{α_{i}}

equal, and also corresponds to equality in the Cauchy–Schwartz inequality

\begin{matrix} {(\sum_{i = 1}^{n} α_{i} \sqrt{({σ^{'}}_{i}^{2} + {μ^{'}}_{i}^{2})})}^{2} \leq (\sum_{i = 1}^{n} α_{i}) (\sum_{i = 1}^{n} α_{i} ({σ^{'}}_{i}^{2} + {μ^{'}}_{i}^{2})), \end{matrix}

(51)

with equality only when, for all i,

{σ^{'}}_{i}^{2} + {μ^{'}}_{i}^{2}

are equal.

If there exist

{α_{i}^{⋆}}_{i = 1}^{n}

values such that for all i,

μ_{i}^{'} = μ

and all

{σ^{'}}_{i}

are equal, these values correspond to the global optimum

V [F^{⋆}] = V [F^{⋆}] = V [G^{⋆}]

, and

F^{⋆} \equiv G^{⋆}

, but

F^{⋆} ≢ F^{⋆}

. Suppose there exist

{α_{i}^{⋆}}_{i = 1}^{n}

such that for all i,

μ_{i}^{'} = μ

, then we can not improve the F estimator with a G estimator with a suitable selection of

{β_{i}}_{i = 1}^{n}

parameters, because the optimal

{α_{i}}

parameters for the G estimator when

{β_{i}}_{i = 1}^{n} = {α_{i}^{⋆}}_{i = 1}^{n}

are precisely

{α_{i}^{⋆}}

. This is, the optimal of

V [G ({α_{i}}_{i = 1}^{n}, {β_{i}}_{i = 1}^{n} = {α_{i}^{⋆}}_{i = 1}^{n})]

is for

{α_{i}}_{i = 1}^{n} = {α_{i}^{⋆}}_{i = 1}^{n}

.

Let us see now

V [F]

:

V [F] = \sum_{i = 1}^{n} α_{i} {σ^{'}}_{i}^{2} .

Suppose there exist

{α_{i}^{⋆}}_{i = 1}^{n}

such that all

σ_{i}^{'}

are equal, then we can not improve the F estimator with a suitable selection of

{β_{i}}_{i = 1}^{n}

parameters for the G estimator, because the optimal

{β_{i}^{⋆}}_{i = 1}^{n}

are such that

{β_{i}^{⋆}}_{i = 1}^{n} = {α_{i}^{⋆}}_{i = 1}^{n}

. This is, the optimal of

V [G ({α_{i}^{⋆}}_{i = 1}^{n}, {β_{i}}_{i = 1}^{n})]

is for

{β_{i}}_{i = 1}^{n} = {α_{i}^{⋆}}_{i = 1}^{n}

. Observe that by Cauchy–Schwartz

\begin{matrix} {(\sum_{i = 1}^{n} α_{i} {σ^{'}}_{i})}^{2} \leq (\sum_{i = 1}^{n} α_{i}) (\sum_{i = 1}^{n} α_{i} {σ^{'}}_{i}^{2}), \end{matrix}

(52)

with equality only when, for all i, all the

{σ^{'}}_{i}^{2}

are equal. Observe thus that both the solutions for

μ_{i}^{'}

equal and for

{σ^{'}}_{i}^{2}

equal are singular solutions for G, where it reverts to F.

Let us consider now

V [G]

:

\begin{matrix} V [G] = \sum_{i = 1}^{n} \frac{α_{i}^{2}}{β_{i}} {σ^{'}}_{i}^{2} = \sum_{i = 1}^{n} α_{i} (\frac{α_{i}}{β_{i}} {σ^{'}}_{i}^{2}) = \sum_{i = 1}^{n} β_{i} (\frac{α_{i}^{2}}{β_{i}^{2}} {σ^{'}}_{i}^{2}) . \end{matrix}

(53)

In the second equality in Equation (53), the solution for all i,

\frac{α_{i}}{β_{i}} {σ^{'}}_{i}^{2}

equal, or

β_{i} \propto α_{i} {σ_{i}^{'}}^{2}

, makes

V [G] = V [F]

, because then for all i,

α_{i} {σ^{'}}_{i}^{2} = K β_{i}

, and substituting in the expressions for

V [G], V [F]

we have

V [G] = V [F] = K

, but

G ≢ F

. By Cauchy–Schwartz,

\begin{matrix} {(\sum_{i = 1}^{n} \sqrt{\frac{α_{i}^{3} {σ^{'}}_{i}^{2}}{β_{i}}})}^{2} \leq (\sum_{i = 1}^{n} α_{i}) (\sum_{i = 1}^{n} \frac{α_{i}^{2}}{β_{i}} {σ^{'}}_{i}^{2}), \end{matrix}

(54)

with equality only when, for all i,

\frac{α_{i}}{β_{i}} {σ^{'}}_{i}^{2}

are equal.

In the third equality in Equation (53), the solution for all i,

\frac{α_{i}^{2}}{β_{i}^{2}} {σ^{'}}_{i}^{2}

, equal, or

β_{i} \propto α_{i} σ_{i}^{'}

, gives the optimal solution for

V [G]

when

{α_{i}}

are fixed, Equation (30). By Cauchy–Schwartz,

\begin{matrix} {(\sum_{i = 1}^{n} α_{i} {σ^{'}}_{i})}^{2} \leq (\sum_{i = 1}^{n} β_{i}) (\sum_{i = 1}^{n} \frac{α_{i}^{2}}{β_{i}} {σ^{'}}_{i}^{2}), \end{matrix}

(55)

with equality only when, for all i,

\frac{α_{i}^{2}}{β_{i}^{2}} {σ^{'}}_{i}^{2}

are equal.

We summarize in Table 2 the different possibilities.

5. Relationship with $χ^{2}$ Divergence: A Necessary and Sufficient Condition for $V [G] \leq V [F]$

The variance of an importance sampling estimator can be written in terms of a

χ^{2}

divergence if

f (x) \geq 0

[29,30]. As the

F

estimator can be seen as an importance sampling estimator with pdf

p (x) = \sum_{i}^{n} p_{i} (x)

, its variance can be written as

\begin{matrix} V [F] & = & \int \frac{f {(x)}^{2}}{\sum_{i}^{n} α_{i} p_{i} (x)} d x - μ^{2} \\ = & μ^{2} \int \frac{\frac{f {(x)}^{2}}{μ^{2}}}{\sum_{i}^{n} α_{i} p_{i} (x)} d x - 2 μ^{2} \int \frac{f (x)}{μ} d x + μ^{2} \int (\sum_{i}^{n} α_{i} p_{i} (x)) d x \\ = & μ^{2} \int \frac{{(\frac{f (x)}{μ} - (\sum_{i}^{n} α_{i} p_{i} (x)))}^{2}}{\sum_{i}^{n} α_{i} p_{i} (x)} d x \\ = & μ^{2} χ^{2} (\frac{f (x)}{μ}, \sum_{i}^{n} α_{i} p_{i} (x)) . \end{matrix}

(56)

Observe that from the

χ^{2}

expression in Equation (56) it is clear that

V [F] = 0 \Leftrightarrow f (x) \propto \sum_{i}^{n} α_{i} p_{i} (x)

except possibly in a zero measure set. As a consequence, for all i

μ_{i}^{'} = μ

, and thus

V [F] = V [F] = 0

. As

V [F] \leq V [F]

, we have the result

V [F] = 0 \Leftrightarrow V [F] = 0 .

(57)

A possible generalization of Equation (56) to the

G

estimator, the randomized version of G, is given in Appendix D.

Although neither

V [F]

nor

V [G]

can be written as in Equation (56), we can still relate them to

χ^{2}

divergence. In general,

\sum_{i}^{n} α_{k} {σ^{'}}_{k} > 0

. Note that, if

\sum_{k}^{n} α_{k} {σ^{'}}_{k} = 0

, then, for all k,

{σ^{'}}_{k} = 0

and

V [F] = V [G] = 0

. Thus both

V [G], V [F] = 0

or both

V [G], V [F] > 0

. Let us denote

K = \sum_{k = 1}^{n} α_{k} {σ^{'}}_{k}

. Then,

\begin{matrix} V [G] & = & \sum_{i = 1}^{n} \frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{β_{i}} \\ = & K^{2} \sum_{i = 1}^{n} \frac{\frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{K^{2}}}{β_{i}} + K^{2} \sum_{i = 1}^{n} β_{i} - K^{2} + K^{2} \sum_{i = 1}^{n} 2 \frac{α_{i} {σ^{'}}_{i}}{K} - K^{2} \sum_{i = 1}^{n} 2 \frac{α_{i} {σ^{'}}_{i}}{K} \\ = & K^{2} (\sum_{i = 1}^{n} \frac{\frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{K^{2}} - 2 \frac{α_{i} {σ^{'}}_{i}}{K} β_{i} + β_{i}^{2}}{β_{i}}) - K^{2} + K^{2} \sum_{i = 1}^{n} 2 \frac{α_{i} {σ^{'}}_{i}}{K} \\ = & K^{2} (\sum_{i = 1}^{n} \frac{\frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{K^{2}} - 2 \frac{α_{i} {σ^{'}}_{i}}{K} β_{i} + β_{i}^{2}}{β_{i}}) + K^{2} \\ = & K^{2} (\sum_{i = 1}^{n} \frac{{(\frac{α_{i} {σ^{'}}_{i}}{K} - β_{i})}^{2}}{β_{i}} + 1) \\ = & {(\sum_{k = 1}^{n} α_{k} {σ^{'}}_{k})}^{2} (χ^{2} ({\frac{α_{i} {σ^{'}}_{i}}{\sum_{k = 1}^{n} α_{k} {σ^{'}}_{k}}}, {β_{i}}) + 1) . \end{matrix}

(58)

For

{β_{i} \propto α_{i} σ^{'}}_{i = 1}^{n}

the

χ^{2}

divergence value is zero, and

V [G] = {(\sum_{k = 1}^{n} α_{k} {σ^{'}}_{k})}^{2}

. This is the optimal of

V [G]

when the

{α_{i}}_{i = 1}^{n}

are fixed, as seen in Section 3.2.

Substituting

{β_{i} = α_{i}}_{i = 1}^{n}

in Equation (58) we obtain

\begin{matrix} V [F] = {(\sum_{k = 1}^{n} α_{k} {σ^{'}}_{k})}^{2} (χ^{2} ({\frac{α_{i} {σ^{'}}_{i}}{\sum_{k = 1}^{n} α_{k} {σ^{'}}_{k}}}, {α_{i}}) + 1) . \end{matrix}

(59)

As we have taken

\sum_{k = 1}^{n} α_{k} {σ^{'}}_{k} > 0

, we can divide first and last terms of Equations (58) and (59),

\begin{matrix} V [G] = V [F] \frac{χ^{2} ({\frac{α_{i} {σ^{'}}_{i}}{\sum_{k = 1}^{n} α_{k} {σ^{'}}_{k}}}, {β_{i}}) + 1}{χ^{2} ({\frac{α_{i} {σ^{'}}_{i}}{\sum_{k = 1}^{n} α_{k} {σ^{'}}_{k}}}, {α_{i}}) + 1} . \end{matrix}

(60)

Hence the following theorem,

Theorem 13.

If

V [G], V [F] > 0

, then

V [G] \leq V [F] \Leftrightarrow χ^{2} ({\frac{α_{i} {σ^{'}}_{i}}{\sum_{k = 1}^{n} α_{k} {σ^{'}}_{k}}}, {β_{i}}) \leq χ^{2} ({\frac{α_{i} {σ^{'}}_{i}}{\sum_{k = 1}^{n} α_{k} {σ^{'}}_{k}}}, {α_{i}}) .

Observe that Theorem 13 does not impose any a priori condition on the

{α_{i}}, {β_{i}}

parameters and generalizes Theorem 12. In the same way we can generalize Theorems 7–11 with the following corollaries,

Corollary 3.

If

V [G], V [G ({β_{i} = 1 / n})] > 0

, then

V [G] \leq V [G ({β_{i} = 1 / n}]) \Leftrightarrow χ^{2} ({\frac{α_{i} {σ^{'}}_{i}}{\sum_{k = 1}^{n} α_{k} {σ^{'}}_{k}}}, {β_{i}}) \leq χ^{2} ({\frac{α_{i} {σ^{'}}_{i}}{\sum_{k = 1}^{n} α_{k} {σ^{'}}_{k}}}, {1 / n}) .

Corollary 4.

If

V [G ({β_{i} = 1 / n})], V [F] > 0

, then

V [G ({β_{i} = 1 / n})] \leq V [F] \Leftrightarrow χ^{2} ({\frac{α_{i} {σ^{'}}_{i}}{\sum_{k = 1}^{n} α_{k} {σ^{'}}_{k}}}, {1 / n}) \leq χ^{2} ({\frac{α_{i} {σ^{'}}_{i}}{\sum_{k = 1}^{n} α_{k} {σ^{'}}_{k}}}, {α_{i}}) .

Remark 4.

Observe that, using the second order

χ^{2}

approximation of

K L

(Kullback–Leibler) distance between two distributions

X_{1}, X_{2}

,

χ^{2} (X_{1}, X_{2}) \approx 2 K L (X_{1}, X_{2})

[31], Theorem 13 can be written as

\begin{matrix} V [G] & \leq & V [F] \Leftrightarrow K L ({\frac{α_{i} {σ^{'}}_{i}}{\sum_{k = 1}^{n} α_{k} {σ^{'}}_{k}}}, {β_{i}}) \leq K L ({\frac{α_{i} {σ^{'}}_{i}}{\sum_{k = 1}^{n} α_{k} {σ^{'}}_{k}}}, {α_{i}}) \\ \Leftrightarrow & C E ({\frac{α_{i} {σ^{'}}_{i}}{\sum_{k = 1}^{n} α_{k} {σ^{'}}_{k}}}, {β_{i}}) \leq C E ({\frac{α_{i} {σ^{'}}_{i}}{\sum_{k = 1}^{n} α_{k} {σ^{'}}_{k}}}, {α_{i}}), \end{matrix}

where cross entropy

C E (X_{1}, X_{2}) = H (X_{1}) + K L (X_{1}, X_{2})

, and H stands for entropy.

The same approximation can be used for Corollaries 3 and 4.

6. Numerical Examples

6.1. Efficiency Comparison between F and G Estimators

We compare the efficiencies for the F estimator and the optimal G estimator defined in Theorem 5 with five examples. Table 3 shows the inverse of the efficiencies,

E_{F}^{- 1} = V [F] \cdot C o s t [F]

and

E_{G}^{- 1} = V [G] \cdot C o s t [G]

, i.e., the product of variance and cost for the F estimator and for the optimal G estimator, for these possible sets of

{α_{k}}_{k = 1}^{n}

: (i) equal count of samples,

{α_{k} = 1 / n}

; (ii) count of samples inversely proportional to the variances of the independent techniques times the cost of sampling the technique,

{α_{k} \propto \frac{1}{c_{k} v_{k}}}

[8,9]; and the three balance heuristic estimators defined in [28], which are (iii) count of samples inversely proportional to the second moments of the independent techniques times the cost of sampling the technique,

{α_{k} \propto \frac{1}{c_{k} m_{k}^{2}}}

; (iv) count of samples proportional to

σ_{k, e q} = \frac{σ_{k}^{'} ({α_{i} = 1 / n}_{i = 1}^{n})}{n}

, and inversely proportional to the square root of the cost,

{α_{k} \propto \frac{σ_{k, e q}}{\sqrt{c_{k}}}}

; and (v)

{α_{k} \propto \frac{M_{k, e q}}{\sqrt{c_{k}}}}

where

M_{k, e q} = \sqrt{\int \frac{f^{2} (x) p_{k} (x)}{{(\sum_{i} p_{i} (x))}^{2}}}

.

Below, we describe the five examples. From the results in Table 3, we can see:

As expected, Theorems 4 and 5 hold for all the cases.
Examples 1–4 show a general gain in efficiency, around 20%.
Example 5 with equal costs shows a gain in efficiency from 2% to around 20%.
Example 5 with different costs shows big gains in efficiency, in particular for equal count of sampling the efficiency is doubled.

We see in all our examples an increase in efficiency when the estimator G is used instead of the estimator F, showing in some cases, as in Example 5, a 50% improvement. Theorems 4 and 5 ensure that estimator G is always better than estimator F if the optimal

{β_{i}}

values are used; however, for equal costs (i.e., comparing only variances), the amount of improvement strongly depends on the example considered, from very small o negligible (see first three rows of Table 4 for the first four examples) to an important one (compare Table 3 for Example 5 with equal costs with the first three rows of Table 4 for Example 5). The values are computed for the results in Table 3 and Table 4 by using numerical approximations obtained with Mathematica software. It would also be possible to select them in an adaptive and iterative way. We defer the investigation of these adaptive schemes for a future work.

6.1.1. Example 1

Suppose we want to solve the integral

μ = \int_{\frac{3}{2 π}}^{π} x (x^{2} - \frac{x}{π}) sin (x) d x \approx 10.29

(61)

by MIS sampling on functions

x

,

(x^{2} - \frac{x}{π})

, and

sin (x)

, respectively. We first find the normalization constants:

\int_{\frac{3}{2 π}}^{π} x d x = 4.82082

,

\int_{\frac{3}{2 π}}^{π} (x^{2} - \frac{x}{π}) d x = 8.76463

,

\int_{\frac{3}{2 π}}^{π} sin (x) d x = 1.88816

. The costs for sampling the techniques are (1; 6.24; 3.28).

6.1.2. Example 2

Let us solve the integral

μ = \int_{\frac{3}{2 π}}^{π} (x^{2} - \frac{x}{π}) {sin}^{2} (x) d x \approx 3.60

(62)

using the same functions

x

,

(x^{2} - \frac{x}{π})

, and

sin (x)

as before.

6.1.3. Example 3

As the third example, let us solve the integral

μ = \int_{\frac{3}{2 π}}^{π} x + (x^{2} - \frac{x}{π}) + sin (x) d x \approx 15.47

(63)

using the same functions as before. The optimal

α

values for this example, with zero variance, are

\frac{1}{4.82082 + 8.76463 + 1.88816} (4.82082, 8.76463, 1.88816)

.

6.1.4. Example 4

As the fourth example, consider the integral of the sum of the three pdfs

\begin{matrix} μ & = \int_{\frac{3}{2 π}}^{π} 30 \frac{x}{4.82082} + 30 \frac{(x^{2} - \frac{x}{π})}{8.76463} + 40 \frac{sin (x)}{1.88816} d x \\ \approx 100 . \end{matrix}

(64)

In this case we know again the optimal (zero variance)

α

values:

(0.3, 0.3, 0.4)

. The difference with the previous example is that this case should be most favorable to equal count of samples.

6.1.5. Example 5

As the last example, consider solving the following integral

\begin{matrix} μ = \int_{0.01}^{π / 2} (\sqrt{x} + sin x) d x \approx 2.31175 . \end{matrix}

(65)

by MIS sampling on functions

2 - x

, and

{sin}^{2} (x)

.

6.2. Variances of Examples 1–5 for Some Notable Cases

In Table 4 we present some notable cases for the variance of Examples 1–5. Observe that here we do not consider the cost of sampling. The values are computed for the results in Table 4 with Mathematica software. We have

In first, second and third row we present the optimal values of $V [F] \geq V [F] \geq V [G]$ , respectively. In the first example those values are equal up to the fourth decimal position. In the second example they are equal up to the second decimal position, in the third and fourth examples they are equal, and in the fifth example the gain of $V [G]$ with respect to $V [F]$ is more than 30%.
In the fourth row we present the values of $V [F (1 / n)]$ , which are compared against several strategies and heuristics in the following rows
The values in the fifth row for $V [G]$ , that correspond to Equation (30) (or the left-hand side of Equation (37) with all costs equal to 1), although being better than the values for $V [F]$ in the fourth row, as expected, do not give any significant improvement. As we have seen in Table 3 the improvements for this case come rather from efficiency.
The values in the sixth row for $V [G]$ , that correspond to the solution of Equation (40), this is, the optimal ${α_{i}}$ for ${β_{i} = 1 / n}$ , are smaller than the variances of $V [F]$ for equal sampling rate, fourth row, and much smaller in some of the examples, with variances equal (as in Examples 3 and 4), or less, as in Example 5, than the optimal $V [F]$ .
Although in general there is no ${α_{i}}$ solution for all $μ_{i}^{'}$ equal, for instance the numerical approximation obtained in Example 1 is $μ_{1}^{'} = 10.2624$ , $μ_{2}^{'} = μ_{3}^{'} = 10.2876$ , the variances obtained, in the seventh row of Table 4, are for all the examples very near to the optimal value of $V [F]$ . Observe that if these ${α_{i}^{⋆}} = a r g \forall i, μ_{i}^{'} e q u a l$ values exist, then they are the optimal ${α_{i}}$ values for $V [G (β_{i} = α_{i}^{⋆})]$ , i.e., for the sampling rates ${β_{i} = α_{i}^{⋆}}$ the G estimator can not improve on the F estimator.
In row 8 we show the G estimator corresponding $α_{i} σ_{i}^{'} \approx e q u a l$ , with $β_{i} = 1 / n$ . It improves on $F (1 / n)$ for all examples except the third one, where it scores closely, beating estimators in rows 9 and 10 except for Example 3. It is indeed a theoretical estimator as there is no easy way to solve this equation. However, it can be approximated by the heuristic in line 11, see below. Observe that if ${α_{i}^{⋆}}$ are the solutions of for all i, $α_{i} σ_{i}^{'}$ equal, then the optimal ${β_{i}}$ for $G ({α_{i}^{⋆}}, {β_{i}})$ are ${β_{i} = 1 / n}$ .
The results in row 9 correspond to the estimator defined in [19], $α_{i} \propto m_{i}^{2} / v_{i}$ , where $m_{i}^{2} = v_{i} + μ^{2}$ are the second moments of the independent techniques, and $β_{i} = 1 / n$ . The results are slightly better than $F (1 / n)$ for Examples 1 and 2, much better for Example 3, much worse for Example 4 and slightly worse for Example 5. Observe that the $m_{i}, v_{i}$ values can be easily approximated using a first batch of samples in a Monte Carlo integration, as it was already performed in [7,9,10].
In row 10 we introduce the estimator corresponding to $α_{i} \propto 1 / v_{i}$ , and $β_{i} = 1 / n$ , which gives better results than $F (1 / n)$ for Examples 1 and 2, much better for Example 3, worse for Example 5, and much worse for Example 4. This estimator improves on the one in [19] except for Example 4. Observe that these $α_{i}$ weights correspond to the optimal weights in the linear combination of estimators when the sampling rates ( $β_{i}$ ) are fixed, Equation (13) of [9].
In row 11 we approximate the estimator in row 8 by using ${σ_{i, 1 / n}^{'}}$ , which corresponds to ${α_{i} = 1 / n}$ , to approximate ${σ_{i}^{'}}$ . Observe that ${σ_{i, 1 / n}^{'}}$ values can be easily obtained with a first batch of samples in a Monte Carlo integration. In all cases, except in Example 3, we improve on $F (1 / n)$ . The big error on Example 4 of the estimator in lines 9 and 10 is controlled.
In row 12 we modulate the estimator in row 11 with the $μ_{i}^{'}$ values. Observe that now for all the examples the variance is less than for $F (1 / n)$ .
The results in row 13 show that, even if $F ({α_{i}}) \leq F ({1 / n})$ , we can not guarantee that $G ({α_{i}}, {β_{i} = 1 / n}) \leq F ({1 / n})$ .

From the results in Table 4 row 6, we can conclude that there can be room for improvement on

F (1 / n)

estimator using a G estimator, with a suitable selection of

α_{i}

parameters. Heuristics as in rows 9 and 10 can give much improvement in some cases at the cost of no improvement at all and even a huge variance in other cases, thus they are not robust heuristics. More sophisticated heuristics as in rows 11 and 12, based on the theoretically justified heuristic in row 8, show a robust behavior.

A very promising estimator is the one in row 7, for all i all

μ_{i}^{'}

equal (to

μ

), in all five examples the variance is very near the optimal

V [F]

. Observe that this estimator has a strong theoretical justification, derived from the study of G estimator. The

{α_{i}}

parameters that approximate for all i,

μ_{i}^{'} = μ

could be obtained in an adaptive way; however, a practical way to obtain this estimator should be devised.

7. Conclusions

In this paper, we have proposed a multiple importance sampling estimator that combines samples simulated from different techniques. The novel estimator generalizes the Monte Carlo balance heuristic estimator, widely used in the literature of signal processing, computational statistics, and computer graphics. In particular, this estimator relaxes the connection between the coefficients that select the number of samples per proposal, and the samples that appear in the mixture of techniques at the denominator of the importance weight. This flexibility shows a relevant improvement in terms of variance in the combined estimator with regard to the balance heuristic estimator (which is included as a particular case in the novel estimator). We have studied the optimal choice of the free coefficients in such a way that the variance of the resulting estimator is minimized. In addition, numerical results have shown that the significant gap in terms of variance between both estimators justifies the use of the novel estimator whenever possible. In the particular, but much used case, of equal sampling, our new estimator shows a big potential for improvement. Future work may include the application of this variance-reduction technique in the context of adaptive importance sampling [15] or within MCMC-based methods that include re-weighting [32].

Author Contributions

Conceptualization, M.S.; methodology, M.S.; software, M.S.; validation, M.S. and V.E.; formal analysis, M.S. and V.E.; investigation, M.S. and V.E.; writing—original draft preparation, M.S.; writing—review and editing, M.S. and V.E.; funding acquisition, M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by grant PID2019-106426RB-C31 funded by Spanish Ministry of Science and Innovation MCIN/AEI/ 10.13039/501100011033.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Difference between the Variances of Deterministic and Randomized Multiple Importance Sampling Estimators

Proof of Equation (9).

The difference between the variances of the deterministic multiple importance sampling estimator, Z, and the randomized one,

Z

, is given by [17] (we normalize here to one sample)

\begin{matrix} V [Z^{1}] - V [Z^{1}] & = & \sum_{i} α_{i} {μ^{'}}_{i}^{2} - μ^{2} \\ = & \sum_{i} α_{i} {μ^{'}}_{i}^{2} - {(\sum_{i} α_{i} {μ^{'}}_{i})}^{2}, \end{matrix}

(A1)

which is positive, as by Cauchy-Schwartz

\begin{matrix} {(\sum_{i} α_{i} {μ^{'}}_{i})}^{2} \leq (\sum_{i} α_{i}) (\sum_{i} α_{i} {μ^{'}}_{i}^{2}), \end{matrix}

(A2)

and equality only happens (apart from the case when both

V [Z^{1}], V [Z^{1}]

are zero) when for all i,

α_{i} \propto α_{i} {μ^{'}}_{i}^{2}

, i.e., all

{μ^{'}}_{i}

are equal. □

Proof of Equation (10).

Observe that if

f (x) \geq 0

then for all i

μ_{i}^{'} \geq 0

, and we have

\begin{matrix} {(\sum_{i} {μ^{'}}_{i})}^{2} \geq \sum_{i} {μ^{'}}_{i}^{2}, \end{matrix}

(A3)

and thus

\begin{matrix} μ^{2} = {(\sum_{i} \frac{1}{n} {μ^{'}}_{i})}^{2} \geq \frac{1}{n} (\sum_{i} \frac{1}{n} {μ^{'}}_{i}^{2}), \end{matrix}

(A4)

where equality would be approached when there is an index k such that for all

i \neq k

,

μ_{k}^{'} > > μ_{i}^{'}

. Thus when

α_{i} = 1 / n

,

\begin{matrix} V [Z^{1}] - V [Z^{1}] \leq (n - 1) μ^{2} . \end{matrix}

(A5)

□

In general, the more different the

μ_{i}^{'}

are, the higher the difference between the variances.

Appendix B. Proof of Theorem 2: Optimal Variance of F

Proof.

The

{α_{i}}_{i = 1}^{n}

values for the optimal variance of F estimator can be obtained using Lagrange multipliers with the target function

Λ ({α_{i}}_{i = 1}^{n}, λ) = \sum_{i = 1}^{n} α_{i} {σ^{'}}_{i}^{2} + λ (\sum_{i = 1}^{n} α_{i} = 1) .

Taking partial derivatives with respect to

α_{j}

,

\begin{matrix} \frac{\partial Λ ({α_{i}}_{i = 1}^{n}, λ)}{\partial_{α_{j}}} & = & \frac{\partial (\sum_{i = 1}^{n} α_{i} {σ^{'}}_{i}^{2})}{\partial_{α_{j}}} + \frac{\partial (λ (\sum_{i = 1}^{n} α_{i} - 1))}{\partial_{α_{j}}} \\ = & \sum_{i = 1}^{n} \frac{\partial (α_{i} {σ^{'}}_{i}^{2})}{\partial_{α_{j}}} + λ = 0 . \end{matrix}

(A6)

The partial derivatives are equal to

\begin{matrix} \frac{\partial (α_{i} {σ^{'}}_{i}^{2})}{\partial_{α_{j}}} = χ_{i} (j) {σ^{'}}_{j}^{2} + α_{i} \frac{\partial ({σ^{'}}_{i}^{2})}{\partial_{α_{j}}}, \end{matrix}

(A7)

where

χ_{i} (j)

is the characteristic function,

χ_{i} (j) = 1

if

i = j

and 0 if

i \neq j

.

\begin{matrix} \frac{\partial ({σ^{'}}_{i}^{2})}{\partial_{α_{j}}} & = & \frac{\partial (\int \frac{f^{2} (x) p_{i} (x)}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{2}} d x - {(μ_{i}^{'})}^{2})}{\partial_{α_{j}}} \\ = & \frac{\partial (\int \frac{f^{2} (x) p_{i} (x)}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{2}} d x)}{\partial_{α_{j}}} - 2 μ_{i}^{'} \frac{\partial μ_{i}^{'}}{\partial_{α_{j}}} \\ = & - 2 \int \frac{f^{2} (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{3}} d x - 2 μ_{i}^{'} \frac{\partial μ_{i}^{'}}{\partial_{α_{j}}} . \end{matrix}

(A8)

Since we can write

\begin{matrix} \frac{\partial μ_{i}^{'}}{\partial_{α_{j}}} & = & \frac{\partial (\int \frac{f (x) p_{i} (x)}{\sum_{k = 1}^{n} α_{k} p_{k} (x)} d x)}{\partial_{α_{j}}} \\ = & - \int \frac{f (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{2}} d x, \end{matrix}

(A9)

thus Equation (A8) reads

\begin{matrix} \frac{\partial ({σ^{'}}_{i}^{2})}{\partial_{α_{j}}} & = & - 2 \int \frac{f^{2} (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{3}} d x \\ + & 2 μ_{i}^{'} \int \frac{f (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{2}} d x . \end{matrix}

(A10)

Then,

\begin{matrix} λ = - \sum_{i = 1}^{n} \frac{\partial (α_{i} {σ^{'}}_{i}^{2})}{\partial_{α_{j}}} = - {σ^{'}}_{j}^{2} + 2 \sum_{i = 1}^{n} α_{i} \times \\ (\int \frac{f^{2} (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{3}} d x - μ_{i}^{'} \int \frac{f (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{2}} d x) \\ = & - {σ^{'}}_{j}^{2} + 2 \int \frac{f^{2} (x) p_{j} (x) (\sum_{i = 1}^{n} α_{i} p_{i} (x))}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{3}} d x \\ - & 2 \sum_{i = 1}^{n} α_{i} μ_{i}^{'} \int \frac{f (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{2}} d x \\ = & - {σ^{'}}_{j}^{2} + 2 ({σ^{'}}_{j}^{2} + {μ_{j}^{'}}^{2}) - 2 \sum_{i = 1}^{n} α_{i} μ_{i}^{'} \int \frac{f (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{2}} d x \\ = & {σ^{'}}_{j}^{2} + 2 {μ_{j}^{'}}^{2} - 2 \sum_{i = 1}^{n} α_{i} μ_{i}^{'} \int \frac{f (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{2}} d x \end{matrix}

(A11)

This is, for all j, the following values have to be equal

\begin{matrix} {σ^{'}}_{j}^{2} + 2 {μ_{j}^{'}}^{2} - 2 \sum_{i = 1}^{n} α_{i} μ_{i}^{'} \int \frac{f (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{2}} d x \end{matrix}

(A12)

Multiplying by

α_{j}

and adding over all j in Equation (A11),

\begin{matrix} λ & = & \sum_{j = 1}^{n} α_{j} λ = \sum_{j = 1}^{n} α_{j} {σ^{'}}_{j}^{2} + 2 \sum_{j = 1}^{n} α_{j} {μ_{j}^{'}}^{2} \\ - & 2 \sum_{i = 1}^{n} α_{i} μ_{i}^{'} \int \frac{f (x) p_{i} (x) (\sum_{j = 1}^{n} α_{j} p_{j} (x))}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{2}} d x \\ = & \sum_{j = 1}^{n} α_{j} {σ^{'}}_{j}^{2} + 2 \sum_{j = 1}^{n} α_{j} {μ_{j}^{'}}^{2} - 2 \sum_{i = 1}^{n} α_{i} {μ_{i}^{'}}^{2} \\ = & \sum_{j = 1}^{n} α_{j} {σ^{'}}_{j}^{2} \end{matrix}

(A13)

which is the optimal variance of estimator F. Equation (A12) becomes, for all j

\begin{matrix} \sum_{j = 1}^{n} α_{j} {σ^{'}}_{j}^{2} = {σ^{'}}_{j}^{2} + 2 {μ_{j}^{'}}^{2} - 2 \sum_{i = 1}^{n} α_{i} μ_{i}^{'} \int \frac{f (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{2}} d x . \end{matrix}

(A14)

Observe that all derivatives in (A6) are negative for the optimal

{α_{i}^{⋆}}

values and equal to

- \sum_{j = 1}^{n} α_{j}^{⋆} {σ^{^{'} ⋆}}_{j}^{2}

. □

Appendix C. Derivation of Case 3

Proof of Theorem 6.

We have to optimize the target function

Λ ({α_{i}}_{i = 1}^{n}, λ) = \sum_{i = 1}^{n} \frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{β_{i}} + λ (\sum_{i = 1}^{n} α_{i} - 1) .

Taking partial derivatives with respect to

α_{i}

, as the

β_{i}

values are constant,

\begin{matrix} \frac{\partial Λ ({α_{i}}_{i = 1}^{n}, λ)}{\partial_{α_{j}}} = \sum_{i = 1}^{n} \frac{1}{β_{i}} \frac{\partial (α_{i}^{2} {σ^{'}}_{i}^{2})}{\partial_{α_{j}}} + λ = 0 . \end{matrix}

(A15)

The partial derivatives are equal to

\begin{matrix} \frac{\partial (α_{i}^{2} {σ^{'}}_{i}^{2})}{\partial_{α_{j}}} = 2 α_{j} χ_{i} (j) {σ^{'}}_{j}^{2} + α_{i}^{2} \frac{\partial ({σ^{'}}_{i}^{2})}{\partial_{α_{j}}}, \end{matrix}

(A16)

where

χ_{i} (j)

is the characteristic function. Using the result in Equation (A10), we obtain

\begin{matrix} \sum_{i = 1}^{n} \frac{1}{β_{i}} \frac{\partial (α_{i}^{2} {σ^{'}}_{i}^{2})}{\partial_{α_{j}}} = 2 \frac{α_{j} {σ^{'}}_{j}^{2}}{β_{j}} - 2 \sum_{i = 1}^{n} \frac{α_{i}^{2}}{β_{i}} \\ \times & (\int \frac{f^{2} (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{3}} d x - μ_{i}^{'} \int \frac{f (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{2}} d x) \\ = & - λ . \end{matrix}

(A17)

In Equation (A17), we multiply by

α_{j}

, and add over all indexes j, obtaining

\begin{matrix} λ & = & - 2 \sum_{j = 1}^{n} \frac{α_{j}^{2} {σ^{'}}_{j}^{2}}{β_{j}} + 2 \sum_{i = 1}^{n} \frac{α_{i}^{2}}{β_{i}} \\ \times & (\int \frac{f^{2} (x) p_{i} (x) (\sum_{j = 1}^{n} α_{j} p_{j} (x))}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{3}} d x \\ - & μ_{i}^{'} \int \frac{f (x) p_{i} (x) (\sum_{j = 1}^{n} α_{j} p_{j} (x))}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{2}} d x) \\ = & - 2 \sum_{j = 1}^{n} \frac{α_{j}^{2} {σ^{'}}_{j}^{2}}{β_{j}} + 2 \sum_{i = 1}^{n} \frac{α_{i}^{2}}{β_{i}} \\ \times & (\int \frac{f^{2} (x) p_{i} (x)}{{(\sum_{k = 1}^{n} α_{k} p_{k} (x))}^{2}} d x - μ_{i}^{'} \int \frac{f (x) p_{i} (x)}{(\sum_{k = 1}^{n} α_{k} p_{k} (x))} d x) \\ = & - 2 \sum_{j = 1}^{n} \frac{α_{j}^{2} {σ^{'}}_{j}^{2}}{β_{j}} + 2 \sum_{i = 1}^{n} \frac{α_{i}^{2} {σ^{'}}_{i}^{2}}{β_{i}} = 0 . \end{matrix}

(A18)

We remind that

\sum_{j = 1}^{n} α_{j} = 1

which disappears in the left-hand side and the second term of the right-hand side equation.

From Equation (A17), the optimal

{α_{j}^{⋆}}_{j = 1}^{n}

obey

\begin{matrix} \frac{α_{j}^{⋆} {σ^{'}}_{j}^{2}}{β_{j}} = \sum_{i = 1}^{n} \frac{{α^{⋆}}_{i}^{2}}{β_{i}} \\ \times & (\int \frac{f^{2} (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k}^{⋆} p_{k} (x))}^{3}} d x - μ_{i}^{'} \int \frac{f (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k}^{⋆} p_{k} (x))}^{2}} d x) . \end{matrix}

(A19)

□

Proof of Equation (39).

For the optimal solution to be, for all i,

α_{i}^{⋆} = β_{i}

then

\begin{matrix} {σ^{'}}_{j}^{2} & = & \int \frac{f^{2} (x) (\sum_{i = 1}^{n} α_{i}^{⋆} p_{i} (x)) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k}^{⋆} p_{k} (x))}^{3}} d x \\ - & \sum_{i = 1}^{n} α_{i}^{⋆} μ_{i}^{'} \int \frac{f (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k}^{⋆} p_{k} (x))}^{2}} d x \\ = & {σ^{'}}_{j}^{2} + {μ^{'}}_{j}^{2} - \sum_{i = 1}^{n} α_{i}^{⋆} μ_{i}^{'} \int \frac{f (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k}^{⋆} p_{k} (x))}^{2}} d x, \end{matrix}

and thus the following equation has to hold for all j

\begin{matrix} {μ^{'}}_{j}^{2} = \sum_{i = 1}^{n} α_{i}^{⋆} μ_{i}^{'} \int \frac{f (x) p_{i} (x) p_{j} (x)}{{(\sum_{k = 1}^{n} α_{k}^{⋆} p_{k} (x))}^{2}} d x . \end{matrix}

(A20)

□

Appendix D. An Alternative Perspective Based on the χ 2 Divergence

Let us define

ψ_{α} = \sum_{k = 1}^{n} α_{k} p_{k} (x)

and

ψ_{β} = \sum_{k = 1}^{n} β p_{k} (x)

. Further, let us define

Z_{\frac{f β}{α}} = ψ_{β} [\frac{f (x)}{ψ_{α (x)}}]

. We also recall that the

χ^{2}

divergence between the pdf

\tilde{f} (x) = \frac{f (x)}{μ}

and

ψ

, is given by

\begin{matrix} χ^{2} (\tilde{f}, ψ) & = \int \frac{{(\tilde{f} (x) - ψ (x))}^{2}}{ψ (x)} d x . \end{matrix}

(A21)

Since in the randomized generalized balance heuristic estimator,

G

, all samples are simulated i.i.d. from

α_{β}

, the variance can be expressed as

\begin{array}{l} (A22) & V_{ψ_{β}} [G] & = \frac{1}{N^{2}} \sum_{i = 1}^{n} \frac{α_{i}^{2}}{β_{i}^{2}} \sum_{j = 1}^{n_{i}} V_{ψ_{β}} [\frac{f (X_{i, j})}{ψ_{α} (X_{i, j})}] \\ (A23) & = \frac{1}{N} \sum_{i = 1}^{n} \frac{α_{i}^{2}}{β_{i}} V_{ψ_{β}} [\frac{f (X)}{ψ_{α} (X)}] \\ (A24) & = \frac{1}{N} \sum_{i = 1}^{n} \frac{α_{i}^{2}}{β_{i}} E_{ψ_{β}} [{(\frac{f (X)}{ψ_{α} (X)} - Z_{\frac{f β}{α}})}^{2}] \\ (A25) & = \frac{1}{N} \sum_{i = 1}^{n} \frac{α_{i}^{2}}{β_{i}} \int {(\frac{f (x)}{ψ_{α} (x)} - Z_{\frac{f β}{α}})}^{2} ψ_{β} (x) d x \\ (A26) & = \frac{1}{N} \sum_{i = 1}^{n} \frac{α_{i}^{2}}{β_{i}} \int \frac{{(f (x) - Z_{\frac{f β}{α}} ψ_{α} (x))}^{2}}{ψ_{α} (x)} \frac{ψ_{β} (x)}{ψ_{α} (x)} d x \\ (A27) & = \frac{1}{N} \sum_{i = 1}^{n} \frac{α_{i}^{2}}{β_{i}} \int \frac{{(μ \tilde{f} (x) - Z_{\frac{f β}{α}} ψ_{α} (x))}^{2}}{ψ_{α} (x)} \frac{ψ_{β} (x)}{ψ_{α} (x)} d x, \end{array}

where the integral can be seen as a modified

χ^{2}

divergence with two differences: (a) the normalizing constant

Z_{\frac{f β}{α}}

is modified with regard to

μ

when

ψ_{α} \neq ψ_{β}

, and (b) there is the ratio

\frac{ψ_{β}}{ψ_{α}}

that appears multiplying (evoking an importance weight between the denominator we have used,

ψ_{α}

, and the proposal used to simulate,

ψ_{β}

). Note that, when

ψ_{α} = ψ_{β}

,

Z_{\frac{f β}{α}} = μ

, the ratio of mixtures is equal to one, and

V_{ψ_{β}} [G] = \frac{1}{N} μ^{2} χ^{2} (f, ψ_{β})

.

References

Veach, E.; Guibas, L. Optimally Combining Sampling Techniques for Monte Carlo Rendering. In Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles, CA, USA, 6–11 August 1995; pp. 419–428. [Google Scholar] [CrossRef] [Green Version]
Elvira, V.; Martino, L.; Luengo, D.; Bugallo, M.F. Generalized Multiple Importance Sampling. Stat. Sci. 2019, 34, 129–155. [Google Scholar] [CrossRef] [Green Version]
Owen, A.; Zhou, Y. Safe and Effective Importance Sampling. J. Am. Stat. Assoc. 2000, 95, 135–143. [Google Scholar] [CrossRef]
Elvira, V.; Martino, L.; Luengo, D.; Bugallo, M.F. Efficient Multiple Importance Sampling Estimators. IEEE Signal Process. Lett. 2015, 22, 1757–1761. [Google Scholar] [CrossRef] [Green Version]
Elvira, V.; Martino, L.; Luengo, D.; Bugallo, M.F. Multiple importance sampling with overlapping sets of proposals. In Proceedings of the Statistical Signal Processing Workshop (SSP), Palma de Mallorca, Spain, 26–29 June 2016; pp. 1–5. [Google Scholar]
Elvira, V.; Martino, L.; Luengo, D.; Bugallo, M.F. Heretical Multiple Importance Sampling. IEEE Signal Process. Lett. 2016, 23, 1474–1478. [Google Scholar] [CrossRef]
Sbert, M.; Havran, V.; Szirmay-Kalos, L. Variance Analysis of Multi-sample and One-sample Multiple Importance Sampling. Comput. Graph. Forum 2016, 35, 451–460. [Google Scholar] [CrossRef]
Havran, V.; Sbert, M. Optimal Combination of Techniques in Multiple Importance Sampling. In Proceedings of the 13th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and Its Applications in Industry, VRCAI ’14, Shenzhen, China, 30 November–2 December 2014; ACM: New York, NY, USA, 2014; pp. 141–150. [Google Scholar] [CrossRef]
Sbert, M.; Havran, V. Adaptive multiple importance sampling for general functions. Vis. Comput. 2017, 33, 1–11. [Google Scholar] [CrossRef]
Sbert, M.; Havran, V.; Szirmay-Kalos, L.; Elvira, V. Multiple importance sampling characterization by weighted mean invariance. Vis. Comput. 2018, 34, 843–852. [Google Scholar] [CrossRef]
Cappé, O.; Guillin, A.; Marin, J.M.; Robert, C.P. Population Monte Carlo. J. Comput. Graph. Stat. 2004, 13, 907–929. [Google Scholar] [CrossRef]
Cornuet, J.M.; Marin, J.M.; Mira, A.; Robert, C.P. Adaptive Multiple Importance Sampling. Scand. J. Stat. 2012, 39, 798–812. [Google Scholar] [CrossRef] [Green Version]
Martino, L.; Elvira, V.; Luengo, D.; Corander, J. Layered adaptive importance sampling. Stat. Comput. 2017, 27, 599–623. [Google Scholar] [CrossRef] [Green Version]
Elvira, V.; Martino, L.; Luengo, D.; Bugallo, M.F. Improving Population Monte Carlo: Alternative Weighting and Resampling Schemes. Signal Process. 2017, 131, 77–91. [Google Scholar] [CrossRef] [Green Version]
Bugallo, M.F.; Elvira, V.; Martino, L.; Luengo, D.; Míguez, J.; Djuric, P.M. Adaptive Importance Sampling: The past, the present, and the future. IEEE Signal Process. Mag. 2017, 34, 60–79. [Google Scholar] [CrossRef]
Kondapaneni, I.; Vevoda, P.; Grittmann, P.; Skřivan, T.; Slusallek, P.; Křivánek, J. Optimal Multiple Importance Sampling. ACM Trans. Graph. 2019, 38, 37. [Google Scholar] [CrossRef]
Veach, E. Robust Monte Carlo Methods for Light Transport Simulation. Ph.D. Thesis, Stanford University, Stanford, CA, USA, 1997. [Google Scholar]
Karlík, O.; Šik, M.; Vévoda, P.; Skřivan, T.; Křivánek, J. MIS Compensation: Optimizing Sampling Techniques in Multiple Importance Sampling. ACM Trans. Graph. 2019, 38, 151. [Google Scholar] [CrossRef] [Green Version]
Grittmann, P.; Georgiev, I.; Slusallek, P.; Křivánek, J. Variance-Aware Multiple Importance Sampling. ACM Trans. Graph. 2019, 38, 152. [Google Scholar] [CrossRef] [Green Version]
West, R.; Georgiev, I.; Gruson, A.; Hachisuka, T. Continuous Multiple Importance Sampling. ACM Trans. Graph. 2020, 39, 136. [Google Scholar] [CrossRef]
Kajiya, J.T. The Rendering Equation. In Proceedings of the Computer Graphics (SIGGRAPH ’86 Proceedings), Dallas, TX, USA, 18–22 August 1986; Evans, D.C., Athay, R.J., Eds.; Volume 20, pp. 143–150. [Google Scholar]
Robert, C.P. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation; Springer: Berlin/Heidelberg, Germany, 2007; Volume 2. [Google Scholar]
Elvira, V.; Martino, L.; Closas, P. Importance Gaussian Quadrature. IEEE Trans. Signal Process. 2020, 69, 474–488. [Google Scholar] [CrossRef]
Elvira, V.; Chouzenoux, E. Langevin-based strategy for efficient proposal adaptation in population Monte Carlo. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5077–5081. [Google Scholar]
Rubinstein, R.; Kroese, D. Simulation and the Monte Carlo Method; Wiley Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2008. [Google Scholar]
Sbert, M.; Yoshida, Y. Stochastic Orders on Two-Dimensional Space: Application to Cross Entropy. In Proceedings of the Modeling Decisions for Artificial Intelligence—17th International Conference, MDAI 2020, Sant Cugat, Spain, 2–4 September 2020; Lecture Notes in Computer Science. Torra, V., Narukawa, Y., Nin, J., Agell, N., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; Volume 12256, pp. 28–40. [Google Scholar]
Sbert, M.; Poch, J. A necessary and sufficient condition for the inequality of generalized weighted means. J. Inequalities Appl. 2016, 2016, 292. [Google Scholar] [CrossRef]
Sbert, M.; Havran, V.; Szirmay-Kalos, L. Multiple importance sampling revisited: Breaking the bounds. EURASIP J. Adv. Signal Process. 2018, 2018, 15. [Google Scholar] [CrossRef] [Green Version]
Cornebise, J.; Moulines, E.; Olsson, J. Adaptive methods for sequential importance sampling with application to state space models. Stat. Comput. 2008, 18, 461–480. [Google Scholar] [CrossRef] [Green Version]
Míguez, J. On the performance of nonlinear importance samplers and population Monte Carlo schemes. In Proceedings of the 2017 22nd International Conference on Digital Signal Processing (DSP), London, UK, 23–25 August 2017; pp. 1–5. [Google Scholar]
Nielsen, F.; Nock, R. On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Process. Lett. 2014, 21, 10–13. [Google Scholar] [CrossRef] [Green Version]
Geyer, C.J. Reweighting Monte Carlo Mixtures; Technical Report; University of Minnesota: Minneapolis, MN, USA, 1991. [Google Scholar]

Table 1. Naming convention for the multiple importance sampling estimators in this paper. We drop the superindex

^{1}

from primary estimators when not strictly necessary.

Table 1. Naming convention for the multiple importance sampling estimators in this paper. We drop the superindex

^{1}

from primary estimators when not strictly necessary.

Z	Generic deterministic (multi-sample) MIS estimator
$Z^{1}$	Generic deterministic (multi-sample) MIS estimator normalized to one sample
$Z$	Generic randomized (one-sample) MIS estimator
$Z^{1}$	Generic randomized (one-sample) MIS estimator, for number of samples equal to 1
F	Balance heuristic multi-sample MIS estimator
$F^{1}$	Balance heuristic multi-sample MIS estimator normalized to one sample
$F$	Balance heuristic one-sample MIS estimator
$F^{1}$	Balance heuristic one-sample MIS estimator, for number of samples equal to 1
G	Generalized balance heuristic multi-sample MIS estimator
$G^{1}$	Generalized balance heuristic multi-sample MIS estimator normalized to one sample
$G$	Generalized balance heuristic one-sample MIS estimator

Table 2. Singular parameter values.

{α_{i}^{⋆}}_{i = 1}^{n}, {β_{i}^{⋆}}_{i = 1}^{n}

are solutions of the equations in first column and same row.

Table 2. Singular parameter values.

{α_{i}^{⋆}}_{i = 1}^{n}, {β_{i}^{⋆}}_{i = 1}^{n}

are solutions of the equations in first column and same row.

for all i, ${σ_{i}^{'}}^{2} + {μ^{'}}_{i}^{2}$ equal	$V [F ({α_{i}^{⋆}})]$ global minimum
for all i, ${σ_{i}^{'}}^{2}$ equal and ${μ^{'}}_{i}$ equal	$V [F ({α_{i}^{⋆}})] = V [F ({α_{i}^{⋆}})] = V [G ({α_{i}^{⋆}}, {β_{i}} = {α_{i}^{⋆}})]$ global minimum, $F \equiv G$ , but $F ≢ F$
for all i, $μ_{i}^{'}$ equal	$V [F ({α_{i}^{⋆}})] = V [F ({α_{i}^{⋆}})]$ ; optimal of $V [G ({α_{i}}, {β_{i}} = {α_{i}^{⋆}})]$ is for ${α_{i}} = {α_{i}^{⋆}}$
for all i, $σ_{i}^{'}$ equal	optimal of $V [G ({α_{i}^{⋆}}, {β_{i}})]$ is for ${β_{i}} = {α_{i}^{⋆}}$ , $F \equiv G$
for all i, $\frac{α_{i}}{β_{i}} {σ^{'}}_{i}^{2}$ equal, or $β_{i} \propto α_{i} {σ_{i}^{'}}^{2}$	$V [G ({α_{i}^{⋆}}, {β_{i}^{⋆}})] = V [F ({α_{i}^{⋆}})]$
for all i, $\frac{α_{i}^{2}}{β_{i}^{2}} {σ^{'}}_{i}^{2}$ , equal, or $β_{i} \propto α_{i} σ_{i}^{'}$	optimal of $V [G ({α_{i}}, {β_{i}})]$ when ${α_{i}}$ are fixed, $\forall {α_{i}}, V [G ({α_{i}}, {β_{i}^{⋆}})] \leq V [F ({α_{i}})]$

Table 3. We show the metric

E_{F}^{- 1} = V [F] \cdot C o s t [F]

and

E_{G}^{- 1} = V [G] \cdot C o s t [G]

, i.e., the inverse of efficiency or product of variance and cost for the F estimator and for the optimal G estimator, using for both the same

{α_{k}}

values. The optimal efficiencies of G estimator, for each set of

{α_{k}}

values, are computed using the left-hand side of Equation (37), while the efficiencies of F estimator use the right-hand side of Equation (37). We consider for the five numerical examples using an equal count of samples, count inversely proportional to the variances of independent estimators [8,9], and for the three estimators defined in [28]. The sampling costs are (1, 6.24, 3.28). In Example 5, we present the case with equal costs (1, 1), and different costs (1, 5).

Table 3. We show the metric

E_{F}^{- 1} = V [F] \cdot C o s t [F]

and

E_{G}^{- 1} = V [G] \cdot C o s t [G]

, i.e., the inverse of efficiency or product of variance and cost for the F estimator and for the optimal G estimator, using for both the same

{α_{k}}

values. The optimal efficiencies of G estimator, for each set of

{α_{k}}

values, are computed using the left-hand side of Equation (37), while the efficiencies of F estimator use the right-hand side of Equation (37). We consider for the five numerical examples using an equal count of samples, count inversely proportional to the variances of independent estimators [8,9], and for the three estimators defined in [28]. The sampling costs are (1, 6.24, 3.28). In Example 5, we present the case with equal costs (1, 1), and different costs (1, 5).

	Ex. 1		Ex. 2		Ex. 3		Ex. 4		Ex. 5		Ex. 5
	Ex. 1		Ex. 2		Ex. 3		Ex. 4		costs = (1,1)	costs = (1,1)	costs = (1,5)	costs = (1,5)
Estimator	F	G	F	G	F	G	F	G	F	G	F	G
$α_{k} \propto \frac{1}{n}$	102.26	89.40	17.24	15.44	37.47	31.80	98.68	83.78	0.28	0.23	0.83	0.40
$α_{k} \propto \frac{1}{c_{k} v_{k}}$ [8]	49.53	41.29	9.28	8.10	4.03	3.85	300.12	294.85	0.31	0.26	2.76	2.33
$α_{k} \propto \frac{1}{c_{k} m_{k}^{2}}$ [28]	54.36	46.2	9.82	8.49	3.12	2.49	534.37	449.33	0.20	0.15	1.51	1.00
$α_{k} \propto \frac{σ_{k, e q}}{\sqrt{c_{k}}}$ [28]	81.43	69.88	13.54	11.67	28.68	23.17	91.01	73.54	1.00	0.98	2.90	2.50
$α_{k} \propto \frac{M_{k, e q}}{\sqrt{c_{k}}}$ [28]	79.73	67.77	13.08	11.35	25.74	20.76	31.77	25.90	0.29	0.24	2.72	2.28

Table 4. Variances of Examples 1–5 for some notable cases. The variables

{μ^{'}}_{i, 1 / n}, {σ^{'}}_{i, 1 / n}

correspond to

{α_{i} = \frac{1}{n}}

. The variables

m_{i}^{2}, v_{i}

are the second moment and variance of independent technique i,

v_{i} = m_{i}^{2} - μ^{2}

.

Table 4. Variances of Examples 1–5 for some notable cases. The variables

{μ^{'}}_{i, 1 / n}, {σ^{'}}_{i, 1 / n}

correspond to

{α_{i} = \frac{1}{n}}

. The variables

m_{i}^{2}, v_{i}

are the second moment and variance of independent technique i,

v_{i} = m_{i}^{2} - μ^{2}

.

	Estimator	Example 1	Example 2	Example 3	Example 4	Example 5
1	$O p t i m a l_{α} V [F]$	22.7122	4.1949	0	0	0.0910
2	$O p t i m a l_{α} V [F]$	22.7122	4.1944	0	0	0.0903
3	$O p t i m a l_{α, β} V [G]$	22.7122	4.1932	0	0	0.0601
4	$V [F (α_{i} = 1 / n)]$	29.1634	4.9175	10.6877	28.1431	0.2771
5	$O p t i m a l_{β} V [G (α_{i} = 1 / n)]$	29.0908	4.9069	10.6125	27.9412	0.2264
6	$O p t i m a l_{α} V [G (β_{i} = 1 / n)]$	27.1603	4.7265	0	0	0.0653
7	$V [F (α_{i} = a r g \forall i, μ_{i}^{'} \approx μ)]$	22.8216	4.1980	0	0	0.0926
8	$V [G (α_{i} = a r g \forall i, α_{i} σ_{i}^{'} \approx equal, β_{i} = 1 / n)]$	28.4089	4.8313	12.4705	11.2072	0.0734
9	$V [G (α_{i} \propto \frac{m_{i}^{2}}{v_{i}}, β_{i} = 1 / n)]$ ,19]	28.0224	4.8513	3.7955	172.192	0.2832
10	$V [G (α_{i} \propto \frac{1}{v_{i}}, β_{i} = 1 / n)]$	27.4126	4.7897	0.5466	1256.48	0.2943
11	$V [G (α_{i} \propto \frac{1}{σ_{i, 1 / n}^{'}}, β_{i} = 1 / n)]$	28.3977	4.8291	11.95	13.7926	0.0724
12	$V [G (α_{i} \propto \frac{{μ^{'}}_{i, 1 / n}}{{σ^{'}}_{i, 1 / n}}, β_{i} = 1 / n))]$	27.4426	4.7373	7.8615	6.7895	0.1112
13	$V [G (α_{i} = a r g \forall i, μ_{i}^{'} \approx μ, β_{i} = 1 / n)]$	41.8791	8.8798	0.0014	0	0.0739

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sbert, M.; Elvira, V. Generalizing the Balance Heuristic Estimator in Multiple Importance Sampling. Entropy 2022, 24, 191. https://doi.org/10.3390/e24020191

AMA Style

Sbert M, Elvira V. Generalizing the Balance Heuristic Estimator in Multiple Importance Sampling. Entropy. 2022; 24(2):191. https://doi.org/10.3390/e24020191

Chicago/Turabian Style

Sbert, Mateu, and Víctor Elvira. 2022. "Generalizing the Balance Heuristic Estimator in Multiple Importance Sampling" Entropy 24, no. 2: 191. https://doi.org/10.3390/e24020191

APA Style

Sbert, M., & Elvira, V. (2022). Generalizing the Balance Heuristic Estimator in Multiple Importance Sampling. Entropy, 24(2), 191. https://doi.org/10.3390/e24020191

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Generalizing the Balance Heuristic Estimator in Multiple Importance Sampling

Abstract

1. Introduction

2. Balance Heuristic Estimator

2.1. Interpretation of F and General Notation of the Paper

2.2. Rationale

2.3. Realistic Applications

2.3.1. Global Illumination

2.3.2. Bayesian Inference

3. Generalized Multiple Importance Sampling Balance Heuristic Estimator

3.1. Case 1: α i = β i , ∀ i

3.2. Case 2: Fixed { α i } i = 1 n

Optimal Efficiency

3.3. Case 3: Fixed { β i } i = 1 n

3.4. Case 4: β i = 1 / n , ∀ i

Comparing V [ G 1 ( { α i } , { β i } ) ] , V [ G 1 ( { α i } , { β i = 1 / n } ) ] and V [ F 1 ( { α i } ) ]

3.5. Case 5: General Case

4. Singular Solutions

5. Relationship with χ 2 Divergence: A Necessary and Sufficient Condition for V [ G ] ≤ V [ F ]

6. Numerical Examples

6.1. Efficiency Comparison between F and G Estimators

6.1.1. Example 1

6.1.2. Example 2

6.1.3. Example 3

6.1.4. Example 4

6.1.5. Example 5

6.2. Variances of Examples 1–5 for Some Notable Cases

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A. Difference between the Variances of Deterministic and Randomized Multiple Importance Sampling Estimators

Appendix B. Proof of Theorem 2: Optimal Variance of F

Appendix C. Derivation of Case 3

Appendix D. An Alternative Perspective Based on the χ 2 Divergence

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1. Case 1: $α_{i} = β_{i}$ , $\forall i$

3.2. Case 2: Fixed ${α_{i}}_{i = 1}^{n}$

3.3. Case 3: Fixed ${β_{i}}_{i = 1}^{n}$

3.4. Case 4: $β_{i} = 1 / n, \forall i$

Comparing $V [G^{1} ({α_{i}}, {β_{i}})]$ , $V [G^{1} ({α_{i}}, {β_{i} = 1 / n})]$ and $V [F^{1} ({α_{i}})]$

5. Relationship with $χ^{2}$ Divergence: A Necessary and Sufficient Condition for $V [G] \leq V [F]$