Ensemble Estimation of Information Divergence †

Kevin R. Moon; Kumar Sricharan; Kristjan Greenewald; Alfred O. Hero III

doi:10.3390/e20080560

,

and

¹

Genetics Department and Applied Math Program, Yale University, New Haven, CT 06520, USA

²

Intuit Inc., Mountain View, CA 94043, USA

³

IBM Research, Cambridge, MA 02142, USA

⁴

Electrical Engineering and Computer Science Department, University of Michigan, Ann Arbor, MI 48109, USA

Entropy2018, 20(8), 560;https://doi.org/10.3390/e20080560

This article belongs to the Special Issue Information Theory in Machine Learning and Data Science

Version Notes

Order Reprints

Abstract

Recent work has focused on the problem of nonparametric estimation of information divergence functionals between two continuous random variables. Many existing approaches require either restrictive assumptions about the density support set or difficult calculations at the support set boundary which must be known a priori. The mean squared error (MSE) convergence rate of a leave-one-out kernel density plug-in divergence functional estimator for general bounded density support sets is derived where knowledge of the support boundary, and therefore, the boundary correction is not required. The theory of optimally weighted ensemble estimation is generalized to derive a divergence estimator that achieves the parametric rate when the densities are sufficiently smooth. Guidelines for the tuning parameter selection and the asymptotic distribution of this estimator are provided. Based on the theory, an empirical estimator of Rényi-

α

divergence is proposed that greatly outperforms the standard kernel density plug-in estimator in terms of mean squared error, especially in high dimensions. The estimator is shown to be robust to the choice of tuning parameters. We show extensive simulation results that verify the theoretical results of our paper. Finally, we apply the proposed estimator to estimate the bounds on the Bayes error rate of a cell classification problem.

Keywords:

divergence; differential entropy; nonparametric estimation; central limit theorem; convergence rates; bayes error rate

1. Introduction

Information divergences are integral functionals of two probability distributions and have many applications in the fields of information theory, statistics, signal processing, and machine learning. Some applications of divergences include estimating the decay rates of error probabilities [1], estimating bounds on the Bayes error [2,3,4,5,6,7,8] or the minimax error [9] for a classification problem, extending machine learning algorithms to distributional features [10,11,12,13], testing the hypothesis that two sets of samples come from the same probability distribution [14], clustering [15,16,17], feature selection and classification [18,19,20], blind source separation [21,22], image segmentation [23,24,25], and steganography [26]. For many more applications of divergence measures, see reference [27]. There are many information divergence families including Alpha- and Beta-divergences [28] as well as f-divergences [29,30]. In particular, the f-divergence family includes the well-known Kullback–Leibler (KL) divergence [31], the Rényi-

α

divergence integral [32], the Hellinger–Bhattacharyya distance [33,34], the Chernoff-

α

divergence [5], the total variation distance, and the Henze–Penrose divergence [6].

Despite the many applications of divergences between continuous random variables, there are no nonparametric estimators of these functionals that achieve the parametric mean squared error (MSE) convergence rate, are simple to implement, do not require knowledge of the boundary of the density support set, and apply to a large set of divergence functionals. In this paper, we present the first information divergence estimator that achieves all of the above. Specifically, we address the problem of estimating divergence functionals when only a finite population of independent and identically distributed (i.i.d.) samples is available from the two d-dimensional distributions that are unknown, nonparametric, and smooth. Our contributions are as follows:

We propose the first information divergence estimator, referred to as EnDive, that is based on ensemble methods. The ensemble estimator takes a weighted average of an ensemble of weak kernel density plug-in estimators of divergence where the weights are chosen to improve the MSE convergence rate. This ensemble construction makes it very easy to implement EnDive.
We prove that the proposed ensemble divergence estimator achieves the optimal parametric MSE rate of $O (\frac{1}{N})$ , where N is the sample size when the densities are sufficiently smooth. In particular, EnDive achieves these rates without explicitly performing boundary correction which is required for most other estimators. Furthermore, we show that the convergence rates are uniform.
We prove that EnDive obeys a central limit theorem and thus, can be used to perform inference tasks on the divergence such as testing that two populations have identical distributions or constructing confidence intervals.

1.1. Related Work

Much work has focused on the problem of estimating the entropy and the information divergence of discrete random variables [1,29,35,36,37,38,39,40,41,42,43]. However, the estimation problem for discrete random variables differs significantly from the continuous case and thus employs different tools for both estimation and analysis.

One approach to estimating the differential entropy and information divergence of continuous random variables is to assume a parametric model for the underlying probability distributions [44,45,46]. However, these methods perform poorly when the parametric model does not fit the data well. Unfortunately, the structure of the underlying data distribution is unknown for many applications, and thus the chance for model misspecification is high. Thus, in many of these applications, parametric methods are insufficient, and nonparametric estimators must be used.

While several nonparametric estimators of divergence functionals between continuous random variables have been previously defined, the convergence rates are known for only a few of them. Furthermore, the asymptotic distributions of these estimators are unknown for nearly all of them. For example, Póczos and Schneider [10] established a weak consistency for a bias-corrected k-nearest neighbor (nn) estimator for Rényi-

α

and other divergences of a similar form where k was fixed. Li et al. [47] examined k-nn estimators of entropy and the KL divergence using hyperspherical data. Wang et al. [48] provided a k-nn based estimator for KL divergence. Plug-in histogram estimators of mutual information and divergence have been proven to be consistent [49,50,51,52]. Hero et al. [53] provided a consistent estimator for Rényi-

α

divergence when one of the densities is known. However none of these works studied the convergence rates or the asymptotic distribution of their estimators.

There has been recent interest in deriving convergence rates for divergence estimators for continuous data [54,55,56,57,58,59,60]. The rates are typically derived in terms of a smoothness condition on the densities, such as the Hölder condition [61]:

Definition 1 (Hölder Class).

Let

X \subset R^{d}

be a compact space. For

r = (r_{1}, \dots, r_{d}),

r_{i} \in N,

define

| r | = \sum_{i = 1}^{d} r_{i}

and

D^{r} = \frac{\partial^{| r |}}{\partial x_{1}^{r_{1}} \dots \partial x_{d}^{r_{d}}}

. The Hölder class

Σ (s, K_{H})

of functions on

L_{2} (X)

consists of the functions (f) that satisfy

|D^{r} f (x) - D^{r} f (y)| \leq K_{H} {∥x - y∥}^{min (s - | r |, 1)},

for all

x, y \in X

and for all r s.t.

| r | \leq ⌊s⌋

.

From Definition 1, it is clear that if a function (f) belongs to

Σ (s, K_{H})

, then f is continuously differentiable up to order

⌊s⌋

. In this work, we show that EnDive achieves a parametric MSE convergence rate of

O (1 / N)

when

s \geq d

and

s > \frac{d}{2}

, depending on the specific form of the divergence function.

Nguyen et al. [56] proposed an f-divergence estimator that estimates the likelihood ratio of the two densities by solving a convex optimization problem and then plugging it into the divergence formulas. The authors proved that the minimax MSE convergence rate is parametric when the likelihood ratio is a member of the bounded Hölder class

Σ (s, K_{H})

with

s \geq d / 2

. However, this estimator is restricted to true f-divergences and may not apply to the broader class of divergence functionals that we consider here (as an example, the

L_{2}^{2}

divergence is not an f-divergence). Additionally, solving the convex problem of [56] has similar computational complexity to that of training a support vector machine (SVM) (between

O (N^{2})

and

O (N^{3})

), which can be demanding when N is large. In contrast, the EnDive estimator that we propose requires only the construction of simple density plug-in estimates and the solution of an offline convex optimization problem. Therefore, the most computationally demanding step in the EnDive estimator is the calculation of the density estimates, which has a computational complexity no greater than

O (N^{2})

.

Singh and Póczos [58,59] provided an estimator for Rényi-

α

divergences as well as general density functionals that use a “mirror image” kernel density estimator. They proved that these estimators obtain an MSE convergence rate of

O (\frac{1}{N})

when

s \geq d

for each of the densities. However their approach requires several computations at each boundary of the support of the densities which is difficult to implement as d gets large. Also, this computation requires knowledge of the support (specifically, the boundaries) of the densities which is unknown in most practical settings. In contrast, while our assumptions require the density support sets to be bounded and the boundaries to be smooth, knowledge of the support is not required to implement EnDive.

The “linear” and “quadratic” estimators presented by Krishnamurthy et al. [57] estimate divergence functionals that include the form

\int f_{1}^{α} (x) f_{2}^{β} (x) d μ (x)

for given

α

and

β

where

f_{1}

and

f_{2}

are probability densities. These estimators achieve the parametric rate when

s \geq d / 2

and

s \geq d / 4

for the linear and quadratic estimators, respectively. However, the latter estimator is computationally infeasible for most functionals, and the former requires numerical integration for some divergence functionals which can be computationally difficult. Additionally, while a suitable

α

-

β

indexed sequence of divergence functionals of this form can be constructed that converge to the KL divergence, this does not guarantee convergence of the corresponding sequence of divergence estimators, as shown in reference [57]. In contrast, EnDive can be used to estimate the KL divergence directly. Other important f-divergence functionals are also excluded from this form including some that bound the Bayes error [2,4,6]. In contrast, our method applies to a large class of divergence functionals and avoids numerical integration.

Finally, Kandasamy et al. [60] proposed influence function-based estimators of distributional functionals including divergences that achieve the parametric rate when

s \geq d / 2

. While this method can be applied to general functionals, the estimator requires numerical integration for some functionals. Additionally, the estimators in both Kandasamy et al. [60] and Krishnamurthy et al. [57] require an optimal kernel density estimator. This is difficult to construct when the density support is bounded as it requires difficult computations at the density support set boundary and therefore, knowledge of the density support set. In contrast, Endive does not require knowledge of the support boundary.

In addition to the MSE convergence rates, the asymptotic distribution of divergence estimators is of interest. Asymptotic normality has been established for certain divergences between a specific density estimator and the true density [62,63,64]. This differs from the problem we consider where we assume that both densities are unknown. The asymptotic distributions of the estimators in references [56,57,58,59] are currently unknown. Thus, it is difficult to use these estimators for hypothesis testing which is crucial in many scientific applications. Kandasamy et al. [60] derived the asymptotic distribution of their data-splitting estimator but did not prove similar results for their leave-one-out estimator. We establish a central limit theorem for EnDive which greatly enhances its applicability in scientific settings.

Our ensemble divergence estimator reduces to an ensemble entropy estimator as a special case when data from only one distribution is considered and the other density is set to a uniform measure (see reference [28] for more on the relationship between entropy and information divergence). The resultant entropy estimator differs from the ensemble entropy estimator proposed by Sricharan et al. [65] in several important ways. First, the density support set must be known for the estimator in reference [65] to perform the explicit boundary correction. In contrast, the EnDive estimator does not require any boundary correction. To show this requires a significantly different approach to prove the bias and variance rates of the EnDive estimator. Furthermore, the EnDive results apply under more general assumptions for the densities and the kernel used in the weak estimators. Finally, the central limit theorem applies to the EnDive estimator which is currently unknown for the estimator in reference [65].

We also note that Berrett et al. [66] proposed a modification of the Kozachenko and Leonenko estimator of entropy [67] that takes a weighted ensemble estimation approach. While their results require stronger assumptions for the smoothness of the densities than ours do, they did obtain the asymptotic distribution of their weighted estimator and they also showed that the asymptotic variance of the estimator is not increased by taking a weighted average. This latter point is an important selling point of the ensemble framework—we can improve the asymptotic bias of an estimator without increasing the asymptotic variance.

1.2. Organization and Notation

The paper is organized as follows. We first derive the MSE convergence rates in Section 2 for a weak divergence estimator, which is a kernel density plug-in divergence estimator. We then generalize the theory of optimally weighted ensemble entropy estimation developed in reference [65] to obtain the ensemble divergence estimator EnDive from an ensemble of weak estimators in Section 3. A central limit theorem and uniform convergence rate for the ensemble estimator are also presented in Section 3. In Section 4, we provide guidelines for selecting the tuning parameters based on experiments and the theory derived in the previous sections. We then perform experiments in Section 4 that validate the theory and establish the robustness of the proposed estimators to the tuning parameters.

Bold face type is used for random variables and random vectors. The conditional expectation given a random variable

Z

is denoted as

E_{Z}

. The variance of a random variable is denoted as

V

, and the bias of an estimator is denoted as

B

.

2. The Divergence Functional Weak Estimator

This paper focuses on estimating functionals of the form

G (f_{1}, f_{2}) = \int g (f_{1} (x), f_{2} (x)) f_{2} (x) d x,,

(1)

where

g (x, y)

is a smooth functional, and

f_{1}

and

f_{2}

are smooth d-dimensional probability densities. If

g (f_{1} (x), f_{2} (x)) = g (\frac{f_{1} (x)}{f_{2} (x)}),

g is convex, and

g (1) = 0

, then

G (f_{1}, f_{2})

defines the family of f-divergences. Some common divergences that belong to this family include the KL divergence (

g (t) = - ln t

) and the total variation distance (

g (t) = | t - 1 |

). In this work, we consider a broader class of functionals than the f-divergences, since g is allowed to be very general.

To estimate

G (f_{1}, f_{2})

, we first define a weak plug-in estimator based on kernel density estimators (KDEs), that is, a simple estimator that converges slowly to the true value

G (f_{1}, f_{2})

in terms of MSE. We then derive the bias and variance expressions for this weak estimator as a function of sample size and bandwidth. We then use the resulting bias and variance expressions to derive an ensemble estimator that takes a weighted average of weak estimators with different bandwidths and achieves superior MSE performance.

2.1. The Kernel Density Plug-in Estimator

We use a kernel density plug-in estimator of the divergence functional in (1) as the weak estimator. Assume that

N_{1}

i.i.d. realizations

\{Y_{1}, \dots, Y_{N_{1}}\}

are available from

f_{1}

and

N_{2}

i.i.d. realizations

\{X_{1}, \dots, X_{N_{2}}\}

are available from

f_{2}

. Let

h_{i} > 0

be the kernel bandwidth for the density estimator of

f_{i}

. For simplicity of presentation, assume that

N_{1} = N_{2} = N

and

h_{1} = h_{2} = h

. The results for the more general case of differing sample sizes and bandwidths are given in Appendix C. Let

K (\cdot)

be a kernel function with

\int K (x) d x = 1

and

{| | K | |}_{\infty} < \infty

where

{∥ K ∥}_{\infty}

is the

ℓ_{\infty}

norm of the kernel (K). The KDEs for

f_{1}

and

f_{2}

are, respectively,

\begin{matrix} {\tilde{f}}_{1, h} (X_{j}) & = & \frac{1}{N h^{d}} \sum_{i = 1}^{N} K (\frac{X_{j} - Y_{i}}{h}), \\ {\tilde{f}}_{2, h} (X_{j}) & = & \frac{1}{M h^{d}} \sum_{\begin{matrix} i = 1 \\ i \neq j \end{matrix}}^{N} K (\frac{X_{j} - X_{i}}{h}), \end{matrix}

where

M = N - 1

.

G (f_{1}, f_{2})

is then approximated as

{\tilde{G}}_{h} = \frac{1}{N} \sum_{i = 1}^{N} g ({\tilde{f}}_{1, h} (X_{i}), {\tilde{f}}_{2, h} (X_{i})) .

(2)

2.2. Convergence Rates

For many estimators, MSE convergence rates are typically provided in the form of upper (or sometimes lower) bounds on the bias and the variance. Therefore, only the slowest converging terms (as a function of the sample size (N)) are presented in these cases. However, to apply our generalized ensemble theory to obtain estimators that guarantee the parametric MSE rate, we required explicit expressions for the bias of the weak estimators in terms of the sample size (N) and the kernel bandwidth (h). Thus, an upper bound was insufficient for our work. Furthermore, to guarantee the parametric rate, we required explicit expressions of all bias terms that converge to zero slower than

O (1 / \sqrt{N})

.

To obtain bias expressions, we required multiple assumptions on the densities

f_{1}

and

f_{2}

, the functional g, and the kernel K. Similar to reference [7,54,65], the principal assumptions we make were that (1)

f_{1}

,

f_{2},

and g are smooth; (2)

f_{1}

and

f_{2}

have common bounded support sets

S

; and (3)

f_{1}

and

f_{2}

are strictly lower bounded on

S

. We also assume (4) that the density support set is smooth with respect to the kernel (

K (u)

). The full technical assumptions and a discussion of them are contained in Appendix A. Given these assumptions, we have the following result on the bias of

{\tilde{G}}_{h}

:

Theorem 1.

For a general g, the bias of the plug-in estimator

{\tilde{G}}_{h}

is given by

B [{\tilde{G}}_{h}] = \sum_{j = 1}^{⌊s⌋} c_{10, j} h^{j} + c_{11} \frac{1}{N h^{d}} + O (h^{s} + \frac{1}{N h^{d}}) .

(3)

To apply our generalized ensemble theory to the KDE plug-in estimator (

{\tilde{G}}_{h}

), we required only an upper bound on its variance. The following variance result required much less strict assumptions than the bias results in Theorem 1:

Theorem 2.

Assume that the functional g in (1) is Lipschitz continuous in both of its arguments with the Lipschitz constant (

C_{g}

). Then, the variance of the plug-in estimator (

{\tilde{G}}_{h}

) is bounded by

V [{\tilde{G}}_{h}] \leq C_{g}^{2} {| | K | |}_{\infty}^{2} \frac{11}{N} .

From Theorems 1 and 2, we observe that

h \to 0

and

N h^{d} \to \infty

are required for

{\tilde{G}}_{h}

to be unbiased, while the variance of the plug-in estimator depends primarily on the sample size (N). Note that the constants depend on the densities

f_{1}

and

f_{2}

and their derivatives which are often unknown.

2.3. Optimal MSE Rate

From Theorem 1, the dominating terms in the bias are observed to be

Θ (h)

and

Θ (\frac{1}{N h^{d}})

. If no bias correction is performed, the optimal choice of h that minimizes MSE is

h^{*} = Θ (N^{\frac{- 1}{d + 1}}) .

This results in a dominant bias term of order

Θ (N^{\frac{- 1}{d + 1}})

. Note that this differs from the standard result for the optimal KDE bandwidth for minimum MSE density estimation which is

Θ (N^{- 1 / (d + 4)})

for a symmetric uniform kernel when the boundary bias is ignored [68].

Figure 1 shows a heatmap showing the leading bias term

O (h)

as a function of d and N when

h = N^{\frac{- 1}{d + 1}}

. The heatmap indicates that the bias of the plug-in estimator in (2) is small only for relatively small values of d. This is consistent with the empirical results in reference [69] which examined the MSE of multiple plug-in KDE and k-nn estimators. In the next section, we propose an ensemble estimator that achieves a superior convergence rate regardless of the dimensions (d) as long as the density is sufficiently smooth.

Figure 1. Heat map showing the predicted bias of the divergence functional plug-in estimator

{\tilde{G}}_{h}

based on Theorem 1 as a function of the dimensions (d) and sample size (N) when

h = N^{\frac{- 1}{d + 1}}

. Note that the phase transition in the bias as the dimensions (d) increase for a fixed sample size (N); the bias remains small only for relatively small values of

d

. The proposed weighted ensemble estimator EnDive eliminates this phase transition when the densities and the function g are sufficiently smooth.

2.4. Proof Sketches of Theorems 1 and 2

To prove the bias expressions in Theorem 1, the bias is first decomposed into two parts by adding and subtracting

g (E_{Z} {\tilde{f}}_{1, h} (Z), E_{Z} {\tilde{f}}_{2, h} (Z))

within the expectation creating a “bias” term and a “variance” term. Applying a Taylor series expansion on the bias and variance terms results in expressions that depend on powers of

B_{Z} [{\tilde{f}}_{i, h} (Z)] : = E_{Z} {\tilde{f}}_{i, h} (Z) - f_{i} (Z)

and

{\tilde{e}}_{i, h} (Z) : = {\tilde{f}}_{i, h} (Z) - E_{Z} {\tilde{f}}_{i, h} (Z)

, respectively. Within the interior of the support, moment bounds can be derived from properties of the KDEs and a Taylor series expansion of the densities. Near the boundary of the support, the smoothness assumption on the boundary

A . 5

is required to obtain an expression of the bias in terms of the KDE bandwidth (h) and the sample size (N). The full proof of Theorem 1 is given in Appendix E.

The proof of the variance result takes a different approach. The proof uses the Efron–Stein inequality [70] which bounds the variance by analyzing the expected squared difference between the plug-in estimator when one sample is allowed to differ. This approach provides a bound on the variance under much less strict assumptions on the densities and the functional g than is required for Theorem 1. The full proof of Theorem 2 is given in Appendix F.

3. Weighted Ensemble Estimation

From Theorem 1 and Figure 1, we can observe that the bias of the MSE-optimal plug-in estimator

{\tilde{G}}_{h}

decreases very slowly as a function of the sample size (N) when the data dimensions (d) are not small, resulting in a large MSE. However, by applying the theory of optimally weighted ensemble estimation, we can obtain an estimator with improved performance by taking a weighted sum of an ensemble of weak estimators where the weights are chosen to significantly reduce the bias.

The ensemble of weak estimators is formed by choosing different values of the bandwidth parameter h as follows. Set

L = \{l_{1}, \dots, l_{L}\}

to be real positive numbers that index

h (l_{i})

. Thus, the parameter l indexes over different neighborhood sizes for the KDEs. Define the weight

w : = \{w (l_{1}), \dots, w (l_{L})\}

and

{\tilde{G}}_{w} : = \sum_{l \in L} w (l) {\tilde{G}}_{h (l)} .

That is, for each estimator

{\tilde{G}}_{h (l)}

there is a corresponding weight value (

w (l)

). The key to reducing the MSE is to choose the weight vector (w) to reduce the lower order terms in the bias while minimizing the impact of the weighted average on the variance.

3.1. Finding the Optimal Weight

The theory of optimally weighted ensemble estimation is a general theory that is applicable to any estimation problem as long as the bias and variance of the estimator can be expressed in a specific way. An early version of this theory was presented in reference [65]. We now generalize this theory so that it can be applied to a wider variety of estimation problems. Let N be the number of available samples and let

L = \{l_{1}, \dots, l_{L}\}

be a set of index values. Given an indexed ensemble of estimators

{\{{\hat{E}}_{l}\}}_{l \in L}

of some parameter (E), the weighted ensemble estimator with weights

w = \{w (l_{1}), \dots, w (l_{L})\}

satisfying

\sum_{l \in L} w (l) = 1

is defined as

{\hat{E}}_{w} = \sum_{l \in L} w (l) {\hat{E}}_{l} .

{\hat{E}}_{w}

is asymptotically unbiased as long as the estimators

{\{{\hat{E}}_{l}\}}_{l \in L}

are asymptotically unbiased. Consider the following conditions on

{\{{\hat{E}}_{l}\}}_{l \in L}

:

$C . 1$ The bias is expressible as

$B [{\hat{E}}_{l}] = \sum_{i \in J} c_{i} ψ_{i} (l) ϕ_{i, d} (N) + O (\frac{1}{\sqrt{N}}),$

where $c_{i}$ are constants that depend on the underlying density and are independent of N and l, $J = \{i_{1}, \dots, i_{I}\}$ is a finite index set with $I < L$ , and $ψ_{i} (l)$ are basis functions depending only on parameter l and not on the sample size (N).
$C . 2$ The variance is expressible as

$V [{\hat{E}}_{l}] = c_{v} (\frac{1}{N}) + o (\frac{1}{N}) .$

Theorem 3.

Assume conditions

C . 1

and

C . 2

hold for an ensemble of estimators

{\{{\hat{E}}_{l}\}}_{l \in L}

. Then, there exists a weight vector (

w_{0}

) such that the MSE of the weighted ensemble estimator attains the parametric rate of convergence:

E [{({\hat{E}}_{w_{0}} - E)}^{2}] = O (\frac{1}{N}) .

The weight vector (

w_{0}

) is the solution to the following convex optimization problem:

\begin{matrix} min_{w} & {| | w | |}_{2} \\ s u b j e c t t o & \sum_{l \in L} w (l) = 1, \\ γ_{w} (i) = \sum_{l \in L} w (l) ψ_{i} (l) = 0, i \in J . \end{matrix}

(4)

Proof.

From condition

C . 1

, we can write the bias of the weighted estimator as

B [{\hat{E}}_{w}] = \sum_{i \in J} c_{i} γ_{w} (i) ϕ_{i, d} (N) + O (\frac{\sqrt{L} {| | w | |}_{2}}{\sqrt{N}}) .

The variance of the weighted estimator is bounded as

V [{\hat{E}}_{w}] \leq \frac{{L | | w | |}_{2}^{2}}{N} .

(5)

The optimization problem in (4) zeroes out the lower-order bias terms and limits the

ℓ_{2}

norm of the weight vector (w) to prevent the variance from exploding. This results in an MSE rate of

O (1 / N)

when the dimensions (d) are fixed and when L is fixed independently of the sample size (N). Furthermore, a solution to (4) is guaranteed to exist if

L > I

and the vectors

a_{i} = [ψ_{i} (l_{1}), \dots, ψ_{i} (l_{L})]

are linearly independent. This completes our sketch of the proof of Theorem 3. ☐

3.2. The EnDive Estimator

The parametric rate of

O (1 / N)

in MSE convergence can be achieved without requiring

γ_{w} (i) = 0, i \in J

. This can be accomplished by solving the following convex optimization problem in place of the optimization problem in Theorem 3:

\begin{matrix} min_{w} & ϵ \\ s u b j e c t t o & \sum_{l \in L} w (l) = 1, \\ |γ_{w} (i) N^{\frac{1}{2}} ϕ_{i, d} (N)| \leq ϵ, i \in J, \\ {∥w∥}_{2}^{2} \leq η ϵ, \end{matrix}

(6)

where the parameter

η

is chosen to achieve a trade-off between bias and variance. Instead of forcing

γ_{w} (i) = 0

, the relaxed optimization problem uses the weights to decrease the bias terms at a rate of

O (1 / \sqrt{N})

, yielding an MSE convergence rate of

O (1 / N) .

In fact, it was shown in reference [71] that the optimization problem in (6) guarantees the parametric MSE rate as long as the conditions of Theorem 3 are satisfied and a solution to the optimization problem in (4) exists (the conditions for this existence are given in the proof of Theorem 3).

We now construct a divergence ensemble estimator from an ensemble of plug-in KDE divergence estimators. Consider first the bias result in (3) where g is general, and assume that

s \geq d

. In this case, the bias contains a

O (\frac{1}{h^{d} N})

term. To guarantee the parametric MSE rate, any remaining lower-order bias terms in the ensemble estimator must be no slower than

O (1 / \sqrt{N})

. Let

h (l) = l N^{- 1 / (2 d)}

where

l \in L

. Then

O (\frac{1}{h {(l)}^{d} N}) = O (\frac{1}{l^{d} \sqrt{N}})

. We therefore obtain an ensemble of plug-in estimators

{\{{\tilde{G}}_{h (l)}\}}_{l \in L}

and a weighted ensemble estimator

{\tilde{G}}_{w} = \sum_{l \in L} w (l) {\tilde{G}}_{h (l)}

. The bias of each estimator in the ensemble satisfies the condition

C . 1

with

ψ_{i} (l) = l^{i}

and

ϕ_{i, d} (N) = N^{- i / (2 d)}

for

i = 1, \dots, d

. To obtain a uniform bound on the bias with respect to w and

L

, we also include the function

ψ_{d + 1} (l) = l^{- d}

with corresponding

ϕ_{d + 1, d} (N) = N^{- 1 / 2}

. The variance also satisfies the condition

C . 2

. The optimal weight (

w_{0}

) is found by using (6) to obtain an optimally weighted plug-in divergence functional estimator

{\tilde{G}}_{w_{0}}

with an MSE convergence rate of

O (\frac{1}{N})

as long as

s \geq d

and

L \geq d

. Otherwise, if

s < d

, we can only guarantee the MSE rate up to

O (\frac{1}{N^{s / d}})

. We refer to this estimator as the Ensemble Divergence (EnDive) estimator and denote it as

{\tilde{G}}_{EnDive}

.

We note that for some functionals (g) (including the KL divergence and the Renyi-

α

divergence integral), we can modify the EnDive estimator to obtain the parametric rate under the less strict assumption that

s > d / 2

. For details on this approach, see Appendix B.

3.3. Central Limit Theorem

The following theorem shows that an appropriately normalized ensemble estimator

{\tilde{G}}_{w}

converges in distribution to a normal random variable under rather general conditions. Thus, the same result applies to the EnDive estimator

{\tilde{G}}_{EnDive}

. This enables us to perform hypothesis testing on the divergence functional which is very useful in many scientific applications. The proof is based on the Efron–Stein inequality and an application of Slutsky’s Theorem (Appendix G).

Theorem 4.

Assume that the functional g is Lipschitz in both arguments with the Lipschitz constant

C_{g}

. Further assume that

h (l) = o (1)

,

N \to \infty

, and

N h {(l)}^{d} \to \infty

for each

l \in L

. Then, for a fixed

L

, the asymptotic distribution of the weighted ensemble estimator

{\tilde{G}}_{w}

is

P r (({\tilde{G}}_{w} - E [{\tilde{G}}_{w}]) / \sqrt{V [{\tilde{G}}_{w}]} \leq t) \to P r (S \leq t),

where

S

is a standard normal random variable.

3.4. Uniform Convergence Rates

Here, we show that the optimally weighted ensemble estimators achieve the parametric MSE convergence rate uniformly. Denote the subset of

Σ (s, K_{H})

with densities bounded between

ϵ_{0}

and

ϵ_{\infty}

as

Σ (s, K_{H}, ϵ_{0}, ϵ_{\infty})

.

Theorem 5.

Let

{\tilde{G}}_{EnDive}

be the EnDive estimator of the functional

G (p, q) = \int g (p (x), q (x)) q (x) d x,

where p and q are d-dimensional probability densities. Additionally, let

r = d

and assume that

s > r

. Then,

sup_{p, q \in Σ (s, K_{H}, ϵ_{0}, ϵ_{\infty})} E [{({\tilde{G}}_{w_{0}} - G (p, q))}^{2}] \leq \frac{C}{N},

(7)

where C is a constant.

The proof decomposes the MSE into the variance plus the square of the bias. The variance is bounded easily by using Theorem 2. To bound the bias, we show that the constants in the bias terms are continuous with respect to the densities p and q under an appropriate norm. We then show that

Σ (s, K_{H}, ϵ_{0}, ϵ_{\infty})

is compact with respect to this norm and then apply an extreme value theorem. Details are given in Appendix H.

4. Experimental Results

In this section, we discuss the choice of tuning parameters and validate the EnDive estimator’s convergence rates and the central limit theorem. We then use the EnDive estimator to estimate bounds on the Bayes error for a single-cell bone marrow data classification problem.

4.1. Tuning Parameter Selection

The optimization problem in (6) has parameters

η

, L, and

L

. By applying (6), and the resulting MSE of the ensemble estimator is

O (ϵ^{2} / N) + O (L η^{2} ϵ^{2} / N),

(8)

where each term in the sum comes from the bias and variance, respectively. From this expression and (6), we see that the parameter

η

provides a tradeoff between bias and variance. Increasing

η

enables the norm of the weight vector to be larger. This means the feasible region for the variable w increases in size as

η

increases which can result in decreased bias. However, as

η

contributes to the variance term, increasing

η

may result in increased variance.

If all of the constants in (3) and an exact expression for the variance of the ensemble estimator were known, then

η

could be chosen to optimize this tradeoff in bias and variance and thus minimize the MSE. Since these constants are unknown, we can only choose

η

based on the asymptotic results. From (8), this would suggest setting

η = 1 / \sqrt{L}

. In practice, we find that for finite sample sizes, the variance in the ensemble estimator is less than the upper bound of

L η^{2} ϵ^{2} / N

. Thus, setting

η = 1 / \sqrt{L}

is unnecessarily restrictive. We find that, in practice, setting

η = 1

works well.

Upon first glance, it appears that for fixed L, the set

L

that parameterizes the kernel widths can, in theory, be chosen by minimizing

ϵ

in (6) over

L

in addition to w. However, adding this constraint results in a non-convex optimization problem since w does not lie in the non-negative orthant. A parameter search over possible values for

L

is another possibility. However, this may not be practical as

ϵ

generally decreases as the size and spread of

L

increases. In addition, for finite sample sizes, decreasing

ϵ

does not always directly correspond to a decrease in MSE, as very high or very low values of

h (l)

can lead to inaccurate density estimates, resulting in a larger MSE.

Given these limitations, we provide the following recommendations for

L

. Denote the value of the minimum value of l such that

{\tilde{f}}_{i, h (l_{m i n})} (X_{j}) > 0

\forall i = 1, 2

as

l_{m i n}

and the diameter of the support

S

as D. To ensure the KDEs are bounded away from zero, we require that

min (L) \geq l_{m i n}

. As demonstrated in Figure 2, the weights in

w_{0}

are generally largest for the smallest values of

L

. This indicates that

min (L)

should also be sufficiently larger than

l_{m i n}

to render an adequate density estimate. Similarly,

max (L)

should be sufficiently smaller than the diamter (D) as high bandwidth values can lead to high bias in the KDEs. Once these values are chosen, all other

L

values can then be chosen to be equally spaced between

min (L)

and

max (L)

.

Figure 2. The optimal weights from (6) when

d = 4

,

N = 3100

,

L = 50

, and l are uniformly spaced between 1.5 and 3. The lowest values of l are given the highest weight. Thus, the minimum value of bandwidth parameters

L

should be sufficiently large to render an adequate estimate.

An efficient way to choose

l_{m i n}

and

l_{m a x}

is to select the integers

k_{m i n}

and

k_{m a x}

and compute the

k_{m i n}

and

k_{m a x}

nearest neighbor distances of all the data points. The bandwidths

h (l_{m i n})

and

h (l_{m a x})

can then be chosen to be the maximums of these corresponding distances. The parameters

l_{m i n}

and

l_{m a x}

can then be computed from the expression

h (l) = l N^{- \frac{1}{2 d}}

. This choice ensures that a minimum of

k_{m i n}

points are within the kernel bandwidth for the density estimates at all points and that a maximum of

k_{m a x}

points are within the kernel bandwidth for the density estimates at one of the points.

Once

min (L)

and

max (L)

have been chosen, the similarity of bandwidth values

h (l)

and basis functions

ψ_{i, d} (l)

increases as L increases, resulting in a negligible decrease in the bias. Hence, L should be chosen to be large enough for sufficient bias but small enough so that the bandwidth values

h (l)

are sufficiently distinct. In our experiments, we found

30 \leq L \leq 60

to be sufficient.

4.2. Convergence Rates Validation: Rényi- $α$ Divergence

To validate our theoretical convergence rate results, we estimated the Rényi-

α

divergence integral between two truncated multivariate Gaussian distributions with varying dimension and sample sizes. The densities had means of

{\bar{μ}}_{1} = 0.7 \times {\bar{1}}_{d}

,

{\bar{μ}}_{2} = 0.3 \times {\bar{1}}_{d}

and covariance matrices of

0.4 \times I_{d}

, where

{\bar{1}}_{d}

is a d-dimensional vector of ones, and

I_{d}

is a

d \times d

identity matrix. We restricted the Gaussians to the unit cube and used

α = 0.5

.

The left plots in Figure 3 show the MSE (200 trials) of the standard plug-in estimator implemented with a uniform kernel and the proposed optimally weighted estimator EnDive for various dimensions and sample sizes. The parameter set

L

was selected based on a range of k-nearest neighbor distances. The bandwidth used for the standard plug-in estimator was selected by setting

h_{f i x e d} (l^{*}) = l^{*} N^{- \frac{1}{d + 1}}

, where

l^{*}

was chosen from

L

to minimize the MSE of the plug-in estimator. For all dimensions and sample sizes, EnDive outperformed the plug-in estimator in terms of MSE. EnDive was also less biased than the plug-in estimator and even had lower variance at smaller sample sizes (e.g.,

N = 100

). This reflects the strength of ensemble estimators—the weighted sum of a set of relatively poor estimators can result in a very good estimator. Note also that for the larger values of N, the ensemble estimator MSE rates approached the theoretical rate based on the estimated log–log slope given in Table 1.

Figure 3. (Left) Log–log plot of MSE of the uniform kernel plug-in (“Kernel”) and the optimally weighted EnDive estimator for various dimensions and sample sizes. (Right) Plot of the true values being estimated compared to the average values of the same estimators with standard error bars. The proposed weighted ensemble estimator approaches the theoretical rate (see Table 1), performed better than the plug-in estimator in terms of MSE and was less biased.

Table 1. Negative log–log slope of the EnDive mean squared error (MSE) as a function of the sample size for various dimensions. The slope was calculated beginning at

N_{s t a r t}

. The negative slope was closer to 1 with

N_{s t a r t} = 10^{2.375}

than for

N_{s t a r t} = 10^{2}

indicating that the asymptotic rate had not yet taken effect at

N_{s t a r t} = 10^{2}

.

To illustrate the difference between the problems of density estimation and divergence functional estimation, we estimated the average pointwise squared error between the KDE

{\tilde{f}}_{1, h}

and

f_{1}

in the previous experiment. We used exactly the same bandwidth and kernel as the standard plug-in estimators in Figure 3 and calculated the pointwise error at 10,000 points sampled from

f_{1}

. The results are shown in Figure 4. From these results, we see that the KDEs performed worse as the dimension of the densities increased. Additionally, we observe by comparing Figure 3 and Figure 4, the average pointwise squared error decreased at a much slower rate as a function of the sample size (N) than the MSE of the plug-in divergence estimators, especially for larger dimensions (d).

Figure 4. Log–log plot of the average pointwise squared error between the KDE

{\tilde{f}}_{1, h}

and

f_{1}

for various dimensions and sample sizes using the same bandwidth and kernel as the standard plug-in estimators in Figure 3. The KDE and the density were compared at 10,000 points sampled from

f_{1}

.

Our experiments indicated that the proposed ensemble estimator is not sensitive to the tuning parameters. See reference [72] for more details.

4.3. Central Limit Theorem Validation: KL Divergence

To verify the central limit theorem of the EnDive estimator, we estimated the KL divergence between two truncated Gaussian densities, again restricted to the unit cube. We conducted two experiments where (1) the densities were different with means of

{\bar{μ}}_{1} = 0.7 \times {\bar{1}}_{d}

,

{\bar{μ}}_{2} = 0.3 \times {\bar{1}}_{d}

and covariances of matrices

σ_{i} \times I_{d}

,

σ_{1} = 0.1

,

σ_{2} = 0.3

; and where (2) the densities were the same with means of

0.3 \times {\bar{1}}_{d}

and covariance matrices of

0.3 \times I_{d}

. For both experiments, we chose

d = 6

and four different sample sizes (N). We found that the correspondence between the quantiles of the standard normal distribution and the quantiles of the centered and scaled EnDive estimator was very high under all settings (see Table 2 and Figure 5) which validates Theorem 4.

Table 2. Comparison between quantiles of a standard normal random variable and the quantiles of the centered and scaled EnDive estimator applied to the KL divergence when the distributions were the same and different. Quantiles were computed from 10,000 trials. The parameter

ρ

gives the correlation coefficient between the quantiles, while

β

is the estimated slope between the quantiles. The correspondence between quantiles was very high for all cases.

Figure 5. QQ-plots comparing the quantiles of a standard normal random variable and the quantiles of the centered and scaled EnDive estimator applied to the Kullback–Leibler (KL) divergence when the distributions were the same and different. Quantiles were computed from 10,000 trials. These plots correspond to the same experiments as in Table 2 when N = 100 and N = 1000. The correspondence between quantiles is high for all cases.

4.4. Bayes Error Rate Estimation on Single-Cell Data

Using the EnDive estimator, we estimated bounds on the Bayes error rate (BER) of a classification problem involving MARS-seq single-cell RNA-sequencing (scRNA-seq) data measured from developing mouse bone marrow cells enriched for the myeloid and erythroid lineages [73]. However, we first demonstrated the ability of EnDive to estimate the bounds on the BER of a simulated problem. In this simulation, the data were drawn from two classes where each class distribution was a

d = 10

dimensional Gaussian distribution with different means and the identity covariance matrix. We considered two cases, namely, the distance between the means was 1 or 3. The BER was calculated in both cases. We then estimated upper and lower bounds on the BER by estimating the Henze–Penrose (HP) divergence [4,6]. Figure 6 shows the average estimated upper and lower bounds on the BER with standard error bars for both cases. For all tested sample sizes, the BER was within one standard deviation of the estimated lower bound. The lower bound was also closer, on average, to the BER for most of the tested sample sizes (lower sample sizes with smaller distances between means were the exceptions). Generally, these resuls indicate that the true BER is relatively close to the estimated lower bound, on average.

Figure 6. Estimated upper (UB) and lower bounds (LB) on the Bayes error rate (BER) based on estimating the HP divergence between two 10-dimensional Gaussian distributions with identity covariance matrices and distances between means of 1 (left) and 3 (right), respectively. Estimates were calculated using EnDive, with error bars indicating the standard deviation from 400 trials. The upper bound was closer, on average, to the true BER when N was small (≈100–300) and the distance between the means was small. The lower bound was closer, on average, in all other cases.

We then estimated similar bounds on the scRNA-seq classification problem using EnDive. We considered the three most common cell types within the data: erythrocytes (eryth.), monocytes (mono.), and basophils (baso.) (

N = 1095, 559, 300,

respectively). We estimated the upper and lower bounds on the pairwise BER between these classes using different combinations of genes selected from the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways associated with the hematopoietic cell lineage [74,75,76]. Each collection of genes contained 11–14 genes. The upper and lower bounds on the BER were estimated using the Henze–Penrose divergence [4,6]. The standard deviations of the bounds for the KEGG-based genes were estimated via 1000 bootstrap iterations. The KEGG-based bounds were compared to BER bounds obtained from 1000 random selections of 12 genes. In all cases, we compared the bounds to the performance of a quadratic discriminant analysis classifier (QDA) with 10-fold cross validation. Note that to correct for undersampling in scRNA-seq data, we first imputed the undersampled data using MAGIC [77].

All results are given in Table 3. From these results, we note that erythrocytes are relatively easy to distinguish from the other two cell types as the BER lower bounds were within nearly two standard deviations of zero when using genes associated with platelet, erythrocyte, and neutrophil development as well as a random selection of 12 genes. This is corroborated by the QDA cross-validated results which were all within two standard deviations of either the upper or lower bound for these gene sets. In contrast, the macrophage-associated genes seem to be less useful for distinguishing erythrocytes than the other gene sets.

Table 3. Misclassification rate of a quadratic discriminant analysis classifier (QDA) classifier and estimated upper bounds (UB) and lower bounds (LB) of the pairwise BER between mouse bone marrow cell types using the Henze–Penrose divergence applied to different combinations of genes selected from the KEGG pathways associated with the hematopoietic cell lineage. Results are presented as percentages in the form of mean ± standard deviation. Based on these results, erythrocytes are relatively easy to distinguish from the other two cell types using these gene sets.

We also found that basophils are difficult to distinguish from monocytes using these gene sets. Assuming the relative abundance of each cell type is representative of the population, a trivial upper bound on the BER is

300 / (300 + 559) \approx 0.35

which is between all of the estimated lower and upper bounds. The QDA results were also relatively high (and may be overfitting the data in some cases based on the estimated BER bounds), suggesting that different genes should be explored for this classification problem.

5. Conclusions

We derived the MSE convergence rates for a kernel density plug-in estimator for a large class of divergence functionals. We generalized the theory of optimally weighted ensemble estimation and derived an ensemble divergence estimator EnDive that achieves the parametric rate when the densities are more than d times differentiable. The estimator we derived can be applied to general bounded density support sets and can be implemented without knowledge of the support, which is a distinct advantage over other competing estimators. We also derived the asymptotic distribution of the estimator, provided some guidelines for tuning parameter selection, and experimentally validated the theoretical convergence rates for the case of empirical estimation of the Rényi-

α

divergence integral. We then performed experiments to examine the estimator’s robustness to the choice of tuning parameters, validated the central limit theorem for KL divergence estimation, and estimated bounds on the Bayes error rate for a single cell classification problem.

We note that based on the proof techniques employed in our work, our weighted ensemble estimators are easily extended beyond divergence estimation to more general distributional functionals which may be integral functionals of any number of probability distributions. We also show in Appendix B that EnDive can be easily modified to obtain an estimator that achieves the parametric rate when the densities are more than

d / 2

times differentiable and the functional g has a specific form that includes the Rényi and KL divergences. Future work includes extending this modification to functionals with more general forms. An important divergence of interest in this context is the Henze–Penrose divergence that we used to bound the Bayes error. Further future work will focus on extending this work on divergence estimation to k-nn based estimators where knowledge of the support is, again, not required. This will improve the computational burden, as k-nn estimators require fewer computations than standard KDEs.

Author Contributions

K.M. wrote this article primarily as part of his PhD dissertation under the supervision of A.H. and in collaboration with K.S. A.H, K.M., and K.S. edited the paper. K.S. provided the primary contribution to the proof of Theorem A1 and assisted with all other proofs. K.M. provided the primary contributions for the proofs of all other theorems and performed all other experiments. K.G. contributed to the bias proof.

Funding

This research was funded by Army Research Office (ARO) Multidisciplinary University Research Initiative (MURI) grant number W911NF-15-1-0479, National Science Foundation (NSF) grant number CCF-1217880, and a National Science Foundation (NSF) Graduate Research Fellowship to the first author under grant number F031543.

Conflicts of Interest

The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

KL	Kullback–Leibler
MSE	Mean squared error
SVM	Support vector machine
KDE	kernel density estimator
EnDive	Ensemble Divergence
BER	Bayes error rate
scRNA-seq	Single-cell RNA-sequencing
HP	Henze–Penrose

Appendix A. Bias Assumptions

Our full assumptions to prove the bias expressions for the estimator

{\tilde{G}}_{h}

were as follows:

$(A . 0)$ : Assume that the kernel K is symmetric, is a product kernel, and has bounded support in each dimension.
$(A . 1)$ : Assume there exist constants $ϵ_{0}, ϵ_{\infty}$ , such that $0 < ϵ_{0} \leq f_{i} (x) \leq ϵ_{\infty} < \infty, \forall x \in S .$
$(A . 2)$ : Assume that the densities are $f_{i} \in Σ (s, K_{H})$ in the interior of $S$ with $s \geq 2$ .
$(A . 3)$ : Assume that g has an infinite number of mixed derivatives.
$(A . 4$ ): Assume that $|\frac{\partial^{k + l} g (x, y)}{\partial x^{k} \partial y^{l}}|$ , $k, l = 0, 1, \dots$ are strictly upper bounded for $ϵ_{0} \leq x, y \leq ϵ_{\infty}$ .
$(A . 5)$ : Assume the following boundary smoothness condition: Let $p_{x} (u) : R^{d} \to R$ be a polynomial in u of order $q \leq r = ⌊s⌋$ whose coefficients are a function of x and are $r - q$ times differentiable. Then, assume that

$\int_{x \in S} {(\int_{u : K (u) > 0, x + u h \notin S} K (u) p_{x} (u) d u)}^{t} d x = v_{t} (h),$

where $v_{t} (h)$ admits the expansion

$v_{t} (h) = \sum_{i = 1}^{r - q} e_{i, q, t} h^{i} + o (h^{r - q}),$

for some constants $e_{i, q, t}$ .

We focused on finite support kernels for simplicity in the proofs, although it is likely that our results extend to some infinitely supported kernels as well. We assumed relatively strong conditions on the smoothness of g in

A . 3

to enable us to obtain an estimator that achieves good convergence rates without knowledge of the boundary of the support set. While this smoothness condition may seem restrictive, in practice, nearly all divergence and entropy functionals of interest satisfy this condition. Functionals of interest that do not satisfy this assumption (e.g., the total variation distance) typically have at least one point that is not differentiable which violates the assumptions of all competing estimators [54,57,58,59,60,65]. We also note that to obtain simply an upper bound on the bias for the plug-in estimator, much less restrictive assumptions on the functional g are sufficient.

Assumption

A . 5

requires the boundary of the density support set to be smooth with respect to the kernel (

K (u)

) in the sense that the expectation of the area outside of

S

with respect to any random variable u with smooth distribution is a smooth function of the bandwidth (h). Note that we do not require knowledge of the support of the unknown densities to actually implement the estimator (

{\tilde{G}}_{h}

). As long as assumptions

A . 0

–

A . 5

are satisfied, then the bias results we obtain are valid, and therefore, we can obtain the parametric rate with the EnDive estimator. This is in contrast to many other estimators of information theoretic measures such as those presented in references [59,60,65]. In these cases, the boundary of the support set must be known precisely to perform boundary correction to obtain the parametric rate, since the boundary correction is an explicit step in these estimators. In contrast, we do not need to explicitly perform a boundary correction.

It is not necessary for the boundary of

S

to have smooth contours with no edges or corners as assumption

A . 5

is satisfied by the following case:

Theorem A1.

Assumption

A . 5

is satisfied when

S = {[- 1, 1]}^{d}

and when K is the uniform rectangular kernel; that is,

K (x) = 1

for all

{x : | | x | |}_{1} \leq 1 / 2

.

The proof is given in Appendix D. The methods used to prove this can be easily extended to show that

A . 5

is satisfied with the uniform rectangular kernel and other similar supports with flat surfaces and corners. Furthermore, we showed in reference [78] that

A . 5

is satisfied using the uniform spherical kernel with a density support set equal to the unit cube. Note that assumption

A . 0

is trivially satisfied by the uniform rectangular kernel as well. Again, this is easily extended to more complicated density support sets that have boundaries that contain flat surfaces and corners. Determining other combinations of kernels and density support sets that satisfy

A . 5

is left for future work.

Densities for which assumptions

A

.1–

A . 2

hold include the truncated Gaussian distribution and the beta distribution on the unit cube. Functions for which assumptions

A

.3–

A . 4

hold include

g (x, y) = - ln (\frac{x}{y})

and

g (x, y) = {(\frac{x}{y})}^{α} .

Appendix B. Modified EnDive

If the functional g has a specific form, we can modify the EnDive estimator to obtain an estimator that achieves the parametric rate when

s > d / 2

. Specifically, we have the following theorem:

Theorem A2.

Assume that assumptions

A

.0–

A . 5

hold. Furthermore, if

g (x, y)

has

k, l

-th order mixed derivatives

\frac{\partial^{k + l} g (x, y)}{\partial x^{k} \partial y^{l}}

that depend on

x, y

only through

x^{α} y^{β}

for some

α, β \in R

, then for any positive integer

λ \geq 2

, the bias of

{\tilde{G}}_{h}

is

\begin{matrix} B [{\tilde{G}}_{h}] & = & \sum_{j = 1}^{⌊s⌋} c_{10, j} h^{j} + \sum_{q = 1}^{λ / 2} \sum_{j = 0}^{⌊s⌋} c_{11, q, j} \frac{h^{j}}{{(N h^{d})}^{q}} \\ + O (h^{s} + \frac{1}{{(N h^{d})}^{\frac{λ}{2}}}) . \end{matrix}

(A1)

Divergence functionals that satisfy the mixed derivatives condition required for (A1) include the KL divergence and the Rényi-

α

divergence. Obtaining similar terms for other divergence functionals requires us to separate the dependence on h of the derivatives of g evaluated at

E_{Z} {\tilde{f}}_{i, h} (Z)

. This is left for future work. See Appendix E for details.

As compared to (3), there are many more terms in (A1). These terms enable us to modify the EnDive estimator to achieve the parametric MSE convergence rate when

s > d / 2

for an appropriate choice of bandwidths, whereas the terms in (3) requires

s \geq d

to achieve the same rate. This is accomplished by letting

h (l)

decrease at a faster rate, as follows.

Let

δ > 0

and

h (l) = l N^{\frac{- 1}{d + δ}}

where

l \in L

. The bias of each estimator in the resulting ensemble has terms proportional to

l^{j - d q} N^{- \frac{j + q}{d + δ}}

, where

j, q \geq 0

and

j + q > 0

. Then, the bias of

{\tilde{G}}_{h (l)}

satisfies condition

C . 1

if

ϕ_{j, q, d} (N) = N^{- \frac{j + q}{d + δ}}

,

ψ_{j, q} (l) = l^{j - d q}

, and

J = \{{j, q} : 0 < j + q < (d + 1) / 2, q \in {0, 1, 2, \dots, λ / 2}, j \in {0, 1, 2, \dots, ⌊s⌋}\},

(A2)

as long as

L > | J | = I

. The variance also satisfies condition

C . 2

. The optimal weight (

w_{0}

) is found by using (6) to obtain an optimally weighted plug-in divergence functional estimator

{\tilde{G}}_{w_{0}}

that achieves the parametric convergence rate if

λ \geq d / δ + 1

and if

s \geq (d + δ) / 2

. Otherwise, if

s < (d + δ) / 2

, we can only guarantee the MSE rate up to

O (\frac{1}{N^{2 s / (d + δ)}})

. We refer to this estimator as the modified EnDive estimator and denote it as

{\tilde{G}}_{Mod}

. The ensemble estimator

{\tilde{G}}_{Mod}

is summarized in Algorithm 1 when

δ = 1

.

Algorithm A1: The Modified EnDive Estimator

Input:: $η$ , L positive real numbers $L$ , samples $\{Y_{1}, \dots, Y_{N}\}$ from $f_{1}$ , samples $\{X_{1}, \dots, X_{N}\}$ from $f_{2}$ , dimension d, function g, kernel K
Output:: The modified EnDive estimator ${\tilde{G}}_{Mod}$
1:: Solve for $w_{0}$ using (6) with $ϕ_{j, q, d} (N) = N^{- \frac{j + q}{d + 1}}$ and basis functions $ψ_{j, q} (l) = l^{j - d q}$ , $l \in \bar{l}$ , and ${i, j} \in J$ defined in (A2)
2:: for all $l \in \bar{l}$ do
3:: $h (l) \leftarrow l N^{\frac{- 1}{d + 1}}$
4:: for $i = 1$ to N do
5:: ${\tilde{f}}_{1, h (l)} (X_{i}) \leftarrow \frac{1}{N h {(l)}^{d}} \sum_{j = 1}^{N} K (\frac{X_{i} - Y_{j}}{h (l)})$ , ${\tilde{f}}_{2, h (l)} (X_{i}) \leftarrow \frac{1}{(N - 1) h {(l)}^{d}} \sum_{\begin{matrix} j = 1 j \neq i \end{matrix}}^{N} K (\frac{X_{i} - X_{j}}{h (l)})$
6:: end for
7:: ${\tilde{G}}_{h (l)} \leftarrow \frac{1}{N} \sum_{i = 1}^{N} g ({\tilde{f}}_{1, h (l)} (X_{i}), {\tilde{f}}_{2, h (l)} (X_{i}))$
8:: end for
9:: ${\tilde{G}}_{Mod} \leftarrow \sum_{l \in L} w_{0} (l) {\tilde{G}}_{h (l)}$

The parametric rate can be achieved with

{\tilde{G}}_{Mod}

under less strict assumptions on the smoothness of the densities than those required for

{\tilde{G}}_{EnDive}

. Since

δ > 0

can be arbitrary, it is theoretically possible to achieve the parametric rate with the modified estimator as long as

s > d / 2

. This is consistent with the rate achieved by the more complex estimators proposed in reference [57]. We also note that the central limit theorem applies and that the convergence is uniform as Theorem 5 applies for

s > ⌊(d + δ) / 2⌋

and

s \geq (d + δ) / 2

.

These rate improvements come at a cost for the number of parameters (L) required to implement the weighted ensemble estimator. If

s \geq \frac{d + δ}{2}

, then the size of J for

{\tilde{G}}_{Mod}

is in the order of

\frac{d^{2}}{8 δ}

. This may lead to increased variance in the ensemble estimator as indicated by (5).

So far,

{\tilde{G}}_{Mod}

can only be applied to functionals (

g (x, y)

) with mixed derivatives of the form of

x^{α} y^{β}

. Future work is required to extend this estimator to other functionals of interest.

Appendix C. General Results

Here we present the generalized forms of Theorems 1 and 2 where the sample sizes and bandwidths of the two datasets are allowed to differ. In this case, the KDEs are

\begin{matrix} {\tilde{f}}_{1, h_{1}} (X_{j}) & = & \frac{1}{N_{1} h_{1}^{d}} \sum_{i = 1}^{N_{1}} K (\frac{X_{j} - Y_{i}}{h_{1}}), \\ {\tilde{f}}_{2, h_{2}} (X_{j}) & = & \frac{1}{M_{2} h_{2}^{d}} \sum_{\begin{matrix} i = 1 \\ i \neq j \end{matrix}}^{N_{2}} K (\frac{X_{j} - X_{i}}{h_{2}}), \end{matrix}

where

M_{2} = N_{2} - 1

.

G (f_{1}, f_{2})

is then approximated as

{\tilde{G}}_{h_{1}, h_{2}} = \frac{1}{N_{2}} \sum_{i = 1}^{N_{2}} g ({\tilde{f}}_{1, h_{1}} (X_{i}), {\tilde{f}}_{2, h_{2}} (X_{i})) .

(A3)

We also generalize the bias result to the case where the kernel (K) has the order

ν

which means that the j-th moment of the kernel

K_{i}

defined as

\int t^{j} K_{i} (t) d t

is zero for all

j = 1, \dots, ν - 1

and

i = 1, \dots, d

where

K_{i}

is the kernel in the i-th coordinate. Note that symmetric product kernels have the order

ν \geq 2

. The following theorem on the bias follows under assumptions

A

.0–

A . 5

:

Theorem A3.

For general g, the bias of the plug-in estimator (

{\tilde{G}}_{h_{1}, h_{2}}

) is of the form

\begin{matrix} B [{\tilde{G}}_{h_{1}, h_{2}}] & = & \sum_{j = 1}^{r} (c_{4, 1, j} h_{1}^{j} + c_{4, 2, j} h_{2}^{j}) + \sum_{j = 1}^{r} \sum_{i = 1}^{r} c_{5, i, j} h_{1}^{j} h_{2}^{i} + O (h_{1}^{s} + h_{2}^{s}) \\ + c_{9, 1} \frac{1}{N_{1} h_{1}^{d}} + c_{9, 2} \frac{1}{N_{2} h_{2}^{d}} + o (\frac{1}{N_{1} h_{1}^{d}} + \frac{1}{N_{2} h_{2}^{d}}) . \end{matrix}

(A4)

Furthermore, if

g (x, y)

has

k, l

-th order mixed derivatives

\frac{\partial^{k + l} g (x, y)}{\partial x^{k} \partial y^{l}}

that depend on

x, y

only through

x^{α} y^{β}

for some

α, β \in R

, then for any positive integer

λ \geq 2

, the bias is of the form

\begin{matrix} B [{\tilde{G}}_{h_{1}, h_{2}}] & = & \sum_{j = 1}^{r} (c_{4, 1, j} h_{1}^{j} + c_{4, 2, j} h_{2}^{j}) + \sum_{j = 1}^{r} \sum_{i = 1}^{r} c_{5, i, j} h_{1}^{j} h_{2}^{i} + O (h_{1}^{s} + h_{2}^{s}) \\ \sum_{j = 1}^{λ / 2} \sum_{m = 0}^{r} (c_{9, 1, j, m} \frac{h_{1}^{m}}{{(N_{1} h_{1}^{d})}^{j}} + c_{9, 2, j, m} \frac{h_{2}^{m}}{{(N_{2} h_{2}^{d})}^{j}}) \\ + \sum_{j = 1}^{λ / 2} \sum_{m = 0}^{r} \sum_{i = 1}^{λ / 2} \sum_{n = 0}^{r} c_{9, j, i, m, n} \frac{h_{1}^{m} h_{2}^{n}}{{(N_{1} h_{1}^{d})}^{j} {(N_{2} h_{2}^{d})}^{i}} \\ + O (\frac{1}{{(N_{1} h_{1}^{d})}^{\frac{λ}{2}}} + \frac{1}{{(N_{2} h_{2}^{d})}^{\frac{λ}{2}}}) . \end{matrix}

(A5)

Note that the bandwidth and sample size terms do not depend on the order of the kernel (

ν

). Thus, using a higher-order kernel does not provide any benefit to the convergence rates. This lack of improvement is due to the bias of the density estimators at the boundary of the density support sets. To obtain better convergence rates using higher-order kernels, boundary correction would be necessary [57,60]. In contrast, we improve the convergence rates by using a weighted ensemble that does not require boundary correction.

The variance result requires much less strict assumptions than the bias results:

Theorem A4.

Assume that the functionalg in (1) is Lipschitz continuous in both of its arguments with the Lipschitz constant

C_{g}

. Then, the variance of the plug-in estimator (

{\tilde{G}}_{h_{1}, h_{2}}

) is bounded by

V [{\tilde{G}}_{h_{1}, h_{2}}] \leq C_{g}^{2} {| | K | |}_{\infty}^{2} (\frac{10}{N_{2}} + \frac{N_{1}}{N_{2}^{2}}) .

The proofs of these theorems are in Appendix E and Appendix F. Theorems 1 and 2 then follow.

Appendix D. Proof of Theorem A1 (Boundary Conditions)

Consider a uniform rectangular kernel

K (x)

that satisfies

K (x) = 1

for all x, such that

{| | x | |}_{1} \leq 1 / 2

. Also, consider the family of probability densities (f) with rectangular support

S = {[- 1, 1]}^{d}

. We prove Theorem A1 which is that that

S

satisfies the following smoothness condition

(A . 5

): for any polynomial

p_{x} (u) : R^{d} \to R

of order

q \leq r = ⌊s⌋

with coefficients that are

r - q

times differentiable wrt x,

\int_{x \in S} {(\int_{{u : | | u | |}_{1} \leq \frac{1}{2}, x + u h \notin S} p_{x} (u) d u)}^{t} d x = v_{t} (h),

(A6)

where

v_{t} (h)

has the expansion

v_{t} (h) = \sum_{i = 1}^{r - q} e_{i, q, t} h^{i} + o (h^{r - q}) .

Note that the inner integral forces the xs under consideration to be boundary points via the constraint

x + u h \notin S

.

Appendix D.1. Single Coordinate Boundary Point

We begin by focusing on points x that are boundary points by virtue of a single coordinate

x_{i}

, such that

x_{i} + u_{i} h \notin S

. Without loss of generality, assume that

x_{i} + u_{i} h > 1

. The inner integral in (A6) can then be evaluated first with respect to (wrt) all coordinates other than i. Since all of these coordinates lie within the support, the inner integral over these coordinates will amount to integration of the polynomial

p_{x} (u)

over a symmetric

d - 1

dimensional rectangular region

|u_{j}| \leq \frac{1}{2}

for all

j \neq i

. This yields a function

\sum_{m = 1}^{q} {\tilde{p}}_{m} (x) u_{i}^{m}

where the coefficients

{\tilde{p}}_{m} (x)

are each

r - q

times differentiable wrt x.

With respect to the

u_{i}

coordinate, the inner integral will have limits from

\frac{1 - x_{i}}{h}

to

\frac{1}{2}

for some

1 > x_{i} > 1 - \frac{h}{2}

. Consider the

{\tilde{p}}_{q} (x) u_{i}^{q}

monomial term. The inner integral wrt this term yields

\sum_{m = 1}^{q} {\tilde{p}}_{m} (x) \int_{\frac{1 - x_{i}}{h}}^{\frac{1}{2}} u_{i}^{m} d u_{i} = \sum_{m = 1}^{q} {\tilde{p}}_{m} (x) \frac{1}{m + 1} (\frac{1}{2^{m + 1}} - {(\frac{1 - x_{i}}{h})}^{m + 1}) .

(A7)

Raising the right-hand-side of (A7) to the power of t results in an expression of the form

\sum_{j = 0}^{q t} {\overset{ˇ}{p}}_{j} (x) {(\frac{1 - x_{i}}{h})}^{j},

(A8)

where the coefficients

{\overset{ˇ}{p}}_{j} (x)

are

r - q

times differentiable wrt x. Integrating (A8) over all the coordinates in x other than

x_{i}

results in an expression of the form

\sum_{j = 0}^{q t} {\bar{p}}_{j} (x_{i}) {(\frac{1 - x_{i}}{h})}^{j},

(A9)

where, again, the coefficients

{\bar{p}}_{j} (x_{i})

are

r - q

times differentiable wrt

x_{i}

. Note that since the other coordinates of x other than

x_{i}

are far away from the boundary, the coefficients

{\bar{p}}_{j} (x_{i})

are independent of h. To evaluate the integral of (A9), consider the

r - q

term Taylor series expansion of

{\bar{p}}_{j} (x_{i})

around

x_{i} = 1

. This will yield terms of the form

\begin{matrix} \int_{1 - h / 2}^{1} \frac{{(1 - x_{i})}^{j + k}}{h^{k}} d x_{i} & = & {- \frac{{(1 - x_{i})}^{j + k + 1}}{h^{k} (j + k + 1)}|}_{x_{i} = 1 - h / 2}^{x_{i} = 1} \\ = & \frac{h^{j + 1}}{(j + k + 1) 2^{j + k + 1}}, \end{matrix}

for

0 \leq j \leq r - q

, and

0 \leq k \leq q t

. Combining terms results in the expansion

v_{t} (h) = \sum_{i = 1}^{r - q} e_{i, q, t} h^{i} + o (h^{r - q})

.

Appendix D.2. Multiple Coordinate Boundary Point

The case where multiple coordinates of point x are near the boundary is a straightforward extension of the single boundary point case, so we only sketch the main ideas here. As an example, consider the case where two of the coordinates are near the boundary. Assume for notational ease that they are

x_{1}

and

x_{2}

and that

x_{1} + u_{1} h > 1

and

x_{2} + u_{2} h > 1

. The inner integral in (A6) can again be evaluated first wrt all coordinates other than 1 and 2. This yields a function

\sum_{m, j = 1}^{q} {\tilde{p}}_{m, j} (x) u_{1}^{m} u_{2}^{j}

where the coefficients

{\tilde{p}}_{m, j} (x)

are each

r - q

times differentiable wrt x. Integrating this wrt

x_{1}

and

x_{2}

and then raising the result to the power of t yields a double sum similar to (A8). Integrating this over all the coordinates in x other than

x_{1}

and

x_{2}

gives a double sum similar to (A9). Then, a Taylor series expansion of the coefficients and integration over

x_{1}

and

x_{2}

yields the result.

Appendix E. Proof of Theorem A3 (Bias)

In this appendix, we prove the bias results in Theorem A3. The bias of the base kernel density plug-in estimator

{\tilde{G}}_{h_{1}, h_{2}}

can be expressed as

\begin{matrix} B [{\tilde{G}}_{h_{1}, h_{2}}] & = & E [g ({\tilde{f}}_{1, h_{1}} (Z), {\tilde{f}}_{2, h_{2}} (Z)) - g (f_{1} (Z), f_{2} (Z))] \\ = & E [g ({\tilde{f}}_{1, h_{1}} (Z), {\tilde{f}}_{2, h_{2}} (Z)) - g (E_{Z} {\tilde{f}}_{1, h_{1}} (Z), E_{Z} {\tilde{f}}_{2, h_{2}} (Z))] \\ + E [g (E_{Z} {\tilde{f}}_{1, h_{1}} (Z), E_{Z} {\tilde{f}}_{2, h_{2}} (Z)) - g (f_{1} (Z), f_{2} (Z))], \end{matrix}

(A10)

where

Z

is drawn from

f_{2}

. The first term is the “variance” term, while the second is the “bias” term. We bound these terms using Taylor series expansions under the assumption that g is infinitely differentiable. The Taylor series expansion of the variance term in (A11) will depend on variance-like terms of the KDEs, while the Taylor series expansion of the bias term in (A11) will depend on the bias of the KDEs.

The Taylor series expansion of

g (E_{Z} {\tilde{f}}_{1, h_{1}} (Z), E_{Z} {\tilde{f}}_{2, h_{2}} (Z))

around

f_{1} (Z)

and

f_{2} (Z)

is

\begin{matrix} g (E_{Z} {\tilde{f}}_{1, h_{1}} (Z), E_{Z} {\tilde{f}}_{2, h_{2}} (Z)) = \sum_{i = 0}^{\infty} \sum_{j = 0}^{\infty} ({\frac{\partial^{i + j} g (x, y)}{\partial x^{i} \partial y^{j}}|}_{\begin{matrix} x = f_{1} (Z) \\ y = f_{2} (Z) \end{matrix}}) \frac{B_{Z}^{i} [{\tilde{f}}_{1, h_{1}} (Z)] B_{Z}^{j} [{\tilde{f}}_{2, h_{2}} (Z)]}{i! j!}, \end{matrix}

(A11)

where

B_{Z}^{j} [{\tilde{f}}_{i, h_{i}} (Z)] = {(E_{Z} {\tilde{f}}_{i, h_{i}} (Z) - f_{i} (Z))}^{j}

is the bias of

{\tilde{f}}_{i, h_{i}}

at the point

Z

raised to the power of j. This expansion can be used to control the second term (the bias term) in (A11). To accomplish this, we require an expression for

E_{Z} {\tilde{f}}_{i, h_{i}} (Z) - f_{i} (Z) = B_{Z} [{\tilde{f}}_{i, h_{i}} (Z)]

.

To obtain an expression for

B_{Z} [{\tilde{f}}_{i, h_{i}} (Z)]

, we separately consider the cases when

Z

is in the interior of the support

S

or when

Z

is near the boundary of the support. A point

X \in S

is defined to be in the interior of

S

if for all

Y \notin S

,

K (\frac{X - Y}{h_{i}}) = 0

. A point

X \in S

is near the boundary of the support if it is not in the interior. Denote the region in the interior and near the boundary wrt

h_{i}

as

S_{I_{i}}

and

S_{B_{i}}

, respectively. We will need the following:

Lemma A1.

Let

Z

be a realization of the density

f_{2}

independent of

{\tilde{f}}_{i, h_{i}}

for

i = 1, 2

. Assume that the densities

f_{1}

and

f_{2}

belong to

Σ (s, L)

. Then, for

Z \in S_{I_{i}}

,

E_{Z} [{\tilde{f}}_{i, h_{i}} (Z)] = f_{i} (Z) + \sum_{j = ν / 2}^{⌊s / 2⌋} c_{i, j} (Z) h_{i}^{2 j} + O (h_{i}^{s}) .

(A12)

Proof.

Obtaining the lower order terms in (A12) is a common result in kernel density estimation. However, since we also require the higher order terms, we present the proof here. Additionally, some of the results in this proof will be useful later. From the linearity of the KDE, we have that if

X

is drawn from

f_{i}

and is independent of

Z

, then

\begin{matrix} E_{Z} {\tilde{f}}_{i, h_{i}} (Z) & = & E_{Z} [\frac{1}{h_{i}^{d}} K (\frac{X - Z}{h_{i}})] \\ = & \int \frac{1}{h_{i}^{d}} K (\frac{x - Z}{h_{i}}) f_{i} (x) d x \\ = & \int K (t) f_{i} (t h_{i} + Z) d t, \end{matrix}

(A13)

where the last step follows on from the substitution

t = \frac{x - Z}{h_{i}}

. Since the density (

f_{i}

) belongs to

Σ (s, K)

, by using multi-index notation we can expand it to

f_{i} (t h_{i} + Z) = f_{i} (Z) + \sum_{0 < | α | \leq ⌊s⌋} \frac{D^{α} f_{i} (Z)}{α!} {(t h_{i})}^{α} + O ({∥t h_{i}∥}^{s}),

(A14)

where

α! = α_{1}! α_{2}! \dots α_{d}!

and

Y^{α} = Y_{1}^{α_{1}} Y_{2}^{α_{2}} \dots Y_{d}^{α_{d}}

. Combining (A13) and (A14) gives

\begin{matrix} E_{Z} {\tilde{f}}_{i, h_{i}} (Z) & = & f_{i} (Z) + \sum_{0 < | α | \leq ⌊s⌋} \frac{D^{α} f_{i} (Z)}{α!} h_{i}^{| α |} \int t^{α} K (t) d t + O (h_{i}^{s}) \\ = & f_{i} (Z) + \sum_{j = ν / 2}^{⌊s / 2⌋} c_{i, j} (Z) h_{i}^{2 j} + O (h_{i}^{s}), \end{matrix}

where the last step follows from the fact that K is symmetric and of order

ν

. ☐

To obtain a similar result for the case when

Z

is near the boundary of

S

, we use the assumption

A . 5

.

Lemma A2.

Let

γ (x, y)

be an arbitrary function satisfying

{sup}_{x, y} | γ (x, y) | < \infty

. Let

S

satisfy the boundary smoothness conditions of Assumption

A . 5

. Assume that the densities

f_{1}

and

f_{2}

belong to

Σ (s, L)

, and let

Z

be a realization of the

f_{2}

density independently of

{\tilde{f}}_{i, h_{i}}

for

i = 1, 2

. Let

h^{^{'}} = min (h_{1}, h_{2})

. Then,

\begin{matrix} E [1_{\{Z \in S_{B_{i}}\}} γ (f_{1} (Z), f_{2} (Z)) B_{Z}^{t} [{\tilde{f}}_{i, h_{i}} (Z)]] & = & \sum_{j = 1}^{r} c_{4, i, j, t} h_{i}^{j} + o (h_{i}^{r}) \end{matrix}

(A15)

\begin{matrix} E [1_{\{Z \in S_{B_{1}} \cap S_{B_{2}}\}} γ (f_{1} (Z), f_{2} (Z)) B_{Z}^{t} [{\tilde{f}}_{1, h_{1}} (Z)] B_{Z}^{q} [{\tilde{f}}_{2, h_{2}} (Z)]] & = & \sum_{j = 0}^{r - 1} \sum_{i = 0}^{r - 1} c_{4, j, i, q, t} h_{1}^{j} h_{2}^{i} h^{^{'}} + o ({(h^{^{'}})}^{r}) \end{matrix}

(A16)

Proof.

For a fixed X near the boundary of

S

, we have

\begin{matrix} E [{\tilde{f}}_{i, h_{i}} (X)] - f_{i} (X) & = & \frac{1}{h_{i}^{d}} \int_{Y : Y \in S} K (\frac{X - Y}{h_{i}}) f_{i} (Y) d Y - f_{i} (X) \\ = & [\frac{1}{h_{i}^{d}} \int_{Y : K (\frac{X - Y}{h_{i}}) > 0} K (\frac{X - Y}{h_{i}}) f_{i} (Y) d Y - f_{i} (X)] \\ - [\frac{1}{h_{i}^{d}} \int_{Y : Y \notin S} K (\frac{X - Y}{h_{i}}) f_{i} (Y) d Y] \\ = & T_{1, i} (X) - T_{2, i} (X) . \end{matrix}

Note that, in

T_{1, i} (X)

, we are extending the integral beyond the support of the

f_{i}

density. However, by using the same Taylor series expansion method as in the proof of Lemma A1, we always evaluate

f_{i}

and its derivatives at point X which is within the support of

f_{i}

. Thus, it does not matter how we define an extension of

f_{i}

since the Taylor series will remain the same. Thus,

T_{1, i} (X)

results in an identical expression to that obtained from (A12).

For the

T_{2, i} (X)

term, we expand it using multi-index notation as

\begin{matrix} T_{2, i} (X) & = & \frac{1}{h_{i}^{d}} \int_{Y : Y \notin S} K (\frac{X - Y}{h_{i}}) f_{i} (Y) d Y \\ = & \int_{u : h_{i} u + X \notin S, K (u) > 0} K (u) f_{i} (X + h_{i} u) d u \\ = & \sum_{| α | \leq r} \frac{h_{i}^{| α |}}{α!} \int_{u : h_{i} u + X \notin S, K (u) > 0} K (u) D^{α} f_{i} (X) u^{α} d u + o (h_{i}^{r}) . \end{matrix}

Recognizing that the

| α |

th derivative of

f_{i}

is

r - | α |

times differentiable, we can apply assumption

A . 5

to obtain the expectation of

T_{2, i} (X)

wrt X:

\begin{matrix} E [T_{2, i} (X)] & = & \frac{1}{h_{i}^{d}} \int_{X} \int_{Y : Y \notin S} K (\frac{X - Y}{h_{i}}) f_{i} (Y) d Y f_{2} (X) d x \\ = & \sum_{| α | \leq r} \frac{h_{i}^{| α |}}{α!} \int_{X} \int_{u : h_{i} u + X \notin S, K (u) > 0} K (u) D^{α} f_{i} (X) u^{α} d u f_{2} (X) d X + o (h_{i}^{r}) \\ = & \sum_{| α | \leq r} \frac{h_{i}^{| α |}}{α!} [\sum_{1 \leq | β | \leq r - | α |} e_{β, r - | α |} h_{i}^{| β |} + o (h_{i}^{r - | α |})] + o (h_{i}^{r}) \\ = & \sum_{j = 1}^{r} e_{j} h_{i}^{j} + o (h_{i}^{r}) . \end{matrix}

Similarly, we find that

\begin{matrix} E [{(T_{2, i} (X))}^{t}] & = & \frac{1}{h_{i}^{d t}} \int_{X} {(\int_{Y : Y \notin S} K (\frac{X - Y}{h_{i}}) f_{i} (Y) d Y)}^{t} f_{2} (X) d x \\ = & \int_{X} {(\sum_{| α | \leq r} \frac{h_{i}^{| α |}}{α!} \int_{u : h_{i} u + X \notin S, K (u) > 0} K (u) D^{α} f_{i} (X) u^{α} d u)}^{t} f_{2} (X) d X \\ = & \sum_{j = 1}^{r} e_{j, t} h_{i}^{j} + o (h_{i}^{r}) . \end{matrix}

Combining these results gives

\begin{matrix} E [1_{\{Z \in S_{B}\}} γ (f_{1} (Z), f_{2} (Z)) {(E_{Z} [{\tilde{f}}_{i, h_{i}} (Z)] - f_{i} (Z))}^{t}] & = & E [γ (f_{1} (Z), f_{2} (Z)) {(T_{1, i} (Z) - T_{2, i} (Z))}^{t}] \\ = & E [γ (f_{1} (Z), f_{2} (Z)) \sum_{j = 0}^{t} (\binom{t}{j}) {(T_{1, i} (Z))}^{j} {(- T_{2, i} (Z))}^{t - j}] \\ = & \sum_{j = 1}^{r} c_{4, i, j, t} h_{i}^{j} + o (h_{i}^{r}), \end{matrix}

where the constants are functionals of the kernel

γ

and the densities.

The expression in (A16) can be proved in a similar manner. ☐

Applying Lemmas A1 and A2 to (A11) gives

E [g (E_{Z} {\tilde{f}}_{1, h_{1}} (Z), E_{Z} {\tilde{f}}_{2, h_{2}} (Z)) - g (f_{1} (Z), f_{2} (Z))] = \sum_{j = 1}^{r} (c_{4, 1, j} h_{1}^{j} + c_{4, 2, j} h_{2}^{j}) + \sum_{j = 0}^{r - 1} \sum_{i = 0}^{r - 1} c_{5, i, j} h_{1}^{j} h_{2}^{i} h^{^{'}} + o (h_{1}^{r} + h_{2}^{r}) .

(A17)

For the variance term (the first term) in (A10), the truncated Taylor series expansion of

g ({\tilde{f}}_{1, h_{1}} (Z), {\tilde{f}}_{2, h_{2}} (Z))

around

E_{Z} {\tilde{f}}_{1, h_{1}} (Z)

and

E_{Z} {\tilde{f}}_{2, h_{2}} (Z)

gives

\begin{matrix} g ({\tilde{f}}_{1, h_{1}} (Z), {\tilde{f}}_{2, h_{2}} (Z)) & = & \sum_{i = 0}^{λ} \sum_{j = 0}^{λ} ({\frac{\partial^{i + j} g (x, y)}{\partial x^{i} \partial y^{j}}|}_{\begin{matrix} x = E_{Z} {\tilde{f}}_{1, h_{1}} (Z) \\ y = E_{Z} {\tilde{f}}_{2, h_{2}} (Z) \end{matrix}}) \frac{{\tilde{e}}_{1, h_{1}}^{i} (Z) {\tilde{e}}_{2, h_{2}}^{j} (Z)}{i! j!} + o ({\tilde{e}}_{1, h_{1}}^{λ} (Z) + {\tilde{e}}_{2, h_{2}}^{λ} (Z)) \end{matrix}

(A18)

where

{\tilde{e}}_{i, h_{i}} (Z) : = {\tilde{f}}_{i, h_{i}} (Z) - E_{Z} {\tilde{f}}_{i, h_{i}} (Z)

. To control the variance term in (A11), we thus require expressions for

E_{Z} [{\tilde{e}}_{i, h_{i}}^{j} (Z)]

.

Lemma A3.

Let

Z

be a realization of the

f_{2}

density that is in the interior of the support and is independent of

{\tilde{f}}_{i, h_{i}}

for

i = 1, 2

. Let

n (q)

be the set of integer divisors of q including 1 but excluding q. Then,

\begin{matrix} E_{Z} [{\tilde{e}}_{i, h_{i}}^{q} (Z)] & = & \{\begin{matrix} \sum_{j \in n (q)} \frac{1}{{(N_{2} h_{2}^{d})}^{q - j}} \sum_{m = 0}^{⌊s / 2⌋} c_{6, i, q, j, m} (Z) h_{i}^{2 m} + O (\frac{1}{N_{i}}), & q \geq 2 \\ 0, & q = 1, \end{matrix} \end{matrix}

(A19)

\begin{matrix} E_{Z} [{\tilde{e}}_{1, h_{1}}^{q} (Z) {\tilde{e}}_{2, h_{2}}^{l} (Z)] & = & \{\begin{matrix} (\sum_{i \in n (q)} \frac{1}{{(N_{1} h_{1}^{d})}^{q - i}} \sum_{m = 0}^{⌊s / 2⌋} c_{6, 1, q, i, m} (Z) h_{1}^{2 m}) \times q, & l \geq 2 \\ (\sum_{j \in n (l)} \frac{1}{{(N_{2} h_{2}^{d})}^{l - j}} \sum_{t = 0}^{⌊s / 2⌋} c_{6, 2, l, j, t} (Z) h_{2}^{2 t}) + O (\frac{1}{N_{1}} + \frac{1}{N_{2}}), \\ 0, & q = 1 or l = 1 \end{matrix} \end{matrix}

(A20)

where

c_{6, i, q, j, m}

is a functional of

f_{1}

and

f_{2} .

Proof.

Define the random variable

V_{i} (Z) = K (\frac{X_{i} - Z}{h_{2}}) - E_{Z} K (\frac{X_{i} - Z}{h_{2}})

. This gives

\begin{matrix} {\tilde{e}}_{2, h_{2}} (Z) & = & {\tilde{f}}_{2, h_{2}} (Z) - E_{Z} {\tilde{f}}_{2, h_{2}} (Z) \\ = & \frac{1}{N_{2} h_{2}^{d}} \sum_{i = 1}^{N_{2}} V_{i} (Z) . \end{matrix}

Clearly,

E_{Z} V_{i} (Z) = 0

. From (A13), we have for integer

j \geq 1

\begin{matrix} E_{Z} [K^{j} (\frac{X_{i} - Z}{h_{2}})] & = & \int K^{j} (t) f_{2} (t h_{2} + Z) d t \\ = & h_{2}^{d} \sum_{m = 0}^{⌊s / 2⌋} c_{3, 2, j, m} (Z) h_{2}^{2 m}, \end{matrix}

where the constants

c_{3, 2, j, m}

depend on density

f_{2}

, its derivatives, and the moments of kernel

K^{j}

. Note that since K is symmetric, the odd moments of

K^{j}

are zero for

Z

in the interior of the support. However, all even moments may now be non-zero since

K^{j}

may now be non-negative. In accordance with the binomial theorem,

\begin{matrix} E_{Z} [V_{i}^{j} (Z)] & = & \sum_{k = 0}^{j} (\binom{j}{k}) E_{Z} [K^{k} (\frac{X_{i} - Z}{h_{2}})] E_{Z} {[K (\frac{X_{i} - Z}{h_{2}})]}^{j - k} \\ = & \sum_{k = 0}^{j} (\binom{j}{k}) h_{2}^{d} O (h_{2}^{d (j - k)}) \sum_{m = 0}^{⌊s / 2⌋} c_{3, 2, k, m} (Z) h_{2}^{2 m} \\ = & h_{2}^{d} \sum_{m = 0}^{⌊s / 2⌋} c_{3, 2, j, m} (Z) h_{2}^{2 m} + O (h^{2 d}) . \end{matrix}

We can use these expressions to simplify

E_{Z} [{\tilde{e}}_{2, h_{2}}^{q} (Z)]

. As an example, let

q = 2

. Then, since the

X_{i} s

are independent,

\begin{matrix} E_{Z} [{\tilde{e}}_{2, h_{2}}^{2} (Z)] & = & \frac{1}{N_{2} h_{2}^{2 d}} E_{Z} V_{i}^{2} (Z) \\ = & \frac{1}{N_{2} h_{2}^{d}} \sum_{m = 0}^{⌊s / 2⌋} c_{3, 2, 2, m} (Z) h_{2}^{2 m} + O (\frac{1}{N_{2}}) . \end{matrix}

Similarly, we find that

\begin{matrix} E_{Z} [{\tilde{e}}_{2, h_{2}}^{3} (Z)] & = & \frac{1}{N_{2}^{2} h_{2}^{3 d}} E_{Z} V_{i}^{3} (Z) \\ = & \frac{1}{{(N_{2} h_{2}^{d})}^{2}} \sum_{m = 0}^{⌊s / 2⌋} c_{3, 2, 3, m} (Z) h_{2}^{2 m} + o (\frac{1}{N_{2}}) . \end{matrix}

For

q = 4

, we have

\begin{matrix} E_{Z} [{\tilde{e}}_{2, h_{2}}^{4} (Z)] & = & \frac{1}{N_{2}^{3} h_{2}^{4 d}} E_{Z} V_{i}^{4} (Z) + \frac{N_{2} - 1}{N_{2}^{3} h_{2}^{4 d}} {(E_{Z} V_{i}^{2} (Z))}^{2} \\ = & \frac{1}{{(N_{2} h_{2}^{d})}^{3}} \sum_{m = 0}^{⌊s / 2⌋} c_{3, 2, 4, m} (Z) h_{2}^{2 m} + \frac{1}{{(N_{2} h_{2}^{d})}^{2}} \sum_{m = 0}^{⌊s / 2⌋} c_{6, 2, 2, m} (Z) h_{2}^{2 m} + o (\frac{1}{N_{2}}) . \end{matrix}

The pattern for

q \geq 2

is then,

E_{Z} [{\tilde{e}}_{2, h_{2}}^{q} (Z)] = \sum_{i \in n (q)} \frac{1}{{(N_{2} h_{2}^{d})}^{q - i}} \sum_{m = 0}^{⌊s / 2⌋} c_{6, 2, q, i, m} (Z) h_{2}^{2 m} + O (\frac{1}{N_{2}}) .

For any integer (q), the largest possible factor is

q / 2

. Thus, for a given q, the smallest possible exponent on the

N_{2} h_{2}^{d}

term is

q / 2

. This increases as q increases. A similar expression holds for

E_{Z} [{\tilde{e}}_{1, h_{1}}^{q} (Z)]

, except the

X_{i}

s are replaced with

Y_{i}

,

f_{2}

is replaced with

f_{1}

, and

N_{2}

and

h_{2}

are replaced with

N_{1}

and

h_{1}

, respectively, all resulting in different constants. Then, since

{\tilde{e}}_{1, h_{1}} (Z)

and

{\tilde{e}}_{2, h_{2}} (Z)

are conditionally independent given

Z

,

\begin{matrix} E_{Z} [{\tilde{e}}_{1, h_{1}}^{q} (Z) {\tilde{e}}_{2, h_{2}}^{l} (Z)] & = & (\sum_{i \in n (q)} \frac{1}{{(N_{1} h_{1}^{d})}^{q - i}} \sum_{m = 0}^{⌊s / 2⌋} c_{6, 1, q, i, m} (Z) h_{1}^{2 m}) (\sum_{j \in n (l)} \frac{1}{{(N_{2} h_{2}^{d})}^{l - j}} \sum_{t = 0}^{⌊s / 2⌋} c_{6, 2, l, j, t} (Z) h_{2}^{2 t}) \\ + O (\frac{1}{N_{1}} + \frac{1}{N_{2}}) . \end{matrix}

☐

Applying Lemma A3 to (A18) when taking the conditional expectation given

Z

in the interior gives an expression of the form

\begin{matrix} \sum_{j = 1}^{λ / 2} \sum_{m = 0}^{⌊s / 2⌋} (c_{7, 1, j, m} (E_{Z} {\tilde{f}}_{1, h_{1}} (Z), E_{Z} {\tilde{f}}_{2, h_{2}} (Z)) \frac{h_{1}^{2 m}}{{(N_{1} h_{1}^{d})}^{j}} + c_{7, 2, j, m} (E_{Z} {\tilde{f}}_{2, h_{2}} (Z), E_{Z} {\tilde{f}}_{2, h_{2}} (Z)) \frac{h_{2}^{2 m}}{{(N_{2} h_{2}^{d})}^{j}}) \\ + \sum_{j = 1}^{λ / 2} \sum_{m = 0}^{⌊s / 2⌋} \sum_{i = 1}^{λ / 2} \sum_{n = 0}^{⌊s / 2⌋} c_{7, j, i, m, n} (E_{Z} {\tilde{f}}_{2, h_{2}} (Z), E_{Z} {\tilde{f}}_{2, h_{2}} (Z)) \frac{h_{1}^{2 m} h_{2}^{2 n}}{{(N_{1} h_{1}^{d})}^{j} {(N_{2} h_{2}^{d})}^{i}} \\ + O (\frac{1}{{(N_{1} h_{1}^{d})}^{\frac{λ}{2}}} + \frac{1}{{(N_{2} h_{2}^{d})}^{\frac{λ}{2}}}) . \end{matrix}

(A21)

Note that the functionals

c_{7, i, j, m}

and

c_{7, j, i, m, n}

depend on the derivatives of g and

E_{Z} {\tilde{f}}_{i, h_{i}} (Z)

which depend on

h_{i}

. To apply an ensemble estimation, we need to separate the dependence on

h_{i}

from the constants. If we use ODin1, then it is sufficient to note that in the interior of the support,

E_{Z} {\tilde{f}}_{i, h_{i}} (Z) = f_{i} (Z) + o (1)

and therefore,

c (E_{Z} {\tilde{f}}_{1, h_{1}} (Z), E_{Z} {\tilde{f}}_{2, h_{2}} (Z)) = c (f_{1} (Z), f_{2} (Z)) + o (1)

for some functional c. The terms in (A22) reduce to

c_{7, 1, 1, 0} (f_{1} (Z), f_{2} (Z)) \frac{1}{N_{1} h_{1}^{d}} + c_{7, 2, 1, 0} (f_{1} (Z), f_{2} (Z)) \frac{1}{N_{2} h_{2}^{d}} + o (\frac{1}{N_{1} h_{1}^{d}} + \frac{1}{N_{2} h_{2}^{d}}) .

For ODin2, we need the higher order terms. To separate the dependence on

h_{i}

from the constants, we need more information about the functional g and its derivatives. Consider a special case where the functional

g (x, y)

has derivatives of the form of

x^{α} y^{β}

with

α, β < 0 .

This includes the important cases of the KL divergence and the Renyi divergence. The generalized binomial theorem states that if

(\binom{α}{m}) : = \frac{α (α - 1) \dots (α - m + 1)}{m!}

and if q and t are real numbers with

| q | > | t |

, then for any complex number (

α

),

{(q + t)}^{α} = \sum_{m = 0}^{\infty} (\binom{α}{m}) q^{α - m} t^{m} .

(A22)

Since the densities are bounded away from zero, for sufficiently small

h_{i}

, we have

f_{i} (Z) > |\sum_{j = ν / 2}^{⌊s / 2⌋} c_{i, j} (Z) h_{i}^{2 j} + O (h_{i}^{s})| .

Applying the generalized binomial theorem and Lemma A1 gives

{(E_{Z} {\tilde{f}}_{1, h_{1}} (Z))}^{α} = \sum_{m = 0}^{\infty} (\binom{α}{m}) f_{i}^{α - m} (Z) {(\sum_{j = ν / 2}^{⌊s / 2⌋} c_{i, j} (Z) h_{i}^{2 j} + O (h_{i}^{s}))}^{m} .

Since m is an integer, the exponents of the

h_{i}

terms are also integers. Thus, (A22) gives, in this case,

\begin{matrix} E_{Z} [g ({\tilde{f}}_{1, h_{1}} (Z), {\tilde{f}}_{2, h_{2}} (Z)) - g (E_{Z} {\tilde{f}}_{1, h_{1}} (Z), E_{Z} {\tilde{f}}_{2, h_{2}} (Z))] & = & \sum_{j = 1}^{λ / 2} \sum_{m = 0}^{⌊s / 2⌋} (c_{8, 1, j, m} (Z) \frac{h_{1}^{2 m}}{{(N_{1} h_{1}^{d})}^{j}} + c_{8, 2, j, m} (Z) \frac{h_{2}^{2 m}}{{(N_{2} h_{2}^{d})}^{j}}) \\ + \sum_{j = 1}^{λ / 2} \sum_{m = 0}^{⌊s / 2⌋} \sum_{i = 1}^{λ / 2} \sum_{n = 0}^{⌊s / 2⌋} c_{8, j, i, m, n} (Z) \frac{h_{1}^{2 m} h_{2}^{2 n}}{{(N_{1} h_{1}^{d})}^{j} {(N_{2} h_{2}^{d})}^{i}} \\ + O (\frac{1}{{(N_{1} h_{1}^{d})}^{\frac{λ}{2}}} + \frac{1}{{(N_{2} h_{2}^{d})}^{\frac{λ}{2}}} + h_{1}^{s} + h_{2}^{s}) . \end{matrix}

(A23)

As before, the case for

Z

close to the boundary of the support is more complicated. However, by using a similar technique to the proof of Lemma A2 for

Z

at the boundary and combining with previous results, we find that for general g,

E [g ({\tilde{f}}_{1, h_{1}} (Z), {\tilde{f}}_{2, h_{2}} (Z)) - g (E_{Z} {\tilde{f}}_{1, h_{1}} (Z), E_{Z} {\tilde{f}}_{2, h_{2}} (Z))] = c_{9, 1} \frac{1}{N_{1} h_{1}^{d}} + c_{9, 2} \frac{1}{N_{2} h_{2}^{d}} + o (\frac{1}{N_{1} h_{1}^{d}} + \frac{1}{N_{2} h_{2}^{d}}) .

(A24)

If

g (x, y)

has derivatives of the form of

x^{α} y^{β}

with

α, β < 0

, then we can similarly obtain

\begin{matrix} E [g ({\tilde{f}}_{1, h_{1}} (Z), {\tilde{f}}_{2, h_{2}} (Z)) - g (E_{Z} {\tilde{f}}_{1, h_{1}} (Z), E_{Z} {\tilde{f}}_{2, h_{2}} (Z))] & = & \sum_{j = 1}^{λ / 2} \sum_{m = 0}^{r} (c_{9, 1, j, m} \frac{h_{1}^{m}}{{(N_{1} h_{1}^{d})}^{j}} + c_{9, 2, j, m} \frac{h_{2}^{m}}{{(N_{2} h_{2}^{d})}^{j}}) \\ + \sum_{j = 1}^{λ / 2} \sum_{m = 0}^{r} \sum_{i = 1}^{λ / 2} \sum_{n = 0}^{r} c_{9, j, i, m, n} \frac{h_{1}^{m} h_{2}^{n}}{{(N_{1} h_{1}^{d})}^{j} {(N_{2} h_{2}^{d})}^{i}} \\ + O (\frac{1}{{(N_{1} h_{1}^{d})}^{\frac{λ}{2}}} + \frac{1}{{(N_{2} h_{2}^{d})}^{\frac{λ}{2}}} + h_{1}^{s} + h_{2}^{s}) . \end{matrix}

(A25)

Combining (A17) with either (A24) or (A26) completes the proof.

Appendix F. Proof of Theorem A4 (Variance)

To bound the variance of the plug-in estimator

{\tilde{G}}_{h_{1}, h_{2}}

, we use the Efron–Stein inequality [70]:

Lemma A4 (Efron–Stein Inequality).

Let

X_{1}, \dots, X_{n}, X_{1}^{^{'}}, \dots, X_{n}^{^{'}}

be independent random variables on the space

S

. Then, if

f : S \times \dots \times S \to R

, we have

V [f (X_{1}, \dots, X_{n})] \leq \frac{1}{2} \sum_{i = 1}^{n} E [{(f (X_{1}, \dots, X_{n}) - f (X_{1}, \dots, X_{i}^{^{'}}, \dots, X_{n}))}^{2}] .

Suppose we have samples

\{X_{1}, \dots, X_{N_{2}}, Y_{1}, \dots, Y_{N_{1}}\}

and

\{X_{1}^{^{'}}, \dots, X_{N_{2}}, Y_{1}, \dots, Y_{N_{1}}\}

and denote the respective estimators as

{\tilde{G}}_{h_{1}, h_{2}}

and

{\tilde{G}}_{h_{1}, h_{2}}^{^{'}}

. We have

\begin{matrix} |{\tilde{G}}_{h_{1}, h_{2}} - {\tilde{G}}_{h_{1}, h_{2}}^{^{'}}| & \leq & \frac{1}{N_{2}} |g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}} (X_{1})) - g ({\tilde{f}}_{1, h_{1}} (X_{1}^{^{'}}), {\tilde{f}}_{2, h_{2}} (X_{1}^{^{'}}))| \\ + \frac{1}{N_{2}} \sum_{j = 2}^{N_{2}} |g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{j}))| . \end{matrix}

(A26)

Since g is Lipschitz continuous with the constant

C_{g}

, we have

\begin{matrix} |g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}} (X_{1})) - g ({\tilde{f}}_{1, h_{1}} (X_{1}^{^{'}}), {\tilde{f}}_{2, h_{2}} (X_{1}^{^{'}}))| & \leq & C_{g} (|{\tilde{f}}_{1, h_{1}} (X_{1}) - {\tilde{f}}_{1, h_{1}} (X_{1}^{^{'}})| + |{\tilde{f}}_{2, h_{2}} (X_{1}) - {\tilde{f}}_{2, h_{2}} (X_{1}^{^{'}})|), \end{matrix}

(A27)

\begin{matrix} |{\tilde{f}}_{1, h_{1}} (X_{1}) - {\tilde{f}}_{1, h_{1}} (X_{1}^{^{'}})| & = & \frac{1}{N_{1} h_{1}^{d}} |\sum_{i = 1}^{N_{1}} (K (\frac{X_{1} - Y_{i}}{h_{1}}) - K (\frac{X_{1}^{^{'}} - Y_{i}}{h_{1}}))| \\ \leq & \frac{1}{N_{1} h_{1}^{d}} \sum_{i = 1}^{N_{1}} |K (\frac{X_{1} - Y_{i}}{h_{1}}) - K (\frac{X_{1}^{^{'}} - Y_{i}}{h_{1}})| \\ \Rightarrow E [{|{\tilde{f}}_{1, h_{1}} (X_{1}) - {\tilde{f}}_{1, h_{1}} (X_{1}^{^{'}})|}^{2}] & \leq & \frac{1}{N_{1} h_{1}^{2 d}} \sum_{i = 1}^{N_{1}} E [{(K (\frac{X_{1} - Y_{i}}{h_{1}}) - K (\frac{X_{1}^{^{'}} - Y_{i}}{h_{1}}))}^{2}], \end{matrix}

(A28)

where the last step follows from Jensen’s inequality. By making the substitutions

u_{i} = \frac{X_{1} - Y_{i}}{h_{1}}

and

u_{i}^{^{'}} = \frac{X_{1}^{^{'}} - Y_{i}}{h_{1}}

, this gives

\begin{matrix} \frac{1}{h_{1}^{2 d}} E [{(K (\frac{X_{1} - Y_{i}}{h_{1}}) - K (\frac{X_{1}^{^{'}} - Y_{i}}{h_{1}}))}^{2}] & = & \frac{1}{h^{2 d}} \int {(K (\frac{X_{1} - Y_{i}}{h_{1}}) - K (\frac{X_{1}^{^{'}} - Y_{i}}{h_{1}}))}^{2} \times \\ f_{2} (X_{1}) f_{2} (X_{1}^{^{'}}) f_{1} (Y_{i}) d X_{1} d X_{1}^{^{'}} d Y_{i} \\ \leq & {2 | | K | |}_{\infty}^{2} . \end{matrix}

Combining this with (A29) gives

E [{|{\tilde{f}}_{1, h_{1}} (X_{1}) - {\tilde{f}}_{1, h_{1}} (X_{1}^{^{'}})|}^{2}] \leq {2 | | K | |}_{\infty}^{2} .

Similarly,

E [{|{\tilde{f}}_{2, h_{2}} (X_{1}) - {\tilde{f}}_{2, h_{2}} (X_{1}^{^{'}})|}^{2}] \leq {2 | | K | |}_{\infty}^{2} .

Combining these results with (A27) gives

E [{(g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}} (X_{1})) - g ({\tilde{f}}_{1, h_{1}} (X_{1}^{^{'}}), {\tilde{f}}_{2, h_{2}} (X_{1}^{^{'}})))}^{2}] \leq 8 C_{g}^{2} {| | K | |}_{\infty}^{2} .

(A29)

The second term in (A26) is controlled in a similar way. From the Lipschitz condition,

\begin{matrix} {|g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{j}))|}^{2} & \leq & C_{g}^{2} {|{\tilde{f}}_{2, h_{2}} (X_{j}) - {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{j})|}^{2} \\ = & \frac{C_{g}^{2}}{M_{2}^{2} h_{2}^{2 d}} {(K (\frac{X_{j} - X_{1}}{h}) - K (\frac{X_{j} - X_{1}^{^{'}}}{h}))}^{2} . \end{matrix}

The

h_{2}^{2 d}

terms are eliminated by making the substitutions of

u_{j} = \frac{X_{j} - X_{1}}{h_{2}}

and

u_{j}^{^{'}} = \frac{X_{j} - X_{1}^{^{'}}}{h_{2}}

within the expectation to obtain

\begin{matrix} E [{|g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{j}))|}^{2}] & \leq & \frac{2 C_{g}^{2} {| | K | |}_{\infty}^{2}}{M_{2}^{2}} \end{matrix}

(A30)

\Rightarrow E [{(\sum_{j = 2}^{N_{2}} |g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{j}))|)}^{2}]

\begin{matrix} = & \sum_{j = 2}^{N_{2}} \sum_{i = 2}^{N_{2}} E [|g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{j}))| |g ({\tilde{f}}_{1, h_{1}} (X_{i}), {\tilde{f}}_{2, h_{2}} (X_{i})) - g ({\tilde{f}}_{1, h_{1}} (X_{i}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{i}))|] \\ \leq & M_{2}^{2} E [{|g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{j}))|}^{2}] \\ \leq & 2 C_{g}^{2} {| | K | |}_{\infty}^{2}, \end{matrix}

(A31)

where we use the Cauchy Schwarz inequality to bound the expectation within each summand. Finally, applying Jensen’s inequality and (A29) and (A32) gives

\begin{matrix} E [{|{\tilde{G}}_{h_{1}, h_{2}} - {\tilde{G}}_{h_{1}, h_{2}}^{^{'}}|}^{2}] & \leq & \frac{2}{N_{2}^{2}} E [{|g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}} (X_{1})) - g ({\tilde{f}}_{1, h_{1}} (X_{1}^{^{'}}), {\tilde{f}}_{2, h_{2}} (X_{1}^{^{'}}))|}^{2}] \\ + \frac{2}{N_{2}^{2}} E [{(\sum_{j = 2}^{N_{2}} |g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{j}))|)}^{2}] \\ \leq & \frac{20 C_{g}^{2} {| | K | |}_{\infty}^{2}}{N_{2}^{2}} . \end{matrix}

Now, suppose we have samples

\{X_{1}, \dots, X_{N_{2}}, Y_{1}, \dots, Y_{N_{1}}\}

and

\{X_{1}, \dots, X_{N_{2}}, Y_{1}^{^{'}}, \dots, Y_{N_{1}}\}

and denote the respective estimators as

{\tilde{G}}_{h_{1}, h_{2}}

and

{\tilde{G}}_{h_{1}, h_{2}}^{^{'}}

. Then,

\begin{matrix} |g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}}^{^{'}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j}))| & \leq & C_{g} |{\tilde{f}}_{1, h_{1}} (X_{j}) - {\tilde{f}}_{1, h_{1}}^{^{'}} (X_{j})| \\ = & \frac{C_{g}}{N_{1} h_{1}^{d}} |K (\frac{X_{j} - Y_{1}}{h_{1}}) - K (\frac{X_{j} - Y_{1}^{^{'}}}{h_{1}})| \\ \Rightarrow E [{|g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}}^{^{'}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j}))|}^{2}] & \leq & \frac{2 C_{g}^{2} {| | K | |}_{\infty}^{2}}{N_{1}^{2}} . \end{matrix}

Thus, using a similar argument as was used to obtain (A32),

\begin{matrix} E [{|{\tilde{G}}_{h_{1}, h_{2}} - {\tilde{G}}_{h_{1}, h_{2}}^{^{'}}|}^{2}] & \leq & \frac{1}{N_{2}^{2}} E [{(\sum_{j = 1}^{N_{2}} |g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}}^{^{'}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j}))|)}^{2}] \\ \leq & \frac{2 C_{g}^{2} {| | K | |}_{\infty}^{2}}{N_{2}^{2}} . \end{matrix}

Applying the Efron–Stein inequality gives

V [{\tilde{G}}_{h_{1}, h_{2}}] \leq \frac{10 C_{g}^{2} {| | K | |}_{\infty}^{2}}{N_{2}} + \frac{C_{g}^{2} {| | K | |}_{\infty}^{2} N_{1}}{N_{2}^{2}} .

Appendix G. Proof of Theorem 4 (CLT)

We are interested in the asymptotic distribution of

\begin{matrix} \sqrt{N_{2}} ({\tilde{G}}_{h_{1}, h_{2}} - E [{\tilde{G}}_{h_{1}, h_{2}}]) & = & \frac{1}{\sqrt{N_{2}}} \sum_{j = 1}^{N_{2}} (g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - E_{X_{j}} [g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j}))]) \\ + \frac{1}{\sqrt{N_{2}}} \sum_{j = 1}^{N_{2}} (E_{X_{j}} [g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j}))] - E [g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j}))]) . \end{matrix}

Note that in the standard central limit theorem [79], the second term converges in distribution to a Gaussian random variable. If the first term converges in probability to a constant (specifically, 0), then we can use Slutsky’s theorem [80] to find the asymptotic distribution. So, now, we focus on the first term which we denote as

V_{N_{2}}

.

To prove convergence in probability, we use Chebyshev’s inequality. Note that

E [V_{N_{2}}] = 0

. To bound the variance of

V_{N_{2}}

, we again use the Efron–Stein inequality. Let

X_{1}^{^{'}}

be drawn from

f_{2}

and denote

V_{N_{2}}

and

V_{N_{2}}^{^{'}}

as the sequences using

X_{1}

and

X_{1}^{^{'}}

, respectively. Then,

\begin{matrix} V_{N_{2}} - V_{N_{2}}^{^{'}} & = & \frac{1}{\sqrt{N_{2}}} (g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}} (X_{1})) - E_{X_{1}} [g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}} (X_{1}))]) \\ - \frac{1}{\sqrt{N_{2}}} (g ({\tilde{f}}_{1, h_{1}} (X_{1}^{^{'}}), {\tilde{f}}_{2, h_{2}} (X_{1}^{^{'}})) - E_{X_{1}^{^{'}}} [g ({\tilde{f}}_{1, h_{1}} (X_{1}^{^{'}}), {\tilde{f}}_{2, h_{2}} (X_{1}^{^{'}}))]) \\ + \frac{1}{\sqrt{N_{2}}} \sum_{j = 2}^{N_{2}} (g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{j}))) . \end{matrix}

(A32)

Note that

E [{(g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}} (X_{1})) - E_{X_{1}} [g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}} (X_{1}))])}^{2}] = E [V_{X_{1}} [g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}} (X_{1}))]] .

We use the Efron–Stein inequality to bound

V_{X_{1}} [g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}} (X_{1}))]

.

We do this by bounding the conditional expectation of the term

|g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}} (X_{1})) - g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{1}))|,

where

X_{i}

is replaced with

X_{i}^{^{'}}

in the KDEs for some

i \neq 1

. Using similar steps as in Appendix F, we have

E_{X_{1}} [{|g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}} (X_{1})) - g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{1}))|}^{2}] = O (\frac{1}{N_{2}^{2}}) .

A similar result is obtained when

Y_{i}

is replaced with

Y_{i}^{^{'}}

. Then, based on the Efron–Stein inequality,

V_{X_{1}} [g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}} (X_{1}))] = O (\frac{1}{N_{2}} + \frac{1}{N_{1}})

.

Therefore,

E [\frac{1}{N_{2}} {(g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}} (X_{1})) - E_{X_{1}} [g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}} (X_{1}))])}^{2}] = O (\frac{1}{N_{2}^{2}} + \frac{1}{N_{1} N_{2}}) .

A similar result holds for the

g ({\tilde{f}}_{1, h_{1}} (X_{1}^{^{'}}), {\tilde{f}}_{2, h_{2}} (X_{1}^{^{'}}))

terms in (A33).

For the third term in (A33),

E [{(\sum_{j = 2}^{N_{2}} |g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{j}))|)}^{2}]

= \sum_{j = 2}^{N_{2}} \sum_{i = 2}^{N_{2}} E [|g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{j}))| |g ({\tilde{f}}_{1, h_{1}} (X_{i}), {\tilde{f}}_{2, h_{2}} (X_{i})) - g ({\tilde{f}}_{1, h_{1}} (X_{i}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{i}))|] .

There are

M_{2}

terms where

i = j

, and we have from Appendix F (see (A30)) that

E [{|g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{j}))|}^{2}] \leq \frac{2 C_{g}^{2} {| | K | |}_{\infty}^{2}}{M_{2}^{2}} .

Thus, these terms are

O (\frac{1}{M_{2}})

. There are

M_{2}^{2} - M_{2}

terms when

i \neq j

. In this case, we can do four substitutions of the form

u_{j} = \frac{X_{j} - X_{1}}{h_{2}}

to obtain

E [|g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{j}))| |g ({\tilde{f}}_{1, h_{1}} (X_{i}), {\tilde{f}}_{2, h_{2}} (X_{i})) - g ({\tilde{f}}_{1, h_{1}} (X_{i}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{i}))|] \leq \frac{4 C_{g}^{2} {| | K | |}_{\infty}^{2} h_{2}^{2 d}}{M_{2}^{2}} .

Then, since

h_{2}^{d} = o (1)

, we get

E [{(\sum_{j = 2}^{N_{2}} |g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{j}))|)}^{2}] = o (1),

(A33)

\begin{matrix} \Rightarrow E [{(V_{N_{2}} - V_{N_{2}}^{^{'}})}^{2}] & \leq & \frac{3}{N_{2}} E [{(g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}} (X_{1})) - E_{X_{1}} [g ({\tilde{f}}_{1, h_{1}} (X_{1}), {\tilde{f}}_{2, h_{2}} (X_{1}))])}^{2}] \\ + \frac{3}{N_{2}} E [{(g ({\tilde{f}}_{1, h_{1}} (X_{1}^{^{'}}), {\tilde{f}}_{2, h_{2}} (X_{1}^{^{'}})) - E_{X_{1}^{^{'}}} [g ({\tilde{f}}_{1, h_{1}} (X_{1}^{^{'}}), {\tilde{f}}_{2, h_{2}} (X_{1}^{^{'}}))])}^{2}] \\ + \frac{3}{N_{2}} E [{(\sum_{j = 2}^{N_{2}} (g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}}^{^{'}} (X_{j}), {\tilde{f}}_{2, h_{2}}^{^{'}} (X_{j}))))}^{2}] \\ = & o (\frac{1}{N_{2}}) . \end{matrix}

Now, consider samples

\{X_{1}, \dots, X_{N_{2}}, Y_{1}, \dots, Y_{N_{1}}\}

and

\{X_{1}, \dots, X_{N_{2}}, Y_{1}^{^{'}}, \dots, Y_{N_{1}}\}

and the respective sequences

V_{N_{2}}

and

V_{N_{2}}^{^{'}}

. Then,

\begin{matrix} V_{N_{2}} - V_{N_{2}}^{^{'}} & = & \frac{1}{\sqrt{N_{2}}} \sum_{j = 1}^{N_{2}} (g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}}^{^{'}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j}))) . \end{matrix}

Using a similar argument as that used to obtain (A33), we have that if

h_{1}^{d} = o (1)

and

N_{1} \to \infty

, then

E [{(\sum_{j = 2}^{N_{2}} |g ({\tilde{f}}_{1, h_{1}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j})) - g ({\tilde{f}}_{1, h_{1}}^{^{'}} (X_{j}), {\tilde{f}}_{2, h_{2}} (X_{j}))|)}^{2}] = o (1)

\Rightarrow E [{(V_{N_{2}} - V_{N_{2}}^{^{'}})}^{2}] = o (\frac{1}{N_{2}}) .

Applying the Efron–Stein inequality gives

V [V_{N_{2}}] = o (\frac{N_{2} + N_{1}}{N_{2}}) = o (1) .

Thus, based on Chebyshev’s inequality,

Pr (|V_{N_{2}}| > ϵ) \leq \frac{V [V_{N_{2}}]}{ϵ^{2}} = o (1),

and therefore,

V_{N_{2}}

converges to zero in probability. Based on Slutsky’s theorem,

\sqrt{N_{2}} ({\tilde{G}}_{h_{1}, h_{2}} - E [{\tilde{G}}_{h_{1}, h_{2}}])

converges in distribution to a zero mean Gaussian random variable with variance

V [E_{X} [g ({\tilde{f}}_{1, h_{1}} (X), {\tilde{f}}_{2, h_{2}} (X))]],

where

X

is drawn from

f_{2}

.

For the weighted ensemble estimator, we wish to know the asymptotic distribution of

\sqrt{N_{2}} ({\tilde{G}}_{w} - E [{\tilde{G}}_{w}])

where

{\tilde{G}}_{w} = \sum_{l \in \bar{l}} w (l) {\tilde{G}}_{h (l)}

. We have

\begin{matrix} \sqrt{N_{2}} ({\tilde{G}}_{w} - E [{\tilde{G}}_{w}]) & = & \frac{1}{\sqrt{N_{2}}} \sum_{j = 1}^{N_{2}} \sum_{l \in \bar{l}} w (l) (g ({\tilde{f}}_{1, h (l)} (X_{j}), {\tilde{f}}_{2, h (l)} (X_{j})) - E_{X_{j}} [g ({\tilde{f}}_{1, h (l)} (X_{j}), {\tilde{f}}_{2, h (l)} (X_{j}))]) \\ + \frac{1}{\sqrt{N_{2}}} \sum_{j = 1}^{N_{2}} (E_{X_{j}} [\sum_{l \in \bar{l}} w (l) g ({\tilde{f}}_{1, h (l)} (X_{j}), {\tilde{f}}_{2, h (l)} (X_{j}))] - E [\sum_{l \in \bar{l}} w (l) g ({\tilde{f}}_{1, h (l)} (X_{j}), {\tilde{f}}_{2, h (l)} (X_{j}))]) . \end{matrix}

The second term again converges in distribution to a Gaussian random variable by the central limit theorem. The mean and variance are, respectively, zero and

V [\sum_{l \in \bar{l}} w (l) E_{X} [g ({\tilde{f}}_{1, h (l)} (X), {\tilde{f}}_{2, h (l)} (X))]] .

The first term is equal to

\begin{matrix} \sum_{l \in \bar{l}} w (l) (\frac{1}{\sqrt{N_{2}}} \sum_{j = 1}^{N_{2}} (g ({\tilde{f}}_{1, h (l)} (X_{j}), {\tilde{f}}_{2, h (l)} (X_{j})) - E_{X_{j}} [g ({\tilde{f}}_{1, h (l)} (X_{j}), {\tilde{f}}_{2, h (l)} (X_{j}))])) & = & \sum_{l \in \bar{l}} w (l) o_{P} (1) \\ = & o_{P} (1), \end{matrix}

where

o_{P} (1)

denotes the convergence to zero as a probability. In the last step, we use the fact that if two random variables converge in probability to constants, then their linear combination converges in probability to the linear combination of the constants. Combining this result with Slutsky’s theorem completes the proof.

Appendix H. Proof of Theorem 5 (Uniform MSE)

Since the MSE is equal to the square of the bias plus the variance, we can upper bound the left hand side of (7) with

\begin{matrix} sup_{p, q \in Σ (s, K_{H}, ϵ_{0}, ϵ_{\infty})} E [{({\tilde{G}}_{w_{0}} - G (p, q))}^{2}] & = sup_{p, q \in Σ (s, K_{H}, ϵ_{0}, ϵ_{\infty})} (Bias {({\tilde{G}}_{w_{0}})}^{2} + Var ({\tilde{G}}_{w_{0}})) \\ \leq sup_{p, q \in Σ (s, K_{H}, ϵ_{0}, ϵ_{\infty})} Bias {({\tilde{G}}_{w_{0}})}^{2} + sup_{p, q \in Σ (s, K_{H}, ϵ_{0}, ϵ_{\infty})} Var ({\tilde{G}}_{w_{0}}) . \end{matrix}

From the assumptions (Lipschitz, kernel bounded, weight calculated from the relaxed optimization problem), we have

\begin{matrix} sup_{p, q \in Σ (s, K_{H}, ϵ_{0}, ϵ_{\infty})} Var ({\tilde{G}}_{w_{0}}) & \leq & sup_{p, q \in Σ (s, K, ϵ_{0}, ϵ_{\infty})} \frac{11 C_{g}^{2} | | w_{0} {| |}_{2}^{2} {| | K | |}_{\infty}}{N} \\ = & \frac{11 C_{g}^{2} | | w_{0} {| |}_{2}^{2} {| | K | |}_{\infty}}{N}, \end{matrix}

where the last step follows on from the fact that all of the terms are independent of p and q.

For the bias, recall that if g is infinitely differentiable and if the optimal weight (

w_{0}

) is calculated using the relaxed convex optimization problem, then

\begin{matrix} Bias ({\tilde{G}}_{w_{0}}) & = & \sum_{i \in J} c_{i} (p, q) ϵ N^{- 1 / 2}, \\ \Rightarrow Bias {({\tilde{G}}_{w_{0}})}^{2} & = & \frac{ϵ^{2}}{N} {(\sum_{i \in J} c_{i} (p, q))}^{2} . \end{matrix}

(A34)

We use a topology argument to bound the supremum of this term. We use the Extreme Value Theorem [81]:

Theorem A5 (Extreme Value Theorem)

Let

f : X \to R

be continuous. If X is compact, then points

c, d \in X

s.t.

f (c) \leq f (x) \leq f (d)

exist for every

x \in X

.

Based on this theorem, f achieves its minimum and maximum on X. Our approach is to first show that the functionals

c_{i} (p, q)

are continuous wrt p and q in some appropriate norm. We then show that the space

Σ (s, K_{H}, ϵ_{0}, ϵ_{\infty})

is compact wrt this norm. The Extreme Value Theorem can then be applied to bound the supremum of (A34).

We first define the norm. Let

α = s - r > 0

. We use the standard norm on the space

Σ (s, K_{H})

[82]:

\begin{matrix} | | f | | & = & {| | f | |}_{Σ (s, K_{H})} \\ = & {| | f | |}_{C^{r}} + max_{| β | = r} {| D^{β} f |}_{C^{0, α}} \end{matrix}

where

\begin{matrix} {| | f | |}_{C^{r}} & = & max_{| β | \leq r} sup_{x \in S} | D^{β} f (x) |, \\ {| f |}_{C^{0, α}} & = & sup_{x \neq y \in S} \frac{| f (x) - f (y) |}{{| x - y |}^{α}} . \end{matrix}

Lemma A5.

The functionals

c_{i} (p, q)

are continuous wrt the norm

max (| | p | |_{C^{r}}, | | q | |_{C^{r}})

.

Proof.

The functionals

c_{i} (p, q)

depend on terms of the form

c (p, q) = \int ({\frac{\partial^{i + j} g (t_{1}, t_{2})}{\partial t_{1}^{i} \partial t_{2}^{j}}|}_{\begin{matrix} t_{1} = p (x) \\ t_{2} = q (x) \end{matrix}}) D^{β} p (x) D^{γ} q (x) q (x) d x .

(A35)

It is sufficient to show that this is continuous. Let

ϵ > 0

and

max (| | p - p_{0} {| |}_{C^{r}}, | | q - q_{0} {| |}_{C^{r}}) < δ

where

δ > 0

will be chosen later. Then, by applying the triangle inequality for integration and adding and subtracting terms, we have

| c (p, q) - c (p_{0}, q_{0}) |

\begin{matrix} \leq & \int |({\frac{\partial^{i + j} g (t_{1}, t_{2})}{\partial t_{1}^{i} \partial t_{2}^{j}}|}_{\begin{matrix} t_{1} = p (x) \\ t_{2} = q (x) \end{matrix}}) D^{β} p (x) D^{γ} q (x) (q (x) - q_{0} (x))| d x \\ + \int |({\frac{\partial^{i + j} g (t_{1}, t_{2})}{\partial t_{1}^{i} \partial t_{2}^{j}}|}_{\begin{matrix} t_{1} = p (x) \\ t_{2} = q (x) \end{matrix}}) D^{β} p (x) q_{0} (x) (D^{γ} q (x) - D^{γ} q_{0} (x))| d x \\ + \int |D^{β} p_{0} (x) D^{γ} q_{0} (x) q_{0} (x) (({\frac{\partial^{i + j} g (t_{1}, t_{2})}{\partial t_{1}^{i} \partial t_{2}^{j}}|}_{\begin{matrix} t_{1} = p (x) \\ t_{2} = q (x) \end{matrix}}) \\ - ({\frac{\partial^{i + j} g (t_{1}, t_{2})}{\partial t_{1}^{i} \partial t_{2}^{j}}|}_{\begin{matrix} t_{1} = p_{0} (x) \\ t_{2} = q_{0} (x) \end{matrix}}))| d x \\ + \int |({\frac{\partial^{i + j} g (t_{1}, t_{2})}{\partial t_{1}^{i} \partial t_{2}^{j}}|}_{\begin{matrix} t_{1} = p (x) \\ t_{2} = q (x) \end{matrix}}) D^{γ} q_{0} (x) q_{0} (x) (D^{β} p (x) - D^{β} p_{0} (x))| d x . \end{matrix}

(A36)

Based on Assumption

A . 4

, the absolute value of the mixed derivatives of g is bounded on the range defined for p and q by some constant (

C_{i, j}

). Also,

q_{0} (x) \leq ϵ_{\infty}

. Furthermore, since

D^{γ} q_{0}

and

D^{β} p

are continuous, and since

S \subset R^{d}

is compact, then the absolute value of derivatives

D^{γ} q_{0}

and

D^{β} p

is also bounded by a constant (

ϵ_{\infty}^{^{'}}

). Let

δ_{0} > 0

. Then, since the mixed derivatives of g are continuous on the interval

[ϵ_{0}, ϵ_{\infty}]

, they are uniformly continuous. Therefore, we can choose a small enough

δ

such that (s.t.)

|({\frac{\partial^{i + j} g (t_{1}, t_{2})}{\partial t_{1}^{i} \partial t_{2}^{j}}|}_{\begin{matrix} t_{1} = p (x) \\ t_{2} = q (x) \end{matrix}}) - ({\frac{\partial^{i + j} g (t_{1}, t_{2})}{\partial t_{1}^{i} \partial t_{2}^{j}}|}_{\begin{matrix} t_{1} = p_{0} (x) \\ t_{2} = q_{0} (x) \end{matrix}})| < δ_{0} .

(A37)

Combining all of these results with (A36) gives

\begin{matrix} | c (p, q) - c (p_{0}, q_{0}) | & \leq & λ (S) δ C_{i j} ϵ_{\infty}^{^{'}} (2 + ϵ_{\infty}) \\ + λ (S) ϵ_{\infty}^{^{'}} ϵ_{\infty} (2 δ_{0} + C_{i j} δ), \end{matrix}

where

λ (S)

is the Lebesgue measure of

S

. This is bounded since

S

is compact. Let

δ_{0}^{^{'}} > 0

be s.t. if

max (| | p - p_{0} {| |}_{C^{r}}, | | q - q_{0} {| |}_{C^{r}}) < δ_{0}^{^{'}}

, then (A37) is less than

\frac{ϵ}{4 λ (S) ϵ_{\infty}^{^{'}} ϵ_{\infty}}

. Let

δ_{1} = \frac{ϵ}{4 λ (S) C_{i j} ϵ_{\infty}^{^{'}} (1 + ϵ_{\infty})}

. Then, if

δ < min (δ_{0}^{^{'}}, δ_{1})

,

| c (p, q) - c (p_{0}, q_{0}) | < ϵ .

☐

Given that each

c_{i} (p, q)

is continuous, then

{(\sum_{i \in J} c_{i} (p, q))}^{2}

is also continuous wrt p and q.

We now argue that

Σ (s, K_{H})

is compact. First, a set is relatively compact if its closure is compact. Based on the Arzela–Ascoli theorem [83], the space

Σ (s, K_{H})

is relatively compact in the topology induced by the

{∥\cdot∥}_{Σ (t, K_{H})}

norm for any

t < s

. We choose

t = r

. It can then be shown that under the

{∥\cdot∥}_{Σ (r, K_{H})}

norm,

Σ (s, K_{H})

is complete [82]. Since

Σ (s, K_{H})

is contained in a metric space, it is also closed and therefore, equal to its closure. Thus,

Σ (s, K_{H})

is compact. Then, since

Σ (s, K_{H}, ϵ_{0}, ϵ_{\infty})

is closed in

Σ (s, K_{H})

, it is also compact. Therefore, since for each

p, q \in Σ (s, K_{H}, ϵ_{0}, ϵ_{\infty})

,

{(\sum_{i \in J} c_{i} (p, q))}^{2} < \infty

, based on the Extreme Value Theorem, we have

\begin{matrix} sup_{p, q \in Σ (s, K_{H}, ϵ_{0}, ϵ_{\infty})} Bias {({\tilde{G}}_{w_{0}})}^{2} & = & sup_{p, q \in Σ (s, K_{H}, ϵ_{0}, ϵ_{\infty})} \frac{ϵ^{2}}{N} {(\sum_{i \in J} c_{i} (p, q))}^{2} \\ = & \frac{ϵ^{2}}{N} C, \end{matrix}

where we use the fact that J is finite (see Section 3.2 or Appendix B for the set J).

References

Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: New York, NY, USA, 2012. [Google Scholar]
Avi-Itzhak, H.; Diep, T. Arbitrarily tight upper and lower bounds on the Bayesian probability of error. IEEE Trans. Pattern Anal. Mach. Intell. 1996, 18, 89–91. [Google Scholar] [CrossRef]
Hashlamoun, W.A.; Varshney, P.K.; Samarasooriya, V. A tight upper bound on the Bayesian probability of error. IEEE Trans. Pattern Anal. Mach. Intell. 1994, 16, 220–224. [Google Scholar] [CrossRef]
Moon, K.; Delouille, V.; Hero, A.O., III. Meta learning of bounds on the Bayes classifier error. In Proceedings of the 2015 IEEE Signal Processing and Signal Processing Education Workshop (SP/SPE), Salt Lake City, UT, USA, 9–12 August 2015; pp. 13–18. [Google Scholar]
Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 1952, 23, 493–507. [Google Scholar] [CrossRef]
Berisha, V.; Wisler, A.; Hero, A.O., III; Spanias, A. Empirically Estimable Classification Bounds Based on a New Divergence Measure. IEEE Trans. Signal Process. 2016, 64, 580–591. [Google Scholar] [CrossRef] [PubMed]
Moon, K.R.; Hero, A.O., III. Multivariate f-Divergence Estimation With Confidence. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; pp. 2420–2428. [Google Scholar]
Gliske, S.V.; Moon, K.R.; Stacey, W.C.; Hero, A.O., III. The intrinsic value of HFO features as a biomarker of epileptic activity. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016. [Google Scholar]
Loh, P.-L. On Lower Bounds for Statistical Learning Theory. Entropy 2017, 19, 617. [Google Scholar] [CrossRef]
Póczos, B.; Schneider, J.G. On the estimation of alpha-divergences. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 609–617. [Google Scholar]
Oliva, J.; Póczos, B.; Schneider, J. Distribution to distribution regression. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1049–1057. [Google Scholar]
Szabó, Z.; Gretton, A.; Póczos, B.; Sriperumbudur, B. Two-stage sampled learning theory on distributions. In Proceeding of The 18th International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 9–12 May 2015. [Google Scholar]
Moon, K.R.; Delouille, V.; Li, J.J.; De Visscher, R.; Watson, F.; Hero, A.O., III. Image patch analysis of sunspots and active regions. II. Clustering via matrix factorization. J. Space Weather Space Clim. 2016, 6, A3. [Google Scholar] [CrossRef]
Moon, K.R.; Li, J.J.; Delouille, V.; De Visscher, R.; Watson, F.; Hero, A.O., III. Image patch analysis of sunspots and active regions. I. Intrinsic dimension and correlation analysis. J. Space Weather Space Clim. 2016, 6, A2. [Google Scholar] [CrossRef]
Dhillon, I.S.; Mallela, S.; Kumar, R. A divisive information theoretic feature clustering algorithm for text classification. J. Mach. Learn. Res. 2003, 3, 1265–1287. [Google Scholar]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
Lewi, J.; Butera, R.; Paninski, L. Real-time adaptive information-theoretic optimization of neurophysiology experiments. In Proceedings of the 19th International Conference on Neural Information Processing Systems (NIPS 2006), Vancouver, BC, Canada, 4–9 December 2006; pp. 857–864. [Google Scholar]
Bruzzone, L.; Roli, F.; Serpico, S.B. An extension of the Jeffreys-Matusita distance to multiclass cases for feature selection. IEEE Trans. Geosci. Remote Sens. 1995, 33, 1318–1321. [Google Scholar] [CrossRef]
Guorong, X.; Peiqi, C.; Minhui, W. Bhattacharyya distance feature selection. In Proceedings of the 13th International Conference on Pattern Recognition, Vienna, Austria, 25–29 August 1996; Volume 2, pp. 195–199. [Google Scholar]
Sakate, D.M.; Kashid, D.N. Variable selection via penalized minimum φ-divergence estimation in logistic regression. J. Appl. Stat. 2014, 41, 1233–1246. [Google Scholar] [CrossRef]
Hild, K.E.; Erdogmus, D.; Principe, J.C. Blind source separation using Renyi’s mutual information. IEEE Signal Process. Lett. 2001, 8, 174–176. [Google Scholar] [CrossRef]
Mihoko, M.; Eguchi, S. Robust blind source separation by beta divergence. Neural Comput. 2002, 14, 1859–1886. [Google Scholar] [CrossRef] [PubMed]
Vemuri, B.C.; Liu, M.; Amari, S.; Nielsen, F. Total Bregman divergence and its applications to DTI analysis. IEEE Trans. Med. Imaging 2011, 30, 475–483. [Google Scholar] [CrossRef] [PubMed]
Hamza, A.B.; Krim, H. Image registration and segmentation by maximizing the Jensen-Rényi divergence. In Proceedings of the 4th International Workshop Energy Minimization Methods in Computer Vision and Pattern Recognition (EMMCVPR 2003), Lisbon, Portugal, 7–9 July 2003; pp. 147–163. [Google Scholar]
Liu, G.; Xia, G.; Yang, W.; Xue, N. SAR image segmentation via non-local active contours. In Proceedings of the 2014 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Quebec City, QC, Canada, 13–18 July 2014; pp. 3730–3733. [Google Scholar]
Korzhik, V.; Fedyanin, I. Steganographic applications of the nearest-neighbor approach to Kullback-Leibler divergence estimation. In Proceedings of the 2015 Third International Conference on Digital Information, Networking, and Wireless Communications (DINWC), Moscow, Russia, 3–5 February 2015; pp. 133–138. [Google Scholar]
Basseville, M. Divergence measures for statistical data processing–An annotated bibliography. Signal Process. 2013, 93, 621–633. [Google Scholar] [CrossRef]
Cichocki, A.; Amari, S. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef]
Csiszar, I. Information-type measures of difference of probability distributions and indirect observations. Stud. Sci. Math. Hungar. 1967, 2, 299–318. [Google Scholar]
Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B Stat. Methodol. 1966, 28, 131–142. [Google Scholar]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
Hellinger, E. Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. J. Rein. Angew. Math. 1909, 136, 210–271. (In German) [Google Scholar]
Bhattacharyya, A. On a measure of divergence between two multinomial populations. Indian J. Stat. 1946, 7, 401–406. [Google Scholar]
Silva, J.F.; Parada, P.A. Shannon entropy convergence results in the countable infinite case. In Proceedings of the 2012 IEEE International Symposium on Information Theory Proceedings (ISIT), Cambridge, MA, USA, 1–6 July 2012; pp. 155–159. [Google Scholar]
Antos, A.; Kontoyiannis, I. Convergence properties of functional estimates for discrete distributions. Random Struct. Algorithms 2001, 19, 163–193. [Google Scholar] [CrossRef]
Valiant, G.; Valiant, P. Estimating the unseen: An n/log (n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the 43rd Annual ACM Symposium on Theory of Computing, San Jose, CA, USA, 6–8 June 2011; pp. 685–694. [Google Scholar]
Jiao, J.; Venkat, K.; Han, Y.; Weissman, T. Minimax estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory 2015, 61, 2835–2885. [Google Scholar] [CrossRef] [PubMed]
Jiao, J.; Venkat, K.; Han, Y.; Weissman, T. Maximum likelihood estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory 2017, 63, 6774–6798. [Google Scholar] [CrossRef]
Valiant, G.; Valiant, P. The power of linear estimators. In Proceedings of the 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science (FOCS), Palm Springs, CA, USA, 22–25 October 2011; pp. 403–412. [Google Scholar]
Paninski, L. Estimation of entropy and mutual information. Neural Comput. 2003, 15, 1191–1253. [Google Scholar] [CrossRef]
Paninski, L. Estimating entropy on m bins given fewer than m samples. IEEE Trans. Inf. Theory 2004, 50, 2200–2203. [Google Scholar] [CrossRef]
Alba-Fernández, M.V.; Jiménez-Gamero, M.D.; Ariza-López, F.J. Minimum Penalized ϕ-Divergence Estimation under Model Misspecification. Entropy 2018, 20, 329. [Google Scholar] [CrossRef]
Ahmed, N.A.; Gokhale, D. Entropy expressions and their estimators for multivariate distributions. IEEE Trans. Inf. Theory 1989, 35, 688–692. [Google Scholar] [CrossRef]
Misra, N.; Singh, H.; Demchuk, E. Estimation of the entropy of a multivariate normal distribution. J. Multivar. Anal. 2005, 92, 324–342. [Google Scholar] [CrossRef]
Gupta, M.; Srivastava, S. Parametric Bayesian estimation of differential entropy and relative entropy. Entropy 2010, 12, 818–843. [Google Scholar] [CrossRef]
Li, S.; Mnatsakanov, R.M.; Andrew, M.E. K-nearest neighbor based consistent entropy estimation for hyperspherical distributions. Entropy 2011, 13, 650–667. [Google Scholar] [CrossRef]
Wang, Q.; Kulkarni, S.R.; Verdú, S. Divergence estimation for multidimensional densities via k-nearest-neighbor distances. IEEE Trans. Inf. Theory 2009, 55, 2392–2405. [Google Scholar] [CrossRef]
Darbellay, G.A.; Vajda, I. Estimation of the information by an adaptive partitioning of the observation space. IEEE Trans. Inf. Theory 1999, 45, 1315–1321. [Google Scholar] [CrossRef]
Silva, J.; Narayanan, S.S. Information divergence estimation based on data-dependent partitions. J. Stat. Plan. Inference 2010, 140, 3180–3198. [Google Scholar] [CrossRef]
Le, T.K. Information dependency: Strong consistency of Darbellay–Vajda partition estimators. J. Stat. Plan. Inference 2013, 143, 2089–2100. [Google Scholar] [CrossRef]
Wang, Q.; Kulkarni, S.R.; Verdú, S. Divergence estimation of continuous distributions based on data-dependent partitions. IEEE Trans. Inf. Theory 2005, 51, 3064–3074. [Google Scholar] [CrossRef]
Hero, A.O., III; Ma, B.; Michel, O.; Gorman, J. Applications of entropic spanning graphs. IEEE Signal Process. Mag. 2002, 19, 85–95. [Google Scholar] [CrossRef]
Moon, K.R.; Hero, A.O., III. Ensemble estimation of multivariate f-divergence. In Proceedings of the 2014 IEEE International Symposium on Information Theory (ISIT), Honolulu, HI, USA, 29 June–4 July 2014; pp. 356–360. [Google Scholar]
Moon, K.R.; Sricharan, K.; Greenewald, K.; Hero, A.O., III. Improving convergence of divergence functional ensemble estimators. In Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 1133–1137. [Google Scholar]
Nguyen, X.; Wainwright, M.J.; Jordan, M.I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theory 2010, 56, 5847–5861. [Google Scholar] [CrossRef]
Krishnamurthy, A.; Kandasamy, K.; Poczos, B.; Wasserman, L. Nonparametric Estimation of Renyi Divergence and Friends. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 919–927. [Google Scholar]
Singh, S.; Póczos, B. Generalized exponential concentration inequality for Rényi divergence estimation. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), Beijing, China, 21–26 June 2014; pp. 333–341. [Google Scholar]
Singh, S.; Póczos, B. Exponential Concentration of a Density Functional Estimator. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; pp. 3032–3040. [Google Scholar]
Kandasamy, K.; Krishnamurthy, A.; Poczos, B.; Wasserman, L.; Robins, J. Nonparametric von Mises Estimators for Entropies, Divergences and Mutual Informations. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015; pp. 397–405. [Google Scholar]
Härdle, W. Applied Nonparametric Regression; Cambridge University Press: Cambridge, UK, 1990. [Google Scholar]
Berlinet, A.; Devroye, L.; Györfi, L. Asymptotic normality of L1-error in density estimation. Statistics 1995, 26, 329–343. [Google Scholar] [CrossRef]
Berlinet, A.; Györfi, L.; Dénes, I. Asymptotic normality of relative entropy in multivariate density estimation. Publ. l’Inst. Stat. l’Univ. Paris 1997, 41, 3–27. [Google Scholar]
Bickel, P.J.; Rosenblatt, M. On some global measures of the deviations of density function estimates. Ann. Stat. 1973, 1, 1071–1095. [Google Scholar] [CrossRef]
Sricharan, K.; Wei, D.; Hero, A.O., III. Ensemble estimators for multivariate entropy estimation. IEEE Trans. Inf. Theory 2013, 59, 4374–4388. [Google Scholar] [CrossRef] [PubMed]
Berrett, T.B.; Samworth, R.J.; Yuan, M. Efficient multivariate entropy estimation via k-nearest neighbour distances. arXiv 2017, arXiv:1606.00304. [Google Scholar]
Kozachenko, L.; Leonenko, N.N. Sample estimate of the entropy of a random vector. Probl. Peredachi Inf. 1987, 23, 9–16. [Google Scholar]
Hansen, B.E.; (University of Wisconsin, Madison, WI, USA). Lecture Notes on Nonparametrics. 2009. [Google Scholar]
Budka, M.; Gabrys, B.; Musial, K. On accuracy of PDF divergence estimators and their applicability to representative data sampling. Entropy 2011, 13, 1229–1266. [Google Scholar] [CrossRef]
Efron, B.; Stein, C. The jackknife estimate of variance. Ann. Stat. 1981, 9, 586–596. [Google Scholar] [CrossRef]
Wisler, A.; Moon, K.; Berisha, V. Direct ensemble estimation of density functionals. In Proceedings of the 2018 International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
Moon, K.R.; Sricharan, K.; Greenewald, K.; Hero, A.O., III. Nonparametric Ensemble Estimation of Distributional Functionals. arXiv 2016, arXiv:1601.06884v2. [Google Scholar]
Paul, F.; Arkin, Y.; Giladi, A.; Jaitin, D.A.; Kenigsberg, E.; Keren-Shaul, H.; Winter, D.; Lara-Astiaso, D.; Gury, M.; Weiner, A.; et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 2015, 163, 1663–1677. [Google Scholar] [CrossRef] [PubMed]
Kanehisa, M.; Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000, 28, 27–30. [Google Scholar] [CrossRef] [PubMed]
Kanehisa, M.; Sato, Y.; Kawashima, M.; Furumichi, M.; Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 2015, 44, D457–D462. [Google Scholar] [CrossRef] [PubMed]
Kanehisa, M.; Furumichi, M.; Tanabe, M.; Sato, Y.; Morishima, K. KEGG: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2016, 45, D353–D361. [Google Scholar] [CrossRef] [PubMed]
Van Dijk, D.; Sharma, R.; Nainys, J.; Yim, K.; Kathail, P.; Carr, A.J.; Burdsiak, C.; Moon, K.R.; Chaffer, C.; Pattabiraman, D.; et al. Recovering Gene Interactions from Single-Cell Data Using Data Diffusion. Cell 2018, 174, 716–729. [Google Scholar] [CrossRef] [PubMed]
Moon, K.R.; Sricharan, K.; Hero, A.O., III. Ensemble Estimation of Distributional Functionals via k-Nearest Neighbors. arXiv 2017, arXiv:1707.03083. [Google Scholar]
Durrett, R. Probability: Theory and Examples; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
Gut, A. Probability: A Graduate Course; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Munkres, J. Topology; Prentice Hall: Englewood Cliffs, NJ, USA, 2000. [Google Scholar]
Evans, L.C. Partial Differential Equations; American Mathematical Society: Providence, RI, USA, 2010. [Google Scholar]
Gilbarg, D.; Trudinger, N.S. Elliptic Partial Differential Equations of Second Order; Springer: Berlin/Heidelberg, Germany, 2001. [Google Scholar]

Figure 1. Heat map showing the predicted bias of the divergence functional plug-in estimator

{\tilde{G}}_{h}

based on Theorem 1 as a function of the dimensions (d) and sample size (N) when

h = N^{\frac{- 1}{d + 1}}

. Note that the phase transition in the bias as the dimensions (d) increase for a fixed sample size (N); the bias remains small only for relatively small values of

d

. The proposed weighted ensemble estimator EnDive eliminates this phase transition when the densities and the function g are sufficiently smooth.

Figure 2. The optimal weights from (6) when

d = 4

,

N = 3100

,

L = 50

, and l are uniformly spaced between 1.5 and 3. The lowest values of l are given the highest weight. Thus, the minimum value of bandwidth parameters

L

should be sufficiently large to render an adequate estimate.

Figure 3. (Left) Log–log plot of MSE of the uniform kernel plug-in (“Kernel”) and the optimally weighted EnDive estimator for various dimensions and sample sizes. (Right) Plot of the true values being estimated compared to the average values of the same estimators with standard error bars. The proposed weighted ensemble estimator approaches the theoretical rate (see Table 1), performed better than the plug-in estimator in terms of MSE and was less biased.

Figure 4. Log–log plot of the average pointwise squared error between the KDE

{\tilde{f}}_{1, h}

and

f_{1}

for various dimensions and sample sizes using the same bandwidth and kernel as the standard plug-in estimators in Figure 3. The KDE and the density were compared at 10,000 points sampled from

f_{1}

.

Figure 5. QQ-plots comparing the quantiles of a standard normal random variable and the quantiles of the centered and scaled EnDive estimator applied to the Kullback–Leibler (KL) divergence when the distributions were the same and different. Quantiles were computed from 10,000 trials. These plots correspond to the same experiments as in Table 2 when N = 100 and N = 1000. The correspondence between quantiles is high for all cases.

Figure 6. Estimated upper (UB) and lower bounds (LB) on the Bayes error rate (BER) based on estimating the HP divergence between two 10-dimensional Gaussian distributions with identity covariance matrices and distances between means of 1 (left) and 3 (right), respectively. Estimates were calculated using EnDive, with error bars indicating the standard deviation from 400 trials. The upper bound was closer, on average, to the true BER when N was small (≈100–300) and the distance between the means was small. The lower bound was closer, on average, in all other cases.

Table 1. Negative log–log slope of the EnDive mean squared error (MSE) as a function of the sample size for various dimensions. The slope was calculated beginning at

N_{s t a r t}

. The negative slope was closer to 1 with

N_{s t a r t} = 10^{2.375}

than for

N_{s t a r t} = 10^{2}

indicating that the asymptotic rate had not yet taken effect at

N_{s t a r t} = 10^{2}

.

Table 1. Negative log–log slope of the EnDive mean squared error (MSE) as a function of the sample size for various dimensions. The slope was calculated beginning at

N_{s t a r t}

. The negative slope was closer to 1 with

N_{s t a r t} = 10^{2.375}

than for

N_{s t a r t} = 10^{2}

indicating that the asymptotic rate had not yet taken effect at

N_{s t a r t} = 10^{2}

.

Estimator	$d = 5$	$d = 10$	$d = 15$
$N_{s t a r t} = 10^{2}$	$0.85$	$0.84$	$0.80$
$N_{s t a r t} = 10^{2.375}$	$0.96$	$0.96$	$0.95$

Table 2. Comparison between quantiles of a standard normal random variable and the quantiles of the centered and scaled EnDive estimator applied to the KL divergence when the distributions were the same and different. Quantiles were computed from 10,000 trials. The parameter

ρ

gives the correlation coefficient between the quantiles, while

β

is the estimated slope between the quantiles. The correspondence between quantiles was very high for all cases.

Table 2. Comparison between quantiles of a standard normal random variable and the quantiles of the centered and scaled EnDive estimator applied to the KL divergence when the distributions were the same and different. Quantiles were computed from 10,000 trials. The parameter

ρ

gives the correlation coefficient between the quantiles, while

β

is the estimated slope between the quantiles. The correspondence between quantiles was very high for all cases.

N	Same		Different
N	$1 - ρ$	$β$	$1 - ρ$	$β$
100	$2.35 \times 10^{- 4}$	$1.014$	$9.97 \times 10^{- 4}$	$0.993$
500	$9.48 \times 10^{- 5}$	$1.007$	$5.06 \times 10^{- 4}$	$0.999$
1000	$8.27 \times 10^{- 5}$	$0.996$	$4.30 \times 10^{- 4}$	$0.988$
5000	$8.59 \times 10^{- 5}$	$0.995$	$4.47 \times 10^{- 4}$	$1.005$

Table 3. Misclassification rate of a quadratic discriminant analysis classifier (QDA) classifier and estimated upper bounds (UB) and lower bounds (LB) of the pairwise BER between mouse bone marrow cell types using the Henze–Penrose divergence applied to different combinations of genes selected from the KEGG pathways associated with the hematopoietic cell lineage. Results are presented as percentages in the form of mean ± standard deviation. Based on these results, erythrocytes are relatively easy to distinguish from the other two cell types using these gene sets.

	Platelets	Erythrocytes	Neutrophils	Macrophages	Random
Eryth. vs. Mono., LB	$2.8 \pm 1.5$	$1.2 \pm 0.6$	$0.6 \pm 0.6$	$8.5 \pm 1.2$	$14.4 \pm 8.4$
Eryth. vs. Mono., UB	$5.3 \pm 2.9$	$2.4 \pm 1.3$	$1.2 \pm 1.3$	$15.5 \pm 1.9$	$23.2 \pm 12.3$
Eryth. vs. Mono., Prob. Error	$0.9$	$0.4$	$1.3$	$3.4$	$7.2 \pm 5.4$
Eryth. vs. Baso., LB	$0.5 \pm 0.6$	$0.05 \pm 0.12$	$0.6 \pm 0.5$	$5.1 \pm 0.9$	$11.9 \pm 5.5$
Eryth. vs. Baso., UB	$1.0 \pm 1.1$	$0.1 \pm 0.2$	$1.1 \pm 0.9$	$9.6 \pm 1.6$	$20.3 \pm 8.8$
Eryth. vs. Baso., Prob. Error	$1.2$	$0.3$	$1.9$	$3.6$	$6.8 \pm 5.0$
Baso. vs. Mono., LB	$31.1 \pm 1.8$	$27.8 \pm 3.1$	$27.1 \pm 2.6$	$31.6 \pm 1.3$	$32.1 \pm 2.6$
Baso. vs. Mono., UB	$42.8 \pm 1.4$	$39.9 \pm 2.8$	$39.4 \pm 2.4$	$43.2 \pm 1.0$	$43.5 \pm 1.2$
Baso. vs. Mono., Prob. Error	$28.8$	$30.9$	$23.9$	$22.4$	$29.7 \pm 5.7$

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Ensemble Estimation of Information Divergence ^†

Abstract

1. Introduction

1.1. Related Work

1.2. Organization and Notation

2. The Divergence Functional Weak Estimator

2.1. The Kernel Density Plug-in Estimator

2.2. Convergence Rates

2.3. Optimal MSE Rate

2.4. Proof Sketches of Theorems 1 and 2

3. Weighted Ensemble Estimation

3.1. Finding the Optimal Weight

3.2. The EnDive Estimator

3.3. Central Limit Theorem

3.4. Uniform Convergence Rates

4. Experimental Results

4.1. Tuning Parameter Selection

4.2. Convergence Rates Validation: Rényi- $α$ Divergence

4.3. Central Limit Theorem Validation: KL Divergence

4.4. Bayes Error Rate Estimation on Single-Cell Data

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

Appendix A. Bias Assumptions

Appendix B. Modified EnDive

Appendix C. General Results

Appendix D. Proof of Theorem A1 (Boundary Conditions)

Appendix D.1. Single Coordinate Boundary Point

Appendix D.2. Multiple Coordinate Boundary Point

Appendix E. Proof of Theorem A3 (Bias)

Appendix F. Proof of Theorem A4 (Variance)

Appendix G. Proof of Theorem 4 (CLT)

Appendix H. Proof of Theorem 5 (Uniform MSE)

References

Article Metrics

Citations

Article Access Statistics

Ensemble Estimation of Information Divergence †

Abstract

1. Introduction

1.1. Related Work

1.2. Organization and Notation

2. The Divergence Functional Weak Estimator

2.1. The Kernel Density Plug-in Estimator

2.2. Convergence Rates

2.3. Optimal MSE Rate

2.4. Proof Sketches of Theorems 1 and 2

3. Weighted Ensemble Estimation

3.1. Finding the Optimal Weight

3.2. The EnDive Estimator

3.3. Central Limit Theorem

3.4. Uniform Convergence Rates

4. Experimental Results

4.1. Tuning Parameter Selection

4.2. Convergence Rates Validation: Rényi- α Divergence

4.3. Central Limit Theorem Validation: KL Divergence

4.4. Bayes Error Rate Estimation on Single-Cell Data

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

Appendix A. Bias Assumptions

Appendix B. Modified EnDive

Appendix C. General Results

Appendix D. Proof of Theorem A1 (Boundary Conditions)

Appendix D.1. Single Coordinate Boundary Point

Appendix D.2. Multiple Coordinate Boundary Point

Appendix E. Proof of Theorem A3 (Bias)

Appendix F. Proof of Theorem A4 (Variance)

Appendix G. Proof of Theorem 4 (CLT)

Appendix H. Proof of Theorem 5 (Uniform MSE)

References

Article Metrics

Citations

Article Access Statistics

Ensemble Estimation of Information Divergence ^†

4.2. Convergence Rates Validation: Rényi- $α$ Divergence