Ensemble Estimation of Information Divergence

Recent work has focused on the problem of nonparametric estimation of information divergence functionals between two continuous random variables. Many existing approaches require either restrictive assumptions about the density support set or difficult calculations at the support set boundary which must be known a priori. The mean squared error (MSE) convergence rate of a leave-one-out kernel density plug-in divergence functional estimator for general bounded density support sets is derived where knowledge of the support boundary, and therefore, the boundary correction is not required. The theory of optimally weighted ensemble estimation is generalized to derive a divergence estimator that achieves the parametric rate when the densities are sufficiently smooth. Guidelines for the tuning parameter selection and the asymptotic distribution of this estimator are provided. Based on the theory, an empirical estimator of Rényi-α divergence is proposed that greatly outperforms the standard kernel density plug-in estimator in terms of mean squared error, especially in high dimensions. The estimator is shown to be robust to the choice of tuning parameters. We show extensive simulation results that verify the theoretical results of our paper. Finally, we apply the proposed estimator to estimate the bounds on the Bayes error rate of a cell classification problem.

Despite the many applications of divergences, no nonparametric estimators of these functionals exist that achieve the parametric mean squared error (MSE) convergence rate, are simple to implement, do not require knowledge of the boundary of the densities' support set, and apply to a large set of divergence functionals.In this paper, we present the first information divergence estimator that achieves all of the above.Specifically, we consider the problem of estimating divergence functionals when only a finite population of independent and identically distributed (i.i.d.) samples is available from the two d-dimensional distributions that are unknown, nonparametric, and smooth.Our contributions are 1) We propose the first information divergence estimator, referred to as EnDive, that is based on ensemble methods.The ensemble estimator takes a weighted average of an ensemble of weak kernel density plug-in estimators of divergence where the weights are chosen to improve the MSE convergence rate.This ensemble construction makes it very easy to implement EnDive.2) We prove that the proposed ensemble divergence estimator achieves the optimal parametric MSE rate of O 1 N , where N is the sample size, when the densities are sufficiently smooth.In particular, EnDive achieves these rates without explicitly performing boundary correction, which is required for most other estimators.Furthermore, we show that the convergence rates are uniform.
3) We prove that EnDive obeys a central limit theorem and thus can be used to perform inference tasks on the divergence such as testing that two populations have identical distributions or constructing confidence intervals.

A. Related Work
While several estimators of divergence functionals have been previously defined, the convergence rates are known for only a few of them.Furthermore, the asymptotic distributions of these estimators is unknown for nearly all of them.For example, Póczos and Schneider [10] established weak consistency of a bias-corrected k-nn estimator for Rényi-α and other divergences of a similar form where k is fixed.Wang et al [34] provided a k-nn based estimator for the KL divergence.Mutual information and divergence estimators based on plug-in histogram schemes have been proven to be consistent [35]- [38].Hero et al [39] provided an estimator for Rényi-α divergence but assumed that one of the densities was known.However none of these works study the convergence rates nor the asymptotic distribution of their estimators.
There has been recent interest in deriving convergence rates for divergence estimators [40]- [45].The rates are typically derived in terms of a smoothness condition on the densities, such as the Hölder condition [46]: Definition 1 (Hölder Class): Let X ⊂ R d be a compact space.For r = (r 1 , . . ., r d ), r i ∈ N, define |r| = .The Hölder class Σ(s, K H ) of functions on L 2 (X ) consists of the functions f that satisfy for all x, y ∈ X and for all r s.t.|r| ≤ s .
From Definition 1, it is clear that if a function f belongs to Σ(s, K H ), then f is continuously differentiable up to order s .In this work, we show that EnDive achieves the parametric MSE convergence rate of O(1/N ) when s ≥ d and s > d 2 , depending on the specific form of the divergence function.
Nguyen et al. [41] proposed a method for estimating f -divergences by estimating the likelihood ratio of the two densities by solving a convex optimization problem and then plugging it into the divergence formulas.For this method the authors prove that the minimax MSE convergence rate is parametric when the likelihood ratio is in the bounded Hölder class Σ(s, K H ) with s ≥ d/2.However, this estimator is restricted to true f -divergences and may not apply to the broader class of divergence functionals that we consider here (as an example, the L 2 2 divergence is not an f -divergence).Additionally, solving the convex problem of [41] has similar computational complexity to that of training a support vector machine (SVM) (between O(N 2 ) and O(N 3 )), which can be demanding when N is large.In contrast, EnDive depends only on simple density plug-in estimates and the solution of an offline convex optimization problem.Thus the most computationally demanding step in our approach is the calculation of the density estimates, which has complexity no greater than O(N 2 ).
Singh and Póczos [43], [44] provided an estimator for Rényi-α divergences as well as general density functionals that uses a "mirror image" kernel density estimator.They prove a convergence rate of O 1 N when s ≥ d for each of the densities.However this method requires several computations at each boundary of the support of the densities which is difficult to implement as d gets large.Also, this method requires knowledge of the support of the densities which is unknown in most practical settings.In contrast, while our assumptions require the density support sets to be bounded, knowledge of the support is not required for implementation.
The "linear" and "quadratic" estimators presented by Krishnamurthy et al [42] estimate divergence functionals that include the form f α 1 (x)f β 2 (x)dµ(x) for given α and β where f 1 and f 2 are probability densities.These estimators achieve the parametric rate when s ≥ d/2 and s ≥ d/4 for the linear and quadratic estimators, respectively.However, the latter estimator is computationally infeasible for most functionals and the former requires numerical integration for some divergence functionals, which can be computationally difficult.Additionally, while a suitable α-β indexed sequence of divergence functionals of this form can be made to converge to the KL divergence, this does not guarantee convergence of the corresponding sequence of divergence estimators in [42], whereas our estimator can be used to directly estimate the KL divergence.Other important f -divergence functionals are also excluded from this form including some that bound the Bayes error [3], [5], [7].In contrast, our method applies to a large class of divergence functionals and avoids numerical integration.Finally, Kandasamy et al. [45] propose influence function based estimators of distributional functionals including divergences that achieve the parametric rate when s ≥ d/2.While this method can be applied to general functionals, the estimator requires numerical integration for some functionals.Additionally, the estimators in both Kandasamy et al [45] and Krishnamurthy et al [42] require an optimal kernel density estimator.This is difficult to construct when the density support is bounded as it requires knowledge of the densities' support boundary and difficult computations at the boundary, whereas our method does not require knowledge of the support boundary.
Asymptotic normality has been established for certain appropriately normalized divergences between a specific density estimator and the true density [47]- [49].This differs from our setting where we assume that both densities are unknown.The asymptotic distributions of the estimators in [41]- [44] are currently unknown.Thus it is difficult to use these estimators for hypothesis testing, which is crucial in many scientific applications.Kandasamy et al [45] prove a central limit theorem for their data-splitting estimator but do not prove similar results for their leave-one-out estimator.We establish a central limit theorem for EnDive, which greatly enhances its applicability in scientific settings.
Our ensemble divergence estimator reduces to an ensemble entropy estimator as a special case when data from only one distribution is considered and the other density is set to a uniform measure.The resultant entropy estimator differs from the ensemble entropy estimator proposed by Sricharan et al. [50] in several important ways.First, the densities' support set must be known for the estimator in [50] to perform explicit boundary correction.In contrast, the EnDive estimator does not require any boundary correction.To show this requires a significantly different approach in proving the bias and variance rates of the EnDive estimator.Furthermore, the EnDive results apply under more general assumptions on the densities and the kernel used in the weak estimators.Finally, the central limit theorem applies to the EnDive estimator which is currently unknown for the estimator in [50].

B. Organization and Notation
The paper is organized as follows.We first derive MSE convergence rates in Section II for a weak divergence estimator, which is a kernel density plug-in divergence estimator.We then generalize the theory of optimally weighted ensemble entropy estimation developed in [50] to obtain the ensemble divergence estimator EnDive from an ensemble of weak estimators in Section III.A central limit theorem and uniform convergence rate for the ensemble estimator are also presented in Section III.In Section IV, we provide guidelines for selecting the tuning parameters based on experiments and the theory derived in the previous sections.We then perform experiments in Section IV that validate the theory and establish the robustness of the proposed estimators to the tuning parameters.
Bold face type is used for random variables and random vectors.The conditional expectation given a random variable Z is denoted E Z .The variance of a random variable is denoted V and the bias of an estimator is denoted B.

II. THE DIVERGENCE FUNCTIONAL WEAK ESTIMATOR
This paper focuses on estimating functionals of the form where g(x, y) is a smooth functional, and f 1 and , g is convex, and g(1) = 0, then G (f 1 , f 2 ) defines the family of f -divergences.Some common divergences that belong to this family include the KL divergence (g(t) = − ln t) and the total variation distance (g(t) = |t − 1|).In this work, we consider a broader class of functionals than the f -divergences since g is allowed to be very general.
To estimate G (f 1 , f 2 ), we will first define a weak plug-in estimator based on kernel density estimators (KDEs); that is, a simple estimator that converges slowly to the true value G (f 1 , f 2 ) in terms of MSE.We will then derive the bias and variance expressions for this weak estimator as a function of sample size and bandwidth.We then use the resulting bias and variance expressions to derive an ensemble estimator that takes a weighted average of weak estimators with different bandwidths and achieves superior MSE performance.

A. The Kernel Density Plug-in Estimator
We use a kernel density plug-in estimator of the divergence functional in (1) as the weak estimator.Assume that N 1 i.i.d.realizations {Y 1 , . . ., Y N1 } are available from f 1 and N 2 i.i.d.realizations {X 1 , . . ., X N2 } are available from f 2 .Let h i > 0 be the kernel bandwidth for the density estimator of f i .For simplicity of presentation, assume that N 1 = N 2 = N and h 1 = h 2 = h.The results for the more general case of differing sample sizes and bandwidths are given in Appendix B. Let K(•) be a kernel function with K(x)dx = 1 and ||K|| ∞ < ∞ where K ∞ is the ∞ norm of the kernel K.The KDEs for f 1 and f 2 are, respectively, where

B. Convergence Rates
For many estimators, MSE convergence rates are typically provided in the form of upper (or sometimes lower) bounds on the bias and the variance.Therefore, only the slowest converging terms (as a function of sample size N ) are presented in these cases.However, to apply our generalized ensemble theory to obtain estimators that guarantee the parametric MSE rate, we require explicit expressions for the bias of the weak estimators in terms of the sample size N and the kernel bandwidth h.Thus an upper bound is insufficient for our work.Furthermore, to guarantee the parametric rate, we require explicit expressions of all bias terms that converge to zero slower than O 1/ √ N .To obtain such expressions, we require multiple assumptions on the densities f 1 and f 2 , the functional g, and the kernel K. Similar to [8], [40], [50], the principal assumptions we make are that: 1) f 1 , f 2 , and g are smooth; 2) f 1 and f 2 have common bounded support sets S; 3) f 1 and f 2 are strictly lower bounded on S. We also assume 4) that the densities' support set is smooth with respect to the kernel K(u).Our full assumptions are: • (A.0): Assume that the kernel K is symmetric, is a product kernel, and has bounded support in each dimension.
• (A.2): Assume that the densities f i ∈ Σ(s, K H ) in the interior of S with s ≥ 2.
• (A.3): Assume that g has an infinite number of mixed derivatives.
• (A.5): Assume the following boundary smoothness condition: Let p x (u) : R d → R be a polynomial in u of order q ≤ r = s whose coefficients are a function of x and are r − q times differentiable.Then assume that where v t (h) admits the expansion for some constants e i,q,t .We focus on finite support kernels for simplicity in the proofs although it is likely that our results extend to some infinitely supported kernels as well.We assume relatively strong conditions on the smoothness of g in A.3 to enable us to obtain an estimator that achieves good convergence rates without knowledge of the boundary of the support set.While this smoothness condition may seem restrictive, in practice nearly all divergence and entropy functionals of interest satisfy this condition.Functionals of interest that do not satisfy this assumption (e.g. the total variation distance) typically have at least one point that is not differentiable which violates the assumptions of all competing estimators [40], [42]- [45], [50].We also note that to obtain simply an upper bound on the bias for the plug-in estimator, much less restrictive assumptions on the functional g are sufficient.
Assumption A.5 requires the boundary of the density support set to be smooth with respect to the kernel K(u) in the sense that the expectation of the area outside of S with respect to any random variable u with smooth distribution is a smooth function of the bandwidth h.It is not necessary for the boundary of S to have smooth contours with no edges or corners as this assumption is satisfied by the following case: Theorem 1: Assumption A.5 is satisfied when S = [−1, 1] d and when K is the uniform rectangular kernel; that is K(x) = 1 for all x : The proof is given in Appendix C. Given the simple nature of this density support set and kernel, it is likely that other kernels and supports will satisfy A.5 as well.This is left for future work.Note that assumption A.0 is trivially satisfied by the uniform rectangular kernel as well.
Densities for which assumptions A.1 − A.2 hold include the truncated Gaussian distribution and the Beta distribution on the unit cube.Functions for which assumptions A.3 − A.4 hold include g(x, y) = − ln x y and g(x, y) = x y α .The following theorem on the bias follows under assumptions A.0 − A.5: Theorem 2: For general g, the bias of the plug-in estimator Gh is given by To apply our generalized ensemble theory to the KDE plug-in estimator Gh , we require only an upper bound on its variance.The following variance result requires much less strict assumptions than the bias results in Theorem 2: Theorem 3: Assume that the functional g in (1) is Lipschitz continuous in both of its arguments with Lipschitz constant C g .Then the variance of the plug-in estimator Gh is bounded by From Theorems 2 and 3, it is clear that we require h → 0 and N h d → ∞ for Gh to be unbiased while the variance of the plug-in estimator depends primarily on the number of samples.Note that the constants depend on the densities f 1 and f 2 and their derivatives which are often unknown.

C. Optimal MSE Rate
From Theorem 2, the dominating terms in the bias are Θ (h) and Θ 1 N h d .If no attempt is made to correct the bias, the optimal choice of h in terms of minimizing the MSE is This results in a dominant bias term of order Θ N −1 d+1 . Note that this differs from the standard result for the optimal KDE bandwidth for minimum MSE density estimation which is Θ N −1/(d+4) for a symmetric uniform kernel [51].
Figure 1 shows a heatmap showing the leading bias term O (h) as a function of d and N when h = N −1 d+1 .The heatmap indicates that the bias of the plug-in estimator in (2) is small only for relatively small values of d.In the next section, we propose an ensemble estimator that achieves a superior convergence rate regardless of dimension d as long as the densities are sufficiently smooth.

D. Proof Sketches of Theorems 2 and 3
To prove the expressions for the bias, the bias is first decomposed into two parts by adding and subtracting g E Z f1,h (Z), E Z f2,h (Z) within the expectation creating a "bias" term and a "variance" term.Applying a Taylor series expansion on the bias and variance terms results in expressions that depend on powers of B Z fi,h (Z) := E Z fi,h (Z) − f i (Z) and ẽi,h (Z) := fi,h (Z) − E Z fi,h (Z), respectively.Within the interior of the support, moment bounds can be derived from properties of the KDEs and a Taylor series expansion of the densities.Near the boundary of the support, the smoothness assumption on the boundary A.5 is required to obtain an expression of the bias in terms of the KDE bandwidth h and the sample size N .The full proof of Thm. 2 is given in Appendix D.
The proof of the variance result takes a different approach.The proof uses the Efron-Stein inequality [52] which bounds the variance by analyzing the expected squared difference between the plug-in estimator when one sample is allowed to differ.This approach provides a bound on the variance under much less strict assumptions on the densities and the functional g than is required for Theorem 2. The full proof of Thm. 3 is given in Appendix E.

III. WEIGHTED ENSEMBLE ESTIMATION
Theorem. 2 shows that when the dimension of the data is not small, the bias of the MSE-optimal plug-in estimator Gh decreases very slowly as a function of sample size, resulting in large MSE.However, by applying the theory of optimally weighted ensemble estimation, we can modify the minimum MSE estimator by taking a weighted sum of an ensemble of estimators where the weights are chosen to significantly reduce the bias.
We form an ensemble of estimators by choosing different values of h.Choose L = {l 1 , . . ., l L } to be real positive numbers that index h(l i ).Thus the parameter l indexes over different neighborhood sizes for the KDEs.Define w := {w (l 1 ) , . . ., w (l L )} and Gw := l∈L w(l) Gh(l) .The key to reducing the MSE is to choose the weight vector w to reduce the lower order terms in the bias without substantially increasing the variance.

A. Finding the Optimal Weight
The theory of optimally weighted ensemble estimation is a general theory originally presented by Sricharan et al [50] that can be applied to estimation problems as long as the bias and variance of the estimator can be expressed in a specific way.We now generalize this theory so that it can be applied to a wider variety of estimation problems.Let L = {l 1 , . . ., l L } be a set of index values and let N be the number of samples available.For an indexed ensemble of estimators Êl l∈L of a parameter E, the weighted ensemble estimator with weights w = {w (l 1 ) , . . ., w (l L )} satisfying l∈L w(l) = 1 is defined as Êw is asyptotically unbiased if the estimators Êl where c i are constants depending on the underlying density and are independent of N and l, J = {i 1 , . . ., i I } is a finite index set with I < L, and ψ i (l) are basis functions depending only on the parameter l and not on the sample size N .• C.2 The variance is expressible as Theorem 4: Assume conditions C.1 and C.2 hold for an ensemble of estimators Êl l∈L .Then there exists a weight vector w 0 such that the MSE of the weighted ensemble estimator attains the parametric rate of convergence: The weight vector w 0 is the solution to the following convex optimization problem: Proof: From condition C.1, the bias of the weighted estimator is The variance of the weighted estimator is bounded as The optimization problem in (4) zeroes out the lower-order bias terms and limits the 2 norm of the weight vector w to limit the variance contribution.This results in an MSE rate of O(1/N ) when the dimension d is fixed and when L is fixed independently of the sample size N .Furthermore, a solution to ( 4) is guaranteed to exist as long as L > I and the vectors a i = [ψ i (l 1 ), . . ., ψ i (l L )] are linearly independent.This completes our sketch of the proof of Thm. 4.

B. The EnDive Estimator
To achieve the parametric rate O (1/N ) in MSE convergence it is not necessary that γ w (i) = 0, i ∈ J. Solving the following convex optimization problem in place of the optimization problem in Theorem 4 retains the O(1/N ) rate: where the parameter η is chosen to achieve a trade-off between bias and variance.Instead of forcing γ w (i) = 0, the relaxed optimization problem uses the weights to decrease the bias terms at the rate of O 1/
We now construct a divergence ensemble estimator from an ensemble of plug-in KDE divergence estimators.Consider first the bias result in (3) where g is general and assume that s ≥ d.In this case, the bias contains a O 1 h d N term.To guarantee the parametric MSE rate, any remaining lower-order bias terms in the ensemble estimator must be no slower than O 1/ We therefore obtain an ensemble of plug-in estimators Gh(l) l∈L and a weighted ensemble estimator Gw = l∈L w(l) Gh(l) .The bias of each estimator in the ensemble satisfies condition C.1 with ψ i (l) = l i and φ i,d (N ) = N −i/(2d) for i = 1, . . ., d.To obtain a uniform bound on the bias with respect to w and L, we also include the function ψ d+1 (l) = l −d with corresponding φ d+1,d (N ) = N −1/2 .The variance also satisfies condition C.2.The optimal weight w 0 is found by using (6) to obtain an optimally weighted plug-in divergence functional estimator Gw0 with an MSE convergence rate of O 1 N as long as s ≥ d and L ≥ d.Otherwise, if s < d we can only guarantee the MSE rate up to O 1 N s/d .We refer to this estimator as the Ensemble Divergence (EnDive) estimator and denote it as GEnDive .
We note that for some functionals g (including the KL divergence and the Renyi-α divergence integral) we can modify the EnDive estimator to obtain the parametric rate under the less strict assumption that s > d/2.For details on this approach, see Appendix A.

C. Central Limit Theorem
The following theorem shows that an appropriately normalized ensemble estimator Gw converges in distribution to a normal random variable under rather general conditions.Thus the same result applies to the EnDive estimator GEnDive .This enables us to perform hypothesis testing on the divergence functional.The proof is based on the Efron-Stein inequality and an application of Slutsky's Theorem (Appendix F).
Theorem 5: Assume that the functional g is Lipschitz in both arguments with Lipschitz constant C g .Further assume that h(l) = o(1), N → ∞, and N h(l) d → ∞ for each l ∈ L. Then for fixed L, the asymptotic distribution of the weighted ensemble estimator Gw is

P r
Gw where S is a standard normal random variable.

D. Uniform Convergence Rates
Here we show that the optimally weighted ensemble estimators achieve the parametric MSE convergence rate uniformly.Denote the subset of Σ (s, K H ) with densities bounded between 0 and ∞ as Σ (s, K H , 0 , ∞ ).
Theorem 6: Let GEnDive be the EnDive estimator of the functional G(p, q) = g (p(x), q(x)) q(x)dx, where p and q are d-dimensional probability densities.Additionally, let r = d and assume that s > r.Then where C is a constant.
The proof decomposes the MSE into the variance plus the square of the bias.The variance is bounded easily by using Theorem 3. To bound the bias, we show that the constants in the bias terms are continuous with respect to the densities p and q under an appropriate norm.We then show that Σ (s, K H , 0 , ∞ ) is compact with respect to this norm and then apply an extreme value theorem.Details are given in Appendix G.

IV. EXPERIMENTAL RESULTS
In this section, we discuss the choice of tuning parameters and validate the EnDive estimator's convergence rates and the central limit theorem.We then use the EnDive estimator to estimate bounds on the Bayes error for a single-cell bone marrow data classification problem.

A. Tuning Parameter Selection
The optimization problem in (6) has parameters η, L, and L. The parameter η provides an upper bound on the norm of the weight vector, which gives an upper bound on the constant in the variance of the ensemble estimator.If all the constants in (3) and an exact expression for the variance of the ensemble estimator were known, then η could be chosen to minimize the MSE.Since the constants are unknown, by applying ( 6), the resulting MSE of the ensemble estimator is O 2 /N + O Lη 2 /N , where each term in the sum comes from the bias and variance, respectively.Since there is a tradeoff between η and , in principle setting η = / √ L would minimize these terms asymptotically.In practice, we find that for finite sample sizes, the variance of the ensemble estimator is less than the upper bound of Lη 2 /N and setting η = / √ L is therefore overly restrictive.Setting η = instead works well in practice.
For fixed L, the set of kernel widths L can in theory be chosen by minimizing in (6) over L in addition to w.However, this results in a nonconvex optimization problem since w does not lie in the non-negative orthant.A parameter search may not be practical as generally decreases as the size and spread of L increases.However, a decrease in does not always correspond to a decrease in MSE for finite samples as high and low values of h(l) can lead to inaccurate density estimates.Thus we provide the following recommendations for L. Denote the value of the minimum value of l so that fi,h(lmin) (X j ) > 0 ∀i = 1, 2 as l min and the diameter of the support S as D. To ensure the density estimates are bounded away from zero, we require that min(L) ≥ l min .The weights in w 0 are generally largest for the smallest values of L (see Fig. 2) so min(L) should also be sufficiently larger than l min to render an adequate estimate.Similarly, max(L) should be sufficiently smaller than D as high bandwidth values lead to high bias.The remaining L values are chosen to be equally spaced between min(L) and max(L).
An efficient way to choose l min and l max is to select integers k min and k max and compute the k min and k max nearest neighbor distances of all the data points.The bandwidths h(l min ) and h(l max ) can then be chosen to be the maximum of these corresponding distances.The parameters l min and l max can then be computed from the expression h(l) = lN − 1 2d .This choice ensures that a minimum of k min points are within the kernel bandwidth for the density estimates at all points and that a maximum of k max points are within the kernel bandwidth for the density estimates at one of the points.
As L increases, the similarity of bandwidth values h(l) and basis functions ψ i,d (l) increases, resulting in a negligible decrease in the bias.Hence L should be chosen large enough to decrease the bias but small enough so that the h(l) values are sufficiently distinct (typically 30 ≤ L ≤ 60).

B. Convergence Rates Validation: Rényi-α Divergence
To validate our theory, we estimated the Rényi-α divergence integral between two truncated multivariate Gaussian distributions with varying dimension and sample sizes.The densities have means μ1 = 0.7 * 1d , μ2 = 0.3 * 1d and covariance matrices 0.4 * I d where 1d is a d-dimensional vector of ones, and I d is a d × d identity matrix.We used α = 0.5 and restricted the Gaussians to the unit cube.
The left plots in Figure 3 show the MSE (200 trials) of the standard plug-in estimator implemented with a uniform kernel and the proposed optimally weighted estimators EnDive for various dimensions and sample sizes.The parameter set L was selected Sample Size N  I), performs better than the plug-in estimator in terms of MSE, and is less biased.based on a range of k-nearest neighbor distances.The bandwidth used for the standard plug-in estimator was selected by setting

Table I NEGATIVE LOG-LOG SLOPE OF THE ENDIVE MSE AS
where l * was chosen fromL to minimize the MSE of the plug-in estimator.For all dimensions and sample sizes, EnDive outperforms the plug-in estimator in terms of MSE.EnDive is also less biased than the plug-in estimator and even has a lower variance at smaller sample sizes (e.g.N = 100).This reflects the strength of ensemble estimators: the weighted sum of a set of relatively poor estimators can result in a very good estimator.Note also that for the larger values of N , the ensemble estimator MSE rates approach the theoretical rate based on the estimated log-log slope given in Table I.
Our experiments indicated that the proposed ensemble estimator is not sensitive to the tuning parameters.See [53] for more details.

C. Central Limit Theorem Validation: KL Divergence
To verify the central limit theorem of the EnDive estimator, we estimated the KL divergence between two truncated Gaussian densities again restricted to the unit cube.We conducted two experiments where 1) the densities are different with means μ1 = 0.7 * 1d , μ2 = 0.3 * 1d and covariance matrices σ i * I d , σ 1 = 0.1, σ 2 = 0.3; and where 2) the densities are the same with means 0.3 * 1d and covariance matrices 0.3 * I d .For both experiments, we chose d = 6 and four different sample sizes N .We found that the correspondence between the quantiles of the standard normal distribution and the quantiles of the centered and scaled EnDive estimator is very high under all settings (see Table II), which validates Theorem 5.

D. Bayes Error Rate Estimation on Single-Cell Data
Using the EnDive estimator, we estimate bounds on the Bayes error rate (BER) of a classification problem involving MARS-seq single-cell RNA-sequencing (scRNA-seq) data measured from developing mouse bone marrow cells enriched for the myeloid and erythroid lineages [54].However, we first demonstrate the ability of EnDive to estimate bounds on the BER of a simulated problem.In this simulation, the data are drawn from two classes where each class distribution is a d = 10 dimensional Gaussian distribution with different means and the identity covariance matrix.We considered two cases: the distance between the means is 1 and 3, respectively.The BER can be calculated in both cases.We then estimated upper and lower bounds on the BER by estimating the Henze-Penrose (HP) divergence [5], [7]. Figure 4 shows the average estimated upper and lower bounds on the BER with standard error bars for both cases.For all tested sample sizes, the BER is within one standard deviation of the estimated lower bound.The lower bound is also closer on average to the BER for most of the tested sample sizes (lower sample sizes with a smaller distance between means provide the exceptions).Generally, these resuls indicate that the true BER is relatively close to the estimated lower bound on average.
We now estimate similar bounds on the scRNA-seq classification problem using EnDive.We consider the three most common cell types within the data: erythrocytes, monocytes, and basophils (N = 1095, 559, 300, respectively).We estimate upper and lower bounds on the pairwise BER between these classes using different combinations of genes selected from the KEGG pathways associated with the hematopoietic cell lineage [55]- [57].Each collection of genes contains 11-14 genes.The upper and lower bounds on the BER are estimated using the Henze-Penrose divergence [5], [7].The standard deviation of the bounds for the KEGG-based genes is estimated via 1000 bootstrap iterations.The KEGG-based bounds are compared to BER bounds obtained from 1000 random selections of 12 genes.In all cases, we compare the bounds to the performance of a quadratic discriminant analysis classifier (QDA) with 10-fold cross validation.Note that to correct for undersampling in scRNA-seq data, we first impute the undersampled data using MAGIC [58].
All results are given in Table III.From these results, we note that erythrocytes are relatively easy to distinguish from the other two cell types as the BER lower bounds are within nearly two standard deviations of zero when using genes associated with platelet, erythrocyte, and neutrophil development as well as a random selection of 12 genes.This is corroborated by the QDA cross-validated results which are all within 2 standard deviations of either the upper or lower bound for these gene sets.In contrast, the macrophage associated genes seem to be less useful for distinguishing erythrocytes than the other gene sets.
We also find that basophils are difficult to distinguish from monocytes using these gene sets.Assuming the relative abundance of each cell type is representative of the population, a trivial upper bound on the BER is 300/(300 + 559) ≈ 0.35 which is between all of the estimated lower and upper bounds.The QDA results are also relatively high (and may be overfitting the data in some cases, based on the estimated BER bounds), suggesting that different genes should be explored for this classification problem.

V. CONCLUSION
We derived convergence rates for a kernel density plug-in estimator of divergence functionals.We generalized the theory of optimally weighted ensemble estimation and derived an ensemble divergence estimator EnDive that achieves the parametric rate when the densities are more than d times differentiable.The estimator we derive applies to general bounded density support sets and does not require knowledge of the support which is a distinct advantage over other estimators.We also derived the asymptotic distribution of the estimator, provided some guidelines for tuning parameter selection, and validated the convergence rates for the case of empirical estimation of the Rényi-α divergence.We then performed experiments to examine the estimator's robustness to the choice of tuning parameters, validated the central limit theorem for KL divergence estimation, and estimated bounds on the Bayes error rate for a single cell classification problem.
We note that based on the proof techniques employed in our work, our weighted ensemble estimators are easily extended beyond divergence estimation to more general distributional functionals which may be integral functionals of any number of probability distributions.We also show in Appendix A that EnDive can be easily modified to obtain an estimator that achieves the parametric rate when the densities are more than d/2 times differentiable and the functional g has a specific form that includes the Rényi and KL divergences.Future work includes extending this modification to functionals with more general forms.An important divergence of interest in this context is the Henze-Penrose divergence we use to bound the Bayes error.Further future work will focus on extending this work on divergence estimation to k-nn based estimators where knowledge of the support is again not required.This will improve the computational burden as k-nn estimators require fewer computations than standard KDEs.

APPENDIX A MODIFIED ENDIVE
If the functional g has a specific form, we can modify the EnDive estimator to obtain an estimator that achieves the parametric rate when s > d/2.Specifically, we have the following: Algorithm 1 The modified EnDive estimator Input: η, L positive real numbers L, samples {Y 1 , . . ., Y N } from f 1 , samples {X 1 , . . ., X N } from f 2 , dimension d, function g, kernel K Output: The modified EnDive estimator GMod 1: Solve for w 0 using (6) with φ j,q,d (N ) = N − j+q d+1 and basis functions ψ j,q (l) = l j−dq , l ∈ l, and {i, j} ∈ J defined in (8) 2: for all l ∈ l do 3: end for 7: Gh(l) ← 1 N N i=1 g f1,h(l) (X i ), f2,h(l) (X i ) 8: end for 9: GMod ← l∈L w 0 (l) Gh(l) Theorem 7: Assume that assumptions A.0 − A.5 hold.Furthermore, if g(x, y) has k, l-th order mixed derivatives ∂ k+l g(x,y) ∂x k ∂y l that depend on x, y only through x α y β for some α, β ∈ R, then for any positive integer λ ≥ 2, the bias of Gh is Divergence functionals that satisfy the mixed derivatives condition required for (7) include the KL divergence and the Rényiα divergence.Obtaining similar terms for other divergence functionals requires us to separate the dependence on h of the derivatives of g evaluated at E Z fi,h (Z).This is left for future work.See Appendix D for details.
As compared to (3), there are many more terms in (7).These terms enable us to modify the EnDive estimator to achieve the parametric MSE convergence rate when s > d/2 for an appropriate choice of bandwidths whereas the terms in (3) require s ≥ d to achieve the same rate.This is accomplished by letting h(l) decrease at a faster rate as follows.
The parametric rate can be achieved with GMod under less strict assumptions on the smoothness of the densities than those required for GEnDive .Since δ > 0 can be arbitrary, it is theoretically possible to achieve the parametric rate with the modified estimator as long as s > d/2.This is consistent with the rate achieved by the more complex estimators proposed in [42].We also note that the central limit theorem applies and that the convergence is uniform as Theorem 6 applies for s > (d + δ)/2 and s ≥ (d + δ)/2.
These rate improvements come at a cost in the number of parameters L required to implement the weighted ensemble estimator.If s ≥ d+δ 2 then the size of J for GMod is on the order of d 2 8δ .This may lead to increased variance of the ensemble estimator as indicated by (5).
So far GMod can only be applied to functionals g(x, y) with mixed derivatives of the form of x α y β .Future work is required to extend this estimator to other functionals of interest.

APPENDIX B GENERAL RESULTS
Here we present the generalized forms of Theorems 2 and 3 where the sample sizes and bandwidths of the two datasets are allowed to differ.In this case, the KDEs are where We also generalize the bias result to the case where the kernel K has order ν which means that the jth moment of the kernel K i defined as t j K i (t)dt is zero for all j = 1, . . ., ν − 1 and i = 1, . . ., d where K i is the kernel in the ith coordinate.Note that symmetric product kernels have order ν ≥ 2. The following theorem on the bias follows under assumptions A.0 − A.5: Theorem 8: For general g, the bias of the plug-in estimator Gh1,h2 is of the form Furthermore, if g(x, y) has k, l-th order mixed derivatives ∂ k+l g(x,y) ∂x k ∂y l that depend on x, y only through x α y β for some α, β ∈ R, then for any positive integer λ ≥ 2, the bias is of the form Note that the bandwidth and sample size terms do not depend on the order of the kernel ν.Thus using a higher-order kernel does not provide any benefit in the convergence rates.This lack of improvement is due to the bias of the density estimators at the boundary of the densities' support sets.To obtain better convergence rates using higher-order kernels, boundary correction would be necessary [42], [45].In contrast, we improve the convergence rates by using a weighted ensemble that does not require boundary correction.
The variance result requires much less strict assumptions than the bias results: Theorem 9: Assume that the functional g in (1) is Lipschitz continuous in both of its arguments with Lipschitz constant C g .Then the variance of the plug-in estimator Gh1,h2 is bounded by The proofs of these theorems are in Appendices D and E. Theorems 2 and 3 then follow.
APPENDIX C PROOF OF THEOREM 1 (BOUNDARY CONDITIONS) Consider a uniform rectangular kernel K(x) that satisfies K(x) = 1 for all x such that ||x|| 1 ≤ 1/2.Also consider the family of probability densities f with rectangular support S = [−1, 1] d .We will prove Theorem 1 which is that that S satisfies the following smoothness condition (A.5): for any polynomial p x (u) : R d → R of order q ≤ r = s with coefficients that are r − q times differentiable wrt x, where v t (h) has the expansion Note that the inner integral forces the x's under consideration to be boundary points via the constraint x + uh / ∈ S.

A. Single Coordinate Boundary Point
We begin by focusing on points x that are boundary points by virtue of a single coordinate x i such that x i + u i h / ∈ S. Without loss of generality, assume that x i + u i h > 1.The inner integral in (12) can then be evaluated first wrt all coordinates other than i.Since all of these coordinates lie within the support, the inner integral over these coordinates will amount to integration of the polynomial p x (u) over a symmetric d − 1 dimensional rectangular region |u j | ≤ 1 2 for all j = i.This yields a function q m=1 pm (x)u m i where the coefficients pm (x) are each r − q times differentiable wrt x.With respect to the u i coordinate, the inner integral will have limits from 1−xi h to 1  2 for some 1 > x i > 1 − h 2 .Consider the pq (x)u q i monomial term.The inner integral wrt this term yields q m=1 pm (x) Raising the right hand side of ( 13) to the power of t results in an expression of the form where the coefficients pj (x) are r − q times differentiable wrt x.Integrating ( 14) over all the coordinates in x other than x i results in an expression of the form where again the coefficientsp j (x i ) are r − q times differentiable wrt x i .Note that since the other cooordinates of x other than x i are far away from the boundary, the coefficients pj (x i ) are independent of h.To evaluate the integral of ( 15), consider the r − q term Taylor series expansion of pj (x i ) around x i = 1.This will yield terms of the form for 0 ≤ j ≤ r − q, and 0 ≤ k ≤ qt.Combining terms results in the expansion v t (h) = r−q i=1 e i,q,t h i + o (h r−q ).

B. Multiple Coordinate Boundary Point
The case where multiple coordinates of the point x are near the boundary is a straightforward extension of the single boundary point case so we only sketch the main ideas here.As an example, consider the case where 2 of the coordinates are near the boundary.Assume for notational ease that they are x 1 and x 2 and that x 1 + u 1 h > 1 and x 2 + u 2 h > 1.The inner integral in (12) can again be evaluated first wrt all coordinates other than 1 and 2. This yields a function q m,j=1 pm,j (x)u m 1 u j 2 where the coefficients pm,j (x) are each r − q times differentiable wrt x.Integrating this wrt x 1 and x 2 and then raising the result to the power of t yields a double sum similar to (14).Integrating this over all the coordinates in x other than x 1 and x 2 gives a double sum similar to (15).Then a Taylor series expansion of the coefficients and integration over x 1 and x 2 yields the result.

APPENDIX D PROOF OF THEOREM 8 (BIAS)
In this appendix, we prove the bias results in Thm. 8.The bias of the base kernel density plug-in estimator Gh1,h2 can be expressed as where Z is drawn from f 2 .The first term is the "variance" term while the second is the "bias" term.We bound these terms using Taylor series expansions under the assumption that g is infinitely differentiable.The Taylor series expansion of the variance term in ( 16) will depend on variance-like terms of the KDEs while the Taylor series expansion of the bias term in ( 16) will depend on the bias of the KDEs.
The Taylor series expansion of g where is the bias of fi,hi at the point Z raised to the power of j.This expansion can be used to control the second term (the bias term) in (16).To accomplish this, we require an expression for To obtain an expression for B Z fi,hi (Z) , we consider separately the cases when Z is in the interior of the support S or when Z is near the boundary of the support.A point X ∈ S is defined to be in the interior of S if for all Y / ∈ S, K X−Y hi = 0.A point X ∈ S is near the boundary of the support if it is not in the interior.Denote the region in the interior and near the boundary wrt h i as S Ii and S Bi , respectively.We will need the following.
Lemma 1: Let Z be a realization of the density f 2 independent of fi,hi for i = 1, 2. Assume that the densities f 1 and f 2 belong to Σ(s, L).Then for Z ∈ S Ii , Proof: Obtaining the lower order terms in ( 18) is a common result in kernel density estimation.However, since we also require the higher order terms, we present the proof here.Additionally, some of the results in this proof will be useful later.From the linearity of the KDE, we have that if X is drawn from f i and is independent of Z, then where the last step follows from the substitution t = x−Z hi .Since the density f i belongs to Σ(s, K), using multi-index notation we can expand it as where Combining (19) and (20) gives where the last step follows from the fact that K is symmetric and of order ν.
To obtain a similar result for the case when Z is near the boundary of S, we use assumption A.5. Lemma 2: Let γ(x, y) be an arbitrary function satisfying sup x,y |γ(x, y)| < ∞.Let S satisfy the boundary smoothness conditions of Assumption A.5. Assume that the densities f 1 and f 2 belong to Σ(s, L) and let Z be a realization of the density Proof: For fixed X near the boundary of S, we have Note that in T 1,i (X), we are extending the integral beyond the support of the density f i .However, by using the same Taylor series expansion method as in the proof of Lemma 1, we always evaluate f i and its derivatives at the point X which is within the support of f i .Thus it does not matter how we define an extension of f i since the Taylor series will remain the same.Thus T 1,i (X) results in an identical expression to that obtained from (18).
For the T 2,i (X) term, we expand it as follows using multi-index notation as Recognizing that the |α|th derivative of f i is r −|α| times differentiable, we can apply assumption A.5 to obtain the expectation of T 2,i (X) wrt X: Similarly, we find that Combining these results gives where the constants are functionals of the kernel, γ, and the densities.The expression in ( 22) can be proved in a similar manner.Applying Lemmas 1 and 2 to (17) gives For the variance term (the first term) in ( 16), the truncated Taylor series expansion of g f1,h1 (Z), f2,h2 (Z) around where ẽi,hi (Z) := fi,hi (Z) − E Z fi,hi (Z).To control the variance term in ( 16), we thus require expressions for E Z ẽj i,hi (Z) .Lemma 3: Let Z be a realization of the density f 2 that is in the interior of the support and is independent of fi,hi for i = 1, 2. Let n(q) be the set of integer divisors of q including 1 but excluding q.Then, where c 6,i,q,j,m is a functional of f 1 and f 2 .
Proof: Define the random variable Clearly, E Z V i (Z) = 0. From (19), we have for integer j ≥ 1 where the constants c 3,2,j,m depend on the density f 2 , its derivatives, and the moments of the kernel K j .Note that since K is symmetric, the odd moments of K j are zero for Z in the interior of the support.However, all even moments may now be nonzero since K j may now be nonnegative.By the binomial theorem, We can use these expressions to simplify E Z ẽq 2,h2 (Z) .As an example, let q = 2. Then since the X i s are independent, Similarly, we find that For q = 4, we have The pattern is then for q ≥ 2, For any integer q, the largest possible factor is q/2.Thus for given q, the smallest possible exponent on the N 2 h d 2 term is q/2.This increases as q increases.A similar expression holds for E Z ẽq 1,h1 (Z) except the X i s are replaced with Y i , f 2 is replaced with f 1 , and N 2 and h 2 are replaced with N 1 and h 1 , respectively, all resulting in different constants.Then since ẽ1,h1 (Z) and ẽ2,h2 (Z) are conditionally independent given Z, Applying Lemma 3 to (24) when taking the conditional expectation given Z in the interior gives an expression of the form Note that the functionals c 7,i,j,m and c 7,j,i,m,n depend on the derivatives of g and E Z fi,hi (Z) which depends on h i .To apply ensemble estimation, we need to separate the dependence on h i from the constants.If we use ODin1, then it is sufficient to note that in the interior of the support, for some functional c.The terms in (27) reduce to For ODin2, we need the higher order terms.To separate the dependence on h i from the constants, we need more information about the functional g and its derivatives.Consider the special case where the functional g(x, y) has derivatives of the form of x α y β with α, β < 0. This includes the important cases of the KL divergence and the Renyi divergence.The generalized binomial theorem states that if α m := α(α−1)...(α−m+1) m! and if q and t are real numbers with |q| > |t|, then for any complex number α, Since the densities are bounded away from zero, for sufficiently small h i , we have that Applying the generalized binomial theorem and Lemma 1 gives that Since m is an integer, the exponents of the h i terms are also integers.Thus (27) gives in this case As before, the case for Z close to the boundary of the support is more complicated.However, by using a similar technique to the proof of Lemma 2 for Z at the boundary and combining with the previous results, we find that for general g, If g(x, y) has derivatives of the form of x α y β with α, β < 0, then we can similarly obtain Combining (23) with either (30) or (31) completes the proof.
Note that by the standard central limit theorem [59], the second term converges in distribution to a Gaussian random variable.
If the first term converges in probability to a constant (specifically, 0), then we can use Slutsky's theorem [60] to find the asymptotic distribution.So now we focus on the first term which we denote as V N2 .
For the third term in (38), There are M 2 terms where i = j and we have from Appendix E (see (36)) that Thus these terms are O 1 M2 .There are M 2 2 − M 2 terms when i = j.In this case, we can do four substitutions of the form u j = Xj −X1 h2 to obtain Then since h d 2 = o(1), we get g f1,h1 (X j ), f2,h2 (X j ) − g f 1,h1 (X j ), f2,h2 (X j ) .
Using a similar argument as that used to obtain (39), we have that if Applying the Efron-Stein inequality gives Thus by Chebyshev's inequality, and therefore V N2 converges to zero in probability.By Slutsky's theorem, √ N 2 Gh1,h2 − E Gh1,h2 converges in distribution to a zero mean Gaussian random variable with variance V E X g f1,h1 (X), f2,h2 (X) , where X is drawn from f 2 .
For the weighted ensemble estimator, we wish to know the asymptotic distribution of √ N 2 Gw − E Gw where Gw = l∈ l w(l) Gh(l) .We have that j=1 l∈ l w(l) g f1,h(l) (X j ), f2,h(l) (X j ) − E Xj g f1,h(l) (X j ), f2,h(l) (X j ) The second term again converges in distribution to a Gaussian random variable by the central limit theorem.The mean and variance are, respectively, zero and V   l∈ l w(l)E X g f1,h(l) (X), f2,h(l) (X) The first term is equal to g f1,h(l) (X j ), f2,h(l) (X j ) − E Xj g f1,h(l) (X j ), f2,h(l) (X j )   = l∈ l w(l)o P (1) = o P (1), where o P (1) denotes convergence to zero in probability.In the last step, we used the fact that if two random variables converge in probability to constants, then their linear combination converges in probability to the linear combination of the constants.Combining this result with Slutsky's theorem completes the proof.

APPENDIX G PROOF OF THEOREM 6 (UNIFORM MSE)
Since the MSE is equal to the square of the bias plus the variance, we can upper bound the left hand side of (??) with Var Gw0 .
From the assumptions (lipschitz, kernel bounded, weight calculated from relaxed opt.prob), we have that sup p,q∈Σ(s,K H , 0, ∞) Var Gw0 ≤ sup p,q∈Σ(s,K, 0 , ∞ ) where the last step follows from the fact that all of the terms are independent of p and q.
For the bias, recall that if g is infinitely differentiable and if the optimal weight w 0 is calculated using the relaxed convex optimization problem, then We use a topology argument to bound the supremum of this term.We will use the Extreme Value Theorem [61]: Theorem 10 (Extreme Value Theorem): Let f : X → R be continuous.If X is compact, then for every x ∈ X, there exist points c, d ∈ X s.t.f (c) ≤ f (x) ≤ f (d).
By this theorem, f achieves its minimum and maximum on X.Our approach is to first show that the functionals c i (p, q) are continuous wrt p and q in some appropriate norm.We will then show that the space Σ(s, K H , 0 , ∞ ) is compact wrt this norm.The Extreme Value Theorem can then be applied to bound the supremum of (40).
We first define the norm.Let α = s − r > 0. We use the standard norm on the space Σ(s, K H ) [62]: It is sufficient to show that this is continuous.Let > 0 and max (||p − p 0 || C r , ||q − q 0 || C r ) < δ where δ > 0 will be chosen later.Then by applying the triangle inequality for integration and adding and subtracting terms, we have that ∂t i 1 ∂t j 2 t 1 = p(x) t 2 = q(x)     D β p(x)D γ q(x) (q(x) − q 0 (x)) dx ∂t i 1 ∂t j 2 t 1 = p(x) t 2 = q(x)     D β p(x)q 0 (x) (D γ q(x) − D γ q 0 (x)) dx + D β p 0 (x)D γ q 0 (x)q 0 (x) By Assumption A.4, the absolute value of the mixed derivatives of g is bounded on the range defined for p and q by some constant C i,j .Also, q 0 (x) ≤ ∞ .Furthermore, since D γ q 0 and D β p are continuous, and since S ⊂ R d is compact, then the absolute value of the derivatives D γ q 0 and D β p is also bounded by a constant ∞ .Let δ 0 > 0. Then since the mixed derivatives of g are continuous on the interval [ 0 , ∞ ], they are uniformly continuous.Therefore, we can choose δ small enough s.t.
Given that each c i (p, q) is continuous, then i∈J c i (p, q) 2 is also continuous wrt p and q.
We now argue that Σ(s, K H ) is compact.First, a set is relatively compact if its closure is compact.By the Arzela-Ascoli theorem [63], the space Σ(s, K H ) is relatively compact in the topology induced by the • Σ(t,K H ) norm for any t < s.We choose t = r.It can then be shown that under the • Σ(r,K H ) norm, Σ(s, K H ) is complete [62].Since Σ(s, K H ) is contained in a metric space, then it is also closed and therefore equal to its closure.Thus Σ(s, K H ) is compact.Then since Σ(s, K H , 0 , ∞ ) is closed in Σ(s, K H ), it is also compact.Therefore, since for each p, q ∈ Σ(s, K H , 0 , ∞ ), i∈J c i (p, q) where we use the fact that J is finite (see Sections III-B or Afor the set J ).

Figure 1 . 1 d+1.
Figure 1.Heat map of predicted bias of divergence funtional plug-in estimator based on Theorem 2 as a function of dimension and sample size when h = N −1 d+1 .Note the phase transition in the bias as dimension d increases for fixed sample size N : bias remains small only for relatively small values of d.The proposed weighted ensemble estimator removes this phase transition when the densities are sufficiently smooth.

:• C. 1
l∈L are asymptotically unbiased.Consider the following conditions on Êl l∈L The bias is expressible as

Figure 2 .
Figure 2. The optimal weights from (6) when d = 4, N = 3100, L = 50, and l is uniformly spaced between 1.5 and 3.The lowest values of l are given the highest weight.Thus the minimum value of bandwidth parameters L should be sufficiently large to render an adequate estimate.

Figure 3 .
Figure 3. (Left) Log-log plot of MSE of the uniform kernel plug-in ("Kernel") and the optimally weighted EnDive estimator for various dimensions and sample sizes.(Right) Plot of the average value of the same estimators with standard error bars compared to the true values being estimated.The proposed weighted ensemble estimator approaches the theoretical rate (see TableI), performs better than the plug-in estimator in terms of MSE, and is less biased.
A FUNCTION OF SAMPLE SIZE FOR VARIOUS DIMENSIONS.THE SLOPE IS CALCULATED BEGINNING AT Nstart.THE NEGATIVE SLOPE IS CLOSER TO 1 WITH Nstart = 10 2.375 THAN FOR Nstart = 10 2 , INDICATING THAT THE ASYMPTOTIC RATE HAS NOT YET TAKEN EFFECT AT Nstart = 10 2 .

Figure 4 .
Figure 4.Estimated upper (UB) and lower bounds (LB) on the BER based on estimating the HP divergence between two 10-dimensional Gaussian distributions with identity covariance matrix and a distance between means of 1 (left) and 3 (right), respectively.Estimates are calculated using EnDive with error bars indicating the standard deviation from 400 trials.The upper bound is closer on average to the true BER when N is small (≈ 100 − 300) and the distance between the means is small.The lower bound is closer on average in all other cases.

2 <
∞, by the Extreme Value Theorem we have sup p,q∈Σ(s,K H , 0, ∞) Bias Gw0 2 = sup p,q∈Σ(s,K H , 0, ∞) Table II COMPARISON BETWEEN QUANTILES OF A STANDARD NORMAL RANDOM VARIABLE AND THE QUANTILES OF THE CENTERED AND SCALED ENDIVE ESTIMATOR APPLIED TO THE KL DIVERGENCE WHEN THE DISTRIBUTIONS ARE THE SAME AND DIFFERENT.QUANTILES WERE COMPUTED FROM 10,000 TRIALS.THE PARAMETER ρ GIVES THE CORRELATION COEFFICIENT BETWEEN THE QUANTILES WHILE β IS THE ESTIMATED SLOPE BETWEEN THE QUANTILES.THE CORRESPONDENCE BETWEEN QUANTILES IS VERY HIGH FOR ALL CASES.
Table III MISCLASSIFICATION RATE OF A QDA CLASSIFIER AND ESTIMATED UPPER BOUNDS (UB) AND LOWER BOUNDS (LB) OF THE PAIRWISE BER BETWEEN MOUSE BONE MARROW CELL TYPES USING THE HENZE-PENROSE DIVERGENCE APPLIED TO DIFFERENT COMBINATIONS OF GENES SELECTED FROM THE KEGG PATHWAYS ASSOCIATED WITH THE HEMATOPOIETIC CELL LINEAGE.RESULTS ARE PRESENTED AS PERCENTAGES IN THE FORM OF MEAN±STANDARD DEVIATION.BASED ON THESE RESULTS, ERYTHROCYTES ARE RELATIVELY EASY TO DISTINGHUISH FROM THE OTHER TWO CELL TYPES USING THESE GENE SETS.