Open Access This article is
- freely available
Entropy 2018, 20(8), 560; doi:10.3390/e20080560
Ensemble Estimation of Information Divergence †
Genetics Department and Applied Math Program, Yale University, New Haven, CT 06520, USA
Intuit Inc., Mountain View, CA 94043, USA
IBM Research, Cambridge, MA 02142, USA
Electrical Engineering and Computer Science Department, University of Michigan, Ann Arbor, MI 48109, USA
Correspondence: [email protected]; Tel.: +1-734-764-0564
This paper is an extended version of our paper published in the 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 1133–1137.
Current address: Department of Mathematics and Statistics, Utah State University, Logan, UT 84322, USA; [email protected].
Received: 29 June 2018 / Accepted: 26 July 2018 / Published: 27 July 2018
Recent work has focused on the problem of nonparametric estimation of information divergence functionals between two continuous random variables. Many existing approaches require either restrictive assumptions about the density support set or difficult calculations at the support set boundary which must be known a priori. The mean squared error (MSE) convergence rate of a leave-one-out kernel density plug-in divergence functional estimator for general bounded density support sets is derived where knowledge of the support boundary, and therefore, the boundary correction is not required. The theory of optimally weighted ensemble estimation is generalized to derive a divergence estimator that achieves the parametric rate when the densities are sufficiently smooth. Guidelines for the tuning parameter selection and the asymptotic distribution of this estimator are provided. Based on the theory, an empirical estimator of Rényi- divergence is proposed that greatly outperforms the standard kernel density plug-in estimator in terms of mean squared error, especially in high dimensions. The estimator is shown to be robust to the choice of tuning parameters. We show extensive simulation results that verify the theoretical results of our paper. Finally, we apply the proposed estimator to estimate the bounds on the Bayes error rate of a cell classification problem.
Keywords:divergence; differential entropy; nonparametric estimation; central limit theorem; convergence rates; bayes error rate
Information divergences are integral functionals of two probability distributions and have many applications in the fields of information theory, statistics, signal processing, and machine learning. Some applications of divergences include estimating the decay rates of error probabilities , estimating bounds on the Bayes error [2,3,4,5,6,7,8] or the minimax error  for a classification problem, extending machine learning algorithms to distributional features [10,11,12,13], testing the hypothesis that two sets of samples come from the same probability distribution , clustering [15,16,17], feature selection and classification [18,19,20], blind source separation [21,22], image segmentation [23,24,25], and steganography . For many more applications of divergence measures, see reference . There are many information divergence families including Alpha- and Beta-divergences  as well as f-divergences [29,30]. In particular, the f-divergence family includes the well-known Kullback–Leibler (KL) divergence , the Rényi- divergence integral , the Hellinger–Bhattacharyya distance [33,34], the Chernoff- divergence , the total variation distance, and the Henze–Penrose divergence .
Despite the many applications of divergences between continuous random variables, there are no nonparametric estimators of these functionals that achieve the parametric mean squared error (MSE) convergence rate, are simple to implement, do not require knowledge of the boundary of the density support set, and apply to a large set of divergence functionals. In this paper, we present the first information divergence estimator that achieves all of the above. Specifically, we address the problem of estimating divergence functionals when only a finite population of independent and identically distributed (i.i.d.) samples is available from the two d-dimensional distributions that are unknown, nonparametric, and smooth. Our contributions are as follows:
- We propose the first information divergence estimator, referred to as EnDive, that is based on ensemble methods. The ensemble estimator takes a weighted average of an ensemble of weak kernel density plug-in estimators of divergence where the weights are chosen to improve the MSE convergence rate. This ensemble construction makes it very easy to implement EnDive.
- We prove that the proposed ensemble divergence estimator achieves the optimal parametric MSE rate of , where N is the sample size when the densities are sufficiently smooth. In particular, EnDive achieves these rates without explicitly performing boundary correction which is required for most other estimators. Furthermore, we show that the convergence rates are uniform.
- We prove that EnDive obeys a central limit theorem and thus, can be used to perform inference tasks on the divergence such as testing that two populations have identical distributions or constructing confidence intervals.
1.1. Related Work
Much work has focused on the problem of estimating the entropy and the information divergence of discrete random variables [1,29,35,36,37,38,39,40,41,42,43]. However, the estimation problem for discrete random variables differs significantly from the continuous case and thus employs different tools for both estimation and analysis.
One approach to estimating the differential entropy and information divergence of continuous random variables is to assume a parametric model for the underlying probability distributions [44,45,46]. However, these methods perform poorly when the parametric model does not fit the data well. Unfortunately, the structure of the underlying data distribution is unknown for many applications, and thus the chance for model misspecification is high. Thus, in many of these applications, parametric methods are insufficient, and nonparametric estimators must be used.
While several nonparametric estimators of divergence functionals between continuous random variables have been previously defined, the convergence rates are known for only a few of them. Furthermore, the asymptotic distributions of these estimators are unknown for nearly all of them. For example, Póczos and Schneider  established a weak consistency for a bias-corrected k-nearest neighbor (nn) estimator for Rényi- and other divergences of a similar form where k was fixed. Li et al.  examined k-nn estimators of entropy and the KL divergence using hyperspherical data. Wang et al.  provided a k-nn based estimator for KL divergence. Plug-in histogram estimators of mutual information and divergence have been proven to be consistent [49,50,51,52]. Hero et al.  provided a consistent estimator for Rényi- divergence when one of the densities is known. However none of these works studied the convergence rates or the asymptotic distribution of their estimators.
There has been recent interest in deriving convergence rates for divergence estimators for continuous data [54,55,56,57,58,59,60]. The rates are typically derived in terms of a smoothness condition on the densities, such as the Hölder condition :
Definition 1 (Hölder Class).
Let be a compact space. For define and . The Hölder class of functions on consists of the functions (f) that satisfyfor all and for all r s.t. .
From Definition 1, it is clear that if a function (f) belongs to , then f is continuously differentiable up to order . In this work, we show that EnDive achieves a parametric MSE convergence rate of when and , depending on the specific form of the divergence function.
Nguyen et al.  proposed an f-divergence estimator that estimates the likelihood ratio of the two densities by solving a convex optimization problem and then plugging it into the divergence formulas. The authors proved that the minimax MSE convergence rate is parametric when the likelihood ratio is a member of the bounded Hölder class with . However, this estimator is restricted to true f-divergences and may not apply to the broader class of divergence functionals that we consider here (as an example, the divergence is not an f-divergence). Additionally, solving the convex problem of  has similar computational complexity to that of training a support vector machine (SVM) (between and ), which can be demanding when N is large. In contrast, the EnDive estimator that we propose requires only the construction of simple density plug-in estimates and the solution of an offline convex optimization problem. Therefore, the most computationally demanding step in the EnDive estimator is the calculation of the density estimates, which has a computational complexity no greater than .
Singh and Póczos [58,59] provided an estimator for Rényi- divergences as well as general density functionals that use a “mirror image” kernel density estimator. They proved that these estimators obtain an MSE convergence rate of when for each of the densities. However their approach requires several computations at each boundary of the support of the densities which is difficult to implement as d gets large. Also, this computation requires knowledge of the support (specifically, the boundaries) of the densities which is unknown in most practical settings. In contrast, while our assumptions require the density support sets to be bounded and the boundaries to be smooth, knowledge of the support is not required to implement EnDive.
The “linear” and “quadratic” estimators presented by Krishnamurthy et al.  estimate divergence functionals that include the form for given and where and are probability densities. These estimators achieve the parametric rate when and for the linear and quadratic estimators, respectively. However, the latter estimator is computationally infeasible for most functionals, and the former requires numerical integration for some divergence functionals which can be computationally difficult. Additionally, while a suitable - indexed sequence of divergence functionals of this form can be constructed that converge to the KL divergence, this does not guarantee convergence of the corresponding sequence of divergence estimators, as shown in reference . In contrast, EnDive can be used to estimate the KL divergence directly. Other important f-divergence functionals are also excluded from this form including some that bound the Bayes error [2,4,6]. In contrast, our method applies to a large class of divergence functionals and avoids numerical integration.
Finally, Kandasamy et al.  proposed influence function-based estimators of distributional functionals including divergences that achieve the parametric rate when . While this method can be applied to general functionals, the estimator requires numerical integration for some functionals. Additionally, the estimators in both Kandasamy et al.  and Krishnamurthy et al.  require an optimal kernel density estimator. This is difficult to construct when the density support is bounded as it requires difficult computations at the density support set boundary and therefore, knowledge of the density support set. In contrast, Endive does not require knowledge of the support boundary.
In addition to the MSE convergence rates, the asymptotic distribution of divergence estimators is of interest. Asymptotic normality has been established for certain divergences between a specific density estimator and the true density [62,63,64]. This differs from the problem we consider where we assume that both densities are unknown. The asymptotic distributions of the estimators in references [56,57,58,59] are currently unknown. Thus, it is difficult to use these estimators for hypothesis testing which is crucial in many scientific applications. Kandasamy et al.  derived the asymptotic distribution of their data-splitting estimator but did not prove similar results for their leave-one-out estimator. We establish a central limit theorem for EnDive which greatly enhances its applicability in scientific settings.
Our ensemble divergence estimator reduces to an ensemble entropy estimator as a special case when data from only one distribution is considered and the other density is set to a uniform measure (see reference  for more on the relationship between entropy and information divergence). The resultant entropy estimator differs from the ensemble entropy estimator proposed by Sricharan et al.  in several important ways. First, the density support set must be known for the estimator in reference  to perform the explicit boundary correction. In contrast, the EnDive estimator does not require any boundary correction. To show this requires a significantly different approach to prove the bias and variance rates of the EnDive estimator. Furthermore, the EnDive results apply under more general assumptions for the densities and the kernel used in the weak estimators. Finally, the central limit theorem applies to the EnDive estimator which is currently unknown for the estimator in reference .
We also note that Berrett et al.  proposed a modification of the Kozachenko and Leonenko estimator of entropy  that takes a weighted ensemble estimation approach. While their results require stronger assumptions for the smoothness of the densities than ours do, they did obtain the asymptotic distribution of their weighted estimator and they also showed that the asymptotic variance of the estimator is not increased by taking a weighted average. This latter point is an important selling point of the ensemble framework—we can improve the asymptotic bias of an estimator without increasing the asymptotic variance.
1.2. Organization and Notation
The paper is organized as follows. We first derive the MSE convergence rates in Section 2 for a weak divergence estimator, which is a kernel density plug-in divergence estimator. We then generalize the theory of optimally weighted ensemble entropy estimation developed in reference  to obtain the ensemble divergence estimator EnDive from an ensemble of weak estimators in Section 3. A central limit theorem and uniform convergence rate for the ensemble estimator are also presented in Section 3. In Section 4, we provide guidelines for selecting the tuning parameters based on experiments and the theory derived in the previous sections. We then perform experiments in Section 4 that validate the theory and establish the robustness of the proposed estimators to the tuning parameters.
Bold face type is used for random variables and random vectors. The conditional expectation given a random variable is denoted as . The variance of a random variable is denoted as , and the bias of an estimator is denoted as .
2. The Divergence Functional Weak Estimator
This paper focuses on estimating functionals of the formwhere is a smooth functional, and and are smooth d-dimensional probability densities. If g is convex, and , then defines the family of f-divergences. Some common divergences that belong to this family include the KL divergence () and the total variation distance (). In this work, we consider a broader class of functionals than the f-divergences, since g is allowed to be very general.
To estimate , we first define a weak plug-in estimator based on kernel density estimators (KDEs), that is, a simple estimator that converges slowly to the true value in terms of MSE. We then derive the bias and variance expressions for this weak estimator as a function of sample size and bandwidth. We then use the resulting bias and variance expressions to derive an ensemble estimator that takes a weighted average of weak estimators with different bandwidths and achieves superior MSE performance.
2.1. The Kernel Density Plug-in Estimator
We use a kernel density plug-in estimator of the divergence functional in (1) as the weak estimator. Assume that i.i.d. realizations are available from and i.i.d. realizations are available from . Let be the kernel bandwidth for the density estimator of . For simplicity of presentation, assume that and . The results for the more general case of differing sample sizes and bandwidths are given in Appendix C. Let be a kernel function with and where is the norm of the kernel (K). The KDEs for and are, respectively,where . is then approximated as
2.2. Convergence Rates
For many estimators, MSE convergence rates are typically provided in the form of upper (or sometimes lower) bounds on the bias and the variance. Therefore, only the slowest converging terms (as a function of the sample size (N)) are presented in these cases. However, to apply our generalized ensemble theory to obtain estimators that guarantee the parametric MSE rate, we required explicit expressions for the bias of the weak estimators in terms of the sample size (N) and the kernel bandwidth (h). Thus, an upper bound was insufficient for our work. Furthermore, to guarantee the parametric rate, we required explicit expressions of all bias terms that converge to zero slower than .
To obtain bias expressions, we required multiple assumptions on the densities and , the functional g, and the kernel K. Similar to reference [7,54,65], the principal assumptions we make were that (1) , and g are smooth; (2) and have common bounded support sets ; and (3) and are strictly lower bounded on . We also assume (4) that the density support set is smooth with respect to the kernel (). The full technical assumptions and a discussion of them are contained in Appendix A. Given these assumptions, we have the following result on the bias of :
For a general g, the bias of the plug-in estimator is given by
To apply our generalized ensemble theory to the KDE plug-in estimator (), we required only an upper bound on its variance. The following variance result required much less strict assumptions than the bias results in Theorem 1:
Assume that the functional g in (1) is Lipschitz continuous in both of its arguments with the Lipschitz constant (). Then, the variance of the plug-in estimator () is bounded by
From Theorems 1 and 2, we observe that and are required for to be unbiased, while the variance of the plug-in estimator depends primarily on the sample size (N). Note that the constants depend on the densities and and their derivatives which are often unknown.
2.3. Optimal MSE Rate
From Theorem 1, the dominating terms in the bias are observed to be and . If no bias correction is performed, the optimal choice of h that minimizes MSE is
This results in a dominant bias term of order . Note that this differs from the standard result for the optimal KDE bandwidth for minimum MSE density estimation which is for a symmetric uniform kernel when the boundary bias is ignored .
Figure 1 shows a heatmap showing the leading bias term as a function of d and N when . The heatmap indicates that the bias of the plug-in estimator in (2) is small only for relatively small values of d. This is consistent with the empirical results in reference  which examined the MSE of multiple plug-in KDE and k-nn estimators. In the next section, we propose an ensemble estimator that achieves a superior convergence rate regardless of the dimensions (d) as long as the density is sufficiently smooth.
2.4. Proof Sketches of Theorems 1 and 2
To prove the bias expressions in Theorem 1, the bias is first decomposed into two parts by adding and subtracting within the expectation creating a “bias” term and a “variance” term. Applying a Taylor series expansion on the bias and variance terms results in expressions that depend on powers of and , respectively. Within the interior of the support, moment bounds can be derived from properties of the KDEs and a Taylor series expansion of the densities. Near the boundary of the support, the smoothness assumption on the boundary is required to obtain an expression of the bias in terms of the KDE bandwidth (h) and the sample size (N). The full proof of Theorem 1 is given in Appendix E.
The proof of the variance result takes a different approach. The proof uses the Efron–Stein inequality  which bounds the variance by analyzing the expected squared difference between the plug-in estimator when one sample is allowed to differ. This approach provides a bound on the variance under much less strict assumptions on the densities and the functional g than is required for Theorem 1. The full proof of Theorem 2 is given in Appendix F.
3. Weighted Ensemble Estimation
From Theorem 1 and Figure 1, we can observe that the bias of the MSE-optimal plug-in estimator decreases very slowly as a function of the sample size (N) when the data dimensions (d) are not small, resulting in a large MSE. However, by applying the theory of optimally weighted ensemble estimation, we can obtain an estimator with improved performance by taking a weighted sum of an ensemble of weak estimators where the weights are chosen to significantly reduce the bias.
The ensemble of weak estimators is formed by choosing different values of the bandwidth parameter h as follows. Set to be real positive numbers that index . Thus, the parameter l indexes over different neighborhood sizes for the KDEs. Define the weight and That is, for each estimator there is a corresponding weight value (). The key to reducing the MSE is to choose the weight vector (w) to reduce the lower order terms in the bias while minimizing the impact of the weighted average on the variance.
3.1. Finding the Optimal Weight
The theory of optimally weighted ensemble estimation is a general theory that is applicable to any estimation problem as long as the bias and variance of the estimator can be expressed in a specific way. An early version of this theory was presented in reference . We now generalize this theory so that it can be applied to a wider variety of estimation problems. Let N be the number of available samples and let be a set of index values. Given an indexed ensemble of estimators of some parameter (E), the weighted ensemble estimator with weights satisfying is defined asis asymptotically unbiased as long as the estimators are asymptotically unbiased. Consider the following conditions on :
- The bias is expressible as
- The variance is expressible as
Assume conditions and hold for an ensemble of estimators . Then, there exists a weight vector () such that the MSE of the weighted ensemble estimator attains the parametric rate of convergence:
The weight vector () is the solution to the following convex optimization problem:
From condition , we can write the bias of the weighted estimator as
The variance of the weighted estimator is bounded as
The optimization problem in (4) zeroes out the lower-order bias terms and limits the norm of the weight vector (w) to prevent the variance from exploding. This results in an MSE rate of when the dimensions (d) are fixed and when L is fixed independently of the sample size (N). Furthermore, a solution to (4) is guaranteed to exist if and the vectors are linearly independent. This completes our sketch of the proof of Theorem 3. ☐
3.2. The EnDive Estimator
The parametric rate of in MSE convergence can be achieved without requiring . This can be accomplished by solving the following convex optimization problem in place of the optimization problem in Theorem 3:where the parameter is chosen to achieve a trade-off between bias and variance. Instead of forcing , the relaxed optimization problem uses the weights to decrease the bias terms at a rate of , yielding an MSE convergence rate of In fact, it was shown in reference  that the optimization problem in (6) guarantees the parametric MSE rate as long as the conditions of Theorem 3 are satisfied and a solution to the optimization problem in (4) exists (the conditions for this existence are given in the proof of Theorem 3).
We now construct a divergence ensemble estimator from an ensemble of plug-in KDE divergence estimators. Consider first the bias result in (3) where g is general, and assume that . In this case, the bias contains a term. To guarantee the parametric MSE rate, any remaining lower-order bias terms in the ensemble estimator must be no slower than . Let where . Then . We therefore obtain an ensemble of plug-in estimators and a weighted ensemble estimator . The bias of each estimator in the ensemble satisfies the condition with and for . To obtain a uniform bound on the bias with respect to w and , we also include the function with corresponding . The variance also satisfies the condition . The optimal weight () is found by using (6) to obtain an optimally weighted plug-in divergence functional estimator with an MSE convergence rate of as long as and . Otherwise, if , we can only guarantee the MSE rate up to . We refer to this estimator as the Ensemble Divergence (EnDive) estimator and denote it as .
We note that for some functionals (g) (including the KL divergence and the Renyi- divergence integral), we can modify the EnDive estimator to obtain the parametric rate under the less strict assumption that . For details on this approach, see Appendix B.
3.3. Central Limit Theorem
The following theorem shows that an appropriately normalized ensemble estimator converges in distribution to a normal random variable under rather general conditions. Thus, the same result applies to the EnDive estimator . This enables us to perform hypothesis testing on the divergence functional which is very useful in many scientific applications. The proof is based on the Efron–Stein inequality and an application of Slutsky’s Theorem (Appendix G).
Assume that the functional g is Lipschitz in both arguments with the Lipschitz constant . Further assume that , , and for each . Then, for a fixed , the asymptotic distribution of the weighted ensemble estimator iswhere is a standard normal random variable.
3.4. Uniform Convergence Rates
Here, we show that the optimally weighted ensemble estimators achieve the parametric MSE convergence rate uniformly. Denote the subset of with densities bounded between and as .
Let be the EnDive estimator of the functionalwhere p and q are d-dimensional probability densities. Additionally, let and assume that . Then,where C is a constant.
The proof decomposes the MSE into the variance plus the square of the bias. The variance is bounded easily by using Theorem 2. To bound the bias, we show that the constants in the bias terms are continuous with respect to the densities p and q under an appropriate norm. We then show that is compact with respect to this norm and then apply an extreme value theorem. Details are given in Appendix H.
4. Experimental Results
In this section, we discuss the choice of tuning parameters and validate the EnDive estimator’s convergence rates and the central limit theorem. We then use the EnDive estimator to estimate bounds on the Bayes error for a single-cell bone marrow data classification problem.
4.1. Tuning Parameter Selection
The optimization problem in (6) has parameters , L, and . By applying (6), and the resulting MSE of the ensemble estimator iswhere each term in the sum comes from the bias and variance, respectively. From this expression and (6), we see that the parameter provides a tradeoff between bias and variance. Increasing enables the norm of the weight vector to be larger. This means the feasible region for the variable w increases in size as increases which can result in decreased bias. However, as contributes to the variance term, increasing may result in increased variance.
If all of the constants in (3) and an exact expression for the variance of the ensemble estimator were known, then could be chosen to optimize this tradeoff in bias and variance and thus minimize the MSE. Since these constants are unknown, we can only choose based on the asymptotic results. From (8), this would suggest setting . In practice, we find that for finite sample sizes, the variance in the ensemble estimator is less than the upper bound of . Thus, setting is unnecessarily restrictive. We find that, in practice, setting works well.
Upon first glance, it appears that for fixed L, the set that parameterizes the kernel widths can, in theory, be chosen by minimizing in (6) over in addition to w. However, adding this constraint results in a non-convex optimization problem since w does not lie in the non-negative orthant. A parameter search over possible values for is another possibility. However, this may not be practical as generally decreases as the size and spread of increases. In addition, for finite sample sizes, decreasing does not always directly correspond to a decrease in MSE, as very high or very low values of can lead to inaccurate density estimates, resulting in a larger MSE.
Given these limitations, we provide the following recommendations for . Denote the value of the minimum value of l such that as and the diameter of the support as D. To ensure the KDEs are bounded away from zero, we require that . As demonstrated in Figure 2, the weights in are generally largest for the smallest values of . This indicates that should also be sufficiently larger than to render an adequate density estimate. Similarly, should be sufficiently smaller than the diamter (D) as high bandwidth values can lead to high bias in the KDEs. Once these values are chosen, all other values can then be chosen to be equally spaced between and .
An efficient way to choose and is to select the integers and and compute the and nearest neighbor distances of all the data points. The bandwidths and can then be chosen to be the maximums of these corresponding distances. The parameters and can then be computed from the expression . This choice ensures that a minimum of points are within the kernel bandwidth for the density estimates at all points and that a maximum of points are within the kernel bandwidth for the density estimates at one of the points.
Once and have been chosen, the similarity of bandwidth values and basis functions increases as L increases, resulting in a negligible decrease in the bias. Hence, L should be chosen to be large enough for sufficient bias but small enough so that the bandwidth values are sufficiently distinct. In our experiments, we found to be sufficient.
4.2. Convergence Rates Validation: Rényi- Divergence
To validate our theoretical convergence rate results, we estimated the Rényi- divergence integral between two truncated multivariate Gaussian distributions with varying dimension and sample sizes. The densities had means of , and covariance matrices of , where is a d-dimensional vector of ones, and is a identity matrix. We restricted the Gaussians to the unit cube and used .
The left plots in Figure 3 show the MSE (200 trials) of the standard plug-in estimator implemented with a uniform kernel and the proposed optimally weighted estimator EnDive for various dimensions and sample sizes. The parameter set was selected based on a range of k-nearest neighbor distances. The bandwidth used for the standard plug-in estimator was selected by setting , where was chosen from to minimize the MSE of the plug-in estimator. For all dimensions and sample sizes, EnDive outperformed the plug-in estimator in terms of MSE. EnDive was also less biased than the plug-in estimator and even had lower variance at smaller sample sizes (e.g., ). This reflects the strength of ensemble estimators—the weighted sum of a set of relatively poor estimators can result in a very good estimator. Note also that for the larger values of N, the ensemble estimator MSE rates approached the theoretical rate based on the estimated log–log slope given in Table 1.
To illustrate the difference between the problems of density estimation and divergence functional estimation, we estimated the average pointwise squared error between the KDE and in the previous experiment. We used exactly the same bandwidth and kernel as the standard plug-in estimators in Figure 3 and calculated the pointwise error at 10,000 points sampled from . The results are shown in Figure 4. From these results, we see that the KDEs performed worse as the dimension of the densities increased. Additionally, we observe by comparing Figure 3 and Figure 4, the average pointwise squared error decreased at a much slower rate as a function of the sample size (N) than the MSE of the plug-in divergence estimators, especially for larger dimensions (d).
Our experiments indicated that the proposed ensemble estimator is not sensitive to the tuning parameters. See reference  for more details.
4.3. Central Limit Theorem Validation: KL Divergence
To verify the central limit theorem of the EnDive estimator, we estimated the KL divergence between two truncated Gaussian densities, again restricted to the unit cube. We conducted two experiments where (1) the densities were different with means of , and covariances of matrices , , ; and where (2) the densities were the same with means of and covariance matrices of . For both experiments, we chose and four different sample sizes (N). We found that the correspondence between the quantiles of the standard normal distribution and the quantiles of the centered and scaled EnDive estimator was very high under all settings (see Table 2 and Figure 5) which validates Theorem 4.
4.4. Bayes Error Rate Estimation on Single-Cell Data
Using the EnDive estimator, we estimated bounds on the Bayes error rate (BER) of a classification problem involving MARS-seq single-cell RNA-sequencing (scRNA-seq) data measured from developing mouse bone marrow cells enriched for the myeloid and erythroid lineages . However, we first demonstrated the ability of EnDive to estimate the bounds on the BER of a simulated problem. In this simulation, the data were drawn from two classes where each class distribution was a dimensional Gaussian distribution with different means and the identity covariance matrix. We considered two cases, namely, the distance between the means was 1 or 3. The BER was calculated in both cases. We then estimated upper and lower bounds on the BER by estimating the Henze–Penrose (HP) divergence [4,6]. Figure 6 shows the average estimated upper and lower bounds on the BER with standard error bars for both cases. For all tested sample sizes, the BER was within one standard deviation of the estimated lower bound. The lower bound was also closer, on average, to the BER for most of the tested sample sizes (lower sample sizes with smaller distances between means were the exceptions). Generally, these resuls indicate that the true BER is relatively close to the estimated lower bound, on average.
We then estimated similar bounds on the scRNA-seq classification problem using EnDive. We considered the three most common cell types within the data: erythrocytes (eryth.), monocytes (mono.), and basophils (baso.) ( respectively). We estimated the upper and lower bounds on the pairwise BER between these classes using different combinations of genes selected from the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways associated with the hematopoietic cell lineage [74,75,76]. Each collection of genes contained 11–14 genes. The upper and lower bounds on the BER were estimated using the Henze–Penrose divergence [4,6]. The standard deviations of the bounds for the KEGG-based genes were estimated via 1000 bootstrap iterations. The KEGG-based bounds were compared to BER bounds obtained from 1000 random selections of 12 genes. In all cases, we compared the bounds to the performance of a quadratic discriminant analysis classifier (QDA) with 10-fold cross validation. Note that to correct for undersampling in scRNA-seq data, we first imputed the undersampled data using MAGIC .
All results are given in Table 3. From these results, we note that erythrocytes are relatively easy to distinguish from the other two cell types as the BER lower bounds were within nearly two standard deviations of zero when using genes associated with platelet, erythrocyte, and neutrophil development as well as a random selection of 12 genes. This is corroborated by the QDA cross-validated results which were all within two standard deviations of either the upper or lower bound for these gene sets. In contrast, the macrophage-associated genes seem to be less useful for distinguishing erythrocytes than the other gene sets.
We also found that basophils are difficult to distinguish from monocytes using these gene sets. Assuming the relative abundance of each cell type is representative of the population, a trivial upper bound on the BER is which is between all of the estimated lower and upper bounds. The QDA results were also relatively high (and may be overfitting the data in some cases based on the estimated BER bounds), suggesting that different genes should be explored for this classification problem.
We derived the MSE convergence rates for a kernel density plug-in estimator for a large class of divergence functionals. We generalized the theory of optimally weighted ensemble estimation and derived an ensemble divergence estimator EnDive that achieves the parametric rate when the densities are more than d times differentiable. The estimator we derived can be applied to general bounded density support sets and can be implemented without knowledge of the support, which is a distinct advantage over other competing estimators. We also derived the asymptotic distribution of the estimator, provided some guidelines for tuning parameter selection, and experimentally validated the theoretical convergence rates for the case of empirical estimation of the Rényi- divergence integral. We then performed experiments to examine the estimator’s robustness to the choice of tuning parameters, validated the central limit theorem for KL divergence estimation, and estimated bounds on the Bayes error rate for a single cell classification problem.
We note that based on the proof techniques employed in our work, our weighted ensemble estimators are easily extended beyond divergence estimation to more general distributional functionals which may be integral functionals of any number of probability distributions. We also show in Appendix B that EnDive can be easily modified to obtain an estimator that achieves the parametric rate when the densities are more than times differentiable and the functional g has a specific form that includes the Rényi and KL divergences. Future work includes extending this modification to functionals with more general forms. An important divergence of interest in this context is the Henze–Penrose divergence that we used to bound the Bayes error. Further future work will focus on extending this work on divergence estimation to k-nn based estimators where knowledge of the support is, again, not required. This will improve the computational burden, as k-nn estimators require fewer computations than standard KDEs.
K.M. wrote this article primarily as part of his PhD dissertation under the supervision of A.H. and in collaboration with K.S. A.H, K.M., and K.S. edited the paper. K.S. provided the primary contribution to the proof of Theorem A1 and assisted with all other proofs. K.M. provided the primary contributions for the proofs of all other theorems and performed all other experiments. K.G. contributed to the bias proof.
This research was funded by Army Research Office (ARO) Multidisciplinary University Research Initiative (MURI) grant number W911NF-15-1-0479, National Science Foundation (NSF) grant number CCF-1217880, and a National Science Foundation (NSF) Graduate Research Fellowship to the first author under grant number F031543.
Conflicts of Interest
The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.
The following abbreviations are used in this manuscript:
|MSE||Mean squared error|
|SVM||Support vector machine|
|KDE||kernel density estimator|
|BER||Bayes error rate|
Appendix A. Bias Assumptions
Our full assumptions to prove the bias expressions for the estimator were as follows:
- : Assume that the kernel K is symmetric, is a product kernel, and has bounded support in each dimension.
- : Assume there exist constants , such that
- : Assume that the densities are in the interior of with .
- : Assume that g has an infinite number of mixed derivatives.
- ): Assume that , are strictly upper bounded for .
- : Assume the following boundary smoothness condition: Let be a polynomial in u of order whose coefficients are a function of x and are times differentiable. Then, assume that
We focused on finite support kernels for simplicity in the proofs, although it is likely that our results extend to some infinitely supported kernels as well. We assumed relatively strong conditions on the smoothness of g in to enable us to obtain an estimator that achieves good convergence rates without knowledge of the boundary of the support set. While this smoothness condition may seem restrictive, in practice, nearly all divergence and entropy functionals of interest satisfy this condition. Functionals of interest that do not satisfy this assumption (e.g., the total variation distance) typically have at least one point that is not differentiable which violates the assumptions of all competing estimators [54,57,58,59,60,65]. We also note that to obtain simply an upper bound on the bias for the plug-in estimator, much less restrictive assumptions on the functional g are sufficient.
Assumption requires the boundary of the density support set to be smooth with respect to the kernel () in the sense that the expectation of the area outside of with respect to any random variable u with smooth distribution is a smooth function of the bandwidth (h). Note that we do not require knowledge of the support of the unknown densities to actually implement the estimator (). As long as assumptions – are satisfied, then the bias results we obtain are valid, and therefore, we can obtain the parametric rate with the EnDive estimator. This is in contrast to many other estimators of information theoretic measures such as those presented in references [59,60,65]. In these cases, the boundary of the support set must be known precisely to perform boundary correction to obtain the parametric rate, since the boundary correction is an explicit step in these estimators. In contrast, we do not need to explicitly perform a boundary correction.
It is not necessary for the boundary of to have smooth contours with no edges or corners as assumption is satisfied by the following case:
Assumption is satisfied when and when K is the uniform rectangular kernel; that is, for all .
The proof is given in Appendix D. The methods used to prove this can be easily extended to show that is satisfied with the uniform rectangular kernel and other similar supports with flat surfaces and corners. Furthermore, we showed in reference  that is satisfied using the uniform spherical kernel with a density support set equal to the unit cube. Note that assumption is trivially satisfied by the uniform rectangular kernel as well. Again, this is easily extended to more complicated density support sets that have boundaries that contain flat surfaces and corners. Determining other combinations of kernels and density support sets that satisfy is left for future work.
Densities for which assumptions .1– hold include the truncated Gaussian distribution and the beta distribution on the unit cube. Functions for which assumptions .3– hold include and
Appendix B. Modified EnDive
If the functional g has a specific form, we can modify the EnDive estimator to obtain an estimator that achieves the parametric rate when . Specifically, we have the following theorem:
Assume that assumptions .0– hold. Furthermore, if has -th order mixed derivatives that depend on only through for some , then for any positive integer , the bias of is
Divergence functionals that satisfy the mixed derivatives condition required for (A1) include the KL divergence and the Rényi- divergence. Obtaining similar terms for other divergence functionals requires us to separate the dependence on h of the derivatives of g evaluated at . This is left for future work. See Appendix E for details.
As compared to (3), there are many more terms in (A1). These terms enable us to modify the EnDive estimator to achieve the parametric MSE convergence rate when for an appropriate choice of bandwidths, whereas the terms in (3) requires to achieve the same rate. This is accomplished by letting decrease at a faster rate, as follows.
Let and where . The bias of each estimator in the resulting ensemble has terms proportional to , where and . Then, the bias of satisfies condition if , , andas long as . The variance also satisfies condition . The optimal weight () is found by using (6) to obtain an optimally weighted plug-in divergence functional estimator that achieves the parametric convergence rate if and if . Otherwise, if , we can only guarantee the MSE rate up to . We refer to this estimator as the modified EnDive estimator and denote it as . The ensemble estimator is summarized in Algorithm 1 when .
|Algorithm A1: The Modified EnDive Estimator|
The parametric rate can be achieved with under less strict assumptions on the smoothness of the densities than those required for . Since can be arbitrary, it is theoretically possible to achieve the parametric rate with the modified estimator as long as . This is consistent with the rate achieved by the more complex estimators proposed in reference . We also note that the central limit theorem applies and that the convergence is uniform as Theorem 5 applies for and .
These rate improvements come at a cost for the number of parameters (L) required to implement the weighted ensemble estimator. If , then the size of J for is in the order of . This may lead to increased variance in the ensemble estimator as indicated by (5).
So far, can only be applied to functionals () with mixed derivatives of the form of . Future work is required to extend this estimator to other functionals of interest.
Appendix C. General Results
Here we present the generalized forms of Theorems 1 and 2 where the sample sizes and bandwidths of the two datasets are allowed to differ. In this case, the KDEs arewhere . is then approximated as
We also generalize the bias result to the case where the kernel (K) has the order which means that the j-th moment of the kernel defined as is zero for all and where is the kernel in the i-th coordinate. Note that symmetric product kernels have the order . The following theorem on the bias follows under assumptions .0–:
For general g, the bias of the plug-in estimator () is of the form
Furthermore, if has -th order mixed derivatives that depend on only through for some , then for any positive integer , the bias is of the form
Note that the bandwidth and sample size terms do not depend on the order of the kernel (). Thus, using a higher-order kernel does not provide any benefit to the convergence rates. This lack of improvement is due to the bias of the density estimators at the boundary of the density support sets. To obtain better convergence rates using higher-order kernels, boundary correction would be necessary [57,60]. In contrast, we improve the convergence rates by using a weighted ensemble that does not require boundary correction.
The variance result requires much less strict assumptions than the bias results:
Assume that the functionalg in (1) is Lipschitz continuous in both of its arguments with the Lipschitz constant . Then, the variance of the plug-in estimator () is bounded by
Appendix D. Proof of Theorem A1 (Boundary Conditions)
Consider a uniform rectangular kernel that satisfies for all x, such that . Also, consider the family of probability densities (f) with rectangular support . We prove Theorem A1 which is that that satisfies the following smoothness condition ): for any polynomial of order with coefficients that are times differentiable wrt x,where has the expansion
Note that the inner integral forces the xs under consideration to be boundary points via the constraint .
Appendix D.1. Single Coordinate Boundary Point
We begin by focusing on points x that are boundary points by virtue of a single coordinate , such that . Without loss of generality, assume that . The inner integral in (A6) can then be evaluated first with respect to (wrt) all coordinates other than i. Since all of these coordinates lie within the support, the inner integral over these coordinates will amount to integration of the polynomial over a symmetric dimensional rectangular region for all . This yields a function where the coefficients are each times differentiable wrt x.
With respect to the coordinate, the inner integral will have limits from to for some . Consider the monomial term. The inner integral wrt this term yields
Raising the right-hand-side of (A7) to the power of t results in an expression of the formwhere the coefficients are times differentiable wrt x. Integrating (A8) over all the coordinates in x other than results in an expression of the formwhere, again, the coefficients are times differentiable wrt . Note that since the other coordinates of x other than are far away from the boundary, the coefficients are independent of h. To evaluate the integral of (A9), consider the term Taylor series expansion of around . This will yield terms of the formfor , and . Combining terms results in the expansion .
Appendix D.2. Multiple Coordinate Boundary Point
The case where multiple coordinates of point x are near the boundary is a straightforward extension of the single boundary point case, so we only sketch the main ideas here. As an example, consider the case where two of the coordinates are near the boundary. Assume for notational ease that they are and and that and . The inner integral in (A6) can again be evaluated first wrt all coordinates other than 1 and 2. This yields a function where the coefficients are each times differentiable wrt x. Integrating this wrt and and then raising the result to the power of t yields a double sum similar to (A8). Integrating this over all the coordinates in x other than and gives a double sum similar to (A9). Then, a Taylor series expansion of the coefficients and integration over and yields the result.
Appendix E. Proof of Theorem A3 (Bias)
In this appendix, we prove the bias results in Theorem A3. The bias of the base kernel density plug-in estimator can be expressed aswhere is drawn from . The first term is the “variance” term, while the second is the “bias” term. We bound these terms using Taylor series expansions under the assumption that g is infinitely differentiable. The Taylor series expansion of the variance term in (A11) will depend on variance-like terms of the KDEs, while the Taylor series expansion of the bias term in (A11) will depend on the bias of the KDEs.
The Taylor series expansion of around and iswhere is the bias of at the point raised to the power of j. This expansion can be used to control the second term (the bias term) in (A11). To accomplish this, we require an expression for .
To obtain an expression for , we separately consider the cases when is in the interior of the support or when is near the boundary of the support. A point is defined to be in the interior of if for all , . A point is near the boundary of the support if it is not in the interior. Denote the region in the interior and near the boundary wrt as and , respectively. We will need the following:
Let be a realization of the density independent of for . Assume that the densities and belong to . Then, for ,
Obtaining the lower order terms in (A12) is a common result in kernel density estimation. However, since we also require the higher order terms, we present the proof here. Additionally, some of the results in this proof will be useful later. From the linearity of the KDE, we have that if is drawn from and is independent of , thenwhere the last step follows on from the substitution . Since the density () belongs to , by using multi-index notation we can expand it towhere and . Combining (A13) and (A14) giveswhere the last step follows from the fact that K is symmetric and of order . ☐
To obtain a similar result for the case when is near the boundary of , we use the assumption .
Let be an arbitrary function satisfying . Let satisfy the boundary smoothness conditions of Assumption . Assume that the densities and belong to , and let be a realization of the density independently of for . Let . Then,
For a fixed X near the boundary of , we have
Note that, in , we are extending the integral beyond the support of the density. However, by using the same Taylor series expansion method as in the proof of Lemma A1, we always evaluate and its derivatives at point X which is within the support of . Thus, it does not matter how we define an extension of since the Taylor series will remain the same. Thus, results in an identical expression to that obtained from (A12).
For the term, we expand it using multi-index notation as
Recognizing that the th derivative of is times differentiable, we can apply assumption to obtain the expectation of wrt X:
Similarly, we find that
Combining these results giveswhere the constants are functionals of the kernel and the densities.
The expression in (A16) can be proved in a similar manner. ☐
Applying Lemmas A1 and A2 to (A11) gives
For the variance term (the first term) in (A10), the truncated Taylor series expansion of around and giveswhere . To control the variance term in (A11), we thus require expressions for .
Let be a realization of the density that is in the interior of the support and is independent of for . Let be the set of integer divisors of q including 1 but excluding q. Then,where is a functional of and
Define the random variable . This gives
Clearly, . From (A13), we have for integerwhere the constants depend on density , its derivatives, and the moments of kernel . Note that since K is symmetric, the odd moments of are zero for in the interior of the support. However, all even moments may now be non-zero since may now be non-negative. In accordance with the binomial theorem,
We can use these expressions to simplify . As an example, let . Then, since the are independent,
Similarly, we find that
For , we have
The pattern for is then,
For any integer (q), the largest possible factor is . Thus, for a given q, the smallest possible exponent on the term is . This increases as q increases. A similar expression holds for , except the s are replaced with , is replaced with , and and are replaced with and , respectively, all resulting in different constants. Then, since and are conditionally independent given ,☐
Applying Lemma A3 to (A18) when taking the conditional expectation given in the interior gives an expression of the form
Note that the functionals and depend on the derivatives of g and which depend on . To apply an ensemble estimation, we need to separate the dependence on from the constants. If we use ODin1, then it is sufficient to note that in the interior of the support, and therefore, for some functional c. The terms in (A22) reduce to
For ODin2, we need the higher order terms. To separate the dependence on from the constants, we need more information about the functional g and its derivatives. Consider a special case where the functional has derivatives of the form of with This includes the important cases of the KL divergence and the Renyi divergence. The generalized binomial theorem states that if and if q and t are real numbers with , then for any complex number (),
Since the densities are bounded away from zero, for sufficiently small , we have Applying the generalized binomial theorem and Lemma A1 gives
Since m is an integer, the exponents of the terms are also integers. Thus, (A22) gives, in this case,
As before, the case for close to the boundary of the support is more complicated. However, by using a similar technique to the proof of Lemma A2 for at the boundary and combining with previous results, we find that for general g,
If has derivatives of the form of with , then we can similarly obtain
Appendix F. Proof of Theorem A4 (Variance)
To bound the variance of the plug-in estimator , we use the Efron–Stein inequality :
Lemma A4 (Efron–Stein Inequality).
Let be independent random variables on the space . Then, if , we have
Suppose we have samples and and denote the respective estimators as and . We have
Since g is Lipschitz continuous with the constant , we havewhere the last step follows from Jensen’s inequality. By making the substitutions and , this gives
Combining this with (A29) gives
Combining these results with (A27) gives
The second term in (A26) is controlled in a similar way. From the Lipschitz condition,
The terms are eliminated by making the substitutions of and within the expectation to obtainwhere we use the Cauchy Schwarz inequality to bound the expectation within each summand. Finally, applying Jensen’s inequality and (A29) and (A32) gives
Now, suppose we have samples and and denote the respective estimators as and . Then,
Thus, using a similar argument as was used to obtain (A32),
Applying the Efron–Stein inequality gives
Appendix G. Proof of Theorem 4 (CLT)
We are interested in the asymptotic distribution of
Note that in the standard central limit theorem , the second term converges in distribution to a Gaussian random variable. If the first term converges in probability to a constant (specifically, 0), then we can use Slutsky’s theorem  to find the asymptotic distribution. So, now, we focus on the first term which we denote as .
To prove convergence in probability, we use Chebyshev’s inequality. Note that . To bound the variance of , we again use the Efron–Stein inequality. Let be drawn from and denote and as the sequences using and , respectively. Then,
We use the Efron–Stein inequality to bound .
We do this by bounding the conditional expectation of the termwhere is replaced with in the KDEs for some . Using similar steps as in Appendix F, we have
A similar result is obtained when is replaced with . Then, based on the Efron–Stein inequality, .
A similar result holds for the terms in (A33).
For the third term in (A33),
There are terms where , and we have from Appendix F (see (A30)) that
Thus, these terms are . There are terms when . In this case, we can do four substitutions of the form to obtain
Then, since , we get
Now, consider samples and and the respective sequences and . Then,
Using a similar argument as that used to obtain (A33), we have that if and , then
Applying the Efron–Stein inequality gives
Thus, based on Chebyshev’s inequality,and therefore, converges to zero in probability. Based on Slutsky’s theorem, converges in distribution to a zero mean Gaussian random variable with variancewhere is drawn from .
For the weighted ensemble estimator, we wish to know the asymptotic distribution of where . We have
The second term again converges in distribution to a Gaussian random variable by the central limit theorem. The mean and variance are, respectively, zero and
The first term is equal towhere denotes the convergence to zero as a probability. In the last step, we use the fact that if two random variables converge in probability to constants, then their linear combination converges in probability to the linear combination of the constants. Combining this result with Slutsky’s theorem completes the proof.
Appendix H. Proof of Theorem 5 (Uniform MSE)
Since the MSE is equal to the square of the bias plus the variance, we can upper bound the left hand side of (7) with
From the assumptions (Lipschitz, kernel bounded, weight calculated from the relaxed optimization problem), we havewhere the last step follows on from the fact that all of the terms are independent of p and q.
For the bias, recall that if g is infinitely differentiable and if the optimal weight () is calculated using the relaxed convex optimization problem, thenWe use a topology argument to bound the supremum of this term. We use the Extreme Value Theorem :
Theorem A5 (Extreme Value Theorem)
Let be continuous. If X is compact, then points s.t. exist for every .
Based on this theorem, f achieves its minimum and maximum on X. Our approach is to first show that the functionals are continuous wrt p and q in some appropriate norm. We then show that the space is compact wrt this norm. The Extreme Value Theorem can then be applied to bound the supremum of (A34).
We first define the norm. Let . We use the standard norm on the space :where
The functionals are continuous wrt the norm .
The functionals depend on terms of the form
It is sufficient to show that this is continuous. Let and where will be chosen later. Then, by applying the triangle inequality for integration and adding and subtracting terms, we have
Based on Assumption , the absolute value of the mixed derivatives of g is bounded on the range defined for p and q by some constant (). Also, . Furthermore, since and are continuous, and since is compact, then the absolute value of derivatives and is also bounded by a constant (). Let . Then, since the mixed derivatives of g are continuous on the interval , they are uniformly continuous. Therefore, we can choose a small enough such that (s.t.)
Combining all of these results with (A36) giveswhere is the Lebesgue measure of . This is bounded since is compact. Let be s.t. if , then (A37) is less than . Let . Then, if ,☐
Given that each is continuous, then is also continuous wrt p and q.
We now argue that is compact. First, a set is relatively compact if its closure is compact. Based on the Arzela–Ascoli theorem , the space is relatively compact in the topology induced by the norm for any . We choose . It can then be shown that under the norm, is complete . Since is contained in a metric space, it is also closed and therefore, equal to its closure. Thus, is compact. Then, since is closed in , it is also compact. Therefore, since for each , , based on the Extreme Value Theorem, we havewhere we use the fact that J is finite (see Section 3.2 or Appendix B for the set J).
- Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: New York, NY, USA, 2012. [Google Scholar]
- Avi-Itzhak, H.; Diep, T. Arbitrarily tight upper and lower bounds on the Bayesian probability of error. IEEE Trans. Pattern Anal. Mach. Intell. 1996, 18, 89–91. [Google Scholar] [CrossRef]
- Hashlamoun, W.A.; Varshney, P.K.; Samarasooriya, V. A tight upper bound on the Bayesian probability of error. IEEE Trans. Pattern Anal. Mach. Intell. 1994, 16, 220–224. [Google Scholar] [CrossRef]
- Moon, K.; Delouille, V.; Hero, A.O., III. Meta learning of bounds on the Bayes classifier error. In Proceedings of the 2015 IEEE Signal Processing and Signal Processing Education Workshop (SP/SPE), Salt Lake City, UT, USA, 9–12 August 2015; pp. 13–18. [Google Scholar]
- Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 1952, 23, 493–507. [Google Scholar] [CrossRef]
- Berisha, V.; Wisler, A.; Hero, A.O., III; Spanias, A. Empirically Estimable Classification Bounds Based on a New Divergence Measure. IEEE Trans. Signal Process. 2016, 64, 580–591. [Google Scholar] [CrossRef] [PubMed]
- Moon, K.R.; Hero, A.O., III. Multivariate f-Divergence Estimation With Confidence. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; pp. 2420–2428. [Google Scholar]
- Gliske, S.V.; Moon, K.R.; Stacey, W.C.; Hero, A.O., III. The intrinsic value of HFO features as a biomarker of epileptic activity. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016. [Google Scholar]
- Loh, P.-L. On Lower Bounds for Statistical Learning Theory. Entropy 2017, 19, 617. [Google Scholar] [CrossRef]
- Póczos, B.; Schneider, J.G. On the estimation of alpha-divergences. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 609–617. [Google Scholar]
- Oliva, J.; Póczos, B.; Schneider, J. Distribution to distribution regression. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1049–1057. [Google Scholar]
- Szabó, Z.; Gretton, A.; Póczos, B.; Sriperumbudur, B. Two-stage sampled learning theory on distributions. In Proceeding of The 18th International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 9–12 May 2015. [Google Scholar]
- Moon, K.R.; Delouille, V.; Li, J.J.; De Visscher, R.; Watson, F.; Hero, A.O., III. Image patch analysis of sunspots and active regions. II. Clustering via matrix factorization. J. Space Weather Space Clim. 2016, 6, A3. [Google Scholar] [CrossRef]
- Moon, K.R.; Li, J.J.; Delouille, V.; De Visscher, R.; Watson, F.; Hero, A.O., III. Image patch analysis of sunspots and active regions. I. Intrinsic dimension and correlation analysis. J. Space Weather Space Clim. 2016, 6, A2. [Google Scholar] [CrossRef]
- Dhillon, I.S.; Mallela, S.; Kumar, R. A divisive information theoretic feature clustering algorithm for text classification. J. Mach. Learn. Res. 2003, 3, 1265–1287. [Google Scholar]
- Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
- Lewi, J.; Butera, R.; Paninski, L. Real-time adaptive information-theoretic optimization of neurophysiology experiments. In Proceedings of the 19th International Conference on Neural Information Processing Systems (NIPS 2006), Vancouver, BC, Canada, 4–9 December 2006; pp. 857–864. [Google Scholar]
- Bruzzone, L.; Roli, F.; Serpico, S.B. An extension of the Jeffreys-Matusita distance to multiclass cases for feature selection. IEEE Trans. Geosci. Remote Sens. 1995, 33, 1318–1321. [Google Scholar] [CrossRef]
- Guorong, X.; Peiqi, C.; Minhui, W. Bhattacharyya distance feature selection. In Proceedings of the 13th International Conference on Pattern Recognition, Vienna, Austria, 25–29 August 1996; Volume 2, pp. 195–199. [Google Scholar]
- Sakate, D.M.; Kashid, D.N. Variable selection via penalized minimum φ-divergence estimation in logistic regression. J. Appl. Stat. 2014, 41, 1233–1246. [Google Scholar] [CrossRef]
- Hild, K.E.; Erdogmus, D.; Principe, J.C. Blind source separation using Renyi’s mutual information. IEEE Signal Process. Lett. 2001, 8, 174–176. [Google Scholar] [CrossRef]
- Mihoko, M.; Eguchi, S. Robust blind source separation by beta divergence. Neural Comput. 2002, 14, 1859–1886. [Google Scholar] [CrossRef] [PubMed]
- Vemuri, B.C.; Liu, M.; Amari, S.; Nielsen, F. Total Bregman divergence and its applications to DTI analysis. IEEE Trans. Med. Imaging 2011, 30, 475–483. [Google Scholar] [CrossRef] [PubMed]
- Hamza, A.B.; Krim, H. Image registration and segmentation by maximizing the Jensen-Rényi divergence. In Proceedings of the 4th International Workshop Energy Minimization Methods in Computer Vision and Pattern Recognition (EMMCVPR 2003), Lisbon, Portugal, 7–9 July 2003; pp. 147–163. [Google Scholar]
- Liu, G.; Xia, G.; Yang, W.; Xue, N. SAR image segmentation via non-local active contours. In Proceedings of the 2014 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Quebec City, QC, Canada, 13–18 July 2014; pp. 3730–3733. [Google Scholar]
- Korzhik, V.; Fedyanin, I. Steganographic applications of the nearest-neighbor approach to Kullback-Leibler divergence estimation. In Proceedings of the 2015 Third International Conference on Digital Information, Networking, and Wireless Communications (DINWC), Moscow, Russia, 3–5 February 2015; pp. 133–138. [Google Scholar]
- Basseville, M. Divergence measures for statistical data processing–An annotated bibliography. Signal Process. 2013, 93, 621–633. [Google Scholar] [CrossRef]
- Cichocki, A.; Amari, S. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef]
- Csiszar, I. Information-type measures of difference of probability distributions and indirect observations. Stud. Sci. Math. Hungar. 1967, 2, 299–318. [Google Scholar]
- Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B Stat. Methodol. 1966, 28, 131–142. [Google Scholar]
- Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
- Hellinger, E. Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. J. Rein. Angew. Math. 1909, 136, 210–271. (In German) [Google Scholar]
- Bhattacharyya, A. On a measure of divergence between two multinomial populations. Indian J. Stat. 1946, 7, 401–406. [Google Scholar]
- Silva, J.F.; Parada, P.A. Shannon entropy convergence results in the countable infinite case. In Proceedings of the 2012 IEEE International Symposium on Information Theory Proceedings (ISIT), Cambridge, MA, USA, 1–6 July 2012; pp. 155–159. [Google Scholar]
- Antos, A.; Kontoyiannis, I. Convergence properties of functional estimates for discrete distributions. Random Struct. Algorithms 2001, 19, 163–193. [Google Scholar] [CrossRef]
- Valiant, G.; Valiant, P. Estimating the unseen: An n/log (n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the 43rd Annual ACM Symposium on Theory of Computing, San Jose, CA, USA, 6–8 June 2011; pp. 685–694. [Google Scholar]
- Jiao, J.; Venkat, K.; Han, Y.; Weissman, T. Minimax estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory 2015, 61, 2835–2885. [Google Scholar] [CrossRef] [PubMed]
- Jiao, J.; Venkat, K.; Han, Y.; Weissman, T. Maximum likelihood estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory 2017, 63, 6774–6798. [Google Scholar] [CrossRef]
- Valiant, G.; Valiant, P. The power of linear estimators. In Proceedings of the 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science (FOCS), Palm Springs, CA, USA, 22–25 October 2011; pp. 403–412. [Google Scholar]
- Paninski, L. Estimation of entropy and mutual information. Neural Comput. 2003, 15, 1191–1253. [Google Scholar] [CrossRef]
- Paninski, L. Estimating entropy on m bins given fewer than m samples. IEEE Trans. Inf. Theory 2004, 50, 2200–2203. [Google Scholar] [CrossRef]
- Alba-Fernández, M.V.; Jiménez-Gamero, M.D.; Ariza-López, F.J. Minimum Penalized ϕ-Divergence Estimation under Model Misspecification. Entropy 2018, 20, 329. [Google Scholar] [CrossRef]
- Ahmed, N.A.; Gokhale, D. Entropy expressions and their estimators for multivariate distributions. IEEE Trans. Inf. Theory 1989, 35, 688–692. [Google Scholar] [CrossRef]
- Misra, N.; Singh, H.; Demchuk, E. Estimation of the entropy of a multivariate normal distribution. J. Multivar. Anal. 2005, 92, 324–342. [Google Scholar] [CrossRef]
- Gupta, M.; Srivastava, S. Parametric Bayesian estimation of differential entropy and relative entropy. Entropy 2010, 12, 818–843. [Google Scholar] [CrossRef]
- Li, S.; Mnatsakanov, R.M.; Andrew, M.E. K-nearest neighbor based consistent entropy estimation for hyperspherical distributions. Entropy 2011, 13, 650–667. [Google Scholar] [CrossRef]
- Wang, Q.; Kulkarni, S.R.; Verdú, S. Divergence estimation for multidimensional densities via k-nearest-neighbor distances. IEEE Trans. Inf. Theory 2009, 55, 2392–2405. [Google Scholar] [CrossRef]
- Darbellay, G.A.; Vajda, I. Estimation of the information by an adaptive partitioning of the observation space. IEEE Trans. Inf. Theory 1999, 45, 1315–1321. [Google Scholar] [CrossRef]
- Silva, J.; Narayanan, S.S. Information divergence estimation based on data-dependent partitions. J. Stat. Plan. Inference 2010, 140, 3180–3198. [Google Scholar] [CrossRef]
- Le, T.K. Information dependency: Strong consistency of Darbellay–Vajda partition estimators. J. Stat. Plan. Inference 2013, 143, 2089–2100. [Google Scholar] [CrossRef]
- Wang, Q.; Kulkarni, S.R.; Verdú, S. Divergence estimation of continuous distributions based on data-dependent partitions. IEEE Trans. Inf. Theory 2005, 51, 3064–3074. [Google Scholar] [CrossRef]
- Hero, A.O., III; Ma, B.; Michel, O.; Gorman, J. Applications of entropic spanning graphs. IEEE Signal Process. Mag. 2002, 19, 85–95. [Google Scholar] [CrossRef]
- Moon, K.R.; Hero, A.O., III. Ensemble estimation of multivariate f-divergence. In Proceedings of the 2014 IEEE International Symposium on Information Theory (ISIT), Honolulu, HI, USA, 29 June–4 July 2014; pp. 356–360. [Google Scholar]
- Moon, K.R.; Sricharan, K.; Greenewald, K.; Hero, A.O., III. Improving convergence of divergence functional ensemble estimators. In Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 1133–1137. [Google Scholar]
- Nguyen, X.; Wainwright, M.J.; Jordan, M.I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theory 2010, 56, 5847–5861. [Google Scholar] [CrossRef]
- Krishnamurthy, A.; Kandasamy, K.; Poczos, B.; Wasserman, L. Nonparametric Estimation of Renyi Divergence and Friends. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 919–927. [Google Scholar]
- Singh, S.; Póczos, B. Generalized exponential concentration inequality for Rényi divergence estimation. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), Beijing, China, 21–26 June 2014; pp. 333–341. [Google Scholar]
- Singh, S.; Póczos, B. Exponential Concentration of a Density Functional Estimator. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; pp. 3032–3040. [Google Scholar]
- Kandasamy, K.; Krishnamurthy, A.; Poczos, B.; Wasserman, L.; Robins, J. Nonparametric von Mises Estimators for Entropies, Divergences and Mutual Informations. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015; pp. 397–405. [Google Scholar]
- Härdle, W. Applied Nonparametric Regression; Cambridge University Press: Cambridge, UK, 1990. [Google Scholar]
- Berlinet, A.; Devroye, L.; Györfi, L. Asymptotic normality of L1-error in density estimation. Statistics 1995, 26, 329–343. [Google Scholar] [CrossRef]
- Berlinet, A.; Györfi, L.; Dénes, I. Asymptotic normality of relative entropy in multivariate density estimation. Publ. l’Inst. Stat. l’Univ. Paris 1997, 41, 3–27. [Google Scholar]
- Bickel, P.J.; Rosenblatt, M. On some global measures of the deviations of density function estimates. Ann. Stat. 1973, 1, 1071–1095. [Google Scholar] [CrossRef]
- Sricharan, K.; Wei, D.; Hero, A.O., III. Ensemble estimators for multivariate entropy estimation. IEEE Trans. Inf. Theory 2013, 59, 4374–4388. [Google Scholar] [CrossRef] [PubMed]
- Berrett, T.B.; Samworth, R.J.; Yuan, M. Efficient multivariate entropy estimation via k-nearest neighbour distances. arXiv 2017. [Google Scholar]
- Kozachenko, L.; Leonenko, N.N. Sample estimate of the entropy of a random vector. Probl. Peredachi Inf. 1987, 23, 9–16. [Google Scholar]
- Hansen, B.E.; (University of Wisconsin, Madison, WI, USA). Lecture Notes on Nonparametrics. 2009. [Google Scholar]
- Budka, M.; Gabrys, B.; Musial, K. On accuracy of PDF divergence estimators and their applicability to representative data sampling. Entropy 2011, 13, 1229–1266. [Google Scholar] [CrossRef]
- Efron, B.; Stein, C. The jackknife estimate of variance. Ann. Stat. 1981, 9, 586–596. [Google Scholar] [CrossRef]
- Wisler, A.; Moon, K.; Berisha, V. Direct ensemble estimation of density functionals. In Proceedings of the 2018 International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
- Moon, K.R.; Sricharan, K.; Greenewald, K.; Hero, A.O., III. Nonparametric Ensemble Estimation of Distributional Functionals. arXiv 2016. [Google Scholar]
- Paul, F.; Arkin, Y.; Giladi, A.; Jaitin, D.A.; Kenigsberg, E.; Keren-Shaul, H.; Winter, D.; Lara-Astiaso, D.; Gury, M.; Weiner, A.; et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 2015, 163, 1663–1677. [Google Scholar] [CrossRef] [PubMed]
- Kanehisa, M.; Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000, 28, 27–30. [Google Scholar] [CrossRef] [PubMed]
- Kanehisa, M.; Sato, Y.; Kawashima, M.; Furumichi, M.; Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 2015, 44, D457–D462. [Google Scholar] [CrossRef] [PubMed]
- Kanehisa, M.; Furumichi, M.; Tanabe, M.; Sato, Y.; Morishima, K. KEGG: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2016, 45, D353–D361. [Google Scholar] [CrossRef] [PubMed]
- Van Dijk, D.; Sharma, R.; Nainys, J.; Yim, K.; Kathail, P.; Carr, A.J.; Burdsiak, C.; Moon, K.R.; Chaffer, C.; Pattabiraman, D.; et al. Recovering Gene Interactions from Single-Cell Data Using Data Diffusion. Cell 2018, 174, 716–729. [Google Scholar] [CrossRef] [PubMed]
- Moon, K.R.; Sricharan, K.; Hero, A.O., III. Ensemble Estimation of Distributional Functionals via k-Nearest Neighbors. arXiv 2017. [Google Scholar]
- Durrett, R. Probability: Theory and Examples; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
- Gut, A. Probability: A Graduate Course; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Munkres, J. Topology; Prentice Hall: Englewood Cliffs, NJ, USA, 2000. [Google Scholar]
- Evans, L.C. Partial Differential Equations; American Mathematical Society: Providence, RI, USA, 2010. [Google Scholar]
- Gilbarg, D.; Trudinger, N.S. Elliptic Partial Differential Equations of Second Order; Springer: Berlin/Heidelberg, Germany, 2001. [Google Scholar]
Figure 1. Heat map showing the predicted bias of the divergence functional plug-in estimator based on Theorem 1 as a function of the dimensions (d) and sample size (N) when . Note that the phase transition in the bias as the dimensions (d) increase for a fixed sample size (N); the bias remains small only for relatively small values of . The proposed weighted ensemble estimator EnDive eliminates this phase transition when the densities and the function g are sufficiently smooth.
Figure 2. The optimal weights from (6) when , , , and l are uniformly spaced between 1.5 and 3. The lowest values of l are given the highest weight. Thus, the minimum value of bandwidth parameters should be sufficiently large to render an adequate estimate.
Figure 3. (Left) Log–log plot of MSE of the uniform kernel plug-in (“Kernel”) and the optimally weighted EnDive estimator for various dimensions and sample sizes. (Right) Plot of the true values being estimated compared to the average values of the same estimators with standard error bars. The proposed weighted ensemble estimator approaches the theoretical rate (see Table 1), performed better than the plug-in estimator in terms of MSE and was less biased.
Figure 4. Log–log plot of the average pointwise squared error between the KDE and for various dimensions and sample sizes using the same bandwidth and kernel as the standard plug-in estimators in Figure 3. The KDE and the density were compared at 10,000 points sampled from .
Figure 5. QQ-plots comparing the quantiles of a standard normal random variable and the quantiles of the centered and scaled EnDive estimator applied to the Kullback–Leibler (KL) divergence when the distributions were the same and different. Quantiles were computed from 10,000 trials. These plots correspond to the same experiments as in Table 2 when N = 100 and N = 1000. The correspondence between quantiles is high for all cases.
Figure 6. Estimated upper (UB) and lower bounds (LB) on the Bayes error rate (BER) based on estimating the HP divergence between two 10-dimensional Gaussian distributions with identity covariance matrices and distances between means of 1 (left) and 3 (right), respectively. Estimates were calculated using EnDive, with error bars indicating the standard deviation from 400 trials. The upper bound was closer, on average, to the true BER when N was small (≈100–300) and the distance between the means was small. The lower bound was closer, on average, in all other cases.
Table 1. Negative log–log slope of the EnDive mean squared error (MSE) as a function of the sample size for various dimensions. The slope was calculated beginning at . The negative slope was closer to 1 with than for indicating that the asymptotic rate had not yet taken effect at .
Table 2. Comparison between quantiles of a standard normal random variable and the quantiles of the centered and scaled EnDive estimator applied to the KL divergence when the distributions were the same and different. Quantiles were computed from 10,000 trials. The parameter gives the correlation coefficient between the quantiles, while is the estimated slope between the quantiles. The correspondence between quantiles was very high for all cases.
Table 3. Misclassification rate of a quadratic discriminant analysis classifier (QDA) classifier and estimated upper bounds (UB) and lower bounds (LB) of the pairwise BER between mouse bone marrow cell types using the Henze–Penrose divergence applied to different combinations of genes selected from the KEGG pathways associated with the hematopoietic cell lineage. Results are presented as percentages in the form of mean ± standard deviation. Based on these results, erythrocytes are relatively easy to distinguish from the other two cell types using these gene sets.
|Eryth. vs. Mono., LB|
|Eryth. vs. Mono., UB|
|Eryth. vs. Mono., Prob. Error|
|Eryth. vs. Baso., LB|
|Eryth. vs. Baso., UB|
|Eryth. vs. Baso., Prob. Error|
|Baso. vs. Mono., LB|
|Baso. vs. Mono., UB|
|Baso. vs. Mono., Prob. Error|
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).