Open Access
This article is
 freely available
 reusable
Entropy 2013, 15(5), 16091623; doi:10.3390/e15051609
Article
A Novel Nonparametric Distance Estimator for Densities with Error Bounds
^{1}
Instituto de Engenharia Mecânica e Gestão Industrial, Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias s/n, 4200465 Porto, Portugal
^{2}
Computational Neuro Engineering Laboratory, University of Florida, EB451 Engineering Building, University of Florida, Gainesville, FL 32611, USA
^{*}
Author to whom correspondence should be addressed.
Received: 19 December 2012; in revised form: 25 April 2013 / Accepted: 28 April 2013 / Published: 6 May 2013
Abstract
:The use of a metric to assess distance between probability densities is an important practical problem. In this work, a particular metric induced by an αdivergence is studied. The Hellinger metric can be interpreted as a particular case within the framework of generalized Tsallis divergences and entropies. The nonparametric Parzen’s density estimator emerges as a natural candidate to estimate the underlying probability density function, since it may account for data from different groups, or experiments with distinct instrumental precisions, i.e., nonindependent and identically distributed (noni.i.d.) data. However, the information theoretic derived metric of the nonparametric Parzen’s density estimator displays infinite variance, limiting the direct use of resampling estimators. Based on measure theory, we present a change of measure to build a finite variance density allowing the use of resampling estimators. In order to counteract the poor scaling with dimension, we propose a new nonparametric twostage robust resampling estimator of Hellinger’s metric error bounds for heterocedastic data. The approach presents very promising results allowing the use of different covariances for different clusters with impact on the distance evaluation.
Keywords:
generalized differential entropies; generalized differential divergences; Tsallis entropy; Hellinger metric; nonparametric estimators; heterocedastic dataPACS Codes:
02.50.r; 02.50.Cw; 89.70.a; 89.70.Cf1. Introduction
Distances measures between two probability densities have been extensively studied in the last century [1]. These measures address two important main objectives: how difficult it is to distinguish between one pair of densities in the context of others and to assess the closeness of two densities, compared to others [2]. In learning scenarios essentially associated with the test of a single hypothesis, the use of a divergence to represent the notion of distance is efficient. However, in scenarios involving multiple hypotheses, such as clustering, image retrieval, or pattern recognition and signal detection, for instance, the nonsymmetric and nonmetric nature of divergences becomes problematic [3]. When deciding the closest or the farthest among three or more clusters, the use of a metric is important. In this work, a novel nonparametric metric estimator for densities with error bounds is presented. Shannon’s entropy has a central role in informationtheoretic studies. However, the concept of information is so rich that perhaps there is no single definition that will be able to quantify information properly [4]. The idea of using information theory functional, such as entropies or divergences, in statistical inference is not new. In fact, the socalled statistical information theory has been the subject of much research over the last half century [5]. How to measure the distance between two densities is an open problem with several proposals since the work of Hellinger in 1909 with Hellinger’s distance [1], Kullback and Leibler (1951), with KullbackLeibler’s divergence [6], Bregman (1967) with the Bregman’s divergence [7], Jeffreys (1974) with Jdistance [8], RAO (1985) and Jianhua Lin (1991) with JensenShannon’s divergence [9,10], Menéndez et al. (1997) with (h, Φ)entropy differential metric [11], Seth and Principe (2008) with correntropy [12], among others. This work looks into Hellinger’s metric that is the preferred [13,14] or natural model metric [15]. In 2007 Puga found that Hellinger’s metric is one particular αdivergence [16]. Here, we propose a new measure change to solve the nonparametric metric estimation and a two stage robust estimator with error bounds.
2. Theory Background
Following Hartley’s (1928) and Shannon’s (1948) works [17,18], Alfred Rényi introduced in 1960 the generalized αentropy [19] of probability density function
$f\left(x\right)$:
$${R}_{\alpha}\left(f\right)=\frac{1}{1\alpha}\mathrm{ln}{\displaystyle \int f{\left(x\right)}^{\alpha}dx},\alpha 0$$
The corresponding generalized differential divergence between two densities
${f}_{1}\left(x\right)$ and ${f}_{2}\left(x\right)$
is:
$${D}_{\alpha}^{R}\left({f}_{1},{f}_{2}\right)=\frac{1}{\alpha 1}\mathrm{ln}{\displaystyle \int \frac{{f}_{1}{\left(x\right)}^{\alpha}}{{f}_{2}{\left(x\right)}^{\alpha 1}}dx}$$
GellMann and Tsallis considered another family of αentropies [20]:
being the corresponding αdivergences given as:
$${T}_{\alpha}\left(f\right)=\frac{1}{\alpha 1}\left[1{\displaystyle \int f{\left(x\right)}^{\alpha}dx}\right]$$
$${D}_{\alpha}^{T}\left({f}_{1},{f}_{2}\right)=\frac{1}{1\alpha}\left[1{\displaystyle \int \frac{{f}_{1}{\left(x\right)}^{\alpha}}{{f}_{2}{\left(x\right)}^{\alpha 1}}dx}\right]$$
Making
$\alpha \to 1$, one easily can conclude that:
and:
where:
is Shannon’s differential entropy and:
is KulbackLeibler’s divergence.
$$\underset{\alpha \to 1}{\mathrm{lim}}{R}_{\alpha}\left(f\right)=\underset{\alpha \to 1}{\mathrm{lim}}{T}_{\alpha}\left(f\right)={H}_{S}\left(f\right)$$
$$\underset{\alpha \to 1}{\mathrm{lim}}{D}_{\alpha}^{R}\left({f}_{1},{f}_{2}\right)=\underset{\alpha \to 1}{\mathrm{lim}}{D}_{\alpha}^{T}\left({f}_{1},{f}_{2}\right)={D}_{KL}\left({f}_{1},{f}_{2}\right)$$
$${H}_{S}\left(f\right)={\displaystyle \int f\left(x\right)\mathrm{ln}f\left(x\right)dx}$$
$${D}_{KL}\left({f}_{1},{f}_{2}\right)={\displaystyle \int {f}_{1}\left(x\right)\mathrm{ln}\frac{{f}_{1}\left(x\right)}{{f}_{2}\left(x\right)}dx}$$
Another member of these families is the Rényi’s quadratic entropy (α=2) that is defined as:
while the respective divergence is:
$${R}_{2}\left(f\right)=\mathrm{ln}{\displaystyle \int f{\left(x\right)}^{2}dx}$$
$${D}_{2}^{R}\left({f}_{1},{f}_{2}\right)=\mathrm{ln}{\displaystyle \int \frac{{f}_{1}{\left(x\right)}^{2}}{{f}_{2}\left(x\right)}dx}$$
Rényi’s quadratic entropy, given by Equation (9), is particularly interesting because it accepts a close form nonparametric estimator, saving computational time compared to numerical integration or resampling [21,22].
αEntropy families given by Equations (1) and (3) are monotonically coupled (Ramshaw [23]) through:
$${T}_{\alpha}=\left({e}^{\left(1\alpha \right){R}_{\alpha}}1\right)/\left(1\alpha \right)$$
Therefore, an optimization in one family has equivalence in the other.
2.1. SquareRoot Entropy
Let us consider
$\alpha =1/2$
in Equations (1)–(4). Then, the squareroot entropy in the form of Tsallis is:
with the corresponding divergence given as:
$${T}_{1/2}\left(f\right)=2{\displaystyle \int \sqrt{f\left(x\right)}dx2}$$
$${D}_{1/2}^{T}\left({f}_{1},{f}_{2}\right)=22{\displaystyle \int \sqrt{{f}_{1}\left(x\right){f}_{2}\left(x\right)}dx}$$
In Rényi’s form one finds, respectively, the entropy:
and the divergence:
$${R}_{1/2}\left(f\right)=2\mathrm{ln}{\displaystyle \int \sqrt{f\left(x\right)}dx},$$
$${D}_{1/2}^{R}\left({f}_{1},{f}_{2}\right)=2\mathrm{ln}{\displaystyle \int \sqrt{{f}_{1}\left(x\right){f}_{2}\left(x\right)}dx}$$
It should be noted that, from Equation (13), one obtains:
where
$M\left({f}_{1},{f}_{2}\right)$
is a information theoretic derived metric that, among other properties, verifies the triangular inequality. This particular αdivergence, by means of a monotonous transformation, induces the Hellinger’s distance, which is a metric [13,14,24]:
$$\sqrt{{D}_{1/2}^{T}\left({f}_{1},{f}_{2}\right)}=\sqrt{{\displaystyle \int {\left(\sqrt{{f}_{1}\left(x\right)}\sqrt{{f}_{2}\left(x\right)}\right)}^{2}dx}}=M\left({f}_{1},{f}_{2}\right)$$
$$M\left({f}_{1},{f}_{2}\right)=\sqrt{22I({f}_{1},{f}_{2})}$$
On the other hand, information theoretic derived metrics given by Equations (15) and (16) are also related with Hellinger’s affinity or Bhattacharya’s coefficient ($0\le I\left({f}_{1},{f}_{2}\right)\le 1$):
$$I\left({f}_{1},{f}_{2}\right)={\displaystyle \int \sqrt{{f}_{1}\left(x\right){f}_{2}\left(x\right)}dx}.$$
Considering the expected crossvalue of two probability density functions
$C\left({f}_{1},{f}_{2}\right)$:
the Hellinger’s affinity given by Equation (18) can be then written as:
where
${f}_{\mathrm{\Omega}}\left(x\right)$
is the normalized product density:
and
$H\left({f}_{\mathrm{\Omega}}\right)$
the corresponding entropy of the information theoretic derived metric.
$$C\left({f}_{1},{f}_{2}\right)={E}_{{f}_{1}}\left({f}_{2}\right)={E}_{{f}_{2}}\left({f}_{1}\right)={\displaystyle \int {f}_{1}\left(x\right)}{f}_{2}\left(x\right)dx$$
$$I\left({f}_{1},{f}_{2}\right)=\sqrt{C\left({f}_{1},{f}_{2}\right)}{\displaystyle \int \sqrt{{f}_{\mathrm{\Omega}}\left(x\right)}dx}=\sqrt{C\left({f}_{1},{f}_{2}\right)}H\left({f}_{\mathrm{\Omega}}\right)$$
$${f}_{\mathrm{\Omega}}\left(x\right)=\frac{{f}_{1}\left(x\right){f}_{2}\left(x\right)}{C\left({f}_{1},{f}_{2}\right)}$$
This metric has bounds that can be directly computed from the samples as shown by Puga [16]. These bounds often present overlapping hypothesis intervals, and resampling estimation is a necessary tool to remove ambiguities and access distances between densities.
2.2. Nonparametric Hellinger’s Affinity Estimation
Let us focus on the application of the previous measures on two Parzen’s nonparametric densities [25] from two data clusters
$C{l}^{(1)}=\left\{{{x}_{1}}^{\left(1\right)},{{x}_{2}}^{\left(1\right)},\mathrm{...},{{x}_{{N}_{1}}}^{\left(1\right)}\right\}$ and $C{l}^{(2)}=\left\{{{x}_{1}}^{\left(2\right)},{{x}_{2}}^{\left(2\right)},\mathrm{...},{{x}_{{N}_{2}}}^{\left(2\right)}\right\}$:
and:
where
$G\left(x,\sigma ,\mu \right)$
is the Parzen’s Gaussian kernel, also known as kernel bandwidth, with the approximation of covariance
${\sigma}^{2}I$, and mean
$\mu $
given as:
where
$\aleph $
is the dimension. Notice that the two clusters in Equations (22) and (23) may have two different Gaussian kernel covariances. The kernel covariance may be obtained directly from the a priori knowledge of the instruments used to produce the data; for instance, two different instruments with different precisions may produce the same data, but the densities should reflect the measurements error through the bandwidth, (covariances). To estimate the bandwidth without instrumental apriori knowledge it is possible to estimate the kernel bandwidth with a suitable method, such as kNearest Neighbor (kNN), Silverman [25] or Scott [26].
$${f}_{1}\left(x\right)=\frac{1}{{N}_{1}}{\displaystyle \sum _{j=1}^{{N}_{1}}G\left(x,{\sigma}_{1},{{x}_{j}}^{\left(1\right)}\right)}$$
$${f}_{2}\left(x\right)=\frac{1}{{N}_{2}}{\displaystyle \sum _{j=1}^{{N}_{2}}G\left(x,{\sigma}_{2},{{x}_{j}}^{\left(2\right)}\right)}$$
$$G\left(x,\sigma ,\mu \right)={\displaystyle \prod _{i=1}^{\aleph}\frac{1}{\sqrt{2\pi {\sigma}^{2}}}{e}^{\frac{{\left(x\left(i\right)\mu (i)\right)}^{2}}{2{\sigma}^{2}}}}$$
Now, let us adopt the summing convention
$\sum _{i,j}\equiv}{\displaystyle \sum _{i=1}^{{N}_{1}}{\displaystyle \sum _{j=1}^{{N}_{2}}}$ and define the following auxiliary variables:
and:
The nonparametric estimator
${\widehat{f}}_{\mathrm{\Omega}}\left(\omega \right)$
results in:
$${\sigma}^{2}=\left({\sigma}_{1}^{2}+{\sigma}_{2}^{2}\right)/2$$
$${\sigma}_{*}^{2}={\sigma}_{1}^{2}{\sigma}_{2}^{2}/2{\sigma}^{2}$$
$${s}_{i,j}=\left({\sigma}_{2}^{2}{x}_{i}^{\left(1\right)}+{\sigma}_{1}^{2}{x}_{j}^{\left(2\right)}\right)/2{\sigma}^{2}$$
$${d}_{i,j}=\left({x}_{i}^{\left(1\right)}{x}_{j}^{\left(2\right)}\right)/2$$
$${F}_{D}\left({d}_{i,j}\right)={e}^{\frac{{\Vert {d}_{i,j}\Vert}^{2}}{{\sigma}^{2}}}/{\displaystyle \sum _{k,l}{e}^{\frac{{\Vert {d}_{k,l}\Vert}^{2}}{{\sigma}^{2}}}}$$
$${\widehat{f}}_{\mathrm{\Omega}}\left(\omega \right)={\displaystyle \sum _{i,j}{F}_{D}\left({d}_{i,j}\right)G\left(\omega ,{\sigma}_{*}^{2},{s}_{i,j}\right)}$$
2.3. The Resampling Estimator
The bootstrap resampling is reached through the distribution of probability given by Equation (30) combined with the random generation of samples
$\left({\omega}_{k}\right)$
from nonparametric Parzen’s density with diagonal covariance, which is a wellestablished as well as a computationally efficient procedure [27]. Then, the synthesized samples are directly usable in the estimator:
with
${\omega}_{k}\underset{i.i.d.}{~}{f}_{\mathrm{\Omega}}$.
$$\begin{array}{ll}H\left({f}_{\mathrm{\Omega}}\right)& ={\displaystyle \int \sqrt{{f}_{\mathrm{\Omega}}\left(\omega \right)}}d\omega ={\displaystyle \int \frac{{f}_{\mathrm{\Omega}}\left(\omega \right)}{\sqrt{{f}_{\mathrm{\Omega}}\left(\omega \right)}}}d\omega \\ & ={E}_{{f}_{\mathrm{\Omega}}}\left[\frac{1}{\sqrt{{f}_{\mathrm{\Omega}}\left(\omega \right)}}\right]=\underset{K\to \infty}{\mathrm{lim}}{\tilde{H}}_{K}\left({f}_{\mathrm{\Omega}}\right)=\underset{K\to \infty}{\mathrm{lim}}\frac{1}{K}{\displaystyle \sum _{k=1}^{K}\frac{1}{\sqrt{{\widehat{f}}_{\mathrm{\Omega}}\left({\omega}_{k}\right)}}}\end{array}$$
However, the use of Equation (31) is associated with serious practical difficulties because the second moment:
has infinite variance, which is a condition where the central limit theorem is not valid. In this work, we use measure theory and propose the following change of measure:
with the associated density
${f}_{Z}(z)$:
$$\int d\omega}H{\left({f}_{\mathrm{\Omega}}\right)}^{2}=\infty $$
$$z={f}_{\mathrm{\Omega}}\left(\omega \right)$$
$${E}_{{f}_{\mathrm{\Omega}}}\left[\frac{1}{\sqrt{{f}_{\mathrm{\Omega}}\left(\omega \right)}}\right]={E}_{{f}_{Z}}\left[\frac{1}{\sqrt{z}}\right]=\underset{K\to \infty}{\mathrm{lim}}{\displaystyle \sum _{0}^{{z}_{k}^{\mathrm{max}}}\frac{1}{\sqrt{{z}_{k}}}{\widehat{f}}_{z}({z}_{k})dz}\mathrm{with}{z}_{k}\underset{i.i.d.}{~}{f}_{Z}$$
This new density presents a finite second moment:
having
${f}_{z}$
a limited support between 0 (zero) and
${z}^{\mathrm{max}}$. This is a density with an abrupt jump in
${z}^{\mathrm{max}}$ end of the density. However, the approximation properties of a histogram are not affected by a simple jump at the end of the density [26], hence the histogram estimator was used to estimate
${\widehat{f}}_{Z}({z}_{k})$
with ${z}_{k}={\widehat{f}}_{\mathrm{\Omega}}({\omega}_{k})$.
$$\underset{0}{\overset{{z}^{\mathrm{max}}}{\int}}\frac{1}{z}{f}_{z}(z)dz}H{\left({f}_{\mathrm{\Omega}}\right)}^{2$$
The product probability density function
$\left({f}_{z}\right)$
must be estimated from the random variable
$z={f}_{\mathrm{\Omega}}\left(\omega \right)$, but it ensures finite variance, which is a requisite of the central limit theorem and the tstudent confidence interval may be used Equation (39).
Figure 1.
A logarithmic scale for the ${\sigma}_{1}/{\sigma}_{2}$ coefficient variation and metric measure change between the two respective densities $\left({f}_{1},{f}_{2}\right)$.
To test the algorithm, we consider the simplest case of Hellinger’s metric (17) associated with the nonparametric densities of Equations (22) and (23). In this particular case, we have access to the analytical value of Hellinger’s metric:
$$\tilde{M}\left({f}_{1},{f}_{2}\right)=\sqrt{22{\left(\frac{2{\sigma}_{1}{\sigma}_{2}}{{\sigma}_{1}^{2}+{\sigma}_{2}^{2}}\right)}^{\frac{\aleph}{2}}{e}^{\frac{{\Vert {d}_{1,2}\Vert}^{2}}{{\sigma}^{2}}}}$$
Using Equations (34) and (36), we can quantify the computational behavior of the resampling estimator. Let us first consider the behavior of the Parzen’s density estimator with two distinct kernel sizes:
${\sigma}_{1},\text{}{\sigma}_{2}$. In the simplest case with only two kernels, located at the same coordinates, despite the same location, different Parzen’s windows in Equation (36) provide different distances, as can be observed in Figure 1. It is possible to verify the symmetric behavior of the distance estimator and realize that the bandwidth of the Parzen’s kernel is important to access the distance between clusters. This is a relevant characteristic, especially when the experimental data have different instrumental origins with different measurement precisions; the use of different bandwidths in the Parzen’s kernels may reflect this importance feature of the density, and this implies that the data is heterocedastic.
To quantify the error bounds estimation performance, we propose the generation of N_{1} distance samples
${\tilde{M}}_{m}$
from resampling the density of Equation (34), and to estimate
${\stackrel{~}{f}}_{Z}$
we use a discrete histogram with N_{1} bins, obtaining the ordered
${z}_{k}$ and ${\stackrel{~}{f}}_{{Z}_{k}}$. As such, the metric $\tilde{M}$ estimator becomes:
which can be written as:
$${\tilde{M}}_{m}={\left\{\sqrt{22\sqrt{C\left({f}_{1},{f}_{2}\right)}\frac{1}{\sqrt{{z}_{k}}}{\stackrel{~}{f}}_{Z}({z}_{k})\Delta z}\right\}}_{m=\mathrm{1...}{N}_{1}}$$
$$\tilde{M}=\sqrt{22\sqrt{C\left({f}_{1},{f}_{2}\right)}{\displaystyle \sum _{0}^{{z}_{k}^{\mathrm{max}}}\frac{1}{\sqrt{{z}_{k}}}{\stackrel{~}{f}}_{Z}({z}_{k})\Delta z}}$$
To assess the error bounds estimation, we use the tstudent 95% confidence interval (39), which is a maximum entropy distribution [28,29] and provides a parametric approach to robust statistics [30], and allows the following calculation of the confidence limits:
$$\left[L,U\right]=\tilde{M}\pm {t}_{{N}_{1}1,0.5+0.95/2}\sqrt{\frac{1}{{N}_{1}({N}_{1}1)}{\displaystyle \sum _{k=1}^{{N}_{1}}{\left({\tilde{M}}_{m}\tilde{M}\right)}^{2}}}$$
We calculate the 95% confidence limits, the upper ($U$) and the lower ($L$) for the respective density resampling. The variance of this new estimator is well controlled in one dimension. The unexpected drawback of this estimator is its poor scaling performance with increased dimension, as depicted in Figure 2.
The new variable
$z={f}_{\mathrm{\Omega}}\left(\omega \right)$
may be seen as a projection of the multidimensional Parzen’s kernels into a 1Dimensional function. This insight allowed the design of a twostage estimator for
${f}_{\mathrm{\Omega}}\left(\omega \right)$
that circumvents both problems: infinite variance and poor scalability with dimensionality.
2.4. The Two Stage Resampling Estimator
We propose the generation of N_{1} distance samples
${{\tilde{M}}_{k}}^{(n)}$
from resampling the density
${f}_{\mathrm{\Omega}}({\omega}_{k})$, which constitutes one trial
$(n)$:
$${{\tilde{M}}_{k}}^{(n)}={\left\{\sqrt{22\frac{\sqrt{C\left({f}_{1},{f}_{2}\right)}}{\sqrt{{f}_{\mathrm{\Omega}}\left({\omega}_{k}^{(n)}\right)}}}\right\}}_{k=\mathrm{1...}{N}_{1}}$$
It is possible to estimate
${\tilde{M}}^{(n)}$
as:
$${\tilde{M}}^{(n)}=\sqrt{2\frac{2}{{N}_{1}}{\displaystyle \sum _{k=1}^{{N}_{1}}\sqrt{\frac{C\left({f}_{1},{f}_{2}\right)}{{f}_{\mathrm{\Omega}}\left({\omega}_{k}^{(n)}\right)}}}}$$
For each trial
$(n)$, the 95% confidence limits, the upper
${U}^{(n)}$ and the lower
${L}^{(n)}$
for the respective density resampling, can be calculated:
$$\left[{L}^{(n)},{U}^{(n)}\right]={\tilde{M}}^{(n)}\pm {t}_{{N}_{1}1,0.5+0.95/2}\sqrt{\frac{1}{{N}_{1}({N}_{1}1)}{\displaystyle \sum _{k=1}^{{N}_{1}}{\left({{\tilde{M}}_{k}}^{(n)}{\tilde{M}}^{(n)}\right)}^{2}}}$$
It may seem that this step is enough to estimate the metric
$\tilde{M}\left({f}_{1},{f}_{2}\right)=\sqrt{22\tilde{I}({f}_{1},{f}_{2})}$, but the theoretically predicted undesired behavior associated to Equation (32), with large confidence intervals is present in this estimator. To demonstrate this drawback, we have simulated 100 trials of the simplest case of nonparametric Hellinger’s metric, Equation (36), with Euclidean distance
$d=\Vert {x}_{1}{x}_{2}\Vert =1$. As can be observed in Figure 3, the large confidence intervals are present, hence the motivation for the twostage error bound estimator.
Figure 3.
tStudent 95% confidence intervals for Hellinger’s metric defined by dots; the exact value is represented by a continuous line; the predicted large intervals are marked with triangles; and the missestimated intervals are marked with circles.
To achieve a robust error bound estimator
$\left[{\tilde{L}}_{R},{\tilde{U}}_{R}\right]$
for the expected value of
$M({f}_{1},{f}_{2})$
with similar results of
${f}_{Z}(z)$
in one dimension, we propose a new twostage method. Comparing the results of the two densities resampling, we found that 31 selected trials out of 33 from
${\widehat{f}}_{\mathrm{\Omega}}({\omega}_{k})$
was in good agreement with
${\tilde{f}}_{Z}(z)$. With 33 trails
$(n)$
generated with N_{1} random samples each as:
sorting the amplitude
$\left{U}^{(n)}{L}^{(n)}\right$
and keeping the 31 smallest intervals with the correspondent estimated affinities (${{\tilde{M}}_{s}}^{(n)}$), we obtain the estimator ${\tilde{M}}_{s}({f}_{1},{f}_{2})$ for the second stage with:
$${\left\{{\left\{\sqrt{22\frac{\sqrt{C\left({f}_{1},{f}_{2}\right)}}{\sqrt{{\widehat{f}}_{\mathrm{\Omega}}\left({\omega}_{k}^{(n)}\right)}}}\right\}}_{k=\mathrm{1...}{N}_{1}}\right\}}_{n=\mathrm{1...33}}$$
$${\tilde{M}}_{s}({f}_{1},{f}_{2})=\frac{1}{31}{\displaystyle \sum _{n=1}^{31}{{\tilde{M}}_{s}}^{(n)}}$$
Then, we calculated the respective tstudent 95% confidence interval
$\left[{\tilde{L}}_{s},{\tilde{U}}_{s}\right]$
with the selected trials
${{\widehat{M}}_{s}}^{(n)}$. To overcome the missestimated intervals, we have defined a second estimator for the lower limit of the interval (${\tilde{L}}_{2}$) and a second estimator for the upper limit of the interval (
${\tilde{U}}_{2}$):
which is a potentially asymmetric interval, guided by the selected firststage interval limits.
$$\left[{\tilde{L}}_{2},{\tilde{U}}_{2}\right]=\left[\frac{1}{31}{\displaystyle \sum _{n=1}^{31}{L}_{s}^{(n)}},\frac{1}{31}{\displaystyle \sum _{n=1}^{31}{U}_{s}^{(n)}}\right]$$
The robust estimator for the lower limit of the interval (${\tilde{L}}_{R}$) and the robust estimator for the upper limit of the interval (${\tilde{U}}_{R}$) were defined as:
$$\left[{\tilde{L}}_{R},{\tilde{U}}_{R}\right]=\left[\mathrm{min}\left({\tilde{L}}_{s},{\tilde{L}}_{2}\right),\mathrm{max}\left({\tilde{U}}_{s},{\tilde{U}}_{2}\right)\right]$$
In Figure 4, we can find the intervals defined by Equation (46), and confirmed the robust interval estimator for the Hellinger’s affinity.
Figure 4.
Using the new robust twostage resampling interval estimator the exact Hellinger’s distance is more likely to be found within the interval; it should be noted that from dimension 1 to dimension 20, the exact value of the metric is always in the interval.
The detailed process of the twostage estimator is presented in Algorithm 1. Notice that we studied up to dimension 20 with promising results. kNN is a good alternative [31,32,33,34,35,36], but may present several difficulties, like the k determination [37], the distance measure choice [38] and the curse of dimensionality [39].
Algorithm 1—Twostage resampling estimator 

The implemented algorithm is available upon request.
3. Results and Discussion
To study the proposed resampling estimator behavior, we addressed several dimensions
$\left(\aleph \right)$, different Parzen’s coefficients
$\left(\sigma ={\sigma}_{1}={\sigma}_{2}\right)$
as well as distinct Euclidean distances
$\left(d=\Vert {x}_{1}{x}_{2}\Vert \right)$.
Firstly, we studied the estimator from dimension 1 (one) to dimension 20 and obtained the results shown in Figure 5, which let us verify that the exact value was always within the estimated interval.
Figure 5.
Behavior of the new robust twostage resampling interval estimator regarding dimensions from 1 to 20. (The exact value of the nonparametric Hellinger’s metric is represented by a continuous line and is always contained in the estimated interval.)
If the precision needed is not enough to generate disjoint intervals in competitive scenarios composed by multiple hypotheses, then the twostage resampling can be repeated using a higher N_{1}, see Figure 6.
One can see in Figure 6 that the interval decreases with the increase of random samples, and that the exact value of nonparametric Hellinger’s metric, which is represented by a continuous line, is always contained in the estimated interval.
To verify the behavior of the resampling Estimator with the Parzen’s window
${\sigma}^{2}$
variation, we studied the results for 0.1 to 2 with 0.1 increases, Figure 7. In all the cases, the exact value is within the estimated error bound. Hence, the error bound estimator proposed here leads to robust intervals estimation.
Figure 6.
Illustration of an asymptotic study of the novel robust twostage resampling interval estimator between the upper and lower interval limits for different number of random samples (a detailed view regarding the Euclidean distances between 9 and 10 was added to the 1,000 samples graphic so the behavior of the estimator can be easily confirmed.)
Figure 7.
Illustration of the behavior of the new robust twostage resampling interval estimator with the Parzen’s window variation from 0.1 to 2; the graphics regard Euclidean distances from 0 (zero) to 5; the upper and lower interval limits always contain the exact value that is represented by a continuous line.
4. Conclusions
Hellinger’s metric was obtained from the generalized differential entropies and divergences. A nonparametric metric estimator based on Parzen’s window was introduced. We proposed a change of measure to allow a resampling method. With the change of measure proposed, it was possible to design a new twostage resampling error bound estimator. The resampling error bound estimator also has the advantage of resampling just one density (the sum of normalized product densities) given by a nonparametric Parzen’s density with diagonal covariance, with asymptotic behavior. The new algorithm presented a robust behavior and very promising results. The asymptotic behavior allows to use this metric, in a competitive scenario with three or more densities, like clustering and image retrieval, to obtain disjoint intervals, simply by increasing the number of resampling samples. As to possible future work, two possible paths seem interesting: to evaluate Hellinger’s metric behavior on medical image processing and analysis as in [40,41], and to assess kNN entropy estimation capability with the metric and heterocedastic data addressed here.
Acknowledgments
This paper is dedicated to André T. Puga, who initiated and supervised this work prior to his death. This work has been financially supported by Fundação para a Ciência e a Tecnologia (FCT), in Portugal, in the scope of the research project with reference PTDC/EEACRO/103320/2008.
Conflict of Interest
The authors declare no conflict of interest.
References
 Hellinger, E. Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. Crelle 1909, 210–271. [Google Scholar]
 Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Series B. 1966, 28, 131–142. [Google Scholar]
 Ullah, A. Entropy, divergence and distance measures with econometric applications. J. Stat. Plan Inferace 1996, 49, 137–162. [Google Scholar] [CrossRef]
 Principe, J.C. Information Theoretic Learning Renyi's Entropy and Kernel Perspectives; Springer: New York, NY, USA, 2010. [Google Scholar]
 Pardo, L. Statistical Inference Based on Divergence Measures; Chapman and Hall/CRC: Boca Raton, FL, USA, 2005; p. 483. [Google Scholar]
 Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
 Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
 Jeffreys, H. Fisher and inverse probability. Int. Stat. Rev. 1974, 42, 1–3. [Google Scholar] [CrossRef]
 Rao, C.R.; Nayak, T.K. Cross entropy, dissimilarity measures, and characterizations of quadratic entropy. IEEE Trans. Inf. Theory 1985, 31, 589–593. [Google Scholar] [CrossRef]
 Lin, J.H. Divergence measures based on the Shannon Entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
 Menéndez, M.L.; Morales, D.; Pardo, L.; Salicrú, M. (h, Φ)entropy differential metric. Appl. Math. 1997, 42, 81–98. [Google Scholar] [CrossRef]
 Seth, S.; Principe, J.C. Compressed signal reconstruction using the correntropy induced metric. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, NV, USA, 31–April 4, 2008; pp. 3845–3848.
 Topsoe, F. Some inequalities for information divergence and Related measures of discrimination. IEEE Trans. Inf. Theory 2000, 46, 1602–1609. [Google Scholar] [CrossRef]
 Liese, F.; Vajda, I. On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory 2006, 52, 4394–4412. [Google Scholar] [CrossRef]
 Cubedo, M.; Oller, J.M. Hypothesis testing: A model selection approach. J. Stat. Plan. Inference 2002, 108, 3–21. [Google Scholar] [CrossRef]
 Puga, A.T. Nonparametric Hellinger’s Metric. In Proceedings of CMNE/CILANCE 2007, Porto, Portugal, 13–15 June 2007.
 Hartley, R.V.L. Transmission of information. Bell Syst. Tech. J. 1928, 7, 535–563. [Google Scholar] [CrossRef]
 Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
 Rényi, A. On Measures of Entropy and Information, Fourth Berkeley Symposium on Math. Statist. and Prob; University of California: Berkeley: CA, USA, 1961; Volume 1, pp. 547–561. [Google Scholar]
 Nonextensive Entropy: Interdisciplinary Applications; GellMann, M.; Tsallis, C. (Eds.) Oxford University Press: New York, NY, USA, 2004.
 Wolf, C. Twostate paramagnetism induced by Tsallis and Renyi statistics. Int. J. .Theor. Phys. 1998, 37, 2433–2438. [Google Scholar] [CrossRef]
 Gokcay, E.; Principe, J.C. Information theoretic clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 158–171. [Google Scholar] [CrossRef]
 Ramshaw, J.D. Thermodynamic stability conditions for the Tsallis and Renyi entropies. Phys. Lett. A 1995, 198, 119–121. [Google Scholar] [CrossRef]
 Gibbs, A.L.; Su, F.E. On choosing and bounding probability metrics. Int. Stat. Rev. 2002, 70, 419–435. [Google Scholar] [CrossRef]
 Silverman, B.W. Density Estimation for Statistics and Data Analysis; Chapman and Hall: London, UK, 1986. [Google Scholar]
 Scott, D.W. Multivariate Density Estimation: Theory, Practice, and Visualization; Wiley: New York, NY, USA, 1992. [Google Scholar]
 Devroye, L. NonUniform Random Variate Generation; SpringerVerlag: New York, NY, USA, 1986. [Google Scholar]
 Preda, V.C. The student distribution and the principle of maximumentropy. Ann. Inst. Stat. Math. 1982, 34, 335–338. [Google Scholar] [CrossRef]
 Kapur, J.N. MaximumEntropy Models in Science and Engineering; Wiley: New York, NY, USA, 1989. [Google Scholar]
 The Probable Error of a Mean. Available online: http://www.jstor.org/discover/10.2307/2331554? uid=2&uid=4&sid=21102107492741/ (accessed on 28 April 2013).
 Leonenko, N.; Pronzato, L.; Savani, V. A class of Renyi information estimators for multidimensional densities. Ann. Stat. 2008, 36, 2153–2182. [Google Scholar] [CrossRef]
 Li, S.; Mnatsakanov, R.M.; Andrew, M.E. knearest neighbor based consistent entropy estimation for hyperspherical distributions. Entropy 2011, 13, 650–667. [Google Scholar] [CrossRef]
 Penrose, M.D.; Yukich, J.E. Laws of large numbers and nearest neighbor distances. In Advances in Directional and Linear Statistics; Wells, M.T., SenGupta, A., Eds.; PhysicaVerlag: Heidelberg, Germany, 2011; pp. 189–199. [Google Scholar]
 Misra, N.; Singh, H.; Hnizdo, V. Nearest neighbor estimates of entropy for multivariate circular distributions. Entropy 2010, 12, 1125–1144. [Google Scholar] [CrossRef]
 Mnatsakanov, R.; Misra, N.; Li, S.; Harner, E. kNearest neighbor estimators of entropy. Math. Method. Stat. 2008, 17, 261–277. [Google Scholar] [CrossRef]
 Wang, Q.; Kulkarni, S.R.; Verdu, S. Divergence estimation for multidimensional densities via knearestneighbor distances. IEEE Trans. Inf. Theory 2009, 55, 2392–2405. [Google Scholar] [CrossRef]
 Hall, P.; Park, B.U.; Samworth, R.J. Choice of neighbor order in nearestneighbor classification. Ann. Stat. 2008, 36, 2135–2152. [Google Scholar] [CrossRef]
 Nigsch, F.; Bender, A.; van Buuren, B.; Tissen, J.; Nigsch, E.; Mitchell, J.B.O. Melting point prediction employing knearest neighbor algorithms and genetic parameter optimization. J. Chem. Inf. Model. 2006, 46, 2412–2422. [Google Scholar] [CrossRef] [PubMed]
 Beyer, K.; Goldstein, J.; Ramakrishnan, R.; Shaft, U. When Is “Nearest Neighbor” Meaningful? In Proceedings of 7the International Conference on Database Theory, Jerusalem, Israel, 12 January 1999; pp. 217–235.
 Vemuri, B.C.; Liu, M.; Amari, S.I.; Nielsen, F. Total bregman divergence and its applications to DTI analysis. IEEE Trans. Med. Imag. 2011, 30, 475–483. [Google Scholar] [CrossRef] [PubMed]
 Liu, M.; Vemuri, B.; Amari, S. I.; Nielsen, F. Shape retrieval using hierarchical total bregman soft clustering. IEEE T. Pattern Anal. 2012, 34, 2407–2419. [Google Scholar]
© 2013 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).