Empirical Squared Hellinger Distance Estimator and Generalizations to a Family of α-Divergence Estimators

We present an empirical estimator for the squared Hellinger distance between two continuous distributions, which almost surely converges. We show that the divergence estimation problem can be solved directly using the empirical CDF and does not need the intermediate step of estimating the densities. We illustrate the proposed estimator on several one-dimensional probability distributions. Finally, we extend the estimator to a family of estimators for the family of α-divergences, which almost surely converge as well, and discuss the uniqueness of this result. We demonstrate applications of the proposed Hellinger affinity estimators to approximately bounding the Neyman–Pearson regions.


Introduction
We present an empirical estimator for the squared Hellinger distance between two continuous distributions. The work is a direct extension of Perez-Cruz [1] where they provided an empirical KL divergence estimator. Their work is built upon previous works on divergence estimators such as [2][3][4][5][6]. Similar to their estimator, given two samples from two distributions, our estimator does not need to estimate the probability density functions explicitly before estimating the squared Hellinger distance between the two distributions, which makes it simple and fast. We show that the estimator converges to the true squared Hellinger distance almost surely as the sample size increases. We then extend our estimator to the family of α-divergences, to which the squared Hellinger distance belongs. For each of the estimators, we can obtain a reverse estimator using the other direction of the two data samples, and we can also obtain a symmetric estimator by averaging the two onesided estimators. We present several numerical examples to show the convergence of our estimators. Our newly proposed estimators can be used efficiently to approximate the adjacency of two data samples, leading to various applications in many fields of research.

Preliminaries on Divergences between Probability Distributions
Recall that the definition of squared Hellinger distance [7] is (for univariate continuous distributions): It is symmetric and always bounded between 0 and 1. Additionally, recall the definition of Kullback-Leibler divergence [8] is (for univariate continuous distributions): where total variation distance (TVD) is defined as: The squared Hellinger distance is also closely related to KL divergence and can be bounded by: It is also a known result that KL divergence is stronger than Hellinger distance in the sense that convergence in KL divergence implies convergence in Hellinger distance, which further implies convergence in total variation distances. Therefore, Hellinger distance represents a middle ground between KL divergence and total variation distance; it is weaker than KL divergence but stronger than total variation distance in terms of convergence. As shown before, Hellinger distance has close connections to the total variation distance, which is exactly what inference depends on (KL divergence does not admit a useful lower bound on the TVD). It has another attractive property compared with KL divergence, which is the fact that the squared Hellinger distance is always bounded between zero and one for probability distributions that may or may not have the same support, whereas the KL divergence becomes infinite for probability distributions of different supports. In fact, KL divergence can be unbounded for probability distributions supported on the real line. For example, consider P to be the standard Cauchy distribution and Q to be the standard normal distribution, then D KL (P||Q) diverges to infinity. Hence, an empirical estimator for KL divergence does not provide meaningful estimates in such a case, while the squared Hellinger distance is always bounded. Due to these desirable properties, we focus mainly on the squared Hellinger distance in this work. The squared Hellinger distance is a member of the family of α-divergences (up to a scaling factor), which are defined in Cichocki and Amari [9] for α ∈ (0, 1) as, The α-divergence can also be related to TVD through the following inequalities, similar to squared Hellinger distance up to a scaling factor (see for example [10][11][12]),

Review of Empirical Sample-Based Kullback-Leibler Divergence Estimator of Continuous Distributions
Let X = {x i } n i=1 , X = {x j } m j=1 be iid samples from P and Q in increasing order. Recall that the definition of the empirical CDFs of P and Q are, respectively, where U(x) is a unit-step function with U(0) = 0.5. The continuous piece-wise linear interpolation of the empirical CDF of P is denoted as P c (x). It is zero for any point smaller than a joint lower bound x 0 < in f {X , X } of the data samples from P, Q, and is one for anything greater than or equal to a joint upper bound x n+1 > sup{X , X } of the data samples from P, Q; everywhere in the middle, it is defined as: where coefficients a i , b i are set so that P c (x) matches the values of P e (x) at the sampled values x i , i = 1, . . . , n. Similarly, we can define the interpolated empirical CDF for Q, denoted as Q c (x). These empirical CDFs converge uniformly and are independent of the distribution of their CDFs. Perez-Cruz [1] proposed an empirical KL estimator: of P c at x i and δQ c (x i ) denotes the left slope of Q c at x i . Here, n = |X | and x i are the samples from the P distribution. Ref. [1] showed thatD(P||Q) − 1 → D(P||Q), almost surely. For this 1-D data setting, an experiment showing the convergence of their estimator is shown in Figure 1 where we plotted estimated values against increasing sample sizes, where P, Q are taken to be normal distributions N(0, 1) and N(1, 1) respectively. It is worth mentioning that the major innovation and strength of these types of empirical estimators is the fact that there are no convergent density estimators required in the process of estimating the desired divergences. In fact, only the empirical CDF is used and the density model being used in the estimator is completely based on the slopes of the piecewise linear interpolation of the empirical CDF. This empirical density model is far from being convergent as we can see from the following figures in Figure 2, which shows the calculated slopes (in blue) for N = 10, 100, 1000, 10,000 data samples from a normal distribution against the ground-truth normal densities (in red), plotted in log scale. Clearly, the empirical density model does not converge to the true densities. Perez-Cruz [1] also provided an empirical KL estimator for multivariate distribution samples. The estimator is based on a nearest-neighbor approach. For each sample x i in X , where the dimension of the sample is d, let: where r k (x i ), s k (x i ) are, respectively, the Euclidean distance to the k-th nearest neighbor of x i in X \ x i and X , and π d/2 Γ(d/2+1) is the volume of the unit ball in R d . Ref. [1] continued to show that the random variable p(x) p k (x) converges to an independent Gamma(k,k) random variable which has mean 1 and variance 1 k for each selected k = 1, 2, 3, . . ., where x is sampled from P. Therefore, they proposed the following estimator: It was shown that, since 1

Estimator for 1D Data
Following Perez-Cruz [1], we have defined a similar estimator for Hellinger affinity using empirical CDFs. Let X = {x i } n i=1 , X = {x j } m j=1 be iid samples from P and Q in increasing order. Recall that the definition of the empirical CDFs of P and Q are, respectively, where U(x) is a unit-step function with U(0) = 0.5. The continuous piece-wise linear interpolation of the empirical CDF of P (and Q) is denoted as P c (x) (and Q c (x), respectively). It is zero for anything smaller than a joint lower bound x 0 < in f {X , X } of the data samples from P, Q, and is one for anything greater than or equal to a joint upper bound x n+1 > sup{X , X } of the data samples from P, Q; everywhere in the middle, it is defined as: where coefficients a i , b i are set so that P c (x) matches the values of P e (x) at the sampled values x i , i = 1, . . . , n. Q c (x) is defined similarly. These empirical CDFs converge uniformly and are independent of the distribution of their CDFs. Our estimator for the squared Hellinger distance is based on estimating the Hellinger affinity, which is directly related to the quantity of interest by: The new estimator for Hellinger affinity iŝ where δP c (x i ) = (P c (x i ) − P c (x i − ))/ for any < min i {x i − x i−1 } denotes the left slope of P c at x i and, similarly, δQ c (x i ) denotes the left slope of Q c at x i . We next claim and prove thatÂ converges to a scalar multiple of the true Hellinger affinityÂ → π 4 A. To justify the use of this bias correction constant we need to prove that it results from terms we get from rewriting the estimator: Notice that the first (square root) term in the sum converges almost surely to p(x i ) . We need to show that the above empirical sum converges almost surely to C x p(x)q(x)dx, where the constant C = π 4 is derived from the second term, using similar arguments as Perez-Cruz [1] through waiting time distributions between two consecutive samples from a uniform distribution between 0 and 1.
We outline the proof for the constant term below. Similar to Perez-Cruz [1], we know that, given {x i } n i=1 ∼ P, n∆P(x i ) ∼ Exp(1) and is independent of P (similarly for Q). With this argument, the last expression forÂ(P, Q) can be rewritten as (where z i = n∆P(x i )) ).
The first sum converges almost surely to: The second sum can be rewritten as: The last expression converges almost surely to: Notice here that p e (x) is a density model but does not need to converge to p(x) for the above expression to converge to the desired constant. Combining all previous results, we have shown thatÂ(P, Q) converges almost surely to: Hence, we obtained the desired constant C = π 4 ≈ 0.785. The final estimator for squared Hellinger distance isĤ 2 (P, Notice that Hellinger distance is a symmetric distance metric for any distributions P and Q, hence the estimator above is only one side of the story. Following exactly the same arguments, we can show that the opposite direction estimator, also converges almost surely to π 4 A(Q, P), and since A(P, Q) = A(Q, P) we can obtain a symmetric estimator of Hellinger affinity that converges almost surely to π 4 A(Q, P): Therefore, we can construct a corresponding estimator for the squared Hellinger distance as: which enjoys all of the properties shown above for the two estimators separately. Since the symmetric version uses more information from the two samples, it is supposed to be able to provide better estimates than the two single-sided estimators in terms of the rate of convergence.

Numerical Experiments
We show asymptotic convergence of the new estimatorĤ 2 (P, Q) = 1 − 4Â(P,Q) π , and its symmetric version, to the true H 2 value as the data sample size grows in the below experiments. In each of the experiments, we took two distributions of the same family and compared the estimated squared Hellinger distance value against the ground truth value. We plotted mean estimated values for sample size N = M = 10, 32, 100, 316, 1000, 3162, 10,000 (x-axis) used for each pair of distributions over 100 instances, and we also plotted the 95% confidence interval of the estimates. For each experiment, the squared Hellinger distance estimatorsĤ 2 (P, Q),Ĥ 2 (Q, P) are plotted in red and blue, and the symmetric squared Hellinger distance estimatorĤ 2 S (P, Q) is plotted in purple. We also recall the fact that when P, Q are taken to be normal distributions N(µ 1 , σ 2 1 ), N(µ 2 , σ 2 2 ), the squared Hellinger distance has an analytic form: In the first experiment ( Figure 3), P, Q are taken to be normal distributions N(0, 4) and N(1, 1), respectively. In the second experiment P, Q are taken to be normal distributions N(0, 1) and N(2, 1), respectively.  In the third experiment ( Figure 4), P, Q are taken to be normal distributions N(0, 1) and N(0.01, 1), respectively. In the fourth experiment, P, Q are taken to be exponential distributions Exp(1) and Exp(2), respectively.
In the fifth experiment ( Figure 5), P, Q are taken to be uniform distributions U(0, 1) and U(0, 2), respectively. In the sixth experiment, P, Q are taken to be uniform distributions U(0, 1) and U(0.5, 1.5), respectively. Notice that the squared Hellinger distance is welldefined for distributions of different support.  In the last two experiments ( Figure 6), we considered two distributions from different distribution families. Here, P = Cauchy(0, 1) is the standard Cauchy distribution. In the seventh experiment, Q = N(1, 1) and in the last experiment Q = N(0, 1). The true squared Hellinger distances are computed using numerical integration.
We can observe from the previous experiments that, depending on the distributions, either the estimatorĤ 2 (P, Q) or the reverse direction estimatorĤ 2 (Q, P) can turn out to be better, which is a consequence of our choice to take the left slope of the empirical CDF so the relative location of the two distributions will determine which estimator is more accurate. The symmetric squared Hellinger estimator provides a middle ground between the two one-sided estimators and it also exhibits smaller variances.  As mentioned before, the proposed estimator does not use the information of the underlying distribution and does not need to estimate the density first before estimating the squared Hellinger distance. As a comparison with an estimator that knows the distribution, we performed experiments with Gaussian distributions where we could use the sample mean and sample variance to estimate the distributions and then compute the squared Hellinger distance analytically using the estimated parameters. The estimator is constructed as follows, whereμ 1 ,σ 1 ,μ 2 ,σ 2 are sample estimates of mean and standard deviation from the two datasets. This estimator knows extra information about the data coming from Gaussian distributions. However, as we can see from the plots in Figure 7, the proposed squared Hellinger distance estimator performs similarly to the estimator that knows the distribution family. In the first experiment, the two distributions are N(0, 4), N(2, 4). In the second experiment, the two distributions are N(0, 1), N(2, 1). For both plots, we plotted the proposed symmetric squared Hellinger distance estimator in red and the naive estimator using sampled parameters in blue. The upper bound and lower bound of each estimator, performed over 100 iterations, are plotted in dashed lines.
Finally, we consider the setting in Test 8, where P is a standard Cauchy distribution and Q is a standard normal distribution, and we compare the behavior of the empirical squared Hellinger estimatorĤ 2 (P, Q) with the empirical KL divergence estimatorD(P||Q) as in [1]. As mentioned in the discussion in Section 2, for this case, the KL divergence diverges to infinity while the squared Hellinger distance is bounded. With the same experiment setup, we plotted the resulting divergence estimates and confidence intervals for both KL divergence and squared Hellinger distance in Figure 8, where the ground truth H 2 (P, Q) value (approximated by numerical integration) is plotted in black and the ground truth D KL (P||Q) value is infinity.   As we can observe from Figure 8, while the empirical squared Hellinger estimator converges to the ground truth value quickly, the empirical KL divergence estimator cannot converge to some value due to the fact that the ground truth value is infinity. This justifies the desirability of considering the squared Hellinger distance, which is always bounded.

Estimator for Vectorial Data
Utilizing the results proved for the vectorial data case in [1], we propose the following estimator for squared Hellinger distance in multivariate cases (for a chosen k). Similar to the definitions in [1], let the kNN density estimator be defined as: where r k (x i ), s k (x i ) are, respectively, the Euclidean distance to the k-th nearest neighbor of x i in X \ x i and X . Let: q k (x) are independent Gamma(k,k) random variables that are also independent from P, Q, we conclude thatÂ k (P, Q) converges almost surely to: So,Â k (P, Q) converges almost surely to the true Hellinger affinity up to a constant multiplier, similar to the 1D case. Therefore, we propose the following estimator for squared Hellinger distance, which converges almost surely to the true squared Hellinger distance: Similar to the 1D case, we can extend this estimator to a symmetric version that also shares the desired convergence properties:

Numerical Experiments for Vectorial Data
Similar to the experiment setting in Section 4.2, we show the convergence of the proposed estimators in Section 4.3. Sample size N = M = 10, 32, 100, 316, 1000, 3162 (plotted on the x-axis) is used for each pair of distributions. The analytical formula for the squared Hellinger distance for two multivariate Gaussians N(µ 1 , Σ 1 ), N(µ 2 , Σ 2 ) is: In the experiment in Figure 9, we picked 2D normal distributions P and Q with µ 1 = (0, 0) T , µ 2 = (1, 1) T , Σ 1 = Σ 2 = I 2 . For the proposed k-nearest neighbor estimator, we picked k = 5. The performance of the proposed estimators and a comparison with the naive estimator are plotted below. The naive estimator estimates the mean and covariance based on the data samples and estimates the squared Hellinger distance based on the analytic formula: From these results we can observe that, similar to the 1D cases, the symmetric estimator seems to perform the best and is comparable to the naive estimator in terms of convergence. In general, a larger k leads to a smaller variance in the proposed estimator for multivariate data. To balance the convergence rate with computational cost, we can select k to be around 4 to 6 which converges faster than a smaller k and is also easy to compute. This behavior is shown in Figure 10, where we compared the performance of the proposed estimator using k = 2, 3, 4, 5, 6 for the same experiment setting as above. Another test we conducted was to check if the squared Hellinger distance estimate behavior in a non-asymptotic sense is similar for two pairs of concentric Gaussians that have the same squared Hellinger distance. For this experiment, we picked the first pair of Gaussians to be N(0, I) and N(0, 4I), and the second pair of Gaussians to be N(0, 1 2 I) and N(0, 2I). The squared Hellinger distance between each pair of Gaussians is 0.2 and, since these two pairs correspond to a single coordinate transformation on the sample space, we expect similar behavior of the estimator in terms of convergence on both pairs. The result is shown in Figure 11. As expected, the empirical estimator for vectorial data has very similar convergence behavior for each of the two pairs of Gaussians to the same ground-truth value. Figure 11. Two pairs of concentric Gaussians with invariant squared Hellinger distance.

Estimator for 1D Data
We generalized the results obtained before to a family of α-divergences to which the squared Hellinger distance belongs. Following Cichoki and Amari [9], we define an α-divergence between two probability distributions as: We want to obtain an empirical estimator similar to that in Section 3 that uses only the empirical CDFs of P and Q and estimates this quantity directly for any α ∈ (0, 1). Notice that for α = 0.5, D α A (P||Q) = 4H 2 (P, Q), which corresponds to the squared Hellinger distance. Notice that we can rewrite the α-divergence above as: Clearly, we are interested in the last quantity, so we only need to have an estimator for that term that converges almost surely.
For this purpose, let us define an estimator: Notice that, in the general cases, the α-divergence is not symmetric. Following similar procedures as in Section 3, we can rewrite the estimator as: This sum converges almost surely to (since the exponential waiting distributions are independent of the data distribution): ).
Following the same arguments as in Section 3, we can show that the proposed estimator converges almost surely to: where we define the constants Therefore, we know that the estimator converges almost surely to the true α-divergence value, D α A (P||Q). Although the α-divergence is not symmetric, it has the property that So, given the same two sample data sets, we can get another estimator for the same quantity where we are estimating based on the sampling distribution from Q instead of P. SinceD α A (P||Q),D 1−α A (Q||P) converges to the same divergence value, we can again create a symmetric estimator based on averaging these and it is expected to perform similarly if not better. Lastly, notice that, when α = 0.5, we obtain C α = C 1−α = √ π 2 andD 0.5 A (P||Q) = 4(1 − 4 πÂ 0.5 (P||Q)), which corresponds to the squared Hellinger estimator we have seen in Section 3, scaled by 4.

Numerical Experiments
We show asymptotic convergence of the new estimatorD α A (P||Q), and its symmetric version, to the true α-divergence value as the data sample size grows in the below experiments. In each of the below experiments, we took two distributions of the same family and compared the estimated α-divergence value against the ground truth value. Mean esti-mated values for sample size N = M = 10, 32, 100, 316, 1000, 3162, 10,000, 31,623 (plotted on the x-axis) used for each pair of distributions over 100 instances and we also plotted the 95% confidence interval of the estimates. For each experiment, the α-divergence estima-torsD α A (P||Q),D 1−α A (Q||P) are plotted in red and blue, and the symmetric α-divergence estimatorD α A,S (P||Q) is plotted in purple. In the first experiment, P, Q are taken to be normal distributions N(0, 4) and N(1, 1), respectively, and α = 0.6. In the second experiment, P, Q are taken to be normal distributions N(0, 1) and N(2, 1), respectively, and α = 0.4. The results are plotted in Figure 12. Notice that for two normal distributions P ∼ N(µ 1 , σ 2 1 ), Q ∼ N(µ 2 , σ 2 2 ), we have an analytical formula for the α-divergence:  Again, we provide a comparison with an estimator that knows the distribution family. We performed experiments with Gaussian distributions where we could use the sample mean and sample variance to estimate the distributions and then compute the α-divergences analytically using the estimated parameters. The estimator is constructed as follows:   whereμ 1 ,σ 1 ,μ 2 ,σ 2 are sample estimates of mean and standard deviation from the two datasets. This estimator knows extra information about the data coming from Gaussian distributions. However, as we can see from the plots in Figure 13, the proposed α-divergence estimator performs similarly to the estimator that knows the distribution family.
In the first experiment, the two distributions are N(0, 4), N(1, 1) and α = 0.6. In the second experiment, the two distributions are N(0, 1), N(2, 1) and α = 0.4. For both plots, we plotted the proposed symmetric α-divergence estimator in red and the naive estimator using sampled parameters in blue. The upper bound and lower bound of each estimator, performed over 100 iterations, are plotted in dashed lines.

Estimator for Vectorial Data
Similarly to Section 4.3, we propose α-divergence estimators for samples from multivariate distributions. For this purpose, let us define: Using similar arguments, we can show that this estimator converges almost surely to: Therefore, we propose the following estimator for α-divergences, which converges almost surely: Similarly, we can extend this estimator to a symmetric version, for any fixed k: As a remark, for the vectorial case, the above kNN density-based empirical estimator for α-divergences (and the squared Hellinger distance in Section 4.3 as a special case) agree with the estimators proposed in [13], although the proof of convergence differs. Nonetheless, the univariate estimators we proposed in Sections 4.1 and 5.1 are different from trivial reductions of the kNN-based estimators in Sections 4.3 and 5.3 when taking d = 1 and k = 1.

Failure of a Similar Estimator for Total Variation Distance
As we have shown so far, by using the trick of waiting time distributions, we can bias-correct an empirical mean type estimator to produce an almost-sure convergence estimator for KL divergence, squared Hellinger distance, and in general the α-divergences. However, the same kind of trick does not work for other f-divergences that have an ffunction without certain desired properties such as f (ab) = f (a) + f (b) for KL divergence or f (ab) = f (a) f (b) for Hellinger affinity, which we shall discuss in more detail later. As a simple demonstration, consider the Total Variation Distance (TVD), which, for two continuous distributions P and Q, is defined as: Notice that the TVD is always bounded between 0 and 1. We considered paired distributions in two different families in 1D, namely normal distributions and exponential distributions. For different choices of parameters, we plotted the performance of a biased estimator using the empirical CDFs against the true TVD value. For every parameter setting, we looked at the case where N = M = 10,000 and averaged over 100 instances. The estimator is defined as: Specifically, for the normal distributions, we fixed µ 1 = 0, σ 1 = 1, σ 2 = 1 and varied µ 2 from 0 to 5. For the exponential distributions, we fixed λ 1 = 0.1 and varied λ 2 from 0.1 to 7. This generated a range of true TVD values that are spaced between 0 and 1 for each distribution family. Figure 14 plots the biased estimator values (on the y-axis) against true TVD values (on the x-axis) for pairs of normal distributions P, Q in blue and pairs of exponential distributions P, Q in red. The confidence intervals are also plotted. We observe that, for the same true TVD values, the biased estimator produced different values for different distribution families, where the relationship looks nonlinear and depends on the distribution family itself. This is an indication that the proposed estimator cannot be uniformly corrected with a simple additive and/or multiplicative constant as we performed for squared Hellinger distance (and in general α-divergences) and [1] for KL divergences. Therefore, we conclude that, so far, the proposed methodologies work for KL divergence, squared Hellinger distance, and in general α-divergences only, but cannot be extended to the general f-divergences in a straightforward way.

Uniqueness of α-Divergences
We provide a more detailed explanation as to why the α-divergences are the unique family of f-divergences that can be estimated using our type of estimator based on waiting time random variable transformations. Take the vectorial case for example, where we construct kNN empirical density estimates for the probability densitiesp k ,q k ; for an estimator that is based on these estimates to work for an f-divergence, we would require the f-divergence to be computable through an affinity term as an integration of the form q(x) )q(x)dx up to some constant terms, and we require that the affinity generating functions f satisfy a functional form that can be separated as either f (ab) = g(a) + h(b) or f (ab) = g(a)h(b) for some functions g and h. This restriction is made because, as we have seen for KL or α-divergence estimators, we rely on the independence property of the waiting time random variables, hence we can separate the empirical sums into three terms which converge separately and show the estimator to converge asymptotically up to additive or multiplicative bias constants. Let us examine these two types of restrictions on f.
For f to satisfy f (xy) = g(x) + h(y), ∀x, y > 0, we can see that f is equivalent to g and h in the sense that they differ by a constant. Differentiating the previous equation with respect to x and setting x = 1 we would get: where c = g (1) is a constant. The unique family of solutions to this condition is f (x) = c log x up to some additive constants. This obviously corresponds to KL divergence and reverse KL divergence when integrated against P and Q, respectively. For the other case where f (xy) = g(x)h(y), ∀x, y > 0, let us consider differentiating both sides with respect to x; this gives: h(y) y .
Taking log on both sides and let l = log f , m = log g : Now take the derivative with respect to x again and set x = 1, we get: where c = m (1) is a constant. The unique family of solutions satisfying the last condition is l(y) = c log y + C and hence f (y) = ay b is a general solution up to some additive constant. Without loss of generality, we can see that this corresponds uniquely to the affinity term of interest of the family of alpha divergences where f (y) = y α , ∀α ∈ (0, 1), and up to some constant terms.
Since KL and reverse KL divergence are limits of the α-divergences at two endpoints, we can conclude that the unique family of f-divergences that can be estimated based on the proposed estimators using waiting time random variables are the α-divergences. There is an interesting connection, pointed out by Amari [14], that states that α-divergence is the unique intersection between f-divergences and decomposable Bregman divergences on the space on positive measures. Notice that if restricted to the space of probability measures then the intersection reduces to only the KL divergences. Although the result does not directly connect to the uniqueness of α-divergences being estimable through our proposed methodologies, the proof technique that justifies the functional forms of the α-divergence being the unique f-divergence that allows a decomposition into the Bregman divergence dual functions up to some nonlinear coordinate transformation is very similar to what we carried out above and reaches the same conclusion-that the function f must take on a power function form that corresponds to an α-divergence and at the limit of α becomes logarithm functions that correspond to KL and reverse KL divergences.

Applications
The proposed estimator finds interesting applications in statistical estimation theory, clustering algorithms, visualization/embedding algorithms, and possibly online learning algorithms. We next describe a few such examples.

Bounding the Neyman-Pearson Region by Hellinger Affinity
We show that the Neyman-Pearson region contains one convex region determined by the Hellinger affinity, which is contained in another. These inclusion relations generalize the classical inequalities between total variation and Hellinger distance. Deploying our estimator for Hellinger affinityÂ S (P, Q), we can approximately bound the Neyman-Pearson region.
Our results (see Appendix A for more details) show that, with two distributions p, q, and with s, t > 0 (which can be chosen so that s + t = 2 in standard case), the Neyman-Pearson region for type I (α(E)) and type II (β(E)) errors satisfies the following relation with the total variation distance for optimal choice of event E : and can hence be bounded by the following inequalities where ρ(p, q) is the Hellinger affinity: Hence, by substituting our symmetric estimator for the Hellinger affinity term A S (p, q) ≈ ρ(p, q), we can approximately bound the Neyman-Pearson region given two samples from distributions p and q, If we are dealing with multivariate distributions, then the appropriate multivariate Hellinger affinity estimator from Section 4.3 can be used to approximately bound the Neyman-Pearson region. As a remark, we observe that there is no provable general relationship between Kullback-Leibler divergence or the rest of the α-divergences (besides Hellinger distance) with the Neyman-Pearson regions.

Estimating Eigenvalues of the Matrix Pencil for Inference in the Family of Concentric Gaussians
Consider two multivariate distributions from the concentric Gaussian family P = N(0, C 2 1 ), Q = N(0, C 2 2 ), where C 2 1 , C 2 2 ∈ R d×d . It can be shown that any meaningful statistical inference function on the two covariance matrices should satisfy φ(C 2 1 , C 2 2 ) = φ(I, Λ), where Λ is the diagonal matrix with diagonal entries λ 1 , . . . , λ d being the eigenvalues of the matrix C −1 1 C 2 2 (C −1 1 ) * ; see Appendix C for more details. Since Λ is diagonal and I is simply the identity matrix, we can write φ(C 2 1 , C 2 2 ) = h(λ 1 , . . . , λ d ). Hence, any inference we can make on the two concentric Gaussians will depend only on sufficient statistics, which are the eigenvalues λ 1 , . . . , λ d . In the case of Hellinger affinity (and in general affinities for α-divergences), we can write it as where h(1, λ i ), ∀i = 1, . . . , d is the affinity calculated based on two univariate Gaussian distributions N(0, 1) and N(0, λ i ). For example, we have analytic formulas for the affinity term of the α-divergence family between such univariate Gaussian distributions: .
Then, we have, for the d-dimensional multivariate concentric Gaussians, . Now, given d distinct values of α 1 , . . . , α d , the affinity values A α 1 (P||Q), . . . , A α d (P||Q) can be used to determine the eigenvalues λ 1 , . . . , λ d . Since our proposed estimator for vectorial α-affinitiesÂ α k (P||Q) converges up to a multiplicative constant, we can use the estimated values for α-affinities corresponding to d different values of α = α 1 , . . . , α d to estimate the eigenvalues λ 1 , . . . , λ d by solving a system of d equations. The estimated valuesλ 1 , . . . ,λ d can be then used for any inference problems on these two probability distributions and they are sufficient for inference. This significantly reduces the noise in estimating the entire covariance matrices C 2 1 , C 2 2 when the data come from high dimensions where we could have an over-parametrization problem.

Stock Clustering and Visualization
We next describe a simple application to stock segmentation in a portfolio allocation setting. Consider N stocks with T historic dates. Let {r i,t } i∈[N],t∈[T] denote the returns of each stock on each date. Let {R i } i∈[N] denote the random variable standing for the returns of each stock, which is composed of data {r i,t } t∈[T . To cluster this universe of stocks into K distinct groups, we can first use the Hellinger distance estimatorĤ S (R i , R j ) for a pair of stocks ∀i = j ∈ [N]. Since the estimator is symmetric, we would arrive at a symmetric distance matrix denoted by D H . It is also possible to combine the Hellinger distance with a correlation distance metric through some transformations. After obtaining the distance matrix (or an affinity matrix by subtracting it from 1), we can deploy any desired clustering algorithm on it. The result would be K clusters of stocks that are grouped by similarity in the chosen distance sense. We can also add another step, which is to repair the distance matrix before clustering. There is the possibility that the distance matrix estimated using the proposed estimator does not exactly correspond to a metric, which means some groups of stocks may violate the triangle law in a metric. We can apply a simple sparse metric repair algorithm, see, for example, [15]. The resulting clustering can be helpful for portfolio allocation strategies since we can build sub-strategies inside each cluster and merge them together.
Another example using the same distance matrix constructed from sample data is in visualization algorithms such as FATE [16], which allow for the input of a precomputed distance/affinity matrix specifying the dataset. The visualization algorithm uses the input distance to compute embeddings in lower dimensions that preserve the local/global structures of the dataset and can be useful in many subsequent applications. Here, our estimator can also serve to compute the input distance matrix on sample data from N entities using the Hellinger distance or α-divergences as the distance metric. This could also be used in conjunction with a metric repair algorithm to adjust for the biases and errors in empirical estimators.

Other Applications
Lastly, we suspect that the proposed estimator can find interesting applications in UCB-type algorithms in multi-armed bandit frameworks, where the estimated pairwise Hellinger distances/α-divergences for sample distributions from different arms can be used to eliminate arms that fall outside of the confidence region balls around the top arms historically. We leave these open problems as future works.

Conclusions
We have proposed an estimator for the Hellinger affinity, and hence the squared Hellinger distance, between samples from two distributions based solely on the empirical CDF without the need to estimate the densities themselves. We have proven its almost-sure convergence to the true squared Hellinger distance and have constructed a symmetric version of this estimator. We showed the convergence behavior using several experiments where we observed that the symmetric estimator constructed from averaging the two one-sided estimators for the squared Hellinger distance turned out to be a favorable choice due to accuracy in general and smaller variances. We then extended the estimator to a family of α-divergences, where similar properties hold up to small modifications. For each choice of α, we also showed how to construct a symmetric version of the estimator. We also extended respective estimators to work with multivariate data in higher dimensions using k-nearest-neighbor-based estimators. Numerical examples are given to show the convergence of our proposed estimators. We conclude that the α-divergence family is the unique f-divergences that can be estimated consistently using the proposed methodologies. Our proposed estimators can be applied to approximately bounding the Neyman-Pearson region of a statistical test, among many other applications. Since the integral ∞ 0 z k−1 log ze −z dz evaluates to ∑ k−1 j=1 1 j + C, we conclude thatĤ k (P) + log k − ∑ k−1 j=1 1 j + 0.5772 converges almost surely to H(P). On a related note, a class of estimators of the Rényi and Tsallis entropies for multidimensional densities has been studied in [17], which is also based on computing the k-nearest neighbor distances from empirical samples.

= (
s + t 2 ) 2 − stρ(p, q) 2 (where we used the Cauchy-Schwarz inequality) to the chain of inequalities: We obtain then for the Neyman-Pearson boundary the upper and lower bounds in terms of the Hellinger affinity: This bound has the gigantic advantage of also bounding the Neyman-Pearson region for joint distributions, such as the result of i.i.d. samples. There is no general relationship between Kullback-Leibler divergence and Neyman-Pearson regions since, say, for the Bernoulli family we can have a sequence of pairs {p k , q k } such that H 2 (p k , q k ) → 0 but D KL (p k ||q k ) + D KL (q k ||p k ) → ∞.