Finite-Sample Bounds on the Accuracy of Plug-In Estimators of Fisher Information

Finite-sample bounds on the accuracy of Bhattacharya’s plug-in estimator for Fisher information are derived. These bounds are further improved by introducing a clipping step that allows for better control over the score function. This leads to superior upper bounds on the rates of convergence, albeit under slightly different regularity conditions. The performance bounds on both estimators are evaluated for the practically relevant case of a random variable contaminated by Gaussian noise. Moreover, using Brown’s identity, two corresponding estimators of the minimum mean-square error are proposed.


Introduction
This work considers the problem of estimating the Fisher information for the location of a univariate probability density function (PDF) f based on n random samples Y 1 , . . . , Y n independently drawn from f . To clarify, the Fisher information of a differentiable density function f is given by where f is the derivative of f . For the remainder of the paper, it is assumed that { f (t) > 0} = R, but an extension to the general case is not difficult. The paper considers plugin estimators based on kernel density estimates of f . That is, the Fisher information is estimated by plugging a kernel density estimate of f into the right-hand side of (1). Estimation of the Fisher information in (1) via a plug-in estimator based on kernel density estimates was first considered by Bhattacharya in [1]. Bhattacharya showed that, under mild conditions on f , the plug-in estimator is consistent for a large class of kernels, and he provided bounds on its accuracy in the large (asymptotic) sample regime. These bounds were later revised and improved by Dmitriev and Tarasenko in [2]. However, to the best of our knowledge, no finite-sample regime bounds on the accuracy of Bhattacharya's estimator can be found in the literature. The paper aims at closing this gap.
Bounds on the accuracy of plug-in estimators rely on bounds on the accuracy of the underlying density estimators. For kernel-based density estimators, such bounds have received considerable attention in the literature. For example, Schuster [3] showed that, under mild regularity conditions, the estimation error for higher-order derivatives can be controlled by the estimation error for the corresponding cumulative distribution function (CDF). The interested reader is referred to [4][5][6][7][8] and the references therein. In this paper, as a preliminary result for the analysis of Bhattacharya's estimator, the bounds in [3] are further tightened by replacing some suboptimal constants with the optimal ones. complexity of Bhattacharya's estimator is considerable and that the potentially unbounded score function is a critical bottleneck for tighter bounds. Section 3 proposes a "harmless" modification of Bhattacharya's estimator, namely, a clipping of the estimated score function, which is shown to be sufficient to remedy its large sample complexity. In particular, Theorem 3 shows that the clipped estimator has significantly better bounds on rates of convergence, albeit with slightly different assumptions on the PDF. Section 4 evaluates the convergence rates of the two estimators for the practically relevant case of a random variable contaminated by additive Gaussian noise. Moreover, using Brown's identity, which relates the Fisher information and the MMSE, consistent estimators for the MMSE are proposed and their rates of convergence are evaluated in Proposition 1. Section 5 concludes the paper.

Notation
The expected value and variance of a random variable X are denoted by E[X] and Var(X), respectively. The gamma function is denoted by Γ(·). Estimators of a PDF f based on n samples are denoted by f n . No notational distinction is made between an estimator, which is a random variable, and its realizations (estimates), which are deterministic. However, the difference will be clear from the context or will be highlighted explicitly otherwise. The nth derivative of a function F : R → R is denoted by F (n) ; the first-order derivative is also denoted by F to improve readability.

Bhattacharya's Estimator
In this section, we revisit the asymptotically consistent estimator proposed by Bhattacharya in [1] and produce explicit and non-asymptotic bounds on its accuracy.
Bhattacharya's estimator is given by where k n ≥ 0 determines the integration interval as a function of the sample size n and the unknown functions f and f are replaced by their kernel estimates, that is, Here, a 0 , a 1 > 0 are bandwidth parameters, and K : R → R denotes the kernel, which is assumed to satisfy certain regularity conditions that will be discussed later in this section.

Estimating a Density and Its Derivatives
In order to analyze plug-in estimators, it is necessary to obtain rates of convergence for f n and f n , that is, the kernel estimators of the density and its derivative. The following result, which is largely based on the proof by Schuster in [3], provides such rates. The proof in [3] makes use of the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality for the empirical CDF. The next lemma refines the result in [3] by using the best possible constant for the DKW inequality shown in [20].

of 22
Then, for any > δ r,a r and any n ≥ 1, the following bound holds: Proof. See Appendix A.

Analysis of Bhattacharya's Estimator
The following theorem is a non-asymptotic refinement of the result obtained by Bhattacharya in Theorem 3 of [1] and Dmitriev and Tarasenko in Theorem 1 of [2].
Then, provided that and the following bound holds: where Proof. See Appendix B.
The bound in (11) is an improvement of the original bound in [1,2], which contains terms of the form 0 φ 4 (k n ).
Note that φ(k n ) in (8) can be rapidly increasing with k n . For example, as will be shown later, φ(k n ) increases super-exponentially with k n for a random variable contaminated by Gaussian noise. This implies that, while Bhattacharya's estimator converges, the rate of convergence guaranteed by the bound in (11) is extremely slow. A modified bound is proposed in the subsequent theorem.
for some f 0 ∈ R. If the assumptions in (8), (9), and (10) hold, then where ρ max and c are given by (12) and (13), respectively, and d g (k n ) denotes the number of zeros of the derivative of the function g on the interval [−k n , k n ], i.e., Proof. See Appendix C. (15) is on the order of log(φ(k n )), which typically increases much more slowly with k n than φ in (11). As a result, the bound in Theorem 2 can lead to a better bound on the convergence rate than that in Theorem 1, given appropriate upper bounds on d f and d f n .

Remark 1. Note that ψ in
Since Gaussian blurring of a univariate density function never creates new maxima, we have that d f Y ≤ d f X , which is a constant. However, to the best of our knowledge, the only known upper bound on d f n is given by d f n ≤ n [21] (Theorem 2), which is not useful in practice. Despite this drawback, we include Theorem 2 for the sake of completeness and in the hope that tighter bounds on d f n might be established in the future.
The main problem in the convergence analysis of the estimator in (2) is For distributions with sub-Gaussian tails, this implies that the interval [−k n , k n ], on which this is guaranteed to be the case, grows sub-logarithmically (compare Theorem 4), causing the required number of samples to grow super-exponentially. In next section, we propose an estimator that has better guaranteed rates of convergence.

The Clipped Bhattacharya Estimator
In order to remedy the slow guaranteed convergence rates of Bhattacharya's estimator, we dispense with the tail assumption in (8), but introduce the new assumption that the unknown true score function ρ(t) = f (t)/ f (t) is bounded (in absolute value) by a known function ρ. This allows us to clip f n (t)/ f n (t) and, in turn, 1/ f n (t) without affecting the consistency of the estimator.

of 22
In addition, if f (t) is bounded as in (14), then where ψ and d f are defined in (16) and (17), respectively.
Proof. See Appendix D.
For the upper-bound function ρ(t) in assumption (18), in practice, we can set ρ(k n ) = ρ max (k n ) if the latter is available. Although ρ max (k n ) also increases with k n , it usually increases much more slowly than φ(k n ). For example, as shown later, ρ max (k n ) is linear in k n in the Gaussian noise case. As a result, better bounds on the convergence rate can be shown for the clipped estimator.

Estimation of the Fisher Information of a Random Variable in Gaussian Noise
This section evaluates the results of Sections 2 and 3 for the important special case of a random variable contaminated by additive Gaussian noise. To this end, we let f Y denote the PDF of a random variable where snr > 0 is a signal-to-noise-ratio parameter, X is an arbitrary random variable, Z is a standard Gaussian random variable, and X and Z are independent. We are interested in estimating the Fisher information of f Y . We only make the very mild assumption that X has a finite second moment, but otherwise, it is allowed to be an arbitrary random variable. We further assume that snr is known and that Gaussian kernels are used in the density estimators, i.e., The following lemma provides explicit expressions for the quantities appearing in Sections 2 and 3 that are needed to evaluate the error bounds for the Bhattacharya and the clipped estimator. (27). Then,

Lemma 2. Let K be as in
Proof. See Appendix F.
We now bound c(k n ). To this end, we need the notion of sub-Gaussian random variables: A random variable X is said to be α-sub-Gaussian if (34) In addition, if |X| is α-sub-Gaussian, then Proof. See Appendix G.

Theorem 4. Let K be as in
where ε n ≤ n −w 1 u log(n) 4c 3 + 12 u log(n) + 2c 5 n u−w 1 + c 5 n u−w 0 and where the constants are given by In addition, if |X| is α-sub-Gaussian, then ε n ≤ n −w 1 u log(n) 4c 3 + 12 u log(n) + 2c 5 n u−w 1 + c 5 n u−w 0 where c 6 = 2 Note that the parameters k n , a 0 , and a 1 are chosen so as to guarantee the convergence of I n ( f n ) to I( f Y ) with probability 1. For the details, please refer to the proof in Appendix H. The parameters u and w in the above theorem are auxiliary variables that couple the bandwidth of the kernel density estimators in (3) and (4) with the integration range of the Fisher information estimator in (2). Choosing them according to Theorem 4 results in a trade-off between precision, ε n , and confidence, i.e., the probability of the estimation error exceeding ε n . On the one hand, small values of u and large values of w result in better precision (i.e., small ε n ) at the cost of a lower confidence (i.e., large probability of exceeding ε n ). On the other hand, large values of u and small values of w improve the confidence but deteriorate the precision. In turn, this also affects the convergence rates, meaning that faster convergence of the precision can be achieved at the expense of a slower convergence of the confidence and vice versa.

Convergence of the Clipped Estimator
From the evaluation of Bhattacharya's estimator in Theorem 4, it is apparent that the bottleneck term is the truncation parameter k n = u log(n), which results in slow precision decay of the order ε n = O 1 √ u log(n) . Next, it is shown that the clipped estimator results in improved precision over Bhattacharya's estimator. Specifically, the precision will be shown to decay polynomially in n instead of logarithmically. Another benefit of the clipped estimator is that its error analysis holds for every n ≥ 1. By utilizing the results in Lemma 1, Lemma 2, and Lemma 3, we specialize the result in Theorem 2 to the Gaussian noise case. (27). Choose the parameters of the clipped estimator as follows: 1 6 , and k n = n u , where u ∈ 0, min w 0 3 , w 1 2 . Then, for n ≥ 1

Proof. See Appendix I.
Again, the parameters k n , a 0 , and a 1 are chosen to guarantee the consistency of the estimator. For further details, please refer to Appendix I.

Applications to the Estimations of the MMSE
As discussed in the introduction, the Fisher information is often merely a proxy for the actual quantity of interest. One accuracy measure that is typically of interest is the MMSE, which is defined as In additive Gaussian noise, the MMSE can not only be bounded by the Fisher information, but both are related via Brown's identity: The results for the estimators of Fisher information in Theorem 4 and Theorem 5 can be immediately extended to the MMSE estimators as follows. (27), and let w 0 , w 1 , and n be such that they satisfy the conditions in Theorem 4. It then holds that

Proposition 1. Let K be as in
where ε n , c 1 , and c 2 are given in Theorem 4. (27), and let w 0 and w 1 be such that they satisfy the conditions in Theorem 5. It then holds that

Proposition 2. Let K be as in
where ε n , c 1 , and c 2 are given in Theorem 5.

Sample Complexity
Finally, we demonstrate the difference in the bounds on the convergence rates between Bhattacharya's estimator and its clipped version by comparing the sample complexity of the two estimators, that is, the required number of samples to guarantee a given accuracy with a given confidence. MATLAB implementations of both estimators, as well as the code used to generate the figures below, can be found in [22].
To this end, we consider the simple example of estimating the density of a Gaussian random variable in additive Gaussian noise. More precisely, we assume that X and Z in (26) are independent and identically distributed according to the standard normal distribution N (0, 1), and that snr = 1. This trivially implies that X is α-sub-Gaussian with α = 1. In order to make the comparison as fair as possible, the parameters of the kernel estimators, a 0 , a 1 , and k n , are not chosen according to Theorem 4 or Theorem 5, but are calculated by numerically minimizing the required number of samples; see [22] for details.
Let P err = P[|I n − I( f Y )| ≥ ε n ]. The left-hand plot in Figure 1 shows the corresponding bounds on the sample complexities of the two estimators with P err = 0.2 and ε n varying from 0.1 to 0.9. Note that the results with larger ε n are not shown because I( f Y ) ≤ 1, as shown in Lemma 2. Moreover, the right-hand plot in Figure 1 shows the sample complexities for ε n = 0.5 with P err varying from 0.1 to 0.9. By inspection, the clipped estimator reduces the sample complexity by several orders of magnitude; note that the y-axes scale logarithmically. As discussed before, this does not imply that the clipped estimator is more accurate in general. However, it does imply that the clipped estimator provides significantly better worst-case performance, i.e., it requires significantly fewer samples to guarantee a certain precision or confidence. Finally, note that this improvement comes at a low cost in terms of complexity and regularity assumptions. The complexity of both algorithms is almost identical, with the clipped estimator only requiring an additional evaluation ofρ. The regularity conditions are identical for bounded density functions, and slightly stronger for the clipped estimator for unbounded density functions. evaluation ofρ. The regularity conditions are identical for bounded density functions, and slightly stronger for the clipped estimator for unbounded density functions.

Conclusion
This work focused on the estimation of the Fisher information for the location of a univariate random variable using plug-in estimators based on estimators of the PDF and its derivative. Two estimators of the Fisher information were considered. The first estimator is the estimator due to Bhattacharya, for which new, sharper convergence results were shown. The paper also proposed a second estimator, termed a clipped estimator, which provides better bounds on the convergence rates. The accuracy bounds on both estimators were specialized to the practically relevant case of a random variable contaminated by additive Gaussian noise. Moreover, using special proprieties of the Gaussian noise case, two estimators for the MMSE were proposed, and their convergence rates were analyzed. This was done by using Brown's identity, which connects the Fisher information and the MMSE. Finally, using a numerical example, it was demonstrated that the proposed clipped estimator can achieve a significantly lower sample complexity at little or no additional cost.

Conclusions
This work focused on the estimation of the Fisher information for the location of a univariate random variable using plug-in estimators based on estimators of the PDF and its derivative. Two estimators of the Fisher information were considered. The first estimator is the estimator due to Bhattacharya, for which new, sharper convergence results were shown. The paper also proposed a second estimator, termed a clipped estimator, which provides better bounds on the convergence rates. The accuracy bounds on both estimators were specialized to the practically relevant case of a random variable contaminated by additive Gaussian noise. Moreover, using special proprieties of the Gaussian noise case, two estimators for the MMSE were proposed, and their convergence rates were analyzed. This was done by using Brown's identity, which connects the Fisher information and the MMSE. Finally, using a numerical example, it was demonstrated that the proposed clipped estimator can achieve a significantly lower sample complexity at little or no additional cost.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. A Proof of Lemma 1
Our starting point is the following bound due to [3] (p. 1188): where F is the CDF of f , F n is the empirical CDF, and v r is defined in (5). Now, let δ r,a r be as in (6), and consider the following sequence of bounds: where (A2) follows by using the triangle inequality; (A3) follows by using the bound in (A1); and (A4) follows by using the sharp DKW inequality [20]: This concludes the proof.

Appendix B. A Proof of Theorem 1
First, using the triangle inequality, we have that Next, we bound the first term in (A6) : where the last bound follows from the assumptions in (9). Now, consider the first term in (A11): where the bound in (A14) follows from the assumptions in (9) and the properties of φ, which imply 0 φ(k n ) < 1 ⇒ 0 < f (t), ∀|t| ≤ k n ; (A17) and the bound in (A16) follows from the definition of φ in (8). Now, consider the second term in (A11): where (A20) follows by using similar steps, leading to the bound in (A14), and (A21) follows from the definition of φ.
Finally, the function φ is obtained by observing that where we used Jensen's inequality and the fact that (a + b) 2 ≤ 2(a 2 + b 2 ). This concludes the proof.

Appendix G. A Proof of Lemma 3
Choose some v > 0. Then, where (A99) follows from Hölder's inequality, (A102) follows by using the identity