Next Article in Journal
Second-Law Analysis to Improve the Energy Efficiency of Screw Liquid Chillers
Previous Article in Journal
Comparative Analysis of Networks of Phonologically Similar Words in English and Spanish
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Estimation of an Entropy-based Functional

7576 Dale Ave., Saint Louis, Missouri 63117, USA
Entropy 2010, 12(3), 338-374; https://doi.org/10.3390/e12030338
Submission received: 30 December 2009 / Revised: 8 February 2010 / Accepted: 24 February 2010 / Published: 3 March 2010

Abstract

:
Given a function f from [ 0 , 1 ] to the real line, we consider the (nonlinear) functional h obtained by evaluating the continuous entropy of the “density function” of f. Motivated by an application in signal processing, we wish to estimate h ( f ) . Our main tool is a decomposition of h into two terms, which each have favorable scaling properties. We show that, if functions f and g satisfy a regularity condition, then the smallness of f g and f g , along with some basic control on derivatives of f and g, is sufficient to imply that h ( f ) and h ( g ) are close.

1. Introduction

We define the continuous entropy of a probability density ρ to be
h ( ρ ) = u R ρ ( u ) log ρ ( u ) d u
(where the base of the logarithm is e = 2 . 71828 . . . ). Stripped of its interpretation, continuous entropy is just a particular way of measuring how “spread out” a probability density is. In this paper, we are motivated by a novel application of this measurement in signal processing.
An ultrasound probe generates a short acoustic pulse which travels through the medium of interest (in medical applications, this medium is the tissue of the patient). As the pulse travels through the medium, features within the medium cause some of the pulse to be reflected back towards the probe, and the strength of the reflection contains information about the feature at that location. The signal of interest is the intensity of this reflected pulse that arrives back at the probe over time. This signal can be divided into many short windows of time; after re-scaling the time axis, one of these time windows can be represented by the interval [ 0 , 1 ] , and the signal over that time window can be represented by a real-valued function f on the interval [ 0 , 1 ] . So, f ( t ) represents the intensity of the reflected pulse that arrives back at the probe at the time t [ 0 , 1 ] .
The ultrasound probe can only measure f ( t ) at a discrete set of values of t, and this measurement can be corrupted in various ways. Therefore, one step in the processing of an ultrasound signal is to reconstruct f from the measurements of the ultrasound probe. Then, a functional is applied to f to obtain a single number, the idea being that this number contains all the relevant information about the reflected signal over that particular window in time. A standard functional in the industry is the “energy” in the signal, 0 1 | f | 2 , or (more often) its logarithm. However, there is a series of papers by Hughes and others ([1,2,3,4,5,6,7]) demonstrating the utility of using the continuous entropy of the “density function” of f as the functional, instead of the log-energy.
Suppose that f is a real-valued function on [ 0 , 1 ] . By “density function” of f, we mean the following: Define the measure μ f on R by μ f ( E ) = | f 1 ( E ) | , where | A | denotes the Lebesgue measure of the set A [ 0 , 1 ] . Suppose that μ f is given by the density ρ f , i.e.,
μ f ( E ) = E ρ f
This density ρ f is the “density function” of f. We can think of [ 0 , 1 ] as a probability space and f as a function on that probability space (i.e., a random variable) and then ρ f is literally the density of the “random variable” f.
Another way to think of ρ f is as follows: Suppose that we consider the uniform distribution U ( 0 , 1 ) . Suppose we choose a point t at random from [ 0 , 1 ] according to U ( 0 , 1 ) , and we calculate f ( t ) . The density ρ f gives the relative frequency of different outcomes of this experiment. Also, intuitively we can say that
ρ f ( u ) = { t : f ( t ) = u } 1 | f ( t ) |
(although we will not attempt to formalize this definition for general f). Let us abuse notation and write
h ( f ) = h ( ρ f )
The functional of interest is f h ( f ) .
The effectiveness of using the h functional has been shown in several different settings in ultrasound signal processing, ranging from defect detection in plexiglass [1] to detection of contrast agents in the tissue of live animals ([5,7]). In some situations, the h functional produced images of objects which, because the material was engineered by the experimenter, were known to be present but were not detected by the log-energy functional.
The interpretation of what is being measured by h ( f ) in this context is not yet clear. However, given the utility of the technique, we believe the area merits investigation. We would like to answer the question
How well can we estimate h ( f ) ?
In a real-world setting we will perhaps receive some samples of f corrupted by noise, and we would like to estimate h ( f ) (as we described, the processing of an ultrasound signal is an example of this). Therefore, we seek a solution to the following problem, where the “regularity conditions” and the form of the “partial information” is not specified:
Estimation Problem 1. 
Let f be a function on [ 0 , 1 ] that satisfies some regularity conditions, and suppose we are given some “partial information” on f, such as samples corrupted by noise. We wish to estimate h ( f ) with an estimator h ^ , and give a quantitative bound on the error | h ( f ) h ^ | , such that the bound is invariant under the scaling f λ f for λ 0 .
We seek this scale invariance because, if h ^ is any reasonable estimator of h ( f ) , the true difference h ( f ) h ^ does not change under the scaling f λ f for λ 0 , so we do not want our bound on that difference to change under this scaling either. (This scaling issue is discussed more in section 5.).
Our main result will not give a specific prescription for solving this problem. Instead, we reduce it to a much more tractable problem, namely the problem of taking noisy samples from f and producing an approximating function g which is “close” to f in some quantifiable way.
However, our concern at the moment is to relate problem 1 to the existing literature, and we notice that problem 1 is similar to the standard entropy estimation problem (for an overview of this problem, see Beirlant et al. [8]):
Estimation Problem 2 (Entropy Estimation).
Let X be a random variable with unknown density function ρ. We make N independent observations of X. From these, we wish to estimate the continuous entropy of ρ.
There is a large literature concerning the entropy estimation problem, so one might hope to solve problem 1 by applying some of the standard methods in the field to the density ρ f . However, this is not what we will do. Since we are departing from traditional methods, we would like to provide some indication that new methods may in fact be necessary; before working to develop new methods to solve a problem, one would like to know whether the problem is obviously solved with readily available standard methods. In section 2., we review the current literature with regard to how it applies to densities such as ρ f . This review is not comprehensive; the purpose is only to show that the density ρ f presents some difficulty for current methods of estimation of h ( ρ ) .
Some of the difficulties that we find, stated in general terms, are the following. First, the methods we have seen often involve intermediate parameters that truncate the tails of integrals, truncate the unbounded pieces of ρ, or perform other types of operations that enable a splitting into “good” and “bad” regions. In view of the scale-invariance that we seek, these intermediate parameters will all become involved and their rates of growth will need to be related in a way that produces the desired result and remains scale-invariant. This task, for a density such as ρ f , seems to be non-trivial. Second, many of the results which might apply to ρ f use methods which are non-quantitative: theorems from measure theory such as Lebesgue Differentiation, the Lebesgue Dominated Convergence Theorem, Egoroff’s Theorem, and the Borel-Cantelli Lemma. Replacing these theorems with quantitative estimates will introduce more difficulties. Finally, and perhaps most importantly from an intuitive point of view, there seems to be a general progression from “easy” to “hard” as the assumptions on ρ become less stringent, moving from differentiable to continuous to bounded to L 2 , and so on. The density ρ f seems to fall on the “hard” end of this continuum. This does not correspond to the progression of difficulty that a practitioner of signal processing would expect to see when they attempt to solve problem 1. Even an extremely well behaved function (such as f ( t ) = sin ( 2 π t ) ) will produce a density ρ f that will be ranked as “badly behaved.” A signal processor would expect that the problem would be “easy” for well behaved functions f, and get harder as the function f becomes more badly behaved.
In short, the readily available methods that we are aware of do not seem appropriate for the problem we are presented with. Fortunately, we believe that estimating h ( f ) is easier than solving the entropy estimation problem for ρ f , for the simple reason that we are able to take advantage of the “good behavior” of f in the time domain, instead of only having access to the “bad behavior” of the density ρ f . For example, suppose that in a real-world setting we receive sampled values of f at specific values of t, say { t j } (not to be confused with the values of t where f actually has critical points, which later will be referred to as { t i } ). Neglecting noise (purely for the simplicity of this explanation), we receive the data { ( t j , f ( t j ) ) } . We intend to use all the information available to us, i.e. the time series { ( t j , f ( t j ) ) } , not just the (unordered) values { f ( t j ) } . For instance, we may want to estimate f or f , which will certainly make use of the indicated time series. We will not attempt to apply a standard entropy estimation technique directly to the values { f ( t j ) } .
One strategy would be to first construct an approximating function g (using the observations { ( t j , f ( t j ) ) } ), and then calculate h ( g ) and use this as our estimate of h ( f ) . In Hughes et al. [7], this method is applied to real-time imaging calculations in a laboratory setting; instead of using just the values { f ( t j ) } directly, they first construct an approximating function g and then calculate the desired entropy value using g. In [7], they are estimating a variant of the Rényi entropy of ρ f , not h ( f ) as we have defined it, but the same authors have demonstrated the utility of h ( f ) . Incidentally, the quantity that is eventually computed in [7] is
i 1 | g ( s i ) |
where the { s i } are the critical points of g, and the authors comment [7] that “while this involves use of the second derivative of f ( t ) at its critical points, which can be expected to increase noise in the processing chain output, surprisingly the resulting signal processing scheme does not sacrifice sensitivity.” This gives some assurance that our result, which requires control on multiple derivatives of f, is not without practical application.
Our goal in this paper is to estimate the difference h ( f ) h ( g ) in terms of quantities involving f and g (not ρ f or ρ g ). As we mentioned, this reduces the problem at hand, problem 1, to a much more tractable question of function approximation given noisy data. We note, for completeness, that a specific prescription for solving problem 1, using methods similar to our methods here, was given in Maurizi [9], but we believe that the approach we present now will be clearer than the approach given in [9].
As mentioned, in Section 2., we review the current literature with regard to how it applies to densities such as ρ f . After some definitions and notation in Section 3., we will highlight a useful identity in Section 4. which may be of independent interest, and we discuss some issues regarding the scaling f λ f for λ 0 in Section 5.. In Section 6. we present a proposition showing why the regularity assumption (Definition 6) we introduce is necessary at least in some form, if not the specific form given here. Our regularity assumption on f and g, defined precisely in Section 7, essentially prevents f and g from becoming “too flat” by preventing their first and second derivatives from simultaneously vanishing, this is quantified with a parameter δ.
Section 7. contains our main result, Theorem 1, which states that if f and g satisfy the regularity assumption with parameter δ, then the difference h ( f ) h ( g ) can be bounded by the metrics f g , f g , and the value δ, along with factors that control the overall sizes of f and g and their derivatives.
The proof of the main result is outlined in Section 8. and carried out in the subsequent sections.

2. Background

We now look to the current literature, seeking a method to estimate h ( f ) with quantitative bounds on the error, in a manner that is invariant under the scaling f λ f for λ 0 .
The first sign that new methods may be necessary is that, from the viewpoint of entropy estimation, the densities ρ f are very badly behaved. A non-degenerate critical point in f will induce an asymptote of the form 1 / x in ρ f , so we are in general dealing with density functions that are unbounded, not L 2 , and have discontinuities in the interior of their support. These properties, and other features of the density ρ f , are discussed in [6,7].
Many of the standard results in entropy estimation do not apply to such badly behaved densities. Several theorems in entropy estimation (such as Goria et al. [10], van Es [11], Joe [12], Levit [13], and more recently Leonenko et al. [14]) use the assumption that ρ is bounded. In Tsybakov and van der Meulen [15] and Eggermont and LaRiccia [16], the assumption that ρ is twice differentiable was needed. In Dmitriev and Tarasenko [17] and Ahmad and Lin [18], ρ was assumed to have a bounded derivative. In Hall and Morton [19] ρ was assumed to have a continuous derivative. In Mokkadem [20], a “distributional” equivalent of ρ L 1 was required.
There are methods of solving problem 2 that could shed light on densities such as ρ f . In Kozachenko and Leonenko [21], Vasicek [22], and Györfi and van der Meulen [23], methods are developed which solve the entropy estimation problem 2 defined above and impose only mild restrictions on ρ.
In Kozachenko and Leonenko [21], they consider densities on R m , for simplicity we will consider the case m = 1 . Note that they use different notation; we will continue to use ρ to stand for the density function, whereas they use f. A nearest-neighbor estimator is used, and only fairly mild conditions are imposed on the density ρ: For some ϵ > 0 , the following two equations must hold:
| log ρ ( x ) | 1 + ϵ ρ ( x ) d x <
| log | x y | | 1 + ϵ ρ ( x ) ρ ( y ) d x d y <
(these are equations numbered (3) and (4) in [21]). Densities such as our ρ f will “typically” satisfy these constraints (for instance, the function f ( t ) = t 2 on [ 0 , 1 ] has density ρ f ( u ) = ( 1 / 2 ) u 1 / 2 , u ( 0 , 1 ] and this density satisfies these constraints). The authors then prove that the estimator h N computed from N independent samples of ρ satisfies E ( h N ) h ( ρ ) as N . We would like to discuss some features of their proof, so for convenience of our reader we attempt to sketch it here: They first show that
h N = ( 1 / N ) i = 1 N ζ i
Due to their dependence on N, the ζ i could also be written ζ i , N . The ζ i are identically distributed random variables, so E ( h N ) = E ( ζ i ) and we focus just on ζ i . They consider the cumulative distribution function F N , x ( u ) of e ζ i conditioned on the case when the i-th sample equals x, i.e., X i = x . Let ν ( y , r ) = { x R : | x y | < r } ; one computes that
F N , x ( u ) = 1 1 ν ( x , u 2 γ ( N 1 ) ) ρ ( y ) d y N 1
where log γ = . 5772 is the Euler constant (this is their equation (8) ). We have
ν x , u 2 γ ( N 1 ) = u γ ( N 1 )
so we can write
ν ( x , u 2 γ ( N 1 ) ) ρ ( y ) d y = u γ ( N 1 ) 1 | ν ( x , u 2 γ ( N 1 ) ) | ν ( x , u 2 γ ( N 1 ) ) ρ ( y ) d y
This means that, by the Lebesgue Differentiation Theorem, since ν ( x , u 2 γ ( N 1 ) ) shrinks to x as N , we have
1 | ν ( x , u 2 γ ( N 1 ) ) | ν ( x , u 2 γ ( N 1 ) ) ρ ( y ) d y ρ ( x )
and so F N , x ( u ) 1 e ρ ( x ) u / γ (this is their equation (8) ). They define
F x ( u ) = 1 e ρ ( x ) u / γ
Let the random variable ξ N , x have the cumulative distribution function F N , x and the random variable ξ x have the cumulative distribution function F x . We can compute
E ( log ξ x ) = log ρ ( x )
and so one might hope that, since F N , x F x , we have E ( log ξ N , x ) E ( log ξ x ) = log ρ ( x ) $ because that would mean that E ( ζ i ( N ) | X i = x ) log ρ ( x ) (this is their equation (9) ). The bulk of their proof is to show that this is in fact true. The proof is then completed by taking the pointwise result
E ( ζ i ( N ) | X i = x ) log ρ ( x )
and extending it to a convergence of integrals,
E h N = E ( ζ i ( N ) ) = E ( ζ i ( N ) | X i = x ) ρ ( x ) d x ( log ρ ( x ) ) ρ ( x ) d x = h ( ρ ) $
This is done in their equation (21).
There are several rates of convergence that would have to be quantified: First, a quantitative bound on
1 | ν ( x , u 2 γ ( N 1 ) ) | ν ( x , u 2 γ ( N 1 ) ) ρ ( y ) d y ρ ( x )
would be needed. Next, we would need to quantify how the rate of convergence of F N , x F x translates into the convergence E ( log ξ N , x ) E ( log ξ x ) . This rate would need to be somehow uniform in x; their constants C 1 (from their equation (20) ) and therefore C 2 (at the bottom of [21, p. 99]) depend on x in an unspecified way. Some sort of uniformity is needed because the convergence E ( log ξ N , x ) E ( log ξ x ) must be translated into the convergence
E ( log ξ N , x ) ρ ( x ) d x E ( log ξ x ) ρ ( x ) d x
Quantifying these various rates of convergence appears to be nontrivial. We also note that subsequent research which has worked with estimators analogous to the one used in [21], such as [15], [10], and [14], has required more assumptions on the density in question.
In Györfi and van der Meulen [23], histogram-based density estimators ρ N (computed from the sampled values { X i } ) are used. The stated results make no assumptions on ρ other than the finiteness of h ( ρ ) . They do not give a quantitative rate of convergence, and non-quantitative methods in the proof (such as the Borel-Cantelli Lemma) suggest that a quantitative result would not necessarily follow trivially from their methods. Also, choosing the proper bin size and grid placement for a histogram is a continual problem. For unbounded densities such as ρ f , any bin size which can capture the very tight grouping of sample points near an asymptote of ρ f will then be too narrow and cause a “choppy” estimate in areas where ρ f is not particularly large. One way of addressing this general drawback of histograms has been to adjust the “coarseness” of the estimate depending on how tightly the sample points are grouped, such as nearest-neighbor or sample-spacing-type estimators. We could also accept inefficiency near the critical points, but since we seek quantitative bounds this is not desirable. In practice, we believe that estimating the entropy of a density such as ρ f with a histogram estimator will mean very large inefficiencies near the critical points.
In Vasicek [22], no assumptions are made on ρ other than u 2 ρ ( u ) < , which is satisfied by ρ f . Let { X ( i ) } be the order statistics of the sample { X i } , and let F be the cumulative distribution function of ρ. They use a sample-spacing estimator with a parameter m specifying the number of “neighbors” that will be considered when estimating the value of the density ρ near X ( i ) . Their conclusion is that their estimator h m , N converges in probability to h ( ρ ) as N , as long as m , m / N 0 . One key question, of course, is how m is chosen. They state [22] that “an optimal choice of m for a given [N], however, depends on the (unknown) [ρ]. In general, the smoother the density [ρ], the larger is the optimal value of m.” Therefore, one would expect that we would need to choose relatively small values of m since ρ f is very far from smooth, while still having m . A key step in the proof is that
F ( X ( i + m ) ) F ( X ( i m ) ) X ( i + m ) X ( i m ) which is equal to 1 X ( i + m ) X ( i m ) X ( i m ) X ( i + m ) ρ
will equal ρ ( x ) for some x ( X ( i m ) , X ( i + m ) ) , whenever ρ is positive and continuous over the interval ( X ( i m ) , X ( i + m ) ) . This step is essentially using the Lebesgue Differentiation Theorem, so the accuracy of
1 X ( i + m ) X ( i m ) X ( i m ) X ( i + m ) ρ ( x )
simultaneously over the entire domain of ρ, will need to be quantified. The tradeoff between needing m to be small in order to capture the “fine-scale” behavior of ρ f near an asymptote, while needing m to be large in order to carry through the rest of the proof, appears non-trivial. A discussion of results which use a fixed m, versus m as in [22], is in Tsybakov and van der Meulen [15]. We note also that results using similar techniques, which have achieved conclusions stronger than the “convergence in probability” shown here (such as van Es [11]) have required more restrictions on ρ.
There are other results, such as those in Godavarti and Hero [24], Csiszár [25], and Rényi [26], which do not specifically solve problem 2, but which give insight into how one might produce a value that is “close” to h ( ρ ) given only some information about ρ. These results are not “stochastic”; the approximations for h ( ρ ) that will be considered are not constructed by sampling from ρ.
In Csiszár [25], the only assumption is that the entropy integral exists, there are no other assumptions on the density ρ. Their results are explained in terms of a general measurable space X, and in their presentation the density ρ is not the central object, it simply arises as the Radon-Nikodym derivative of a probability measure on X (in their notation this probability measure is μ), with respect to a σ-finite measure on X (in their notation this σ-finite measure is λ). We will assume that the measurable space is R , the σ-finite measure is Lebesgue measure on R , and the probability measure is given by the density ρ. In their Theorem 1, they show that h ( ρ ) equals the infimum value of a set of “approximate entropies,” where the density ρ is replaced with its average value on each set in a partition { A i } of R , and then the entropy of the resulting “approximate density” is calculated. If we define the characteristic function of a set S, χ S , by
χ S ( u ) = 1 if u S 0 if u S
then the “approximate density” is:
ρ { A i } ( u ) = i 1 | A i | A i ρ χ A i ( u ) ( this is ρ )
and the “approximate entropy” is:
h ( ρ { A i } ) = ρ { A i } log ρ { A i } = i A i ρ log 1 | A i | A i ρ ( this is h ( ρ ) )
They prove this result by choosing a specific set of approximate entropy values h ϵ , where in fact h ϵ is easily seen to be within ϵ of the true value h ( ρ ) . This is certainly quantitative in the sense that the approximating value, h ϵ , is known to be within a certain explicit amount of the true value. The question then becomes whether we can calculate the value h ϵ . The approximating values h ϵ are obtained, essentially, by the definition of the Lebesgue integral of log ρ (with respect to the measure given by ρ) as a limit of integrals of simple functions (for this definition, see for example [27]. For a probability measure μ on R and a function g L 1 ( μ ) , if we choose ϵ and we let
A i = { x R : g ( x ) [ ϵ i , ϵ ( i + 1 ) ) } , i Z
then it is obvious that i ( ϵ i ) μ ( A i ) is within ϵ of g d μ . In [25], log ρ plays the role of g, the measure μ is given by ρ, and we have
h ϵ = i ( ϵ i ) μ ( A i ) is within ϵ of log ρ d μ = R ρ log ρ
In order to calculate the value of i ( ϵ i ) μ ( A i ) , we need to calculate μ ( A i ) , which is
P [ log ρ [ ϵ i , ϵ ( i + 1 ) ) ] = { log ρ [ ϵ i , ϵ ( i + 1 ) ) } ρ
However, we do not know ρ, so we do not know the exact locations on R when log ρ is going to fall in a certain range. Of course, as we mentioned, the purpose of the h ϵ construction is an existence proof, not necessarily as a calculational or estimation tool.
We turn now to Rényi [26]. Rényi considers density functions ρ, the only requirement on ρ being that the discrete entropy of the “integer compression” of ρ is finite, by which we mean (suppose X is a random variable having density ρ):
j Z P [ X [ j , j + 1 ] ] log P [ X [ j , j + 1 ] <
This is not equivalent to the existence of the entropy integral, as he shows in [26]. This admits a wide class of densities, including ρ f for the functions f we will consider. The key tool will be to consider successively finer step-function approximations of ρ, the partition set being the grid with “step size” 1 / N . Let us define ρ [ 1 / N ] by
ρ [ 1 / N ] = j Z χ [ j / N , ( j + 1 ) / N ) 1 1 / N j / N ( j + 1 ) / N ρ
Note that ρ [ 1 / N ] is essentially the “true” histogram with the grid { j / N } j = (an example of the approximations ρ { A i } considered in Csiszár [25]). Rényi proves that, if the “integer compression” of ρ has finite discrete entropy, then
lim N h ( ρ [ 1 / N ] ) = ρ log ρ
One sign that this result will not be optimal for a density such as ρ f is that Rényi first proves this convergence for bounded ρ. He introduces an intermediate parameter L (which specifies the point at which the tails of the sums and integrals will be cut off), proves the pointwise convergence result
ρ [ 1 / N ] ( x ) log ρ [ 1 / N ] ( x ) ρ ( x ) log ρ ( x ) a.e. x
(see his equation (35) ) and then applies the “Theorem of Lebesgue” [26] (we believe he is referring to the Dominated Convergence Theorem) over the interval [ L , L ] to obtain
L L ρ [ 1 / N ] log ρ [ 1 / N ] L L ρ log ρ as N
The tails of the integral are bounded using our numbered equation (3), which completes the proof in the case of bounded ρ. For unbounded ρ he introduces another intermediate parameter A and considers the truncated function ρ A (which is bounded):
ρ A ( u ) = ρ ( u ) if ρ ( u ) A 0 if ρ ( u ) > A
He applies the result for bounded ρ to ρ A , and then he bounds the difference
ρ [ 1 / N ] log ρ [ 1 / N ] ( ρ A ) [ 1 / N ] log ( ρ A ) [ 1 / N ]
using expressions involving A and L (but, importantly, not involving N, see his equations (48), (50) and (53) ). Therefore, his equations (56), (57), (58) and (59) show how the parameters L, A and N interact to produce the result. To make this result quantitative would entail balancing these various rates of convergence, which seems to be non-trivial. Finally, being essentially a histogram-type method (the bin sizes depend only on N), it would face the difficulties that come along with histograms that we covered in our discussion of Györfi and van der Meulen [23].
We turn now to Godavarti and Hero [24]. In Theorem 4 of [24], the convergence h ( ρ N ) h ( ρ ) is proved under fairly general conditions. The only assumptions are that ρ N ρ pointwise and that there exist a constant L and some κ > 1 such that the following statements are true:
ρ | log ρ | κ L
and for all N,
ρ N | log ρ N | κ L
Densities such as ρ f typically satisfy equation (4). Furthermore, if we require f to satisfy the regularity assumption we introduce later (Definition 6), and if an approximation method is used to produce g N given N samples from f, it is reasonable to expect that ρ f and { ρ g N } would satisfy equations (4) and (5) for some L. The necessary pointwise convergence is assured by any reasonable method of producing g N , and then Theorem 4 of [24] in fact proves the result h ( g N ) h ( f ) (although we will not attempt to formalize the above argument). This is essentially what we prove in our Theorem 1, with the key difference that our result is quantitative. This certainly makes one wonder if the result in Theorem 4 of [24] could in fact be easily adapted to obtain a quantitative proof of our Theorem 1, thus not needing our methods. Currently, we believe the answer is no. Or, looked at another way, we believe that a quantitative version of their result for the problem we are examining would end up confronting the same issues we faced and might end up resembling our methods.
Since [24] does not present a quantitative result, we look to their proof to see if such a result can be found in their methods. Note that in [24], “N” has a meaning different from how we have been using it, and “i” is the variable which corresponds to our “N.” We will use the notation from [24] for the moment, for the convenience of the reader who might want to refer back to [24] during our short discussion below; we hope confusion can be avoided. The main ingredients of the proof in [24] are several variables ( N , K , i ) which must become “sufficiently large” and the application of theorems such as Egoroff’s Theorem and Lebesgue Dominated Convergence. One of the chief advantages of these powerful theorems from measure theory is that they can bootstrap pointwise convergence into stronger conclusions and bypass potentially forbidding quantitative relationships. Also, the “essence of the proof” (as they state in [24]) is to find a set A ϵ such that everything happening on A ϵ c (the “bad” set) is negligible, and the desired convergence does in fact take place on A ϵ . The set A ϵ , by construction, is a set on which ρ and all but finitely many of the ρ i are bounded above by N and bounded away from zero by 1 / N . There is a tradeoff between the size of N (larger N means a smaller “bad” set, but less control on the “good” set) and the other parameters K and i. Therefore, to obtain a quantitative result, the rates of growth of these parameters must be balanced, and the use of the measure-theoretic theorems mentioned above must be replaced with concrete estimates; this would appear to be non-trivial.
There is another issue of concern in [24], a scaling issue. As mentioned in problem 1, the scaling f λ f for λ 0 should essentially have no effect on the problem. Note that this scaling induces the standard L 1 scaling
ρ ( u ) | λ | 1 ρ ( u / λ )
so a method which proves a bound analogous to ours should respect this scaling. The assumption by Godavarti and Hero (in equation (4) of the present article) involves a constant, L, and the specific value of L is used in the proof (for example, we need N , K large enough that (essentially) L / | log K | κ 1 and L / | log N | κ 1 are sufficiently small). The scaling ρ ( u ) | λ | 1 ρ ( u / λ ) transforms the integral in (4) into
ρ | log | λ | 1 + log ρ | κ
which means in turn that a different value of L (and therefore different values of K and N) will be needed. Therefore, in order to prove a result similar to our Theorem 1, this scaling issue will need to be resolved as well.

3. Definitions and Notation

We will use the following notation:
Definition 1. 
For real numbers a , b , we define
a , b = [ min { a , b } , max { a , b } ] = [ a , b ] i f a b [ b , a ] i f a > b
For subsets S , T R , we define
S + T = { s + t : s S , t T }
So, for example,
a , b + [ 2 , 3 ] = [ min { a , b } 2 , max { a , b } + 3 ]
As we defined in the Introduction,
Definition 2. 
For a measurable function h : [ 0 , 1 ] R , we define the measure μ h on R by
μ h ( E ) = | h 1 ( E ) |
We will have occasion to refer to the “monotone pieces” of a function, so we make the following definition. We do not want to consider functions which have critical points at either 0 or 1, for technical reasons and to avoid more cumbersome notation, so we define:
Definition 3. 
For a differentiable function h with a finite set of critical points t 1 < < t k , and with h ( 0 ) , h ( 1 ) 0 , we define t 0 = 0 , t k + 1 = 1 and for j = 0 , , k we define h j = h | [ t j , t j + 1 ] .
For any function h, we will abbreviate the domain and range of h:
Definition 4. 
We denote the domain of h by D o m ( h ) , and the range of h by R a n g e ( h )
Recall the definition of the (discrete) Shannon entropy, H, of a finite probability mass distribution p = { p i } i = 1 n :
H ( p ) = i p i log p i
Let h be a differentiable function with finitely many critical points and h ( 0 ) , h ( 1 ) 0 . Suppose that μ h is absolutely continuous. Then we know that the measure given by μ h i ( E ) = | h i 1 ( E ) | is absolutely continuous; let its density be ρ h i . For a fixed u in the range of h, note that u is in the range of at least one, possibly several, of the h i . For those i such that u is in the range of h i we see that ρ h i is non-zero, and we know that
i ρ h i ( u ) = ρ h ( u )
So, we define the probability mass distribution
p h ( u ) = ρ h i ( u ) ρ h ( u ) i = 0 k
Note that the number of non-zero entries of p h ( u ) equals the number of i such that u is in the range of h i .

4. A Useful Identity

Assume that f is a differentiable function with a finite set of critical points t 1 < < t k and f ( 0 ) , f ( 1 ) 0 . The following identity is the centerpiece of our method for proving our main result (a helpful comment [28] led us to this identity):
h ( f ) = t = 0 1 log | f ( t ) | d t t = 0 1 H p f ( f ( t ) ) d t
In the subsequent sections of this article, to simplify a discussion we will sometimes refer to the terms on the right hand side by ( L ) and ( E ) , corresponding to the “logarithm” term and the “entropy” term. To prove (7), note the following:
[ ρ f i log ρ f i ] ρ f log ρ f = [ ρ f i log ρ f i ρ f i log ρ f ] = ρ f i log ( ρ f i / ρ f ) = ρ f ( ρ f i / ρ f ) log ( ρ f i / ρ f ) = ρ f H ( p f ( u ) )
Integrating both sides, we have
[ ρ f i log ρ f i ] ρ f log ρ f = ρ f H ( p f ( u ) )
or
i u R a n g e ( f i ) ρ f i ( u ) log ρ f i ( u ) d u + h ( f ) = ρ f H ( p f ( u ) )
The identity u ρ f i ( u ) g ( u ) d u = t D o m ( f i ) g ( f i ( t ) ) d t , and the formula
ρ f i ( u ) = 1 | f i f i 1 ( u ) |
are used on the first term to obtain
i u R a n g e ( f i ) ρ f i ( u ) log ρ f i ( u ) d u = i t D o m ( f i ) log ρ f i ( f i ( t ) ) d t = t = 0 1 log 1 | f i ( t ) | d t
The identity u ρ f ( u ) g ( u ) d u = t = 0 1 g ( f ( t ) ) d t is used on the last term to obtain
u ρ f H ( p f ( u ) ) d u = t = 0 1 H p f ( f ( t ) ) d t
and so (7) is proved.
This identity is equivalent to a standard identity for Shannon entropy. Suppose we have a probability space Ω, a finite set S, and a random variable X : Ω S . Furthermore, suppose we have an “indicator” function
Δ : Ω N
which takes a finite number of values on Ω. We think of Δ as specifying a partition of Ω, i.e.,
Ω = Δ 1 ( 1 ) Δ 1 ( 2 ) = Ω 1 Ω 2
If X maps the Ω i to disjoints parts of S, i.e.,
X ( Ω i ) X ( Ω j ) = for i j
then we would have
H ( X ) = s S P [ X = s ] log P [ X = s ] = i s X ( Ω i ) P [ X = s ] log P [ X = s ] = i s X ( Ω i ) ( P ( Ω i ) P [ X = s | Ω i ] ) log ( P ( Ω i ) P [ X = s | Ω i ] ) = i P ( Ω i ) s X ( Ω i ) P [ X = s | Ω i ] log ( P ( Ω i ) + log P [ X = s | Ω i ] ) = i P ( Ω i ) log ( P ( Ω i ) i P ( Ω i ) s X ( Ω i ) P [ X = s | Ω i ] log P [ X = s | Ω i ] = H ( Δ ) + H ( X | Δ )
where H ( X | Δ ) is the conditional entropy of X given Δ (see [29]). In other words, H ( Δ ) + H ( X | Δ ) would be the Shannon entropy of X if the images of the Ω i under X did not overlap. If they did overlap, this would be an overestimation of the entropy of X, and in general the true value is given by subtracting a term, H ( Δ | X ) , which represents the amount of “overlap”:
H ( X ) = H ( Δ ) + H ( X | Δ ) H ( Δ | X )
This identity follows from the “chain rule” for entropy [29].
In our case, the probability space is [ 0 , 1 ] , the set is R , the random variable is f : [ 0 , 1 ] R , and the indicator function on the probability space is the function
Δ ( t ) = i if t [ t i , t i + 1 ]
which (essentially) partitions [ 0 , 1 ] : [ 0 , 1 ] = [ 0 , t 1 ] [ t 1 , t 2 ] [ t k , 1 ] . As above,
H ( Δ ) + h ( f | Δ ) = t = 0 1 log | f ( t ) | d t
measures the entropy of f as if the different “monotone laps” of f did not overlap. If they do overlap then this is an overestimate, and the true value is given by subtracting the term representing the amount of “overlap”:
h ( Δ | f ) = t = 0 1 H p f ( f ( t ) ) d t
In this way, (7) is analogous to (8).

5. Scaling

For simplicity, we consider λ > 0 (the case λ < 0 is no different). For λ > 0 , we have ρ λ f ( u ) = λ 1 ρ f ( u / λ ) , and so h ( λ f ) = h ( f ) + log λ . This means
h ( λ f ) h ( λ g ) = h ( f ) h ( g )
So, the difference h ( f ) h ( g ) is invariant under the scaling f λ f . However, if we take [ a , b ] R , and we consider
a b ρ f log ρ f ρ g log ρ g
then, scaling by the factor λ, we have
λ a λ b ρ λ f log ρ λ f ρ λ g log ρ λ g = a b ρ f log ρ f ρ g log ρ g + ( log λ ) a b ρ f ρ g
Since the second term on the right hand side of this equation is not necessarily zero, we see that an integral over just a part of the domain may not be scale-invariant.
Turning our attention to the identity (7), we see that for λ > 0 we have:
t = 0 1 log | ( λ f ) ( t ) | d t = t = 0 1 log | f ( t ) | d t + log λ t = 0 1 H p λ f ( λ f ) ( t ) d t = t = 0 1 H ( p f ( f ( t ) ) ) d t
Perhaps more importantly, each term can be divided along its respective axis, and each individual piece scales similarly:
t a t b log | ( λ f ) ( t ) | d t = t a t b log | f ( t ) | d t + ( t b t a ) log λ t a t b H p λ f ( λ f ) ( t ) d t = t a t b H ( p f ( f ( t ) ) ) d t
Our main strategy takes advantage of precisely this. Recall the definitions for (L) and (E) that followed (7); both (L) and (E) will be subdivided along the t axis.

6. A Counter-Example

We will denote the i t h derivative of f (or g) by f ( i ) (or g ( i ) ). One might hope that a theorem along the following lines could be proved, C being an absolute constant:
| h ( f ) h ( g ) | C i = 0 K f ( i ) g ( i )
This would show that the h -functional is a continuous map in the norm topology of some Banach space of differentiable functions. One can see immediately that this result is not possible, just due to scaling: If we let f n = ( 1 / n ) f , g n = ( 1 / n ) g then | h ( f n ) h ( g n ) | = | h ( f ) h ( g ) | while the right hand side converges to zero as n . However, we might hope that we could prove a result such as
| h ( f ) h ( g ) | C i = 0 K f ( i ) g ( i ) / min { f ( i ) , g ( i ) } i = 0 K
In other words, perhaps this scaling issue is the only problem and the h -functional is continuous when restricted to points f and g which are bounded away from zero in this Banach space. This is not the case. In fact, suppose that we ask even less of the result we seek. Suppose that we seek some r > 0 , c > 0 , and some function Γ of 2 K + 2 variables which is bounded on compact subsets of the open positive orthant ( R + ) 2 K + 2 , and we just want the following bound to hold: Whenever
i = 0 K f ( i ) g ( i ) r Γ ( f ( 0 ) , , f ( K ) , g ( 0 ) , , g ( K ) ) c
is true, we have
| h ( f ) h ( g ) | i = 0 K f ( i ) g ( i ) r Γ ( f ( 0 ) , , f ( K ) , g ( 0 ) , , g ( K ) )
In other words, whenever the differences { f ( i ) g ( i ) r } i = 0 K , after being multiplied by the (large) value Γ, are sufficiently small, then we have a quantitative estimate on h ( f ) h ( g ) .
This is still not possible. Note that (9) is a special case of (10). Note also that our main result, Theorem 1, is almost a special case of (10) with K = 3 , the only additional element in Theorem 1 being the regularity condition (Definition 6) and the involvement of the parameter from that condition, δ.
The impossibility of a result such as (10) can be seen just looking at monotone functions, in which case the h functional is identical to the functional f log | f | . The reason is quite straightforward: The functional f log | f | will become large if a function is extremely flat even for a short distance, whereas any · -type norm will not necessarily measure a large difference between a function that is extremely flat and a function that is merely somewhat flat. We formalize this in the following proposition:
Proposition 5. 
Fix K and ϵ. There exists a constant C K depending only on K, and monotone functions f , g , such that
i K , f ( i ) , g ( i ) a r e c o n t i n u o u s i K , f ( i ) g ( i ) ϵ i K , f ( i ) , g ( i ) 1 i K , f ( i ) , g ( i ) C K
but
log | f | log | g | 1
Proof. 
Let f 0 ( t ) = t N and g 0 ( t ) = t M ; we are free to choose M and N independently, and the idea is to let M > > N > > K . Define
f ( t ) = f 0 ( t ) on [ 0 , 1 / 2 ] , f ( t ) = 2 K + 1 ( t 1 / 2 ) K + 1 + i = 0 K f ( i ) ( 1 / 2 ) i ! ( t 1 / 2 ) i on [ 1 / 2 , 1 ]
g ( t ) = g 0 ( t ) on [ 0 , 1 / 2 ] , g ( t ) = 2 K + 1 ( t 1 / 2 ) K + 1 + i = 0 K g ( i ) ( 1 / 2 ) i ! ( t 1 / 2 ) i on [ 1 / 2 , 1 ]
We see that f and g were chosen to have f ( i ) , g ( i ) continuous for i K . If we choose N and M so that K < N / 2 , N < M / 2 , we see that i K , α 1 / 2 ,
g 0 ( i ) ( α ) M K α M / 2 and f 0 ( i ) ( α ) α N
so once we choose sufficiently large M so that M K α M / 2 α N for all α 1 / 2 , we have g 0 ( i ) ( α ) f 0 ( i ) ( α ) i K , α 1 / 2 and therefore g ( i ) ( t ) f ( i ) ( t ) i K , t [ 0 , 1 ] .
Note that, on [ 0 , 1 / 2 ] we have f ( i ) ( t ) f ( i ) ( 1 / 2 ) N K ( 1 / 2 ) N / 2 . Choose sufficiently large N so that
N K ( 1 / 2 ) N / 2 ϵ / ( K + 1 ) K K
This certainly means | f ( i ) g ( i ) | ϵ on [ 0 , 1 / 2 ] . On [ 1 / 2 , 1 ] we just need i = 0 K f ( i ) ( 1 / 2 ) i ! ( t 1 / 2 ) i and all of its derivatives to be smaller than ϵ and then we will have f ( i ) g ( i ) ϵ i K . We see that
i = 0 K f ( i ) ( 1 / 2 ) i ! ( t 1 / 2 ) i ( j ) ( t ) ( K + 1 ) max i K f ( i ) ( 1 / 2 ) K K ϵ
and so we have f ( i ) g ( i ) ϵ i K .
By the above calculation, we have the immediate bound
f ( i ) ( t ) f ( i ) ( 1 ) 2 K + 1 ( K + 1 ) ! + ( K + 1 ) K K max i K f ( i ) ( 1 / 2 )
and we see that the following is true, with a constant C K depending only on K:
max i K f ( i ) ( 1 / 2 ) N K 2 ( N K ) C K
which means f ( i ) C K (and the same is true for g ( i ) since g ( i ) f ( i ) ).
Finally, observing that f ( i ) ( 1 ) 2 K + 1 ( 1 / 2 ) K + 1 + 1 , and noting that the same is true for g, we have i K , f ( i ) , g ( i ) 1 .
So, f and g meet the criteria of the proposition. At this point, let us “fix” the value of N that we have arrived at. Since f , g > 0 and g f , we have
log | f | log | g | = 0 1 log f log g 0 1 / 2 log f log g = ( 1 / 2 ) [ log N log M ] + [ ( N 1 ) ( M 1 ) ] 0 1 / 2 log t
and as M this expression grows without bound, proving the proposition. ☐

7. Main Result

Suppose that we have functions f , g : [ 0 , 1 ] R . We will assume that each function has three continuous derivatives; we do not attempt to determine the precise smoothness required. The quantities f ( 3 ) , g ( 3 ) will be used frequently, and the reader might be worried, glancing at Theorem 1, that the estimate becomes “worse” as f ( 3 ) , g ( 3 ) become small, certainly a puzzling property. In fact we only need upper bounds on | f | , | g | , not least upper bounds, so the reader may substitute any upper bounds on | f | , | g | in place of f ( 3 ) , g ( 3 ) as long as it is done consistently. Notice that this is still scale invariant (if A is an upper bound on | f | then λ A is an upper bound on | ( λ f ) | ). We will use f ( 3 ) , g ( 3 ) for simplicity.
The quantities which actually bring h ( f ) and h ( g ) close together are f g and f g ; the “smallness” of these quantities, relative to the size of the third derivatives of f and g, will be quantified by a parameter ϵ.
However, Proposition 5 means that the “smallness” of f g and f g alone cannot imply the result we are seeking. We will impose an additional regularity assumption.
Definition 6 
(The Set A ( δ ) ). Suppose that h : [ 0 , 1 ] R . For 0 < δ 1 , we say h A ( δ ) if h, h , h and h ( 3 ) are continuous, and the following two conditions hold:
| h ( t ) | + | h ( t ) | δ h ( 3 ) t [ 0 , 1 ]
| h ( t ) | δ h ( 3 ) t [ 0 , δ ] [ 1 δ , 1 ]
We will require that f and g are in the set A ( δ ) . Intuitively, the requirement for a function to be in A ( δ ) is a quantitative statement that its first and second derivatives do not simultaneously vanish, i.e., it does not have any “flat spots”. The second half of the condition, equation (12), is not as natural; its purpose is simply to eliminate some uninteresting technical difficulties at the endpoints of [ 0 , 1 ] .
Lastly, we will need bounds on the ratios
max f , g min f ( 3 ) , g ( 3 ) and max f , g min f ( 3 ) , g ( 3 )
With these assumptions, we can bound the difference between h ( f ) and h ( g ) :
Theorem 1 (Main Result).
Suppose that f , g : [ 0 , 1 ] R are each members of A ( δ ) , and for some ϵ 2 20 δ 8 we have
f g ϵ min f ( 3 ) , g ( 3 )
f g ϵ 3 / 4 min f ( 3 ) , g ( 3 )
Furthermore, let C 1 , C 2 be constants such that C 1 , C 2 3 and
C 1 1 max f , g min f ( 3 ) , g ( 3 ) C 1 , max f , g min f ( 3 ) , g ( 3 ) C 2
Then
| h ( f ) h ( g ) | C C 1 C 2 log ( C 1 δ 1 ϵ 1 / 4 ) ϵ 1 / 4 / δ 4
where C is an absolute constant.
Our main result can be thought of as a converse to Proposition 5: If the functions f and g do not have any extreme “flat spots” then (essentially) the smallness of f g and f g are sufficient to imply that h ( f ) and h ( g ) are close.
Notice that the result is scale invariant: the assumptions and conclusion are not altered in the least if we multiply both f and g by the same non-zero constant.

8. Overview of Proof

The proof begins by using the “useful identity” from Section 4. to split the difference h ( f ) h ( g ) into two terms:
h ( f ) h ( g ) = t = 0 1 log | f ( t ) | log | g ( t ) | t = 0 1 H p f f ( t ) H p g g ( t )
As in Section 4., we abbreviate the first term by ( L ) and the second term by ( E ) . Next, we wish to split the domain of integration into a “good” set and a “bad” set. However, this is done differently for ( L ) and ( E ) .
The regions of [ 0 , 1 ] that are causing the problems are the places where the derivatives of f and g are very small (or zero). Let us denote this region by B * . When we are dealing with just log | f | , log | g | , it is just the set B * that will constitute the “bad” set. Therefore, we write
log | f ( t ) | log | g ( t ) | = B * log | f ( t ) | log | g ( t ) | + ( B * ) c log | f ( t ) | log | g ( t ) |
or ( L ) = ( L B ) + ( L G ) .
When dealing with the expression H p f f ( t ) , it is of course the same set B * that causes the trouble, but the affected areas spread beyond just those values of t which are in B * . If we have a value of t such that the function value f ( t ) is in the image (by f) of some other point t with f having a small derivative at t , then the point t will be a “problem” for H p f f ( t ) , since the finite probability mass distribution p f f ( t ) involves all points t such that f ( t ) = f ( t ) . In other words, the behavior of H p f f ( t ) is not determined by just a neighborhood of t. This makes sense if we recall that this term is trying to capture the “overlap” in function values that might happen between widely separated regions of the domain of f.
We can think of this in the following way: the “bad” set B * is going to “infect” a region of the u-axis (the set I from Definition 16) and then any points in [ 0 , 1 ] which map to that “infected” area will also be “infected”. The set G * (defined precisely in Definition 16) is the portion of [ 0 , 1 ] that is not “infected” at all, i.e.,. it is “good”. If t G * , this means that there is no point t B * for which f ( t ) = f ( t ) or g ( t ) = g ( t ) (the statement is slightly stronger than this, but that is the idea).
To summarize: We divide [ 0 , 1 ] into a “bad” region B * , a “good” region G * , and a “neutral” region ( B * ) c ( G * ) c . For ( L ) , it is B * that is “bad” and everything else is “good,” while for ( E ) it is only G * that is “good” and everything else is “bad”.
Therefore, for purposes of H p f f ( t ) the bad set will be ( G * ) c and so we write
H p f f ( t ) H p g g ( t ) = ( G * ) c H p f f ( t ) H p g g ( t ) + G * H p f f ( t ) H p g g ( t )
or ( E ) = ( E B ) + ( E G ) .
Note that each piece ( L G ) , ( L B ) , ( E G ) , ( E B ) by itself is scale invariant. As discussed in section 5., some care must be taken to split h ( f ) h ( g ) into scale invariant pieces.
In preparation for estimating the size of each of the four pieces, in Section 9. and Section 10. we develop some results pursuing the implications of the regularity assumption. These implications are both quantitative and qualitative. The quantitative implications are pursued in Section 9.. In Section 10., we define the space E ( τ ) (see Definition 11), which is meant to abstract just some “qualitative” or “incidence” properties which follow from the quantitative results, and we prove some results about E ( τ ) .
In section 11., we apply the results from Section 9. and Section 10. to f and g. In brief, the bound on each piece is as follows.
Looking first at ( L ) , the bound for ( L G ) is immediate, so we concentrate on ( L B ) . The regularity assumption (Definition 6) will imply that critical points of f and g roughly coincide; we consider the portion of B * near just one critical point, call this B j * . As ϵ 0 , certainly | B j * | 0 , but B j * is where | f | or | g | are small (or zero), so log | f | , log | g | can be unbounded and we cannot say that ( L B ) is negligible just because | B j * | is small. However, we observe that f can be approximated by a parabola p f near a critical point, and the same is true for g. We have
B j * log | f | = B j * log | p f | + B j * log | f / p f |
and it is only log | p f | which is unbounded; the estimate of ( L B ) is completed by proving a bound on
B j * log | p f | B j * log | p g |
We turn next to ( E B ) . This piece is bounded by “crude” size estimates, the main point being that the set ( G * ) c is small. The integrand,
H p f f ( t ) H p g g ( t )
is bounded in size by the logarithm of one plus the number of critical points of f (the regularity assumption will imply that f and g have the same number of critical points). The size of the set ( G * ) c is bounded just by the number of components of B * , the size of B * , and some basic data about the functions f and g.
Finally, we have ( E G ) . We in fact only need some “incidence” data about the functions f and g to prove the result here (this data is summarized in Definition 11). The bound on this piece proceeds essentially as follows: Once we consider only t G * , we can say
p f f ( t ) = { p i } i = 0 k , p g g ( t ) = { q i } i = 0 k
and we need to bound | i = 0 k p i log p i q i log q i | . On G * , we will see that p i , q i are bounded below, so we use the inequality
| p i log p i q i log q i | [ 1 + max | log p i | , | log q i | ] | p i q i |
and the proof is completed by estimating the difference
| p i q i | = ρ f i ρ f ρ g i ρ g
which ultimately reduces to the fact that f f j 1 f ( t ) and g g j 1 g ( t ) are close together because | f ( t ) g ( t ) | f g , and f j 1 , g j 1 are going to be close together when evaluated on points that are close, and finally f , g differ by at most f g when evaluated on the same point, so they are going to be close together when evaluated on points that are close.
In the following sections, the functions “f” and “g” will always refer to the two specific functions assumed by Theorem 1. We will use h and l to refer to generic functions.

9. The Space A ( δ )

In the following, we assume h , l A ( δ ) .
Lemma 7. 
If | h ( t 0 ) | A > 0 and | t t 0 | A / ( 2 h ( 3 ) ) , then | h ( t ) | A / 2 .
| h ( t ) h ( t 0 ) | h ( 3 ) | t t 0 | .
Lemma 8. 
Distinct zeros of h are separated by a distance of at least δ / 2 . This also means h has at most 2 / δ zeros on [ 0 , 1 ] .
h ( t 0 ) = 0 implies that | h ( t 0 ) | δ h ( 3 ) , and so
| t t 0 | δ / 2 | t t 0 | ( δ h ( 3 ) ) / ( 2 h ( 3 ) )
and then Lemma 7 implies | h ( t ) | > 0 . Since h is continuous, h does not change sign between t and t 0 , so a zero of h must be further than δ / 2 away from t 0 .
 
Lemma 9. If γ δ 2 / 16 and | h ( t 0 ) | γ h ( 3 ) then t , | t t 0 | 4 γ / δ with h ( t ) = 0 .
By definition of A ( δ ) , we know that t 0 ( δ , 1 δ ) and | h ( t 0 ) | ( δ / 2 ) h ( 3 ) . By Lemma 7 we know that h has the same sign on
[ t 0 δ / 4 , t 0 + δ / 4 ]
and | h | is greater than ( δ / 4 ) h ( 3 ) on that interval. Without loss of generality, suppose that
0 h ( t 0 ) γ h ( 3 ) , h ( t 0 ) < 0
Let t = t 0 + 4 γ / δ . We see that 4 γ / δ δ / 4 which means t [ 0 , 1 ] and t [ t 0 δ / 4 , t 0 + δ / 4 ] . So, on [ t 0 , t ] , we have h ( δ / 4 ) h ( 3 ) . Therefore,
h ( t ) = h ( t 0 ) + t 0 t h γ h ( 3 ) + t 0 t ( δ / 4 ) h ( 3 ) = γ h ( 3 ) ( 4 γ / δ ) ( δ / 4 ) h ( 3 ) = 0
So h ( t 0 ) 0 and h ( t ) 0 and it is proved.
Lemma 10. 
If
h l < δ 2 / 16 min h ( 3 ) , l ( 3 )
then
h ( t 0 ) = 0 t w i t h l ( t ) = 0 , | t t 0 | 4 h l δ min h ( 3 ) , l ( 3 )
n o n u m b e r l ( s 0 ) = 0 s w i t h h ( s ) = 0 , | s s 0 | 4 h l δ min h ( 3 ) , l ( 3 )
This is a corollary of Lemma 9; if h ( t 0 ) = 0 then
| l ( t 0 ) | < δ 2 / 16 min h ( 3 ) , l ( 3 )
and we apply Lemma 9 to l. We proceed analogously if we assume l ( t 0 ) = 0 .

10. The Space E ( τ )

Definition 11. 
We say that h : [ 0 , 1 ] R is a member of E ( τ ) if h , h , h , h ( 3 ) are continuous and
t 1 t 2 a n d h ( t 1 ) = h ( t 2 ) = 0 | t 1 t 2 | > 2 τ
t [ 0 , τ ] [ 1 τ , 1 ] h ( t ) 0
Definition 12. 
For h , l E ( τ ) , we say h l if the following two statements are true:
h ( t ) = 0 t , | t t | < τ , l ( t ) = 0 l ( s ) = 0 s , | s s | < τ , h ( s ) = 0
The set E ( τ ) just isolates some of the “incidence” properties of the critical points of h and l; we will make use of it because of the following two lemmas which connect it to A ( δ ) :
Lemma 13. 
If h A ( δ ) then there exists τ such that h E ( τ ) .
This follows directly from the definition of A ( δ ) and Lemma 8.
Lemma 14. 
If h , l A ( δ ) and
h l < δ 2 / 16 min h ( 3 ) , l ( 3 )
then there exists τ such that h , l E ( τ ) and h l in E ( τ ) .
This follows directly from Lemma 10 and Lemma 14.
Lemma 15. 
If h , l E ( τ ) and h l , then h and l have the same number of critical points, and if h = 0 at t 1 < < t k and l = 0 at s 1 < < s k then
i = 1 , , k 1 , max { t i , s i } < min { t i + 1 , s i + 1 }
Define the function Φ,
Φ : { critical points of h } { critical points of l }
as follows: For a critical point t of h, let
Φ ( t ) = t where | t t | < τ , l ( t ) = 0
It is easy to check that Φ is well defined and is, in fact, a bijection. It follows that s i = Φ ( t i ) , and so the second conclusion follows from the fact that | t i + 1 t i | > 2 τ , | t i s i | < τ , | t i + 1 s i + 1 | < τ .
Definition 16. 
For functions h and l, and B [ 0 , 1 ] , let
I h , l ( B ) = t B h ( t ) , l ( t ) + [ h l , h l ] G h , l ( B ) = [ 0 , 1 ] h 1 ( I ) l 1 ( I )
(recall the notation from Definition 1).
Lemma 17. 
Suppose h , l E ( τ ) , and h l . By Lemma 15 suppose they have critical points t 1 < < t k and s 1 < < s k respectively, and define h i , l i , t 0 , s 0 , t k + 1 , s k + 1 as in Definition 3. Suppose we have a set B such that
B { 0 } { 1 } i t i , s i
Let I = I h , l ( B ) and G = G h , l ( B ) be defined as in Definition 16. Fix i { 0 , , k } and t G . Then the following four statements are equivalent:
1 ) h ( t ) R a n g e ( h i ) 2 ) h ( t ) R a n g e ( l i ) 3 ) l ( t ) R a n g e ( h i ) 4 ) l ( t ) R a n g e ( l i )
Suppose that 1 ) holds and h ( t ) Range ( h i ) . Without loss of generality, suppose that h i > 0 . We know
h ( t ) min t i , s i { h , l } h l , max t i , s i { h , l } + h l
since that set is in I and t G . Since h ( t ) h ( t i ) , we know h ( t ) max t i , s i { h , l } + h l . Also,
l ( t ) min t i , s i { h , l } h l , max t i , s i { h , l } + h l
for the same reason, and we cannot have l ( t ) < min t i , s i { h , l } h l since that would imply
| h ( t ) l ( t ) | > h l
So, l ( t ) max t i , s i { h , l } + h l . By similar logic,
h ( t ) , l ( t ) min t i + 1 , s i + 1 { h , l } h l
So, we see that
max t i , s i { h , l } + h l h ( t ) , l ( t ) min t i + 1 , s i + 1 { h , l } h l
So, the left hand side is less than or equal to the right hand side and we have h ( t ) , l ( t ) Range ( h i ) , Range ( l i ) . This proves
1 ) 2 ) , 3 ) , 4 )
The other implications are seen in the same manner.
Lemma 18. 
Under the same assumptions as Lemma 17, for t G we have
h ( t ) , l ( t ) I c
and so
h 1 h ( t ) , l ( t ) , l 1 h ( t ) , l ( t ) B c
This follows from the observation that I contains no intervals of length less than h l , and h ( t ) , l ( t ) I .
Lemma 19. 
Under the same assumptions as Lemma 17, if we have t G , u h ( t ) , l ( t ) and h ( t ) R a n g e ( h j ) then
h j 1 ( u ) , l j 1 ( u ) D o m ( h j ) D o m ( l j ) B c
and therefore neither h nor l changes sign on h j 1 ( u ) , l j 1 ( u ) .
We know by Lemma 17 that u Range ( h j ) Range ( l j ) . Suppose h j ( t 1 ) = u . Then t 1 B by Lemma 18. From Lemma 15 we see that Dom ( h j ) B c = Dom ( l j ) B c . So,
t 1 Dom ( h j ) B c t 1 Dom ( l j ) h j 1 ( u ) Dom ( h j ) Dom ( l j ) B c
The same is true for l j 1 ( u ) , and so h j 1 ( u ) , l j 1 ( u ) Dom ( h j ) Dom ( l j ) B c .

11. Applying the Results to f and g

We now return to the two specific functions f and g, which have the properties stated in the assumptions of Theorem 1. We introduce a variable, μ, which will allow us to divide [ 0 , 1 ] into areas where either f or g are “small” according to μ, specifically
| f ( t ) | μ f ( 3 ) or | g ( t ) | μ g ( 3 )
and areas where both f and g are “large” according to μ, specifically
| f ( t ) | μ f ( 3 ) and | g ( t ) | μ g ( 3 )
At the end of the proof, we will optimize μ; it will turn out that μ = ϵ 1 / 4 is optimal.
We state here some restrictions on the values of μ that we will consider:
ϵ 3 / 4 < μ ( 1 / 32 ) δ 2
Notice that, since we will set μ = ϵ 1 / 4 , we will require ϵ 2 20 δ 8 .
We know that f , g A ( δ ) , and we also know that f g < δ 2 / 16 by (14) and (19). This means that we can apply the results of Section 9. to f and g, and with Lemmas 13 and 14 we see that we can also apply the results of section 10..
With this in mind, by Lemma 15 let us suppose that f = 0 at t 1 * < < t k * and g = 0 at s 1 * < < s k * . Next, we want to construct a region covering the critical points of f and g, and extending beyond the critical points far enough to cover the entire area where f or g is “small” (according to μ).
Definition 20. 
For i = 1 , , k , let J i * = [ min { t i * , s i * } 4 μ / δ , max { t i * , s i * } + 4 μ / δ ] .
We see that the critical points of f and g are contained in the J i * . Furthermore, suppose that | f ( t 0 ) | μ f ( 3 ) . By Lemma 9, we see that the distance from t 0 to a critical point of f is no more than 4 μ / δ , which means that t 0 is contained in some J i * . The same is true for g, and therefore we have
t : | f ( t ) | μ f ( 3 ) or | g ( t ) | μ g ( 3 ) i = 1 k J i *
For purposes of the expression H p f f ( t ) , the “bad” set will also include the endpoints of the interval, so we define the sets B * and G * as follows.
Definition 21. 
B * = { 0 } { 1 } i = 1 k J i * G * = G f , g ( B * )
Notice that G * B * = , but we do not necessarily have G * B * = [ 0 , 1 ] .
We are now ready to state how we will decompose the expression h ( f ) h ( g ) in order to estimate it. First, recall by equation (7) that
h ( f ) h ( g ) = t log | f ( t ) | log | g ( t ) | t H p f f ( t ) H p g g ( t )
We will divide [ 0 , 1 ] in different ways to estimate the different integrals:
For t log | f ( t ) | log | g ( t ) | we write [ 0 , 1 ] = J i * J i * c For t H p f f ( t ) H p g g ( t ) we write [ 0 , 1 ] = G * G * c
This allows us to decompose our main expression, h ( f ) h ( g ) , into the following four pieces:
Definition 22. 
( L B ) = J i * log | f ( t ) | log | g ( t ) | ( L G ) = J i * c log | f ( t ) | log | g ( t ) | ( E B ) = G * c H p f f ( t ) H p g g ( t ) ( E G ) = G * H p f f ( t ) H p g g ( t )
Next, we have four propositions, each estimating one of the above expressions in a certain setting. Each proposition will be applicable to the functions f and g, and the sets J i * , B * and G * , but it will be helpful to state and prove the propositions using only the relevant information in each case.
In order to not interrupt our exposition, we have placed the proofs of these propositions in the appendix.
Proposition 23 
(Used to bound ( L B ) ). Suppose that h , l A ( δ ) and
h l < δ 2 / 16 min h ( 3 ) , l ( 3 )
By Lemma 15, let their critical points be t 1 < < t k and s 1 < < s k respectively. Let
J i = [ min { t i , s i } a , max { t i , s i } + a ]
and suppose that we know i , | t i s i | + a δ / 4 . Then
J i log | h ( t ) | log | l ( t ) | i ( | t i s i | + 2 a ) h l δ min h ( 3 ) , l ( 3 ) + ( 8 / δ ) ( | t i s i | + 2 a ) 2
Proposition 24 
(Used to bound ( L G ) ). Suppose that h and l have three continuous derivatives on [ 0 , 1 ] . Let
X = t : | h ( t ) | μ h ( 3 ) a n d | l ( t ) | μ l ( 3 )
Then
X log | h ( t ) | log | l ( t ) | h l μ min h ( 3 ) , l ( 3 )
This follows immediately from Lemma 27.
Proposition 25 
(Used to bound ( E B ) ). Suppose that h , l have three continuous derivatives and have finite numbers, k h and k l , of critical points respectively, and h ( 0 ) , l ( 0 ) , h ( 1 ) , l ( 1 ) 0 . Suppose that we have closed intervals { L α } α A and we define B = α A L α and G = G h , l ( B ) as in Definition 16. Then
| G c | ( k h + k l + 2 ) 4 | A | h l + α max L α | h | , | l | | L α | min B c | h | , | l | 1 + | B |
Proposition 26 
(Used to bound ( E G ) ). Suppose that h , l E ( τ ) and h l in E ( τ ) . By Lemma 15, let their critical points be t 1 < < t k and s 1 < < s k respectively. Suppose we have a set B such that
B { 0 } { 1 } i t i , s i
and define G = G h , l ( B ) as in Definition 16. Then
G H p h h ( t ) H p l l ( t ) ( k + 1 ) ( k + 2 ) 1 + log min B c | h | , | l | ( k + 1 ) max h , l ( max h , l ) min B c | h | , | l | 2 × h l + 2 h l min B c | h | , | l | 1 h + l 2
Now, we apply these propositions to f and g. We will abbreviate
Γ 1 = max f , g min h ( 3 ) , l ( 3 ) , Γ 2 = max f , g min h ( 3 ) , l ( 3 )
The assumptions of Theorem 1 mean that
C 1 1 Γ 1 C 1 , Γ 2 C 2
and without loss of generality we may assume C 1 , C 2 3 .
We know that f g < δ 2 / 16 by (14) and (19), and therefore by Lemma 10 and equation (14) we have
i , | t i * s i * | 4 ϵ 3 / 4 / δ
To apply Proposition 23 (with a = 4 μ / δ ), we must have
4 ϵ 3 / 4 / δ + 4 μ / δ δ / 4
and this is implied by (19). So, with some simplifying to arrive at the second equation, we have
| ( L B ) | i ( | t i s i | + 2 · 4 μ / δ ) f g δ min f ( 3 ) , g ( 3 ) + ( 8 / δ ) ( | t i s i | + 2 · 4 μ / δ ) 2 k ( 4 ϵ 3 / 4 + 8 μ ) δ 2 f g min f ( 3 ) , g ( 3 ) + ( 8 / δ ) ( 4 ϵ 3 / 4 + 8 μ ) 12 k μ / δ 2 Γ 2 + 96 μ / δ 12 k C 2 · 97 μ / δ 2
Next, looking to Proposition 24, recalling equation (20) we have
i = 1 k J i * c t : | f ( t ) | μ f ( 3 ) and | g ( t ) | μ g ( 3 )
and so we can apply Proposition 24 to f and g, which gives
| ( L G ) | f g μ min f ( 3 ) , g ( 3 ) ϵ 3 / 4 / μ
Looking to Proposition 25, we have B * = { 0 } { 1 } i = 1 k J i * = L α . Equation (20) means that
B * t : | h ( t ) | μ h ( 3 ) or | l ( t ) | μ l ( 3 )
which means
min ( B * ) c | h | , | l | 1 μ min h ( 3 ) , l ( 3 ) 1
In the notation of Proposition 25, k f , k g = k and | A | = k + 2 , and aside from { 0 } and { 1 } which have measure zero, | J i | ( 4 ϵ 3 / 4 / δ ) + 4 μ / δ 8 μ / δ . Also, note that
( max J i | h | , | l | ) ( max f , g ) · | J i | 8 μ / δ ( max f , g )
which means
| G c | ( k h + k l + 2 ) 4 | A | h l + α max L α | h | , | l | | L α | min B c | h | , | l | 1 + | B | 2 ( k + 1 ) 4 ( k + 2 ) ϵ min h ( 3 ) , l ( 3 ) + k ( 8 μ / δ ) 2 max f , g μ min h ( 3 ) , l ( 3 ) + k ( 8 μ / δ ) 8 ( k + 2 ) 2 ϵ / μ + 2 7 k ( k + 1 ) μ / δ 2 Γ 2 + 8 k μ / δ 2 7 ( k + 2 ) 2 C 2 ( ϵ / μ + μ / δ 2 + μ / δ ) 2 8 ( k + 2 ) 2 C 2 ( ϵ / μ + μ / δ )
This means
( E B ) 2 8 ( k + 2 ) 2 C 2 ( ϵ / μ + μ / δ ) log ( k + 1 )
Looking to Proposition 26, by Lemma 14 and Definition 21 we can apply Proposition 26 to f and g, so we have
G * H p f f ( t ) H p g g ( t ) ( k + 1 ) ( k + 2 ) 1 + log min B c | f | , | g | ( k + 1 ) max f , g ( max f , g ) min B c | f | , | g | 2 × f g + 2 f g min B c | f | , | g | 1 f + g 2 ( k + 2 ) 2 1 + log μ min f ( 3 ) , g ( 3 ) ( k + 1 ) max f , g ( max f , g ) μ min f ( 3 ) , g ( 3 ) 2 × ϵ 3 / 4 min f ( 3 ) , g ( 3 ) + 2 ϵ max f , g ( k + 2 ) 2 Γ 1 ϵ 3 / 4 / μ 2 + 2 ϵ Γ 2 1 + log μ ( k + 1 ) Γ 1 6 C 1 C 2 ( k + 2 ) 2 log ( C 1 ( k + 1 ) μ 1 ) ϵ 3 / 4 / μ 2
Simplifying the logarithm term in the last line was the reason for assuming C 1 3 .
Combining equations (22), (23), (24), (25), and letting C denote an absolute constant, we have
| h ( f ) h ( g ) | C C 1 C 2 ( k + 2 ) 2 log ( C 1 ( k + 1 ) μ 1 ) μ / δ 2 + ϵ 3 / 4 / μ + ϵ / μ + μ / δ + ϵ 3 / 4 / μ 2
Examining just the last expression in the product, again letting C be an absolute constant, we have
μ / δ 2 + ϵ 3 / 4 / μ + ϵ / μ + μ / δ + ϵ 3 / 4 / μ 2 C ( μ / δ 2 + ϵ 3 / 4 / μ 2 )
Treating δ as a fixed parameter, this expression is optimized if μ = ϵ 1 / 4 , and using Lemma 8 we see that k 2 / δ . Therefore, with C being another absolute constant, we have
| h ( f ) h ( g ) | C C 1 C 2 log ( C 1 δ 1 ϵ 1 / 4 ) ϵ 1 / 4 / δ 4

Acknowledgements

We would like to thank the referees, and Michael S. Hughes, for many helpful comments.

References

  1. Hughes, M. A comparison of shannon entropy versus signal energy for acoustic detection of artificially induced defects in plexiglass. J. Acoust. Soc. Am. 1992, 91, 2272–2275. [Google Scholar] [CrossRef]
  2. Hughes, M. Analysis of digitized waveforms using shannon entropy. J. Acoust. Soc. Am. 1993, 93, 892–906. [Google Scholar] [CrossRef]
  3. Hughes, M. Analysis of digitized waveforms using shannon entropy II. High-speed algorithms based on Green’s functions. J. Acoust. Soc. Am. 1994, 95, 2582–2588. [Google Scholar] [CrossRef]
  4. Hughes, M.; Marsh, J.; Hall, C.; Savy, D.; Scott, M.; Allen, J.; Lacy, E.; Carradine, C.; Lanza, G.; Wickline, S. Characterization of digital waveforms using thermodynamic analogs: Applications to detection of material defects. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 2005, 52, 555–1564. [Google Scholar] [CrossRef]
  5. Hughes, M.; Marsh, J.; Zhang, H.; Woodson, A.; Allen, J.; Lacy, E.; Carradine, C.; Lanza, G.; Wickline, S. Characterization of digital waveforms using thermodynamic analogs: Detection of contrast-targeted tissue in vivo. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 2006, 53, 1609–1616. [Google Scholar] [CrossRef] [PubMed]
  6. Hughes, M.; Mc Carthy, J.; Marsh, J.; Arbeit, J.; Neumann, R.; Fuhrhop, R.; Wallace, K.; Znidersic, D.; Maurizi, B.; Baldwin, S.; Lanza, G.; Wickline, S. Properties of an entropy-based signal receiver with an application to ultrasonic molecular imaging. J. Acoust. Soc. Am. 2007, 121, 3542–3557. [Google Scholar] [CrossRef] [PubMed]
  7. Hughes, M.; Mc Carthy, J.; Wickerhauser, M.V.; Marsh, J.; Arbeit, J.; Fuhrhop, R.; Wallace, K.; Thomas, T.; Smith, J.; Agyem, K.; Lanza, G.; Wickline, S. Real-time calculation of the limiting form of the Renyi entropy applied to detection of subtle changes in scattering architecture. J. Acoust. Soc. Am. 2009, 126, 2350–2358. [Google Scholar] [CrossRef] [PubMed]
  8. Beirlant, J.; Dudewicz, E.; Gyorfi, L.; van der Meulen, E. Nonparametric entropy estimation: An overview. Int. J. Math. Stat. Sci. 1997, 6, 17–39. [Google Scholar]
  9. Maurizi, B. Noise Sensitivity of An Entropy-Based Signal Receiver. Ph.D. thesis, Washington University in Saint Louis, May 2008. [Google Scholar]
  10. Goria, M.; Leonenko, N.; Mergel, V.; Inverardi, P. A new class of random vector entropy estimators and its applications in Testing Statistical Hypotheses. J. Nonparametr. Statist. 2005, 17, 277–297. [Google Scholar] [CrossRef]
  11. van Es, B. Estimating functionals related to a density by a class of statistics based on spacings. Scand. J. Stat. 1992, 19, 61–72. [Google Scholar]
  12. Joe, H. Estimation of entropy and other functionals of a multivariate density. Ann. Inst. Stat. Math. 1989, 41, 683–697. [Google Scholar] [CrossRef]
  13. Levit, B. Asymptotically efficient estimation of nonlinear functionals. Probl. Inform. Transm. 1978, 14, 204–209. [Google Scholar]
  14. Leonenko, N.; Pronzato, L.; Savani, V. A class of renyi information estimators for multidimensional densities. Ann. Statist 2008, 36, 2153–2182. [Google Scholar] [CrossRef]
  15. Tsybakov, A.; van der Meulen, E. Root-n consistent estimators of entropy for densities with unbounded support. Scand. J. Stat. 1996, 23, 75–83. [Google Scholar]
  16. Eggermont, P.; LaRiccia, V. Best asymptotic normality of the kernel density entropy estimator for smooth densities. IEEE Trans. Inf. Theory 1999, 45, 1321–1326. [Google Scholar] [CrossRef]
  17. Dmitriev, Y.; Tarasenko, F. On the estimation of functionals of the probability density and its derivatives. Theory Probab. Appl. 1973, 18, 628–633. [Google Scholar] [CrossRef]
  18. Ahmad, I.; Lin, P. A nonparametric estimation of the entropy for absolutely continuous distributions. IEEE Trans. Inf. Theory 1976, 22, 372–375. [Google Scholar] [CrossRef]
  19. Hall, P.; Morton, S. On the estimation of entropy. Ann. Inst. Stat. Math. 1993, 45, 69–88. [Google Scholar] [CrossRef]
  20. Mokkadem, A. Estimation of the entropy and information for absolutely continuous random variables. IEEE Trans. Inf. Theory 1989, 35, 193–196. [Google Scholar] [CrossRef]
  21. Joe, H. Sample estimate of the entropy of a random vector. Ann. Inst. Stat. Math. 1989, 41, 83–697. [Google Scholar]
  22. Vasicek, O. A test for normality based on sample entropy. J. Roy. Statist. Soc. Ser. B. 1976, 38, 54–59. [Google Scholar]
  23. Gyorfi, L.; van der Meulen, E. Density-Free convergence properties of various estimators of entropy. Comput. Stat. Data Anal. 1987, 5, 425–436. [Google Scholar] [CrossRef]
  24. Godavarti, M.; Hero, A. Convergence of differential entropies. IEEE Trans. Inf. Theory 2004, 50, 171–176. [Google Scholar] [CrossRef]
  25. Csiszár, I. On generalized entropy. Studia Sci. Math. Hungar. 1969, 4, 401–419. [Google Scholar]
  26. Rényi, A. On the dimension and entropy of probability distributions. Acta Math. Acad. Sci. Hungar. 1959, 10, 193–215. [Google Scholar] [CrossRef]
  27. Rudin, W. Real and Complex Analysis, 3rd Edition ed; McGraw-Hill Book Company: New York, NY, USA, 1987. [Google Scholar]
  28. O’Sullivan, J.A.; Washington University in Saint Louis School of Engineering and Applied Science, Saint Louis, MS, USA. Personal Communication, 2007.
  29. Cover, T.; Thomas, J. Elements of Information Theory, 2nd Edition ed; Wiley-Interscience [John Wiley and Sons]: Hoboken, NJ, USA, 2006. [Google Scholar]

12. Appendix

12.1. Miscellaneous Technical Lemmas

Lemma 27. 
If a a ¯ > 0 and b b ¯ > 0 , then
1 1 + | a b | / a ¯ a / b 1 + | a b | / b ¯
and therefore
log ( a / b ) log 1 + | a b | min a ¯ , b ¯ | a b | min a ¯ , b ¯
Lemma 28. 
If a i , b i 0 and i = 1 n a i = a > 0 , i = 1 n b i = b > 0 then
a i / a b i / b min ( a , b ) 1 ( n + 1 ) max i | a i b i |
a i / a b i / b = ( 1 / a ) | a i b i ( a / b ) | min ( a , b ) 1 | a i b i | + | b i b i ( a / b ) | = min ( a , b ) 1 | a i b i | + ( b i / b ) | b a | min ( a , b ) 1 | a i b i | + j | a j b j | min ( a , b ) 1 ( n + 1 ) max i | a i b i |

12.2. Proofs of Main Propositions

Proof of Proposition 23 

Proposition 23 states: Suppose that h , l A ( δ ) and
h l < δ 2 / 16 min h ( 3 ) , l ( 3 )
By Lemma 15, let their critical points be t 1 < < t k and s 1 < < s k respectively. Let
J i = [ min { t i , s i } a , max { t i , s i } + a ]
and suppose that we know i , | t i s i | + a δ / 4 . Then
J i log | h ( t ) | log | l ( t ) | i ( | t i s i | + 2 a ) h l δ min h ( 3 ) , l ( 3 ) + ( 8 / δ ) ( | t i s i | + 2 a ) 2
Let us fix i and define the Taylor polynomials
p h ( t ) = h ( t i ) + ( h ( t i ) / 2 ) ( t t i ) 2 p l ( s ) = l ( s i ) + ( l ( s i ) / 2 ) ( s s i ) 2
We see that
h ( t ) p h ( t ) 1 = h ( t ) p h ( t ) p h ( t ) = h ( c ) ( t t i ) 2 2 1 p h ( t )
and we know | p h ( t ) | = | h ( t i ) | | t t i | δ h ( 3 ) | t t i | , so the above expression is
h ( 3 ) ( t t i ) 2 2 1 δ h ( 3 ) | t t i | | t t i | / ( 2 δ )
Note that, since | t t i | | t i s i | + a , by assumption the above expression is 1 / 2 . Therefore,
| log h ( t ) p h ( t ) | | t t i | / δ
and so
J i log | h ( t ) | log | p h ( t ) | ( | t i s i | + 2 a ) 2 / δ
This is also true for J i log | l ( t ) | log | p l ( t ) | , so it remains to bound J i log | p h ( t ) | log | p l ( t ) | .
We will make some abbreviations to aid our exposition in this section: Let
α = h ( t i ) , β = l ( s i )
so we have p h ( t ) = α ( t t i ) and p l ( s ) = β ( s s i ) . Without loss of generality, suppose t i < s i and abbreviate γ = ( s i t i ) / 2 . Define t c = ( t i + s i ) / 2 so that we have
J i = [ t c η , t c + η ]
where η = | t i s i | / 2 + a . Now, we have
J i log | p h ( t ) | log | p l ( t ) | = t = t c η t c + η log | α ( t t i ) | t = t c η t c + η log | β ( t s i ) | = t = t c t i η t c t i + η log | α t | t = t c s i η t c s i + η log | β t | = γ η γ + η log | α t | γ η γ + η log | β t | = γ η γ + η log | α t | log | β t | + γ + η γ + η log | α t | γ η γ η log | β t | = log ( | α | / | β | ) ( 2 η 2 γ ) + 2 γ log | α | 2 γ log | β | + η γ η + γ log | t | η γ η + γ log | t | = 2 η log ( | h ( t i ) | / | l ( s i ) | )
Now, | h ( t i ) | δ h ( 3 ) and | l ( s i ) | δ l ( 3 ) , so by Lemma 27 we have
2 η log ( | h ( t i ) | / | l ( s i ) | ) 2 η h l δ min h ( 3 ) , l ( 3 )
From this, we conclude
J i log | p h ( t ) | log | p l ( t ) | | t i s i | + 2 a h l δ min h ( 3 ) , l ( 3 )
and therefore
J i log | h ( t ) | log | l ( t ) | i J i log | h ( t ) | log | p h ( t ) | + J i log | p h ( t ) | log | p l ( t ) | + J i log | p l ( t ) | log | l ( t ) | i | t i s i | + 2 a h l δ min h ( 3 ) , l ( 3 ) + 2 ( | t i s i | + 2 a ) 2 / δ

Proof of Proposition 25 

Proposition 25 states: Suppose that h , l have three continuous derivatives and have finite numbers, k h and k l , of critical points respectively, and h ( 0 ) , l ( 0 ) , h ( 1 ) , l ( 1 ) 0 . Suppose that we have closed intervals { L α } α A and we define B = α A L α and G = G h , l ( B ) as in Definition 16. Then
| G c | ( k h + k l + 2 ) 4 | A | h l + α max L α | h | , | l | | L α | min B c | h | , | l | 1 + | B |
We begin by noting that
G c = h 1 ( I ) l 1 ( I ) h 1 ( I ) B c l 1 ( I ) B c B = j h j 1 ( I ) B c j l j 1 ( I ) B c B
(recall Definition 3). For a monotone function λ, we can measure λ 1 ( A ) by looking at A | ( λ 1 ) | . In this case, we see that
h j 1 ( I ) B c = h j 1 I h j ( B c )
and therefore
h j 1 ( I ) B c = I h j ( B c ) ( h j 1 ) ( u ) d u = I h j ( B c ) h h j 1 ( u ) 1 d u I h j ( B c ) min B c | h | , | l | 1 d u | I | min B c | h | , | l | 1
The same is true for l j . By (26), this means
| G c | ( k h + k l + 2 ) | I | min B c | h | , | l | 1 + | B |
So, it remains to estimate | I | . We have
I = t B h ( t ) , l ( t ) + [ h l , h l ] = α t L α h ( t ) , l ( t ) + [ h l , h l ]
We now make the following observation:
t L α h ( t ) , l ( t ) + [ h l , h l ] = min L α { h , l } h l , max L α { h , l } + h l
This is a straightforward verification, the only non-trivial observation being that
t L α h ( t ) , l ( t ) + [ h l , h l ]
is connected. We see that the size of this interval is bounded by
max L α { h , l } min L α { h , l } + 2 h l min max L α { h } min L α { h } , max L α { l } min L α { l } + 2 h l + 2 h l | L α | max L α | h | , | l | + 4 h l
and so
| I | α | L α | max L α | h | , | l | + 4 h l
This means
| G c | ( k h + k l + 2 ) | I | min B c | h | , | l | 1 + | B | ( k h + k l + 2 ) 4 | A | h l + α max L α | h | , | l | | L α | min B c | h | , | l | 1 + | B |

Proof of Proposition 26 

Proposition 26 states: Suppose that h , l E ( τ ) and h l in E ( τ ) . By Lemma 15, let their critical points be t 1 < < t k and s 1 < < s k respectively. Suppose we have a set B such that
B { 0 } { 1 } i t i , s i
and define G = G h , l ( B ) as in Definition 16. Then
G H p h h ( t ) H p l l ( t ) ( k + 1 ) ( k + 2 ) 1 + log min B c | h | , | l | ( k + 1 ) max h , l ( max h , l ) min B c | h | , | l | 2 × h l + 2 h l min B c | h | , | l | 1 h + l 2
Let us abbreviate
A 1 = G H p h h ( t ) H p l l ( t )
We have
A 1 = G i = 0 k ρ h i ( h ( t ) ) ρ h ( h ( t ) ) log ρ h i ( h ( t ) ) ρ h ( h ( t ) ) ρ l i ( l ( t ) ) ρ l ( l ( t ) ) log ρ l i ( l ( t ) ) ρ l ( l ( t ) )
Lemma 17 shows that, for t G , we have [ h ( t ) Range ( h i ) ] [ l ( t ) Range ( l i ) ] , and therefore
A 1 = i = 0 k t G : h ( t ) Range ( h i ) ρ h i ( h ( t ) ) ρ h ( h ( t ) ) log ρ h i ( h ( t ) ) ρ h ( h ( t ) ) ρ l i ( l ( t ) ) ρ l ( l ( t ) ) log ρ l i ( l ( t ) ) ρ l ( l ( t ) )
Then, estimating | x 2 log x 2 x 1 log x 1 | 1 + max { | log x 2 | , | log x 1 | } | x 2 x 1 | , we see that
| A 1 | i = 0 k t G : h ( t ) Range ( h i ) 1 + max log ρ h i ( h ( t ) ) ρ h ( h ( t ) ) , log ρ l i ( l ( t ) ) ρ l ( l ( t ) ) ρ h i ( h ( t ) ) ρ h ( h ( t ) ) ρ l i ( l ( t ) ) ρ l ( l ( t ) )
We observe that
min t [ 0 , 1 ] ρ h i ( h ( t ) ) , ρ l i ( l ( t ) ) 1 max | h | , | l |
Also note that, for t G , Lemma 18 means that i , h i 1 ( h ( t ) ) , l i 1 ( l ( t ) ) B , and therefore we have
max t G ρ h ( h ( t ) ) , ρ l ( l ( t ) ) ( k + 1 ) min B c | h | , | l | 1
This means
max i , t G max log ρ h i ( h ( t ) ) ρ h ( h ( t ) ) , log ρ l i ( l ( t ) ) ρ l ( l ( t ) ) log min B c | h | , | l | ( k + 1 ) max | h | , | l |
Abbreviating
A 2 = ρ h i ( h ( t ) ) ρ h ( h ( t ) ) ρ l i ( l ( t ) ) ρ l ( l ( t ) ) t G , h ( t ) Range ( h i )
we have
| A 1 | i = 0 k t G : h ( t ) Range ( h i ) 1 + log min B c | h | , | l | ( k + 1 ) max | h | , | l | | A 2 |
By Lemma 28,
| A 2 | ( k + 2 ) [ max | h | , | l | ] max t G j : h ( t ) Range ( h j ) | ρ h j ( h ( t ) ) ρ l j ( l ( t ) ) |
Then, using the observations
x , y > 0 | 1 / x 1 / y | [ min x , y ] 2 | x y | , | | a | | b | | | a b |
and noting that t G , h ( t ) Range ( h j ) h j 1 ( h ( t ) ) , l j 1 ( l ( t ) ) B , we have
| A 2 | ( k + 2 ) [ max | h | , | l | ] max t G j : h ( t ) Range ( h j ) 1 h h j 1 ( h ( t ) ) 1 l l j 1 ( l ( t ) ) ( k + 2 ) [ max | h | , | l | ] min t B | h ( t ) | , | l ( t ) | 2 max t G j : h ( t ) Range ( h j ) | h h j 1 ( h ( t ) ) l l j 1 ( l ( t ) ) | ( k + 2 ) max | h | , | l | min B c | h | , | l | 2 max t G j : h ( t ) Range ( h j ) | h h j 1 ( h ( t ) ) l l j 1 ( l ( t ) ) |
Using the triangle inequality, we have
| h h j 1 ( h ( t ) ) l l j 1 ( l ( t ) ) | h l + h + l 2 | h j 1 ( h ( t ) ) l j 1 ( l ( t ) ) |
Abbreviating
A 3 = h j 1 ( h ( t ) ) l j 1 ( l ( t ) ) t G , h ( t ) Range ( h j ) ,
we have
| A 2 | ( k + 2 ) max | h | , | l | min B c | h | , | l | 2 max t G j : h ( t ) Range ( h j ) h l + h + l 2 | A 3 |
Next, recalling Lemma 17 (which verifies that the expressions in the first following inequality are well-defined) we have:
| A 3 | | h j 1 ( h ( t ) ) h j 1 ( l ( t ) ) | + | h j 1 ( l ( t ) ) l j 1 ( l ( t ) ) | max t G j : h ( t ) Range ( h j ) u h ( t ) , l ( t ) | ( h 1 ) ( u ) | h l + max t G u h ( t ) , l ( t ) | h j 1 ( u ) l j 1 ( u ) |
We see that
max t G j : h ( t ) Range ( h j ) u h ( t ) , l ( t ) | ( h 1 ) ( u ) | = max t G j : h ( t ) Range ( h j ) u h ( t ) , l ( t ) h ( h j 1 ( u ) ) 1 = min t G j : h ( t ) Range ( h j ) u h ( t ) , l ( t ) h ( h j 1 ( u ) ) 1
Now, we can apply Lemma 18 to conclude that this is
min B c | h | , | l | 1
Abbreviating
A 4 = h j 1 ( u ) l j 1 ( u ) t G , u h ( t ) , l ( t )
we have
| A 3 | min B c | h | , | l | 1 h l + max t G u h ( t ) , l ( t ) | A 4 |
Turning to A 4 , we observe that
h j 1 ( u ) , l j 1 ( u ) | h | min h j 1 ( u ) , l j 1 ( u ) | h | | h j 1 ( u ) l j 1 ( u ) |
By Lemma 19, since t G , u h ( t ) , l ( t ) we have
h j 1 ( u ) , l j 1 ( u ) B c
and therefore
h j 1 ( u ) , l j 1 ( u ) | h | min t B | h ( t ) | , | l ( t ) | | h j 1 ( u ) l j 1 ( u ) | min B c | h | , | l | | h j 1 ( u ) l j 1 ( u ) |
Lemma 19 also tells us that h does not change sign on h j 1 ( u ) , l j 1 ( u ) , which means
h j 1 ( u ) , l j 1 ( u ) | h | = h j 1 ( u ) , l j 1 ( u ) h = h ( h j 1 ( u ) ) h ( l j 1 ( u ) ) = u h ( l j 1 ( u ) ) = l ( l j 1 ( u ) ) h ( l j 1 ( u ) ) h l
Therefore,
| A 4 | = | h j 1 ( u ) l j 1 ( u ) | min B c | h | , | l | 1 h j 1 ( u ) , l j 1 ( u ) | h | min B c | h | , | l | 1 h l
Combining equations (27), (28), (29), (30), (31), (32), (33), (34), the proposition is proved.

Share and Cite

MDPI and ACS Style

Maurizi, B.N. Estimation of an Entropy-based Functional. Entropy 2010, 12, 338-374. https://doi.org/10.3390/e12030338

AMA Style

Maurizi BN. Estimation of an Entropy-based Functional. Entropy. 2010; 12(3):338-374. https://doi.org/10.3390/e12030338

Chicago/Turabian Style

Maurizi, Brian N. 2010. "Estimation of an Entropy-based Functional" Entropy 12, no. 3: 338-374. https://doi.org/10.3390/e12030338

Article Metrics

Back to TopTop