Estimation of an Entropy-based Functional

Given a function f from [0, 1] to the real line, we consider the (nonlinear) functional h obtained by evaluating the continuous entropy of the “density function” of f. Motivated by an application in signal processing, we wish to estimate h(f). Our main tool is a decomposition of h into two terms, which each have favorable scaling properties. We show that, if functions f and g satisfy a regularity condition, then the smallness of ∥f −g∥∞ and ∥f′ − g′∥∞, along with some basic control on derivatives of f and g, is sufficient to imply that h(f) and h(g) are close.


Introduction
We define the continuous entropy of a probability density ρ to be (where the base of the logarithm is e = 2.71828...).Stripped of its interpretation, continuous entropy is just a particular way of measuring how "spread out" a probability density is.In this paper, we are motivated by a novel application of this measurement in signal processing.
An ultrasound probe generates a short acoustic pulse which travels through the medium of interest (in medical applications, this medium is the tissue of the patient).As the pulse travels through the medium, features within the medium cause some of the pulse to be reflected back towards the probe, and the strength of the reflection contains information about the feature at that location.The signal of interest is the intensity of this reflected pulse that arrives back at the probe over time.This signal can be divided into many short windows of time; after re-scaling the time axis, one of these time windows can be represented by the interval [0, 1], and the signal over that time window can be represented by a real-valued function f on the interval [0, 1].So, f (t) represents the intensity of the reflected pulse that arrives back at the probe at the time t ∈ [0, 1].
The ultrasound probe can only measure f (t) at a discrete set of values of t, and this measurement can be corrupted in various ways.Therefore, one step in the processing of an ultrasound signal is to reconstruct f from the measurements of the ultrasound probe.Then, a functional is applied to f to obtain a single number, the idea being that this number contains all the relevant information about the reflected signal over that particular window in time.A standard functional in the industry is the "energy" in the signal, ∫ 1 0 |f | 2 , or (more often) its logarithm.However, there is a series of papers by Hughes and others ( [1][2][3][4][5][6][7]) demonstrating the utility of using the continuous entropy of the "density function" of f as the functional, instead of the log-energy.
Suppose that f is a real-valued function on [0, 1].By "density function" of f , we mean the following: Define the measure µ f on R by µ f (E) = |f −1 (E)|, where |A| denotes the Lebesgue measure of the set A ⊂ [0, 1].Suppose that µ f is given by the density ρ f , i.e., This density ρ f is the "density function" of f .We can think of [0, 1] as a probability space and f as a function on that probability space (i.e., a random variable) and then ρ f is literally the density of the "random variable" f .
Another way to think of ρ f is as follows: Suppose that we consider the uniform distribution U (0, 1).Suppose we choose a point t at random from [0, 1] according to U (0, 1), and we calculate f (t).The density ρ f gives the relative frequency of different outcomes of this experiment.Also, intuitively we can say that (although we will not attempt to formalize this definition for general f ).Let us abuse notation and write The functional of interest is f → h(f ).
The effectiveness of using the h functional has been shown in several different settings in ultrasound signal processing, ranging from defect detection in plexiglass [1] to detection of contrast agents in the tissue of live animals ( [5,7]).In some situations, the h functional produced images of objects which, because the material was engineered by the experimenter, were known to be present but were not detected by the log-energy functional.
The interpretation of what is being measured by h(f ) in this context is not yet clear.However, given the utility of the technique, we believe the area merits investigation.We would like to answer the question "How well can we estimate h(f )?" In a real-world setting we will perhaps receive some samples of f corrupted by noise, and we would like to estimate h(f ) (as we described, the processing of an ultrasound signal is an example of this).Therefore, we seek a solution to the following problem, where the "regularity conditions" and the form of the "partial information" is not specified: Estimation Problem 1.Let f be a function on [0, 1] that satisfies some regularity conditions, and suppose we are given some "partial information" on f , such as samples corrupted by noise.We wish to estimate h(f ) with an estimator ĥ, and give a quantitative bound on the error |h(f ) − ĥ|, such that the bound is invariant under the scaling f → λf for λ ̸ = 0.
We seek this scale invariance because, if ĥ is any reasonable estimator of h(f ), the true difference h(f ) − ĥ does not change under the scaling f → λf for λ ̸ = 0, so we do not want our bound on that difference to change under this scaling either.(This scaling issue is discussed more in section 5.).
Our main result will not give a specific prescription for solving this problem.Instead, we reduce it to a much more tractable problem, namely the problem of taking noisy samples from f and producing an approximating function g which is "close" to f in some quantifiable way.
However, our concern at the moment is to relate problem 1 to the existing literature, and we notice that problem 1 is similar to the standard entropy estimation problem (for an overview of this problem, see Beirlant et al. [8]): Estimation Problem 2 (Entropy Estimation).Let X be a random variable with unknown density function ρ.We make N independent observations of X. From these, we wish to estimate the continuous entropy of ρ.
There is a large literature concerning the entropy estimation problem, so one might hope to solve problem 1 by applying some of the standard methods in the field to the density ρ f .However, this is not what we will do.Since we are departing from traditional methods, we would like to provide some indication that new methods may in fact be necessary; before working to develop new methods to solve a problem, one would like to know whether the problem is obviously solved with readily available standard methods.In section 2., we review the current literature with regard to how it applies to densities such as ρ f .This review is not comprehensive; the purpose is only to show that the density ρ f presents some difficulty for current methods of estimation of h(ρ).
Some of the difficulties that we find, stated in general terms, are the following.First, the methods we have seen often involve intermediate parameters that truncate the tails of integrals, truncate the unbounded pieces of ρ, or perform other types of operations that enable a splitting into "good" and "bad" regions.In view of the scale-invariance that we seek, these intermediate parameters will all become involved and their rates of growth will need to be related in a way that produces the desired result and remains scale-invariant.This task, for a density such as ρ f , seems to be non-trivial.Second, many of the results which might apply to ρ f use methods which are non-quantitative: theorems from measure theory such as Lebesgue Differentiation, the Lebesgue Dominated Convergence Theorem, Egoroff's Theorem, and the Borel-Cantelli Lemma.Replacing these theorems with quantitative estimates will introduce more difficulties.Finally, and perhaps most importantly from an intuitive point of view, there seems to be a general progression from "easy" to "hard" as the assumptions on ρ become less stringent, moving from differentiable to continuous to bounded to L 2 , and so on.The density ρ f seems to fall on the "hard" end of this continuum.This does not correspond to the progression of difficulty that a practitioner of signal processing would expect to see when they attempt to solve problem 1.Even an extremely well behaved function (such as f (t) = sin(2πt)) will produce a density ρ f that will be ranked as "badly behaved."A signal processor would expect that the problem would be "easy" for well behaved functions f , and get harder as the function f becomes more badly behaved.
In short, the readily available methods that we are aware of do not seem appropriate for the problem we are presented with.Fortunately, we believe that estimating h(f ) is easier than solving the entropy estimation problem for ρ f , for the simple reason that we are able to take advantage of the "good behavior" of f in the time domain, instead of only having access to the "bad behavior" of the density ρ f .For example, suppose that in a real-world setting we receive sampled values of f at specific values of t, say {t j } (not to be confused with the values of t where f actually has critical points, which later will be referred to as {t i }).Neglecting noise (purely for the simplicity of this explanation), we receive the data {(t j , f (t j ))}.We intend to use all the information available to us, i.e. the time series {(t j , f (t j ))}, not just the (unordered) values {f (t j )}.For instance, we may want to estimate f ′ or f ′′ , which will certainly make use of the indicated time series.We will not attempt to apply a standard entropy estimation technique directly to the values {f (t j )}.
One strategy would be to first construct an approximating function g (using the observations {(t j , f (t j ))}), and then calculate h(g) and use this as our estimate of h(f ).In Hughes et al. [7], this method is applied to real-time imaging calculations in a laboratory setting; instead of using just the values {f (t j )} directly, they first construct an approximating function g and then calculate the desired entropy value using g.In [7], they are estimating a variant of the Rényi entropy of ρ f , not h(f ) as we have defined it, but the same authors have demonstrated the utility of h(f ).Incidentally, the quantity that is eventually computed in [7] is where the {s i } are the critical points of g, and the authors comment [7] that "while this involves use of the second derivative of f (t) at its critical points, which can be expected to increase noise in the processing chain output, surprisingly the resulting signal processing scheme does not sacrifice sensitivity."This gives some assurance that our result, which requires control on multiple derivatives of f , is not without practical application.
Our goal in this paper is to estimate the difference h(f ) − h(g) in terms of quantities involving f and g (not ρ f or ρ g ).As we mentioned, this reduces the problem at hand, problem 1, to a much more tractable question of function approximation given noisy data.We note, for completeness, that a specific prescription for solving problem 1, using methods similar to our methods here, was given in Maurizi [9], but we believe that the approach we present now will be clearer than the approach given in [9].
As mentioned, in Section 2., we review the current literature with regard to how it applies to densities such as ρ f .After some definitions and notation in Section 3., we will highlight a useful identity in section 4. which may be of independent interest, and we discuss some issues regarding the scaling f → λf for λ ̸ = 0 in Section 5..In Section 6. we present a proposition showing why the regularity assumption (Definition 6) we introduce is necessary at least in some form, if not the specific form given here.Our regularity assumption on f and g, defined precisely in Section 7., essentially prevents f and g from becoming "too flat" by preventing their first and second derivatives from simultaneously vanishing, this is quantified with a parameter δ.
Section 7. contains our main result, Theorem 1, which states that if f and g satisfy the regularity assumption with parameter δ, then the difference h(f ) − h(g) can be bounded by the metrics ∥f − g∥ ∞ , ∥f ′ − g ′ ∥ ∞ , and the value δ, along with factors that control the overall sizes of f and g and their derivatives.
The proof of the main result is outlined in Section 8. and carried out in the subsequent sections.

Background
We now look to the current literature, seeking a method to estimate h(f ) with quantitative bounds on the error, in a manner that is invariant under the scaling f → λf for λ ̸ = 0.
The first sign that new methods may be necessary is that, from the viewpoint of entropy estimation, the densities ρ f are very badly behaved.A non-degenerate critical point in f will induce an asymptote of the form 1/ √ x in ρ f , so we are in general dealing with density functions that are unbounded, not L 2 , and have discontinuities in the interior of their support.These properties, and other features of the density ρ f , are discussed in [6,7].
Many of the standard results in entropy estimation do not apply to such badly behaved densities.Several theorems in entropy estimation (such as Goria et al. [10], van Es [11], Joe [12], Levit [13], and more recently Leonenko et al. [14]) use the assumption that ρ is bounded.In Tsybakov and van der Meulen [15] and Eggermont and LaRiccia [16], the assumption that ρ is twice differentiable was needed.In Dmitriev and Tarasenko [17] and Ahmad and Lin [18], ρ was assumed to have a bounded derivative.In Hall and Morton [19] ρ was assumed to have a continuous derivative.In Mokkadem [20], a "distributional" equivalent of ρ ′′ ∈ L 1 was required.
There are methods of solving problem 2 that could shed light on densities such as ρ f .In Kozachenko and Leonenko [21], Vasicek [22], and Györfi and van der Meulen [23], methods are developed which solve the entropy estimation problem 2 defined above and impose only mild restrictions on ρ.
In Kozachenko and Leonenko [21], they consider densities on R m , for simplicity we will consider the case m = 1.Note that they use different notation; we will continue to use ρ to stand for the density function, whereas they use f .A nearest-neighbor estimator is used, and only fairly mild conditions are imposed on the density ρ: For some ϵ > 0, the following two equations must hold: (these are equations numbered (3) and ( 4) in [21]).Densities such as our ρ f will "typically" satisfy these constraints (for instance, the function f (t) = t 2 on [0, 1] has density ρ f (u) = (1/2)u −1/2 , u ∈ (0, 1] and this density satisfies these constraints).The authors then prove that the estimator h N computed from N independent samples of ρ satisfies E(h N ) → h(ρ) as N → ∞.We would like to discuss some features of their proof, so for convenience of our reader we attempt to sketch it here: They first show that Due to their dependence on N , the ζ i could also be written ζ i,N .The ζ i are identically distributed random variables, so E(h N ) = E(ζ i ) and we focus just on ζ i .They consider the cumulative distribution function F N,x (u) of e ζ i conditioned on the case when the i-th sample equals x, i.e., X i = x.Let ν(y, r) = {x ∈ R : |x − y| < r}; one computes that ρ(y)dy where log γ = .5772 . . . is the Euler constant (this is their equation ( 8) ).We have

ρ(y)dy
This means that, by the Lebesgue Differentiation Theorem, since ν(x, u 2γ(N −1) ) shrinks to x as N → ∞, we have and so F N,x (u) → 1 − e −ρ(x)u/γ (this is their equation ( 8) ).They define Let the random variable ξ N,x have the cumulative distribution function F N,x and the random variable ξ x have the cumulative distribution function F x .We can compute and so one might hope that, since ).The bulk of their proof is to show that this is in fact true.The proof is then completed by taking the pointwise result and extending it to a convergence of integrals, This is done in their equation (21).
There are several rates of convergence that would have to be quantified: First, a quantitative bound on would be needed.Next, we would need to quantify how the rate of convergence of F N,x → F x translates into the convergence E(log ξ N,x ) → E(log ξ x ).This rate would need to be somehow uniform in x; their constants C 1 (from their equation ( 20) ) and therefore C 2 (at the bottom of [21, p. 99]) depend on x in an unspecified way.Some sort of uniformity is needed because the convergence E(log ξ N,x ) → E(log ξ x ) must be translated into the convergence Quantifying these various rates of convergence appears to be nontrivial.We also note that subsequent research which has worked with estimators analogous to the one used in [21], such as [15], [10], and [14], has required more assumptions on the density in question.
In Györfi and van der Meulen [23], histogram-based density estimators ρ N (computed from the sampled values {X i }) are used.The stated results make no assumptions on ρ other than the finiteness of h(ρ).They do not give a quantitative rate of convergence, and non-quantitative methods in the proof (such as the Borel-Cantelli Lemma) suggest that a quantitative result would not necessarily follow trivially from their methods.Also, choosing the proper bin size and grid placement for a histogram is a continual problem.For unbounded densities such as ρ f , any bin size which can capture the very tight grouping of sample points near an asymptote of ρ f will then be too narrow and cause a "choppy" estimate in areas where ρ f is not particularly large.One way of addressing this general drawback of histograms has been to adjust the "coarseness" of the estimate depending on how tightly the sample points are grouped, such as nearest-neighbor or sample-spacing-type estimators.We could also accept inefficiency near the critical points, but since we seek quantitative bounds this is not desirable.In practice, we believe that estimating the entropy of a density such as ρ f with a histogram estimator will mean very large inefficiencies near the critical points.
In Vasicek [22], no assumptions are made on ρ other than ∫ u 2 ρ(u) < ∞, which is satisfied by ρ f .Let {X (i) } be the order statistics of the sample {X i }, and let F be the cumulative distribution function of ρ.They use a sample-spacing estimator with a parameter m specifying the number of "neighbors" that will be considered when estimating the value of the density ρ near X (i) .Their conclusion is that their estimator h ′ m,N converges in probability to h(ρ) as N → ∞, as long as m → ∞, m/N → 0. One key question, of course, is how m is chosen.They state [22] that "an optimal choice of m for a given [N ], however, depends on the (unknown) [ρ].In general, the smoother the density [ρ], the larger is the optimal value of m." Therefore, one would expect that we would need to choose relatively small values of m since ρ f is very far from smooth, while still having m → ∞.A key step in the proof is that ), whenever ρ is positive and continuous over the interval (X (i−m) , X (i+m) ).This step is essentially using the Lebesgue Differentiation Theorem, so the accuracy of 1 simultaneously over the entire domain of ρ, will need to be quantified.The tradeoff between needing m to be small in order to capture the "fine-scale" behavior of ρ f near an asymptote, while needing m to be large in order to carry through the rest of the proof, appears non-trivial.A discussion of results which use a fixed m, versus m → ∞ as in [22], is in Tsybakov and van der Meulen [15].We note also that results using similar techniques, which have achieved conclusions stronger than the "convergence in probability" shown here (such as van Es [11]) have required more restrictions on ρ.
There are other results, such as those in Godavarti and Hero [24], Csiszár [25], and Rényi [26], which do not specifically solve problem 2, but which give insight into how one might produce a value that is "close" to h(ρ) given only some information about ρ.These results are not "stochastic"; the approximations for h(ρ) that will be considered are not constructed by sampling from ρ.
In Csiszár [25], the only assumption is that the entropy integral exists, there are no other assumptions on the density ρ.Their results are explained in terms of a general measurable space X, and in their presentation the density ρ is not the central object, it simply arises as the Radon-Nikodym derivative of a probability measure on X (in their notation this probability measure is µ), with respect to a σ-finite measure on X (in their notation this σ-finite measure is λ).We will assume that the measurable space is R, the σ-finite measure is Lebesgue measure on R, and the probability measure is given by the density ρ.In their Theorem 1, they show that h(ρ) equals the infimum value of a set of "approximate entropies," where the density ρ is replaced with its average value on each set in a partition {A i } of R, and then the entropy of the resulting "approximate density" is calculated.If we define the characteristic function of a set S, χ S , by then the "approximate density" is: and the "approximate entropy" is: They prove this result by choosing a specific set of approximate entropy values h ϵ , where in fact h ϵ is easily seen to be within ϵ of the true value h(ρ).This is certainly quantitative in the sense that the approximating value, h ϵ , is known to be within a certain explicit amount of the true value.The question then becomes whether we can calculate the value h ϵ .The approximating values h ϵ are obtained, essentially, by the definition of the Lebesgue integral of log ρ (with respect to the measure given by ρ) as a limit of integrals of simple functions (for this definition, see for example [27].For a probability measure µ on R and a function g ∈ L 1 (µ), if we choose ϵ and we let In [25], log ρ plays the role of g, the measure µ is given by ρ, and we have In order to calculate the value of ρ However, we do not know ρ, so we do not know the exact locations on R when log ρ is going to fall in a certain range.Of course, as we mentioned, the purpose of the h ϵ construction is an existence proof, not necessarily as a calculational or estimation tool.
We turn now to Rényi [26].Rényi considers density functions ρ, the only requirement on ρ being that the discrete entropy of the "integer compression" of ρ is finite, by which we mean (suppose X is a random variable having density ρ): This is not equivalent to the existence of the entropy integral, as he shows in [26].This admits a wide class of densities, including ρ f for the functions f we will consider.The key tool will be to consider successively finer step-function approximations of ρ, the partition set being the grid with "step size" is essentially the "true" histogram with the grid {j/N } ∞ j=−∞ (an example of the approximations ρ {A i } considered in Csiszár [25]).Rényi proves that, if the "integer compression" of ρ has finite discrete entropy, then One sign that this result will not be optimal for a density such as ρ f is that Rényi first proves this convergence for bounded ρ.He introduces an intermediate parameter L (which specifies the point at which the tails of the sums and integrals will be cut off), proves the pointwise convergence result (see his equation ( 35) ) and then applies the "Theorem of Lebesgue" [26] (we believe he is referring to the Dominated Convergence Theorem) over the interval The tails of the integral are bounded using our numbered equation ( 3), which completes the proof in the case of bounded ρ.For unbounded ρ he introduces another intermediate parameter A and considers the truncated function ρ A (which is bounded): He applies the result for bounded ρ to ρ A , and then he bounds the difference using expressions involving A and L (but, importantly, not involving N , see his equations ( 48), ( 50) and ( 53) ).Therefore, his equations ( 56), ( 57), ( 58) and (59) show how the parameters L, A and N interact to produce the result.To make this result quantitative would entail balancing these various rates of convergence, which seems to be non-trivial.Finally, being essentially a histogram-type method (the bin sizes depend only on N ), it would face the difficulties that come along with histograms that we covered in our discussion of Györfi and van der Meulen [23].
We turn now to Godavarti and Hero [24].In Theorem 4 of [24], the convergence h(ρ N ) → h(ρ) is proved under fairly general conditions.The only assumptions are that ρ N → ρ pointwise and that there exist a constant L and some κ > 1 such that the following statements are true: and for all N , Densities such as ρ f typically satisfy equation ( 4).Furthermore, if we require f to satisfy the regularity assumption we introduce later (Definition 6), and if an approximation method is used to produce g N given N samples from f , it is reasonable to expect that ρ f and {ρ g N } would satisfy equations ( 4) and ( 5) for some L. The necessary pointwise convergence is assured by any reasonable method of producing g N , and then Theorem 4 of [24] in fact proves the result h(g N ) → h(f ) (although we will not attempt to formalize the above argument).This is essentially what we prove in our Theorem 1, with the key difference that our result is quantitative.This certainly makes one wonder if the result in Theorem 4 of [24] could in fact be easily adapted to obtain a quantitative proof of our Theorem 1, thus not needing our methods.Currently, we believe the answer is no.Or, looked at another way, we believe that a quantitative version of their result for the problem we are examining would end up confronting the same issues we faced and might end up resembling our methods.Since [24] does not present a quantitative result, we look to their proof to see if such a result can be found in their methods.Note that in [24], "N " has a meaning different from how we have been using it, and "i" is the variable which corresponds to our "N ."We will use the notation from [24] for the moment, for the convenience of the reader who might want to refer back to [24] during our short discussion below; we hope confusion can be avoided.The main ingredients of the proof in [24] are several variables (N, K, i) which must become "sufficiently large" and the application of theorems such as Egoroff's Theorem and Lebesgue Dominated Convergence.One of the chief advantages of these powerful theorems from measure theory is that they can bootstrap pointwise convergence into stronger conclusions and bypass potentially forbidding quantitative relationships.Also, the "essence of the proof" (as they state in [24]) is to find a set A ϵ such that everything happening on A c ϵ (the "bad" set) is negligible, and the desired convergence does in fact take place on A ϵ .The set A ϵ , by construction, is a set on which ρ and all but finitely many of the ρ i are bounded above by N and bounded away from zero by 1/N .There is a tradeoff between the size of N (larger N means a smaller "bad" set, but less control on the "good" set) and the other parameters K and i.Therefore, to obtain a quantitative result, the rates of growth of these parameters must be balanced, and the use of the measure-theoretic theorems mentioned above must be replaced with concrete estimates; this would appear to be non-trivial.
There is another issue of concern in [24], a scaling issue.As mentioned in problem 1, the scaling f → λf for λ ̸ = 0 should essentially have no effect on the problem.Note that this scaling induces the standard which means in turn that a different value of L (and therefore different values of K and N ) will be needed.Therefore, in order to prove a result similar to our Theorem 1, this scaling issue will need to be resolved as well.

Definitions and Notation
We will use the following notation: For real numbers a, b, we define For subsets S, T ⊂ R, we define As we defined in the Introduction, Definition 2. For a measurable function h : [0, 1] → R, we define the measure µ h on R by We will have occasion to refer to the "monotone pieces" of a function, so we make the following definition.We do not want to consider functions which have critical points at either 0 or 1, for technical reasons and to avoid more cumbersome notation, so we define: Definition 3.For a differentiable function h with a finite set of critical points t 1 < • • • < t k , and with h ′ (0), h ′ (1) ̸ = 0, we define t 0 = 0, t k+1 = 1 and for j = 0, . . ., k we define For any function h, we will abbreviate the domain and range of h:

Definition 4. We denote the domain of h by Dom(h), and the range of h by Range(h)
Recall the definition of the (discrete) Shannon entropy, H, of a finite probability mass distribution Let h be a differentiable function with finitely many critical points and h ′ (0), h ′ (1) ̸ = 0. Suppose that µ h is absolutely continuous.Then we know that the measure given by µ h i (E) = |h −1 i (E)| is absolutely continuous; let its density be ρ h i .For a fixed u in the range of h, note that u is in the range of at least one, possibly several, of the h i .For those i such that u is in the range of h i we see that ρ h i is non-zero, and we know that So, we define the probability mass distribution Note that the number of non-zero entries of ⃗ p h (u) equals the number of i such that u is in the range of h i .

A Useful Identity
Assume that f is a differentiable function with a finite set of critical points The following identity is the centerpiece of our method for proving our main result (a helpful comment [28] led us to this identity): In the subsequent sections of this article, to simplify a discussion we will sometimes refer to the terms on the right hand side by (L) and (E), corresponding to the "logarithm" term and the "entropy" term.
To prove (7), note the following: Integrating both sides, we have or The identity )dt, and the formula ) are used on the first term to obtain The identity ) dt and so (7) is proved.This identity is equivalent to a standard identity for Shannon entropy.Suppose we have a probability space Ω, a finite set S, and a random variable X : Ω → S. Furthermore, suppose we have an "indicator" function ∆ : Ω −→ N which takes a finite number of values on Ω.We think of ∆ as specifying a partition of Ω, i.e., where H(X|∆) is the conditional entropy of X given ∆ (see [29]).In other words, H(∆) + H(X|∆) would be the Shannon entropy of X if the images of the Ω i under X did not overlap.If they did overlap, this would be an overestimation of the entropy of X, and in general the true value is given by subtracting a term, H(∆|X), which represents the amount of "overlap": This identity follows from the "chain rule" for entropy [29].
In our case, the probability space is [0, 1], the set is R, the random variable is f : [0, 1] → R, and the indicator function on the probability space is the function measures the entropy of f as if the different "monotone laps" of f did not overlap.If they do overlap then this is an overestimate, and the true value is given by subtracting the term representing the amount of "overlap": ) dt In this way, ( 7) is analogous to (8).

Scaling
For simplicity, we consider λ > 0 (the case λ < 0 is no different).For λ > 0, we have ρ λf (u) = λ −1 ρ f (u/λ), and so h(λf ) = h(f ) + log λ.This means So, the difference h(f ) − h(g) is invariant under the scaling f → λf .However, if we take [a, b] ⊂ R, and we consider then, scaling by the factor λ, we have Since the second term on the right hand side of this equation is not necessarily zero, we see that an integral over just a part of the domain may not be scale-invariant.
Turning our attention to the identity (7), we see that for λ > 0 we have: Perhaps more importantly, each term can be divided along its respective axis, and each individual piece scales similarly: Our main strategy takes advantage of precisely this.Recall the definitions for (L) and (E) that followed (7); both (L) and (E) will be subdivided along the t axis.

A Counter-Example
We will denote the i th derivative of f (or g) by f (i) (or g (i) ).One might hope that a theorem along the following lines could be proved, C being an absolute constant: This would show that the h-functional is a continuous map in the norm topology of some Banach space of differentiable functions.One can see immediately that this result is not possible, just due to scaling: If we let while the right hand side converges to zero as n → ∞.However, we might hope that we could prove a result such as In other words, perhaps this scaling issue is the only problem and the h-functional is continuous when restricted to points f and g which are bounded away from zero in this Banach space.This is not the case.
In fact, suppose that we ask even less of the result we seek.Suppose that we seek some r > 0, c > 0, and some function Γ of 2K + 2 variables which is bounded on compact subsets of the open positive orthant (R + ) 2K+2 , and we just want the following bound to hold: Whenever is true, we have In other words, whenever the differences {∥f (i) − g (i) ∥ r ∞ } K i=0 , after being multiplied by the (large) value Γ, are sufficiently small, then we have a quantitative estimate on h(f ) − h(g).This is still not possible.Note that ( 9) is a special case of (10).Note also that our main result, Theorem 1, is almost a special case of ( 10) with K = 3, the only additional element in Theorem 1 being the regularity condition (Definition 6) and the involvement of the parameter from that condition, δ.
The impossibility of a result such as (10) can be seen just looking at monotone functions, in which case the h functional is identical to the functional f → ∫ log |f ′ |.The reason is quite straightforward: The functional f → ∫ log |f ′ | will become large if a function is extremely flat even for a short distance, whereas any ∥ • ∥ ∞ -type norm will not necessarily measure a large difference between a function that is extremely flat and a function that is merely somewhat flat.We formalize this in the following proposition: Proposition 5. Fix K and ϵ.There exists a constant C K depending only on K, and monotone functions f, g, such that ∀i ≤ K, f (i) , g (i) are continuous Proof.Let f 0 (t) = t N and g 0 (t) = t M ; we are free to choose M and N independently, and the idea is to let M >> N >> K. Define We see that f and g were chosen to have f (i) , g (i) continuous for i ≤ K.If we choose N and M so that so once we choose sufficiently large M so that M K α M/2 ≤ α N for all α ≤ 1/2, we have g ) i and all of its derivatives to be smaller than ϵ and then we will have ∥f (i) − g (i) ∥ ∞ ≤ ϵ ∀i ≤ K.We see that and so we have By the above calculation, we have the immediate bound and we see that the following is true, with a constant C ′ K depending only on K: , and noting that the same is true for g, we have ∀i ≤ K, So, f and g meet the criteria of the proposition.At this point, let us "fix" the value of N that we have arrived at.Since f ′ , g ′ > 0 and g ′ ≤ f ′ , we have and as M → ∞ this expression grows without bound, proving the proposition.
The quantities which actually bring h(f ) and h(g) close together are ∥f − g∥ ∞ and ∥f ′ − g ′ ∥ ∞ ; the "smallness" of these quantities, relative to the size of the third derivatives of f and g, will be quantified by a parameter ϵ.
However, Proposition 5 means that the "smallness" of ∥f − g∥ ∞ and ∥f ′ − g ′ ∥ ∞ alone cannot imply the result we are seeking.We will impose an additional regularity assumption.
Definition 6 (The Set A(δ)).Suppose that h : [0, 1] −→ R. For 0 < δ ≤ 1, we say h ∈ A(δ) if h, h ′ , h ′′ and h (3) are continuous, and the following two conditions hold: ) We will require that f and g are in the set A(δ).Intuitively, the requirement for a function to be in A(δ) is a quantitative statement that its first and second derivatives do not simultaneously vanish, i.e., it does not have any "flat spots".The second half of the condition, equation (12), is not as natural; its purpose is simply to eliminate some uninteresting technical difficulties at the endpoints of [0, 1].
Lastly, we will need bounds on the ratios With these assumptions, we can bound the difference between h(f ) and h(g): Theorem 1 (Main Result).Suppose that f, g : [0, 1] → R are each members of A(δ), and for some ϵ ≤ 2 −20 δ 8 we have where C is an absolute constant.
Our main result can be thought of as a converse to Proposition 5: If the functions f and g do not have any extreme "flat spots" then (essentially) the smallness of ∥f − g∥ ∞ and ∥f ′ − g ′ ∥ ∞ are sufficient to imply that h(f ) and h(g) are close.
Notice that the result is scale invariant: the assumptions and conclusion are not altered in the least if we multiply both f and g by the same non-zero constant.
In preparation for estimating the size of each of the four pieces, in Sections 9. and 10. we develop some results pursuing the implications of the regularity assumption.These implications are both quantitative and qualitative.The quantitative implications are pursued in Section 9..In Section 10., we define the space E(τ ) (see Definition 11), which is meant to abstract just some "qualitative" or "incidence" properties which follow from the quantitative results, and we prove some results about E(τ ).
In section 11., we apply the results from Sections 9. and 10. to f and g.In brief, the bound on each piece is as follows.
Looking first at (L), the bound for (LG) is immediate, so we concentrate on (LB).The regularity assumption (Definition 6) will imply that critical points of f and g roughly coincide; we consider the portion of B * near just one critical point, call this B * j .As ϵ → 0, certainly |B * j | → 0, but B * j is where |f ′ | or |g ′ | are small (or zero), so log |f ′ |, log |g ′ | can be unbounded and we cannot say that (LB) is negligible just because |B * j | is small.However, we observe that f can be approximated by a parabola p f near a critical point, and the same is true for g.We have and it is only log |p ′ f | which is unbounded; the estimate of (LB) is completed by proving a bound on ∫ We turn next to (EB).This piece is bounded by "crude" size estimates, the main point being that the set (G * ) c is small.The integrand, is bounded in size by the logarithm of one plus the number of critical points of f (the regularity assumption will imply that f and g have the same number of critical points).The size of the set (G * ) c is bounded just by the number of components of B * , the size of B * , and some basic data about the functions f and g.Finally, we have (EG).We in fact only need some "incidence" data about the functions f and g to prove the result here (this data is summarized in Definition 11).The bound on this piece proceeds essentially as follows: Once we consider only t ∈ G * , we can say and we need to bound | ∑ k i=0 p i log p i − q i log q i |.On G * , we will see that p i , q i are bounded below, so we use the inequality and the proof is completed by estimating the difference which ultimately reduces to the fact that f ) ) are close together because |f (t) − g(t)| ≤ ∥f − g∥ ∞ , and f −1 j , g −1 j are going to be close together when evaluated on points that are close, and finally f ′ , g ′ differ by at most ∥f ′ − g ′ ∥ ∞ when evaluated on the same point, so they are going to be close together when evaluated on points that are close.
In the following sections, the functions "f " and "g" will always refer to the two specific functions assumed by Theorem 1.We will use h and l to refer to generic functions.

The Space A(δ)
In the following, we assume h, l ∈ A(δ).
Lemma 8. Distinct zeros of h ′ are separated by a distance of at least δ/2.This also means h ′ has at most 2/δ zeros on[0, 1].
and then Lemma 7 implies |h ′′ (t)| > 0. Since h ′′ is continuous, h ′′ does not change sign between t and t 0 , so a zero of h ′ must be further than δ/2 away from t 0 .
and we apply Lemma 9 to l.We proceed analogously if we assume l ′ (t 0 ) = 0.

The Space E(τ )
Definition 11.We say that h : (3) are continuous and Definition 12.For h, l ∈ E(τ ), we say h ∼ l if the following two statements are true: The set E(τ ) just isolates some of the "incidence" properties of the critical points of h and l; we will make use of it because of the following two lemmas which connect it to A(δ): This follows directly from the definition of A(δ) and Lemma 8.

Lemma 14. If h, l ∈ A(δ) and
This follows directly from Lemma 10 and Lemma 14.

Lemma 15. If h, l ∈ E(τ ) and h ∼ l, then h and l have the same number of critical points, and if h
as follows: For a critical point t of h, let It is easy to check that Φ is well defined and is, in fact, a bijection.It follows that s i = Φ(t i ), and so the second conclusion follows from the fact that Definition 16.For functions h and l, and B ⊂ [0, 1], let (recall the notation from Definition 1).
Lemma 17. Suppose h, l ∈ E(τ ), and h ∼ l.By Lemma 15 suppose they have critical points Suppose that 1) holds and h(t) ∈ Range(h i ).Without loss of generality, suppose that h ′ i > 0. We know So, the left hand side is less than or equal to the right hand side and we have h(t), l(t) ∈ Range(h i ), Range(l i ).This proves 1) =⇒ 2), 3), 4) The other implications are seen in the same manner.
Lemma 18.Under the same assumptions as Lemma 17, for t ∈ G we have and so This follows from the observation that I contains no intervals of length less than ∥h − l∥ ∞ , and h(t), l(t) / ∈ I.
We know by Lemma 17 that u ∈ Range(h j ) ∩ Range(l j ).Suppose h j (t 1 ) = u.Then t 1 / ∈ B by Lemma 18. From Lemma 15 we see that Dom(h j ) ∩ B c = Dom(l j ) ∩ B c .So, The same is true for l −1 j (u), and so

Applying the Results to f and g
We now return to the two specific functions f and g, which have the properties stated in the assumptions of Theorem 1.We introduce a variable, µ, which will allow us to divide [0, 1] into areas where either f ′ or g ′ are "small" according to µ, specifically and areas where both f ′ and g ′ are "large" according to µ, specifically |f ′ (t)| ≥ µ∥f (3) ∥ ∞ and |g ′ (t)| ≥ µ∥g (3) ∥ ∞ At the end of the proof, we will optimize µ; it will turn out that µ = ϵ 1/4 is optimal.
We state here some restrictions on the values of µ that we will consider: Notice that, since we will set µ = ϵ 1/4 , we will require ϵ ≤ 2 −20 δ 8 .We know that f, g ∈ A(δ), and we also know that ∥f ′ − g ′ ∥ ∞ < δ 2 /16 by ( 14) and (19).This means that we can apply the results of section 9. to f and g, and with Lemmas 13 and 14 we see that we can also apply the results of section 10..With this in mind, by Lemma 15 let us suppose that Next, we want to construct a region covering the critical points of f and g, and extending beyond the critical points far enough to cover the entire area where f ′ or g ′ is "small" (according to µ).
We see that the critical points of f and g are contained in the J * i .Furthermore, suppose that |f ′ (t 0 )| ≤ µ∥f (3) ∥ ∞ .By Lemma 9, we see that the distance from t 0 to a critical point of f is no more than 4µ/δ, which means that t 0 is contained in some J * i .The same is true for g, and therefore we have For purposes of the expression H , the "bad" set will also include the endpoints of the interval, so we define the sets B * and G * as follows.
We are now ready to state how we will decompose the expression h(f ) − h(g) in order to estimate it.First, recall by equation (7) that ) ) ) We will divide [0, 1] in different ways to estimate the different integrals: This allows us to decompose our main expression, h(f ) − h(g), into the following four pieces: Now, we apply these propositions to f and g.We will abbreviate The assumptions of Theorem 1 mean that and without loss of generality we may assume C 1 , C 2 ≥ 3.
We know that ∥f ′ − g ′ ∥ ∞ < δ 2 /16 by ( 14) and (19), and therefore by Lemma 10 and equation (14 To apply Proposition 23 (with a = 4µ/δ), we must have and this is implied by (19).So, with some simplifying to arrive at the second equation, we have Next, looking to Proposition 24, recalling equation (20) we have and so we can apply Proposition 24 to f and g, which gives Looking to Proposition 25, we have In the notation of Proposition Looking to Proposition 26, by Lemma 14 and Definition 21 we can apply Proposition 26 to f and g, so we have ) ) Simplifying the logarithm term in the last line was the reason for assuming C 1 ≥ 3.
Combining equations ( 22), ( 23), ( 24), (25), and letting C denote an absolute constant, we have Examining just the last expression in the product, again letting C be an absolute constant, we have Treating δ as a fixed parameter, this expression is optimized if µ = ϵ 1/4 , and using Lemma 8 we see that k ≤ 2/δ.Therefore, with C being another absolute constant, we have

Proofs of Main Propositions Proof of Proposition 23
Proposition 23 states: Suppose that h, l ∈ A(δ) and and suppose that we know ∀i, Let us fix i and define the Taylor polynomials We see that h and so This is also true for We will make some abbreviations to aid our exposition in this section: Let α = h ′′ (t i ) , β = l ′′ (s i ) so we have p ′ h (t) = α(t − t i ) and p ′ l (s) = β(s − s i ).Without loss of generality, suppose t i < s i and abbreviate γ = (s i − t i )/2.Define t c = (t i + s i )/2 so that we have   (3) ∥ ∞ , ∥l (3)  and therefore The same is true for l j .By (26) ) Using the triangle inequality, we have we have Next, recalling Lemma 17 (which verifies that the expressions in the first following inequality are well-defined) we have: Lemma 19 also tells us that h ′ does not change sign on ⟨h −1 j (u), l −1 j (u)⟩, which means Combining equations ( 27), ( 28), ( 29), (30), (31), (32), (33), (34), the proposition is proved.
so a method which proves a bound analogous to ours should respect this scaling.The assumption by Godavarti and Hero (in equation (4) of the present article) involves a constant, L, and the specific value of L is used in the proof (for example, we need N, K large enough that (essentially) L/| log K| κ−1 and L/| log N | κ−1 are sufficiently small).The scaling ρ(u) → |λ| −1 ρ(u/λ) transforms the integral in (4) into ∫ ρ log |λ| −1 + log ρ κ

c
⃝ 2010 by the author; licensee Molecular Diversity Preservation International, Basel, Switzerland.This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license http://creativecommons.org/licenses/by/3.0/.