Relative Entropy Derivative Bounds

Zegers, Pablo; Fuentes, Alexis; Alarcón, Carlos

doi:10.3390/e15072861

Open AccessArticle

Relative Entropy Derivative Bounds

by

Pablo Zegers

^*,

Alexis Fuentes

and

Carlos Alarcón

Universidad de los Andes, Facultad de Ingeniería y Ciencias Aplicadas, Monseñor Álvaro del Portillo 12455, Las Condes, Santiago, Chile

^*

Author to whom correspondence should be addressed.

Entropy 2013, 15(7), 2861-2873; https://doi.org/10.3390/e15072861

Submission received: 24 May 2013 / Revised: 12 July 2013 / Accepted: 16 July 2013 / Published: 23 July 2013

(This article belongs to the Special Issue Maximum Entropy and Bayes Theorem)

Download Versions Notes

Abstract

:

We show that the derivative of the relative entropy with respect to its parameters is lower and upper bounded. We characterize the conditions under which this derivative can reach zero. We use these results to explain when the minimum relative entropy and the maximum log likelihood approaches can be valid. We show that these approaches naturally activate in the presence of large data sets and that they are inherent properties of any density estimation process involving large numbers of random variables.

Keywords:

relative entropy; Kullback-Leibler divergence; Shannon differential entropy; asymptotic equipartition principle; typical set; Fisher information; maximum log likelihood

1. Introduction

Given a large ensemble of i.i.d.random variables, all of them generated according to some common density function, the asymptotic equipartition principle [1] guarantees that only a fraction of the possible ensembles, which is called the typical set, gathers almost all the probability ([2] [p. 226]). This fact opens interesting possibilities: What if the properties of the typical set impose conditions on any estimation process, such that the parameter search is focused on just a small subset of the available parameter set? We explore this using generalizations of the Shannon differential entropy [1] and the Fisher information [3,4,5] under typical set considerations. There are other works that take into account both concepts [6,7,8]. However, this work focuses on something new: defining algebraic bounds on the behavior of the derivative of the relative entropy with respect to their parameters. Furthermore, we characterize the conditions under which seeking for the extreme of this expression is a valid approach. Most importantly, we prove that these conditions automatically activate if large data sets are used. We finish this work discussing the relation between these bounds and the density function estimation problem. Under the conditions that activate these bounds, we characterize when the minimum relative entropy and maximum log likelihood approaches render the same solution. Furthermore, we show that these equivalences become true by the sole fact of having a large number of random variables.

2. Known Density Functions Case

Let us assume for the ensuing discussion that there exists a known source density function,

f_{X}

, with support,

S

. We also know a set of i.i.d. samples,

{x}_{n} \equiv {x_{1}, \dots, x_{n}}

, that was generated by

f_{X}

. There is also another density function,

f_{Y | θ} (y)

, with the same support,

S

. This density function is indexed by the parameter vector,

θ \in Θ \subseteq R^{m}

.

The first part of this work deals with this case, where all the density functions are known. The second part of this work studies the density function estimation problem, where the source density function,

f_{X}

, is not known.

3. Mixed Entropy Derivative: A Proxy for the Relative Entropy Derivative

We present the following definition:

Definition 1

The relative entropy ([2][p. 231]), also called Kullback-Leibler divergence, is defined by:

D_{R E} (f_{X} ∥ f_{Y | θ}) \equiv \int_{S} f_{X} (u) ln \frac{f_{X} (u)}{f_{Y | θ} (u)} d u

(1)

Notice that

{lim}_{u \to 0^{+}} u ln u = 0

. Even though the definition is valid for any pair of density functions, we are using those useful for the ensuing analysis. The domain of this integral, and that of all the integrals in this work, corresponds to the support,

S

.

Definition 2

The mixed entropy,

h_{M} (f_{X}, f_{Y | θ})

of two density functions is defined by:

h_{M} (f_{X}, f_{Y | θ}) \equiv - \int_{S} f_{X} (u) ln f_{Y | θ} (u) d u

(2)

when the integral exists.

Definition 3

The Shannon differential entropy [1] is defined by:

h_{S} (f_{X}) \equiv - \int_{S} f_{X} (u) ln f_{X} (u) d u

(3)

From these definitions, when

f_{Y | θ} = f_{X}

, from a functional point of view, the mixed entropy,

h_{M} (f_{X}, f_{Y | θ})

, equals the differential entropy,

h_{S} (f_{X})

From examination of the relative entropy definition:

\begin{matrix} D_{K L} (f_{X} ∥ f_{Y | θ}) & = & \int f_{X} (u) ln f_{X} (u) d u - \int f_{X} (u) ln f_{Y | θ} (u) d u \\ (4) & = & - h_{S} (f_{X}) + h_{M} (f_{X}, f_{Y | θ}) \end{matrix}

Hence:

\frac{\partial D_{K L} (f_{X} ∥ f_{Y | θ})}{\partial θ_{k}} = \frac{h_{M} (f_{X}, f_{Y | θ})}{\partial θ_{k}}

(5)

In the previous equation it was assumed that the

\frac{\partial}{\partial θ_{k}}

are defined in the interior of Θ for

k \in {1, \dots, m}

. In this work, where we assume that

f_{X}

does not depend on any

θ_{k}

, we use extensively the fact that the relative entropy derivative is equal to the mixed entropy derivative. Hence, in the following, we focus on studying the properties of the mixed entropy in order to be able to say something about the properties of the relative entropy derivatives.

4. Mixed Entropy Typical Set

An interesting insight related to sequences of i.i.d. random variables, discovered by Shannon [1], is the usefulness of the weak law of large numbers to characterize large ensembles of random variables. In the context of this work, this law implies:

Defining

Definition 4

The likelihood,

f_{{Y}_{n} | θ}

, is defined by:

f_{{Y}_{n} | θ} \equiv \prod_{k = 1}^{n} f_{Y | θ} (x_{k})

(6)

we can state:

Theorem 1

For any positive real number, ε, and fixed parameter vector,

θ

:

P \{|- \frac{1}{n} ln f_{{Y}_{n} | θ} - h_{M} (f_{X}, f_{Y | θ})| \leq ε\} \underset{n \to \infty}{\to} 1

(7)

Proof: Use that the random variables are i.i.d., and the weak law of large numbers. ☐

Hence, it makes sense to define the following set ([2][p. 226]):

Definition 5

For any positive real number, ε, fixed parameter vector,

θ

, and any

n \geq 1

, the mixed entropy typical set,

M_{ε}^{n}

, with respect to the density function,

f_{X}

, is defined by:

M_{ε}^{n} = \{{x}_{n} \in S^{n} : |- \frac{1}{n} ln f_{{Y}_{n} | θ} - h_{M} (f_{X}, f_{Y | θ})| \leq ε\}

(8)

Furthermore, using the weak law of large numbers, it is possible to prove:

Lemma 1

For any positive real number, ε, and any parameter vector,

θ

, it is true that:

P \{{x}_{n} \in M_{ε}^{n}\} \underset{n \to \infty}{\to} 1

(9)

Proof: Use the weak law of large numbers in the mixed entropy typical set definition. ☐

This proves that for large values of n, almost all the probability is contained by the mixed entropy typical set and that the probability of being outside this set becomes negligible.

Then, it is possible to prove that:

Lemma 2

For any positive real number, ε, and any fixed parameter vector,

θ

, it exists

n_{0} \equiv n_{0} (ε, θ)

, such that for all

n > n_{0}

, the total probability of the typical set is lower bounded by:

1 - ε \leq P \{M_{ε}^{n}\}

(10)

Proof: Check ([2][p. 226], [9][p. 118]) for details. ☐

5. Micro-Differences in Mixed Entropy Typical Sets

Assuming that

{x}_{n} \in M_{ε}^{n}

and from the definition of mixed entropy typical set:

ln f_{{Y}_{n} | θ} \in] - n h_{M} (f_{X}, f_{Y | θ}) - n ε, - n h_{M} (f_{X}, f_{Y | θ}) + n ε [

(11)

The width of this interval is

2 ε n

. This allows us to compare

ln f_{{Y}_{n} | θ}

with

- n h_{M} (f_{X}, f_{Y | θ})

, the value that is obtained in the limit thanks to the weak law of large numbers. Their ratio is:

|\frac{2 ε n}{- n h_{M} (f_{X}, f_{Y | θ})}| = |\frac{2 ε}{h_{M} (f_{X}, f_{Y | θ})}|

(12)

In other words, given that ε can be chosen as small as needed, the range of values that

ln f_{{Y}_{n} | θ}

can take may be indistinguishable from

- n h_{M} (f_{X}, f_{Y | θ})

, with high probability. This is another way of stating the weak law of large numbers.

The previous facts allow us to present the following definition:

Definition 6

When

{x}_{n} \in M_{ε}^{n}

, its associated micro-difference is defined by the following function:

δ \equiv δ (f_{X}, n, {x}_{n}, f_{Y | θ}, θ, ε) \in] 0, 1 [

(13)

such that:

ln f_{{Y}_{n} | θ} = - n (h_{M} (f_{X}, f_{Y | θ}) + ε) + 2 ε n δ

(14)

hence:

δ = \frac{1}{2 ε n} (ln f_{{Y}_{n} | θ} + n (h_{M} (f_{X}, f_{Y | θ}) + ε))

(15)

The analysis performed in the preceding paragraphs shows that the behavior of the micro-differences might be negligible in the case of large values of n. However, the following sections of this work deal with the the derivative of the micro-difference value with respect to the parameters; so, we cannot neglect it, and we need to study its behavior.

From Equation (14):

\frac{\partial ln f_{{Y}_{n} | θ}}{\partial θ_{k}} = - n \frac{\partial h_{M} (f_{X}, f_{Y | θ})}{\partial θ_{k}} + 2 ε n \frac{\partial δ}{\partial θ_{k}}

(16)

for all

k \in [1, \dots, m]

. Thus:

\begin{matrix} 2 ε \frac{\partial δ}{\partial θ_{k}} & = & \frac{1}{n} \frac{\partial ln f_{{Y}_{n} | θ}}{\partial θ_{k}} + \frac{\partial h_{M} (f_{X}, f_{Y | θ})}{\partial θ_{k}} \end{matrix}

(17)

\begin{matrix} = & \frac{\partial}{\partial θ_{k}} (\frac{1}{n} ln f_{{Y}_{n} | θ} + h_{M} (f_{X}, f_{Y | θ})) \end{matrix}

(18)

\begin{matrix} = & - \frac{\partial}{\partial θ_{k}} (- \frac{1}{n} ln f_{{Y}_{n} | θ} - h_{M} (f_{X}, f_{Y | θ})) \end{matrix}

(19)

This last expression helps us to understand that the derivative of δ with respect to the parameters is related to changes experienced by quantities upper bounded by ε.

6. Mixed Information

We define:

Definition 7

The mixed information,

i_{F} {(f_{{X}_{n}}, f_{{Y}_{n} | θ})}_{k}

, is defined by:

i_{F} {(f_{{X}_{n}}, f_{{Y}_{n} | θ})}_{k} \equiv \int_{S^{n}} f_{{X}_{n}} {(\frac{\partial ln f_{{Y}_{n} | θ}}{\partial θ_{k}})}^{2} d u

(20)

with:

f_{{X}_{n}} \equiv \prod_{k = 1}^{n} f_{X} (x_{k})

(21)

when the integral exists.

7. Mixed Entropy Derivative Bounds

The main contribution of this work is the following pair of inequalities, which are obtained thanks to a combination of the mixed information and the mixed entropy.

Theorem 2

Given a positive real number, ε, any parameter vector,

θ

, and an i.i.d. sequence with n elements, then the components of

\nabla_{θ} h_{M} (f_{X}, f_{Y | θ})

comply with:

\frac{\sqrt{I_{R}^{k}} - \frac{1}{n} \sqrt{i_{F} {(f_{{X}_{n}}, f_{{Y}_{n} | θ})}_{k}}}{\sqrt{I_{L}}} \leq |\frac{\partial h_{M} (f_{X}, f_{Y | θ})}{\partial θ_{k}}| \leq \frac{\sqrt{I_{R}^{k}} + \frac{1}{n} \sqrt{i_{F} {(f_{{X}_{n}}, f_{{Y}_{n} | θ})}_{k}}}{\sqrt{I_{L}}}

(22)

where:

\begin{matrix} I_{R}^{k} & \equiv & \int_{M_{ε}^{n}} f_{{X}_{n}} {|2 ε \frac{\partial δ}{\partial θ_{k}}|}^{2} d u \end{matrix}

(23)

\begin{matrix} I_{L} & \equiv & \int_{M_{ε}^{n}} f_{{X}_{n}} d u \end{matrix}

(24)

Proof: From the mixed information definition:

i_{F} {(f_{{X}_{n}}, f_{{Y}_{n} | θ})}_{k} = I_{M_{ε}^{n}}^{k} + I_{\sim M_{ε}^{n}}^{k} \geq I_{M_{ε}^{n}}^{k} > 0

(25)

where:

I_{M_{ε}^{n}}^{k} \equiv \int_{M_{ε}^{n}} f_{{X}_{n}} {(\frac{\partial ln f_{{Y}_{n} | θ}}{\partial θ_{k}})}^{2} d u

(26)

and the symbol, ∼, denotes the complement set.

Replacing Equation (16) in Equation (26), it is obtained:

\begin{matrix} I_{M_{ε}^{n}}^{k} & = & n^{2} {(\frac{\partial h_{M} (f_{X}, f_{Y | θ})}{\partial θ_{k}})}^{2} \int_{M_{ε}^{n}} f_{{X}_{n}} d u - 2 n \frac{\partial h_{M} (f_{X}, f_{Y | θ})}{\partial θ_{k}} \int_{M_{ε}^{n}} f_{{X}_{n}} (2 ε n \frac{\partial δ}{\partial θ_{k}}) d u \\ (27) & + \int_{M_{ε}^{n}} f_{{X}_{n}} {(2 ε n \frac{\partial δ}{\partial θ_{k}})}^{2} d u \end{matrix}

Using the Cauchy-Bunyakovsky-Schwarz inequality:

\begin{matrix} {|\int_{M_{ε}^{n}} f_{{X}_{n}} (2 ε n \frac{\partial δ}{\partial θ_{k}}) d u|}^{2} & = & {|\int_{M_{ε}^{n}} \sqrt{f_{{X}_{n}}} \sqrt{f_{{X}_{n}}} (2 ε n \frac{\partial δ}{\partial θ_{k}}) d u|}^{2} \\ \leq & \int_{M_{ε}^{n}} f_{{X}_{n}} d u \cdot \int_{M_{ε}^{n}} f_{{X}_{n}} {|2 ε n \frac{\partial δ}{\partial θ_{k}}|}^{2} d u \\ (28) & = & I_{L} \cdot n^{2} I_{R}^{k} \end{matrix}

Therefore:

\begin{matrix} I_{M_{ε}^{n}}^{k} & \geq & n^{2} {(\frac{\partial h_{M} (f_{X}, f_{Y | θ})}{\partial θ_{k}})}^{2} I_{L} - 2 n^{2} |\frac{\partial h_{M} (f_{X}, f_{Y | θ})}{\partial θ_{k}}| \sqrt{I_{L}} \sqrt{I_{R}^{k}} + n^{2} I_{R}^{k} \\ (29) & = & n^{2} {(|\frac{\partial h_{M} (f_{X}, f_{Y | θ})}{\partial θ_{k}}| \sqrt{I_{L}} - \sqrt{I_{R}^{k}})}^{2} \end{matrix}

which implies:

\frac{1}{n^{2}} i_{F} {(f_{{X}_{n}}, f_{{Y}_{n} | θ})}_{k} \geq {(|\frac{\partial h_{M} (f_{X}, f_{Y | θ})}{\partial θ_{k}}| \sqrt{I_{L}} - \sqrt{I_{R}^{k}})}^{2}

(30)

Hence, using the negative solution of the square root:

\frac{1}{n} \sqrt{i_{F} {(f_{{X}_{n}}, f_{{Y}_{n} | θ})}_{k}} \geq - |\frac{\partial h_{M} (f_{X}, f_{Y | θ})}{\partial θ_{k}}| \sqrt{I_{L}} + \sqrt{I_{R}^{k}}

(31)

Therefore:

\frac{\sqrt{I_{R}^{k}} - \frac{1}{n} \sqrt{i_{F} {(f_{{X}_{n}}, f_{{Y}_{n} | θ})}_{k}}}{\sqrt{I_{L}}} \leq |\frac{\partial h_{M} (f_{X}, f_{Y | θ})}{\partial θ_{k}}|

(32)

Taking into account the positive solution of Equation (30):

|\frac{\partial h_{M} (f_{X}, f_{Y | θ})}{\partial θ_{k}}| \leq \frac{\sqrt{I_{R}^{k}} + \frac{1}{n} \sqrt{i_{F} {(f_{{X}_{n}}, f_{{Y}_{n} | θ})}_{k}}}{\sqrt{I_{L}}}

(33)

☐

The previous result allows us to state the following remarks:

Remark 1

This theorem states algebraic bounds on the derivative of the mixed entropy that are valid for any value of n.

Remark 2

The expression:

I_{R}^{k} \equiv \int_{M_{ε}^{n}} f_{{X}_{n}} {|2 ε \frac{\partial δ}{\partial θ_{k}}|}^{2} d u = \int_{M_{ε}^{n}} f_{{X}_{n}} {|- \frac{\partial}{\partial θ_{k}} (- \frac{1}{n} ln f_{{Y}_{n} | θ} - h_{M} (f_{X}, f_{Y | θ}))|}^{2} d u

(34)

refers to the accumulation of derivatives of very small differences. This differences become smaller and smaller as the weak law of large numbers is enacted.

8. Unknown Source Density Function, $f_{X}$

Now, we study what happens when the source density function,

f_{X}

, is not known.

It can be proven that

D_{K L} (f_{X} ∥ f_{Y | θ}) \geq 0

([2][p. 232]). Thus, from the corresponding definitions given at the start of this work:

h_{M} (f_{X}, f_{Y | θ}) \geq h_{S} (f_{X})

(35)

Therefore, the density estimation problem can be framed as the following optimization program:

θ_{\circ} \equiv θ_{\circ} (f_{X}, f_{Y | θ}) = arg min_{θ} h_{M} (f_{X}, f_{Y | θ})

(36)

Remark 3

The solutions of this optimization program pose the following options:

$h_{M} (f_{X}, f_{Y | θ_{\circ}}) = h_{S} (f_{X})$ , hence $f_{Y | θ_{\circ}} = f_{X}$ .
$h_{M} (f_{X}, f_{Y | θ_{\circ}}) > h_{S} (f_{X})$ , thus $f_{Y | θ_{\circ}} \neq f_{X}$ . This happens when the estimator cannot implement the desired density function, which is beyond its reach. One way of checking whether the global minimum has been reached is to compare the mixed entropy value to that of the Shannon differential entropy. If they differ, the global minimum has not been reached yet.

According to this result, it is straightforward to search for parameters that make true the following expression:

{\nabla_{θ} h_{M} (f_{X}, f_{Y | θ})|}_{θ = θ_{\circ}} = 0

(37)

However, this is out of the limits in our density estimation problem, given that we do not have access to

f_{X}

. Thank to the weak law of large numbers:

lim_{n \to \infty} - \frac{1}{n} ln f_{{Y}_{n} | θ} = h_{M} (f_{X}, f_{Y | θ})

(38)

in probability. Thus, for sufficiently large values of n:

- \frac{1}{n} ln f_{{Y}_{n} | θ} \approx h_{M} (f_{X}, f_{Y | θ})

(39)

Hence, the optimization program posed in Equation (36) can be approximated by:

θ_{\circ} \equiv θ_{\circ} (n, {x}_{n}, f_{Y | θ}) = arg min_{θ} (- \frac{1}{n} ln f_{{Y}_{n} | θ})

(40)

or:

θ_{\circ} \equiv θ_{\circ} (n, {x}_{n}, f_{Y | θ}) = arg max_{θ} (\frac{1}{n} ln f_{{Y}_{n} | θ})

(41)

which is the maximum log likelihood framework ([3][p. 261], [4][260], [5][p. 65]).

Remark 4

As in the mixed entropy case, the maximum log likelihood framework can exhibit the following cases:

The optimization program finds a parameter combination that makes the estimator equal to the unknown source density function, thus making true the following expression for a sufficiently large n:

$- \frac{1}{n} ln f_{{Y}_{n} | θ_{\circ}} \approx h_{M} (f_{X}, f_{Y | θ_{\circ}}) \approx h_{S} (f_{X})$

(42)
The estimator does not find the desired target function. This happens when the family of functions that can be implemented by the estimator does not include the target function. It also happens when the maximum log likelihood program is riddled with multiple local minima, and the optimization process gets trapped in one of them. Only under very specific conditions, it is possible to obtain the global maximum of the maximum log likelihood problem [10]. In this case, we cannot use a comparison between the value of the log likelihood term and that of the Shannon differential entropy to determine the goodness of our solution, as we could do in the case of the mixed entropy, because calculating the latter is analogous to the very problem we are trying to solve.

Hence, it is natural to work with this approximated program and look for parameters that make the following expression true:

{\nabla_{θ} (\frac{1}{n} ln f_{{Y}_{n} | θ})|}_{θ = θ_{\circ}} = 0

(43)

9. An Example: Mismatched Density Functions

9.1. Known Density Functions Case

Let us define a source density function:

f_{X} (x) = λ e^{- λ x} H (x)

(44)

with

x \in R

and

H (x)

, the Heaviside step function. This density function is completely defined by the parameter, λ. Its differential entropy is given by:

h_{S} (f_{X}) = 1 - ln λ = 1 + ln \frac{1}{λ}

(45)

We also have a data set,

{x}_{n}

, generated by the source density function.

The density function that depends on parameters is:

f_{Y | μ, σ} = \frac{1}{σ \sqrt{2 π}} exp (- \frac{{(μ - x)}^{2}}{2 σ^{2}})

(46)

with

x \in R

, μ its mean and σ its standard deviation. Its differential entropy is given by:

h_{S} (f_{Y | μ, σ}) = ln (σ \sqrt{2 π e}) = 1 + ln σ + ln \sqrt{\frac{2 π}{e}}

(47)

which does not depend on the mean, μ.

9.2. Compliance with Theorem 2

We first analyze the bounds associated with parameter μ. For these density functions:

\frac{1}{n} ln f_{{Y}_{n} | μ, σ} = \frac{1}{n} \sum_{k = 1}^{n} ln f_{Y | μ, σ} (x_{k}) = ln (\frac{1}{σ \sqrt{2 π}}) - \frac{1}{2 n σ^{2}} \sum_{k = 1}^{n} {(μ - x_{k})}^{2}

(48)

Hence:

\frac{1}{n} \frac{\partial ln f_{{Y}_{n} | μ, σ}}{\partial μ} = - \frac{1}{σ^{2}} (μ - \frac{1}{n} \sum_{k = 1}^{n} x_{k})

(49)

We calculate the mixed information for μ:

\begin{matrix} i_{F} {(f_{{X}_{n}}, f_{{Y}_{n} | μ, σ})}_{μ} & = & \int_{R_{+}^{n}} f_{{X}_{n}} {(\frac{\partial ln f_{{Y}_{n} | μ, σ}}{\partial μ})}^{2} d u \\ = & \int_{R_{+}^{n}} f_{{X}_{n}} {(\frac{n}{σ^{2}} (μ - \frac{1}{n} \sum_{k = 1}^{n} u_{k}))}^{2} d u \\ (50) & = & \frac{n^{2}}{σ^{4}} \int_{R_{+}^{n}} f_{{X}_{n}} {(μ - \frac{1}{n} \sum_{k = 1}^{n} u_{k})}^{2} d u \end{matrix}

Hence:

\frac{1}{n} \sqrt{i_{F} {(f_{{X}_{n}}, f_{{Y}_{n} | μ, σ})}_{μ}} = \frac{1}{σ^{2}} \sqrt{\int_{R_{+}^{n}} f_{{X}_{n}} {(μ - \frac{1}{n} \sum_{k = 1}^{n} u_{k})}^{2} d u}

(51)

Also:

\begin{matrix} h_{M} (f_{X}, f_{Y | μ, σ}) & = & - \int_{- \infty}^{\infty} f_{X} (u) ln f_{Y | μ, σ} (u) d u \\ = & - \int_{- \infty}^{\infty} λ e^{- λ u} H (u) ln f_{Y | μ, σ} (u) d u \\ = & - \int_{0}^{\infty} λ e^{- λ u} (ln (\frac{1}{σ \sqrt{2 π}}) - \frac{{(μ - u)}^{2}}{2 σ^{2}}) d u \\ = & ln (σ \sqrt{2 π}) \int_{0}^{\infty} λ e^{- λ u} d u + \frac{λ}{2 σ^{2}} \int_{0}^{\infty} {(μ - u)}^{2} e^{- λ u} d u \\ (52) & = & ln (σ \sqrt{2 π}) + \frac{1}{σ^{2}} (\frac{μ^{2}}{2} - \frac{μ}{λ} + \frac{1}{λ^{2}}) \end{matrix}

and:

\frac{\partial h_{M} (f_{X}, f_{Y | μ, σ})}{\partial μ} = \frac{1}{σ^{2}} (μ - \frac{1}{λ})

(53)

Finally, we calculate the micro-structure term using Equation (17):

2 ε \frac{\partial δ}{\partial μ} = \frac{1}{σ^{2}} (\frac{1}{n} \sum_{k = 1}^{n} x_{k} - \frac{1}{λ})

(54)

This allows us to calculate:

\begin{matrix} I_{R}^{μ} & = & \int_{M_{ε}^{n}} f_{{X}_{n}} {|2 ε \frac{\partial δ}{\partial μ}|}^{2} d u \\ = & \int_{M_{ε}^{n}} f_{{X}_{n}} {|\frac{1}{σ^{2}} (\frac{1}{n} \sum_{k = 1}^{n} u_{k} - \frac{1}{λ})|}^{2} d u \\ (55) & = & \frac{1}{σ^{4}} \int_{M_{ε}^{n}} f_{{X}_{n}} {(\frac{1}{n} \sum_{k = 1}^{n} u_{k} - \frac{1}{λ})}^{2} d u \end{matrix}

Now, for σ, we have:

\frac{1}{n} \frac{\partial ln f_{{Y}_{n} | μ, σ}}{\partial σ} = \frac{1}{σ^{3}} (- σ^{2} + \frac{1}{n} \sum_{k = 1}^{n} {(μ - u_{k})}^{2})

(56)

The mixed information for σ is defined by:

\begin{matrix} i_{F} {(f_{{X}_{n}}, f_{{Y}_{n} | μ, σ})}_{μ} & = & \int_{R_{+}^{n}} f_{{X}_{n}} {(\frac{\partial ln f_{{Y}_{n} | μ, σ}}{\partial σ})}^{2} d u \\ = & \int_{R_{+}^{n}} f_{{X}_{n}} {(\frac{n}{σ^{3}} (- σ^{2} + \frac{1}{n} \sum_{k = 1}^{n} {(μ - u_{k})}^{2}))}^{2} d u \\ (57) & = & \frac{n^{2}}{σ^{6}} \int_{R_{+}^{n}} f_{{X}_{n}} {(- σ^{2} + \frac{1}{n} \sum_{k = 1}^{n} {(μ - u_{k})}^{2})}^{2} d u \end{matrix}

Hence:

\frac{1}{n} \sqrt{i_{F} {(f_{{X}_{n}}, f_{{Y}_{n} | μ, σ})}_{μ}} = \frac{1}{σ^{3}} \sqrt{\int_{R_{+}^{n}} f_{{X}_{n}} {(- σ^{2} + \frac{1}{n} \sum_{k = 1}^{n} {(μ - u_{k})}^{2})}^{2} d u}

(58)

Again, the derivative of the mixed entropy with respect to σ:

\frac{\partial h_{M} (f_{X}, f_{Y | μ, σ})}{\partial σ} = \frac{1}{σ^{3}} ((σ^{2} - \frac{1}{λ^{2}}) - {(μ - \frac{1}{λ})}^{2})

(59)

Hence, the associated micro-differences expression is:

2 ε \frac{\partial δ}{\partial σ} = \frac{1}{σ^{3}} ((\frac{1}{n} \sum_{k = 1}^{n} {(μ - u_{k})}^{2} - \frac{1}{λ^{2}}) - {(μ - \frac{1}{λ})}^{2})

(60)

which allows us to calculate:

\begin{matrix} I_{R}^{σ} & = & \int_{M_{ε}^{n}} f_{{X}_{n}} {|2 ε \frac{\partial δ}{\partial σ}|}^{2} d u \\ = & \int_{M_{ε}^{n}} f_{{X}_{n}} {|\frac{1}{σ^{3}} ((\frac{1}{n} \sum_{k = 1}^{n} {(μ - u_{k})}^{2} - \frac{1}{λ^{2}}) - {(μ - \frac{1}{λ})}^{2})|}^{2} d u \\ (61) & = & \frac{1}{σ^{6}} \int_{M_{ε}^{n}} f_{{X}_{n}} {((\frac{1}{n} \sum_{k = 1}^{n} {(μ - u_{k})}^{2} - \frac{1}{λ^{2}}) - {(μ - \frac{1}{λ})}^{2})}^{2} d u \end{matrix}

9.3. Unknown Source Density Function, $f_{X}$

If

f_{X}

is unknown, then it is not possible to calculate the previous bounds. In this case, we resort to the optimization program specified in Equation (41). Hence, we use Equations (49) and (56), in order to obtain:

\begin{matrix} μ_{n} & = & \frac{1}{n} \sum_{k = 1}^{n} x_{k} \end{matrix}

(62)

\begin{matrix} σ_{n}^{2} & = & \frac{1}{n} \sum_{k = 1}^{n} {(μ_{n} - x_{k})}^{2} \end{matrix}

(63)

Using this choice of values immediately makes both mixed information expressions described by Equations (51) and (58) equal to zero, without the need for large values of n.

Furthermore, if we use:

\begin{matrix} μ_{\circ} = lim_{n \to \infty} μ_{n} & = & \frac{1}{λ} \end{matrix}

(64)

\begin{matrix} σ_{\circ} = lim_{n \to \infty} σ_{n} & = & \frac{1}{λ} \end{matrix}

(65)

then the micro-differences integrals also become equal to zero.

This example shows several things:

The use of the maximum log likelihood solutions and large values of n guarantees that the derivative of the mixed entropy equals zero; hence, the derivative of the relative entropy with respect to its parameters is zero, too.
However, this is not enough to guarantee that the density functions will match each other (a Gaussian cannot estimate an exponential). The maximum log likelihood solution only guarantees that the estimator will do the best it can do.
If not enough data is available, the derivative of the mixed entropy will be different from zero, because the micro-differences terms are not zero.

10. Discussion

If one wanted to estimate a density function out of a data set, one could minimize the relative entropy: once we reach its minimum, we can guarantee that the density function implemented by the estimator will be equal to that which originated the data. This is true only if the family of functions that the estimator can effectively implement includes that of the source density function. Moreover, realizing that it is not possible to measure the relative entropy, one would quickly settle to maximize the log likelihood and obtain the desired density function. Keeping this in mind, the results presented in this work seem just another way of saying the same. However, this is not so. The difference is subtle. Whereas there are many density function estimation principles and it could be discussed whether minimizing the relative entropy is the most convenient one, the bounds presented in this work show that minimizing the relative entropy automatically activates in the presence of large data sets and that one does not need to decide whether that estimation framework is the most convenient or not. Something similar happens with the weak law of large numbers: there are many ways of determining the expected value of the density function that generated the data, but everybody knows that the weak law of large numbers guarantees that in the presence of large data sets, this value can be effectively approximated by the average of the values. Our bounds, also based on the weak law of large numbers, state something similar: in the presence of large data sets, the source density function is perceived as the solution of the maximum log likelihood solution.

11. Conclusions

Theorem 2, which is the main result of this work, consists in a set of bounds on the partial derivative of the mixed entropy, which is a proxy to the relative entropy, with respect to the parameters of the estimator. These bounds relate the mixed information value, the capacity of the estimator to produce micro-differences and the number of random variables present in the analysis in such a way that it is possible to determine that these bounds activate automatically when the data set is large. If the micro-differences integral is zero, the mixed information derivative is zero for large data sets, independently of any other considerations. In other words, minimizing the mixed entropy is the preferred density estimation framework when large data sets are available.

Given the impossibility of estimating the mixed entropy, it is shown that the correct numerical approximation to this problem is finding the solution to the maximum log likelihood problem. Again, when the estimator is associated to a micro-differences integral equal to zero, this framework becomes the optimal one in the presence of large data sets.

Acknowledgments

The authors thank Jaime Cisternas, Jorge Silva and Jose Principe for their helpful insights about this work.

Conflict of Interest

The authors declare no conflict of interest.

References

Shannon, C. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Cover, T.; Thomas, J. Elements of Information Theory; John Wiley and Sons, Inc.: Hoboken, NJ, USA, 1991. [Google Scholar]
Hogg, R.V.; Craig, A.T. Introduction to Mathematical Statistics; Prentice Hall: Upper Saddle River, NJ, USA, 1995. [Google Scholar]
Papoulis, A. Probability, Random Variables, and Stochastic Processes; McGraw-Hill: New York, NY, USA, 1991. [Google Scholar]
Van Trees, H.L. Detection, Estimation, and Modulation Theory: Part 1; John Wiley and Sons, Inc.: Hoboken, NJ, USA, 2001. [Google Scholar]
Vignat, C.; Bercher, J. Analysis of signals in the Fisher-Shannon information plane. Phys. Lett. A 2003, 312. [Google Scholar] [CrossRef]
Romera, E.; Dehesa, J. The Fisher-Shannon information plane, an electron correlation tool. J. Chem. Phys. 2004, 120, 8906–8912. [Google Scholar] [CrossRef] [PubMed]
Dimitrov, V. On Shannon-Jaynes Entropy and Fisher Information. In Proceedings of the 27th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Saratoga Springs, New York, NY, USA, 8–13 July 2007.
Taubman, D.; Marcellin, M. JPEG2000: Image Compression Fundamentals, Standards, and Practice; Kluwer Academic Publishers: Dordrecht, The Netherlands, 2002. [Google Scholar]
Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]

© 2013 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Zegers, P.; Fuentes, A.; Alarcón, C. Relative Entropy Derivative Bounds. Entropy 2013, 15, 2861-2873. https://doi.org/10.3390/e15072861

AMA Style

Zegers P, Fuentes A, Alarcón C. Relative Entropy Derivative Bounds. Entropy. 2013; 15(7):2861-2873. https://doi.org/10.3390/e15072861

Chicago/Turabian Style

Zegers, Pablo, Alexis Fuentes, and Carlos Alarcón. 2013. "Relative Entropy Derivative Bounds" Entropy 15, no. 7: 2861-2873. https://doi.org/10.3390/e15072861

Article Menu

Relative Entropy Derivative Bounds

Abstract

1. Introduction

2. Known Density Functions Case

3. Mixed Entropy Derivative: A Proxy for the Relative Entropy Derivative

4. Mixed Entropy Typical Set

5. Micro-Differences in Mixed Entropy Typical Sets

6. Mixed Information

7. Mixed Entropy Derivative Bounds

8. Unknown Source Density Function, $f_{X}$

9. An Example: Mismatched Density Functions

9.1. Known Density Functions Case

9.2. Compliance with Theorem 2

9.3. Unknown Source Density Function, $f_{X}$

10. Discussion

11. Conclusions

Acknowledgments

Conflict of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Relative Entropy Derivative Bounds

Abstract

1. Introduction

2. Known Density Functions Case

3. Mixed Entropy Derivative: A Proxy for the Relative Entropy Derivative

4. Mixed Entropy Typical Set

5. Micro-Differences in Mixed Entropy Typical Sets

6. Mixed Information

7. Mixed Entropy Derivative Bounds

8. Unknown Source Density Function, f X

9. An Example: Mismatched Density Functions

9.1. Known Density Functions Case

9.2. Compliance with Theorem 2

9.3. Unknown Source Density Function, f X

10. Discussion

11. Conclusions

Acknowledgments

Conflict of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

8. Unknown Source Density Function, $f_{X}$

9.3. Unknown Source Density Function, $f_{X}$