Article Relative Entropy Derivative Bounds

We show that the derivative of the relative entropy with respect to its parameters is lower and upper bounded. We characterize the conditions under which this derivative can reach zero. We use these results to explain when the minimum relative entropy and the maximum log likelihood approaches can be valid. We show that these approaches naturally activate in the presence of large data sets and that they are inherent properties of any density estimation process involving large numbers of random variables.


Introduction
Given a large ensemble of i.i.d.random variables, all of them generated according to some common density function, the asymptotic equipartition principle [1] guarantees that only a fraction of the possible ensembles, which is called the typical set, gathers almost all the probability ( [2] [p. 226]).This fact opens interesting possibilities: What if the properties of the typical set impose conditions on any estimation process, such that the parameter search is focused on just a small subset of the available parameter set?We explore this using generalizations of the Shannon differential entropy [1] and the Fisher information [3][4][5] under typical set considerations.There are other works that take into account both concepts [6][7][8].However, this work focuses on something new: defining algebraic bounds on the behavior of the derivative of the relative entropy with respect to their parameters.Furthermore, we characterize the conditions under which seeking for the extreme of this expression is a valid approach.Most importantly, we prove that these conditions automatically activate if large data sets are used.We finish this work discussing the relation between these bounds and the density function estimation problem.Under the conditions that activate these bounds, we characterize when the minimum relative entropy and maximum log likelihood approaches render the same solution.Furthermore, we show that these equivalences become true by the sole fact of having a large number of random variables.

Known Density Functions Case
Let us assume for the ensuing discussion that there exists a known source density function, f X , with support, S. We also know a set of i.i.d.samples, {x} n ≡ {x 1 , . . ., x n }, that was generated by f X .There is also another density function, f Y |θ (y), with the same support, S.This density function is indexed by the parameter vector, The first part of this work deals with this case, where all the density functions are known.The second part of this work studies the density function estimation problem, where the source density function, f X , is not known.

Mixed Entropy Derivative: A Proxy for the Relative Entropy Derivative
We present the following definition: Definition 1 The relative entropy ( [2][p.231]), also called Kullback-Leibler divergence, is defined by: Notice that lim u→0 + u ln u = 0.Even though the definition is valid for any pair of density functions, we are using those useful for the ensuing analysis.The domain of this integral, and that of all the integrals in this work, corresponds to the support, S.

Definition 2
The mixed entropy, h M f X , f Y |θ of two density functions is defined by: when the integral exists.

Definition 3
The Shannon differential entropy [1] is defined by: From these definitions, when f Y |θ = f X , from a functional point of view, the mixed entropy, From examination of the relative entropy definition: Hence: In the previous equation it was assumed that the ∂ ∂θ k are defined in the interior of Θ for k ∈ {1, . . ., m}.In this work, where we assume that f X does not depend on any θ k , we use extensively the fact that the relative entropy derivative is equal to the mixed entropy derivative.Hence, in the following, we focus on studying the properties of the mixed entropy in order to be able to say something about the properties of the relative entropy derivatives.

Mixed Entropy Typical Set
An interesting insight related to sequences of i.i.d.random variables, discovered by Shannon [1], is the usefulness of the weak law of large numbers to characterize large ensembles of random variables.In the context of this work, this law implies: Defining Definition 4 The likelihood, f {Y }n|θ , is defined by: we can state: Theorem 1 For any positive real number, ε, and fixed parameter vector, θ: Proof: Use that the random variables are i.i.d., and the weak law of large numbers. 2 Hence, it makes sense to define the following set ([2][p.226]): Definition 5 For any positive real number, ε, fixed parameter vector, θ, and any n ≥ 1, the mixed entropy typical set, M n ε , with respect to the density function, f X , is defined by: Furthermore, using the weak law of large numbers, it is possible to prove: Lemma 1 For any positive real number, ε, and any parameter vector, θ, it is true that: Proof: Use the weak law of large numbers in the mixed entropy typical set definition. 2 This proves that for large values of n, almost all the probability is contained by the mixed entropy typical set and that the probability of being outside this set becomes negligible.
Then, it is possible to prove that: Lemma 2 For any positive real number, ε, and any fixed parameter vector, θ, it exists n 0 ≡ n 0 (ε, θ), such that for all n > n 0 , the total probability of the typical set is lower bounded by: Proof: Check ([2][p.226], [9][p.118]) for details. 2

Micro-Differences in Mixed Entropy Typical Sets
Assuming that {x} n ∈ M n ε and from the definition of mixed entropy typical set: The width of this interval is 2εn.This allows us to compare ln f {Y }n|θ with −nh M f Y , f Y |θ , the value that is obtained in the limit thanks to the weak law of large numbers.Their ratio is: In other words, given that ε can be chosen as small as needed, the range of values that ln f {Y }n|θ can take may be indistinguishable from −nh M f X , f Y |θ , with high probability.This is another way of stating the weak law of large numbers.
The previous facts allow us to present the following definition: Definition 6 When {x} n ∈ M n ε , its associated micro-difference is defined by the following function: such that: hence: The analysis performed in the preceding paragraphs shows that the behavior of the micro-differences might be negligible in the case of large values of n.However, the following sections of this work deal with the the derivative of the micro-difference value with respect to the parameters; so, we cannot neglect it, and we need to study its behavior.
From Equation (14): for all k ∈ [1, . . ., m].Thus: This last expression helps us to understand that the derivative of δ with respect to the parameters is related to changes experienced by quantities upper bounded by ε.

Mixed Information
We define: Definition 7 The mixed information, i F f {X}n , f {Y }n|θ k , is defined by: with: when the integral exists.

Mixed Entropy Derivative Bounds
The main contribution of this work is the following pair of inequalities, which are obtained thanks to a combination of the mixed information and the mixed entropy.
Theorem 2 Given a positive real number, ε, any parameter vector, θ, and an i.i.d.sequence with n elements, then the components of ∇ θ h M f X , f Y |θ comply with: where: Proof: From the mixed information definition: where: and the symbol, ∼, denotes the complement set.
Replacing Equation (16) in Equation ( 26), it is obtained: Using the Cauchy-Bunyakovsky-Schwarz inequality: Therefore: which implies: Hence, using the negative solution of the square root: Therefore: Taking into account the positive solution of Equation (30): The previous result allows us to state the following remarks: Remark 1 This theorem states algebraic bounds on the derivative of the mixed entropy that are valid for any value of n.

Remark 2
The expression: refers to the accumulation of derivatives of very small differences.This differences become smaller and smaller as the weak law of large numbers is enacted.
Therefore, the density estimation problem can be framed as the following optimization program: Remark 3 The solutions of this optimization program pose the following options: This happens when the estimator cannot implement the desired density function, which is beyond its reach.One way of checking whether the global minimum has been reached is to compare the mixed entropy value to that of the Shannon differential entropy.If they differ, the global minimum has not been reached yet.
According to this result, it is straightforward to search for parameters that make true the following expression: However, this is out of the limits in our density estimation problem, given that we do not have access to f X .Thank to the weak law of large numbers: in probability.Thus, for sufficiently large values of n: Hence, the optimization program posed in Equation ( 36) can be approximated by: or: which is the maximum log likelihood framework ([3][p.261], [4][260], [5][p.65]).
Remark 4 As in the mixed entropy case, the maximum log likelihood framework can exhibit the following cases: • The optimization program finds a parameter combination that makes the estimator equal to the unknown source density function, thus making true the following expression for a sufficiently large n: • The estimator does not find the desired target function.This happens when the family of functions that can be implemented by the estimator does not include the target function.It also happens when the maximum log likelihood program is riddled with multiple local minima, and the optimization process gets trapped in one of them.Only under very specific conditions, it is possible to obtain the global maximum of the maximum log likelihood problem [10].In this case, we cannot use a comparison between the value of the log likelihood term and that of the Shannon differential entropy to determine the goodness of our solution, as we could do in the case of the mixed entropy, because calculating the latter is analogous to the very problem we are trying to solve.
Hence, it is natural to work with this approximated program and look for parameters that make the following expression true: 9. An Example: Mismatched Density Functions

Known Density Functions Case
Let us define a source density function: with x ∈ R and H(x), the Heaviside step function.This density function is completely defined by the parameter, λ.Its differential entropy is given by: We also have a data set, {x} n , generated by the source density function.
The density function that depends on parameters is: with x ∈ R, µ its mean and σ its standard deviation.Its differential entropy is given by: which does not depend on the mean, µ.

Compliance with Theorem 2
We first analyze the bounds associated with parameter µ.For these density functions: Hence: We calculate the mixed information for µ: Hence: Also: and: Finally, we calculate the micro-structure term using Equation (17): This allows us to calculate: Now, for σ, we have: The mixed information for σ is defined by: Hence: Again, the derivative of the mixed entropy with respect to σ: Hence, the associated micro-differences expression is: which allows us to calculate: du (61) 9.3.Unknown Source Density Function, f X If f X is unknown, then it is not possible to calculate the previous bounds.In this case, we resort to the optimization program specified in Equation (41).Hence, we use Equations ( 49) and (56), in order to obtain: x k (62) Using this choice of values immediately makes both mixed information expressions described by Equations ( 51) and (58) equal to zero, without the need for large values of n.Furthermore, if we use: then the micro-differences integrals also become equal to zero.This example shows several things: • The use of the maximum log likelihood solutions and large values of n guarantees that the derivative of the mixed entropy equals zero; hence, the derivative of the relative entropy with respect to its parameters is zero, too.
• However, this is not enough to guarantee that the density functions will match each other (a Gaussian cannot estimate an exponential).The maximum log likelihood solution only guarantees that the estimator will do the best it can do.
• If not enough data is available, the derivative of the mixed entropy will be different from zero, because the micro-differences terms are not zero.

Discussion
If one wanted to estimate a density function out of a data set, one could minimize the relative entropy: once we reach its minimum, we can guarantee that the density function implemented by the estimator will be equal to that which originated the data.This is true only if the family of functions that the estimator can effectively implement includes that of the source density function.Moreover, realizing that it is not possible to measure the relative entropy, one would quickly settle to maximize the log likelihood and obtain the desired density function.Keeping this in mind, the results presented in this work seem just another way of saying the same.However, this is not so.The difference is subtle.Whereas there are many density function estimation principles and it could be discussed whether minimizing the relative entropy is the most convenient one, the bounds presented in this work show that minimizing the relative entropy automatically activates in the presence of large data sets and that one does not need to decide whether that estimation framework is the most convenient or not.Something similar happens with the weak law of large numbers: there are many ways of determining the expected value of the density function that generated the data, but everybody knows that the weak law of large numbers guarantees that in the presence of large data sets, this value can be effectively approximated by the average of the values.Our bounds, also based on the weak law of large numbers, state something similar: in the presence of large data sets, the source density function is perceived as the solution of the maximum log likelihood solution.

Conclusions
Theorem 2, which is the main result of this work, consists in a set of bounds on the partial derivative of the mixed entropy, which is a proxy to the relative entropy, with respect to the parameters of the estimator.These bounds relate the mixed information value, the capacity of the estimator to produce micro-differences and the number of random variables present in the analysis in such a way that it is possible to determine that these bounds activate automatically when the data set is large.If the micro-differences integral is zero, the mixed information derivative is zero for large data sets, independently of any other considerations.In other words, minimizing the mixed entropy is the preferred density estimation framework when large data sets are available.
Given the impossibility of estimating the mixed entropy, it is shown that the correct numerical approximation to this problem is finding the solution to the maximum log likelihood problem.Again, when the estimator is associated to a micro-differences integral equal to zero, this framework becomes the optimal one in the presence of large data sets.
8. Unknown Source Density Function, f X Now, we study what happens when the source density function, f X , is not known.It can be proven that D KL f X f Y |θ ≥ 0 ([2][p.232]).Thus, from the corresponding definitions given at the start of this work: