Fisher Information Properties

A set of Fisher information properties are presented in order to draw a parallel with similar properties of Shannon differential entropy. Already known properties are presented together with new ones, which include: (i) a generalization of mutual information for Fisher information; (ii) a new proof that Fisher information increases under conditioning; (iii) showing that Fisher information decreases in Markov chains; and (iv) bound estimation error using Fisher information. This last result is especially important, because it completes Fano’s inequality, i.e., a lower bound for estimation error, showing that Fisher information can be used to define an upper bound for this error. In this way, it is shown that Shannon’s differential entropy, which quantifies the behavior of the random variable, and the Fisher information, which quantifies the internal structure of the density function that defines the random variable, can be used to characterize the estimation error.


Introduction
The birth of information theory was signaled by the publication of Claude Shannon's work [1], which is based on studying the behavior of systems described by density functions.However, much before that work was published, Ronald Fisher had already published the definition of a quantity called Fisher information [2], a hard bound on the capacity to estimate the parameters that define a system [3,4].Hence, this quantity regulates how well it is possible to determine the internal structure of a system and provides another point of view that can be used to study systems: how they are composed, what they are made of.This work springs from the belief that the combination of these approaches is what completely defines systems: their behavior (Shannon) and their architecture (Fisher).In the following, a series of published results is summarized, together with new results, in order to present a coherent set of Fisher information properties that will hopefully be useful for those that work with this quantity.

Fisher Information and Other Fields
One connection between Fisher information and the Shannon differential entropy was stated by Kullback [5] (p.26), who proved that the second derivatives of the Kullback-Leibler divergence with respect to the density functions parameters produce the Fisher information matrix terms.Related results were presented by Blahut [6] (p. 300), and Frieden [7] (p.37).Another important result that also relates these two frameworks is Bruijn's identity ( [8,9] and [10] (p.672)), which establishes a relation between the derivative of Shannon differential entropy and Fisher information when the underlying random variable is the subject of Gaussian perturbations.This result was recently generalized to non-Gaussian perturbations [11,12].A consequence of these results is the convolution inequality for Fisher information ( [8,9,[13][14][15][16]; [10] (p.674)).
Others have been studying the relation between Fisher information and physics.Here, it is important to point out the extreme physical information principle derived by Frieden and others in order to establish a general framework that explains physics [7,[17][18][19][20].Of special interest has been the role of Fisher information to generate thermodynamical theory [7,[17][18][19][20][21][22].It is very common in these approaches to use a special case of Fisher information where the estimated parameter is a location parameter.In this work, the original and general Fisher information definitions, and not the later special case, are addressed only.
Even thought Shannon's ideas have been part of the the machine learning tool set for a long time, Fisher information has not followed the same track.Even though Fisher information is intimately connected to estimation theory [23], its use in the development of learning systems has not been well developed yet.Nevertheless, Amari discovered that natural gradient descent, i.e., common gradient descent corrected with the Fisher information matrix terms, takes into account the topology in a more precise manner, allowing for more efficient training procedures [24,25].The use of Fisher information has also been taken into account in order to design objective functions to lead the estimation procedure.One of them is mixing maximum entropy with minimum Fisher information [26,27].On the other hand, mixing Shannon's differential entropy, Fisher information and the central limit theorem has allowed proving that in the presence of large datasets, it is natural to search for minimum Kullback-Leibler, or equivalent, solutions [28].

Contribution of This Work
This work is focused on presenting already known properties of Fisher information [3,4,7,8,10,[29][30][31][32] and introducing new ones, such that the reader can have a better grasp of Fisher information and its usefulness.The main results presented in this work are: (i) the generalization of the mutual information concept using Fisher information expressions; (ii) a new proof that conditioning under certain assumptions increases Fisher information; (iii) proving that in Markov chains, the Fisher information increases as the random variables become further away from the estimated parameter; and (iv) an upper bound on estimation error, which is regulated by the Fisher information.
This work is structured roughly in the same way in which is organized the first chapter of the well-known book of Cover and Thomas [30], in order to help the reader to draw a parallel between Shannon and Fisher information.

Notation
In the following sections, vectors and matrices are denoted with a bold font [7,31].Furthermore, density functions are denoted by f X;θ ≡ f X;θ (x), where the f is reserved for density functions, the lowercase X corresponds to the name of the random variable, θ represents the parameters that define the density function and the symbol within the (•) stands for the instance of the random variable that is used to evaluate the density function.In this way, as an example, a different random variable could be denoted by f Y;θ ≡ f Y;θ (y).A similar notation is used in [33].

Fisher Information
Let there be a random variable X and its associated density function f X;θ ≡ f X;θ (x), which has a support S, and it depends on a set of parameters that is represented by the vector θ ∈ Θ.The value θ k is the k-th component of θ.According to the original definition designed by Fisher to characterize maximum likelihood estimation [2]: Definition 1 (Fisher Information).Given a random variable X and its associated density function f X;θ (x), which depends on the parameter vector θ ∈ Θ, and θ k is the k-th component of θ, then the Fisher information associated with θ k is defined by: From the definition, it is clear that i Example 1.In a Gaussian case with mean µ and standard deviation η, the density function is given by: In this case: If the parameter to be estimated is the mean µ, the previous expression needs to be derived with respect to µ: Replacing into the definition of Fisher information definition: This shows that for Gaussian functions, the variance of any estimator of the mean is directly proportional to the variance of the density function.
There is another expression that can be used to represent the Fisher information.
Theorem 1.Given a random variable X and its associated density function f X;θ (x), which depends on the parameter vector θ ∈ Θ and complies with the boundary condition for θ k (see Appendix A), where θ k is the k-th component of θ, then the Fisher information associated with θ k is equal to: A proof of this theorem can be found in [34] (p.373).
Example 2. Continuing the Gaussian example, and using the alternative definition of the Fisher information, the required second derivative is first calculated: Replacing into Equation (12), the same result is obtained: The importance of the Fisher information quantity stems from the Cramer-Rao bound [3,4,23,35]: Theorem 2 (Cramer-Rao Bound).Given a random variable X and its associated density function f X;θ (x), which depends on the parameter vector θ ∈ Θ and complies with the boundary condition for θ k (see Appendix A), where θ k is the k-th component of θ, also given that there is an unbiased estimator θk (x) of the scalar parameter θ k , then: where: is the variance of the estimator.Proofs of this theorem can be found in [7] (p.29) and [23] (p.66).
The Cramer-Rao bound establishes that the reciprocal of the Fisher information is a lower bound of the variance of an estimator.Any estimator that reaches the bound imposed by the Cramer-Rao theorem is called efficient [34].It is important to notice that the bound does not depend on the estimator itself; it only depends on i F (f X;θ ) θ k .In this work, the case of biased estimators will not be analyzed, nor when the parameters themselves are random variables.
The following theorem states that the topology of the Fisher information in the density function space is very simple: Theorem 3. The Fisher information i F (f X;θ ) θ k is convex in f X;θ .Proofs of this theorem can be found in [7] (p.69) and [29].

Joint Fisher Information Definition
Definition 2. Given two random variables X and Y and the associated joint density function f X,Y;θ (x, y), which depends on the parameter vector θ ∈ Θ, and θ k is the k-th component of θ, then the joint Fisher information associated with θ k is defined by:

An Equivalent Joint Fisher Information Definition
Theorem 4. Given two random variables X and Y and the associated joint density function f X,Y;θ (x, y), which depends on the parameter vector θ ∈ Θ and complies with the boundary condition for θ k (see Appendix A), where θ k is the k-th component of θ, then the joint Fisher information associated with θ k is equal to: Proof.This follows trivially from the alternative definition of the Fisher information.

Conditional Fisher Information Definition
Definition 3.

Chain Rule for Two Random Variables
The following result was first published by Zamir [32], who used it to produce an alternative proof of the Fisher information inequality.In the following lines, the same chain rule is proven using the results presented in the previous sections.
Theorem 5 (Chain Rule for Two Random Variables).Given a joint density function f X,Y;θ (x, y), which depends on the parameter vector θ ∈ Θ, and given that the density functions comply with the boundary condition for θ k (see Appendix A), where θ k is the k-th component of θ, then: Proof.
If f Y|X;θ (y|x) complies with the boundary condition with respect to θ k (see Appendix A), then: Therefore, the theorem is proven.The other result is proven analogously.
When the chain rule is used to estimate the Fisher information associated with a parameter, it is important to take into account that all of the terms that come out after applying the chain rule contain derivatives with respect to the same parameter.Because some of these terms may be dependent on density functions that do not depend on the parameter, some of these terms may be equal to zero.
Example 3. Given the random variable Y = X + N , where X is a Gaussian density function with mean µ and standard deviation η and N another Gaussian density function with mean zero and standard deviation ν, if the joint density function is available, and the parameter to be estimated is µ, then: The previous result implies that if the joint density function of the output Y and the input X is available, the noise does not affect the estimation process.This is not surprising, since Y is a corrupted version of X, and it cannot shed more information on µ than that contained in X.Because all of the information hidden in X is available through the joint density function, it makes sense to think that the Fisher information of the joint density function corresponds to that of the marginal distribution f X;µ,η .
Given the density functions mentioned above, it is possible to prove that: with Fisher information associated with µ equal to: Using the other expression for the chain rule: Using the previous results: which implies:

Chain Rule for Many Random Variables
In the case of more than two density functions: Theorem 6 (Chain Rule for Many Random Variables).Given a set of n random variables X 1 , X 2 , ..., X n , all of them depending on θ k , if the density functions comply with the boundary condition for θ k (see Appendix A), then: Proof. i If the n random variables in Theorem 6 are i.i.d., then i F (f X 1 ,X 2 ,...,Xn;θ )

Relative Fisher Information Type I
In the following, the relative Fisher information is defined.As far as it was possible to determine, the first definition of the relative Fisher information was given by Otto and Villani [36], who defined it for the translationally-invariant case.Furthermore, this expression has been rediscovered or simply used in many applications thereafter in different problems and fields [22,[37][38][39][40][41][42][43][44].Furthermore, it seems that the first general analysis of the relative Fisher information was presented by the author in [45].The following sections focus on this latter general case, where there is no assumption of translational invariance.
Analogously to the Kullback-Leibler divergence [46], also known as as relative entropy, which was designed to established how much two density functions differed, the relative Fisher information of Type I is obtained when the ratio of two intervening density functions is replaced into Equation (1), as is shown in the following definition.Definition 4. The relative Fisher information Type I is defined by: The same mechanism can be used to generate a second definition for the relative Fisher information.The same ratio can be replaced into Equation (12), producing an alternative and equally valid expression, which is designated as relative Fisher information Type II.This second expression is studied in the following sections.

Information Correlation
Definition 5.The information correlation with respect to θ k is defined by: The name information correlation comes from the similarity between this definition and that of the classical correlation coefficient.It is important to keep in mind that it is different from the terms that fill the Fisher information matrix [23].
According to the definition i C (f X,X;θ Example 4. Continuing with the example where Y = X + N , the information correlation between Y and X is given by: where: Analogously: Replacing these derivatives into the information correlation expression: Theorem 7. The information correlation is bounded according to: which can be reexpressed as: This is a second degree equation that is true for every possible a.Because this equation is always greater than zero, the discriminant of the equation has to comply with which proves the theorem.Definition 6.The information correlation coefficient is defined by: Theorem 8.The information correlation coefficient is limited by: Proof.This comes from the definition of the information correlation coefficient and Theorem 7.
Theorem 9.If at least one of the following conditions: (1) f X;θ and f Y;θ are independent.
(2) Either f X;θ or f Y;θ does not depend on θ k . is true, then: Proof.Examination of the information correlation definition clearly shows that compliance with the first and second cases directly implies that this quantity is zero.

Mutual Fisher Information Type I
As happens in Shannon's differential entropy handling, in this work, mutual Fisher information is also defined as relative Fisher information Type I, where the argument is the ratio between a joint density function and the product of its marginals.
From the definition, it is obvious that m F (I) (f X,Y;θ ) θ k ≥ 0.
Theorem 10.If the boundary condition (see Appendix A) with respect to θ k holds for f X,Y;θ (x, y), the mutual Fisher information Type I can be reformulated as a function of the Fisher information as follows: Simplifying: Assuming that f X,Y;θ complies with the boundary condition (see Appendix A) with respect to θ k , then: Using the previous result, it is obtained: This implies: The other result is obtained analogously.
Example 5. Continuing with the example where Y = X + N , the mutual Fisher information Type I is given by: (87)

Conditional Mutual Fisher Information of Type I
Definition 8.The conditional information correlation with respect to θ k of random variables X and Y given random variable Z is defined by: Definition 9.The conditional mutual Fisher information of Type I of random variables X and Y given random variable Z is defined by: Corollary 1.If the boundary condition (see Appendix A) with respect to θ k holds for f X,Y,Z;θ (x, y, z), the conditional mutual Fisher information of Type I of random variables X and Y given random variable Z can be reformulated as a function of the Fisher information as follows: Proof.This follows analogously to that of the simpler case.

Relative Fisher Information Type II
Given that there is an alternative expression for the Fisher information (check Equation ( 12)), there is another way of defining the relative Fisher information expression.Definition 10.The relative Fisher information Type II is defined by: Even though both definitions for the relative Fisher information are derived from equivalent expressions, they are not equivalent.Why is this so?This is because the argument of the Fisher information definition is a density function, whereas the argument of the relative Fisher information expression is a ratio of density functions, not a density function, thus their difference.

Mutual Fisher Information Type II
Analogously to the definition of the mutual Fisher information Type I, but in this case using the relative Fisher information of Type II, the following definition is obtained: Definition 11.The mutual Fisher information Type II is defined by: Theorem 11.The mutual Fisher information Type II can be reformulated as a function of the Fisher information as follows: from which the theorem follows.
Proof.This comes from combining Theorem 11 and the chain rule for Fisher information.
Example 6.For the example where Y = X + N , the mutual Fisher information Type II is given by: Proof.This can be deduced from the mutual Fisher information theorems.
Given that m F (I) is always greater than or equal to zero, the expression m F (II) can be positive or negative according to the value of the information correlation.

Lower Bound for Fisher Information
Stam's inequality [8,9,40,[47][48][49][50] states a lower bound for Fisher information, which links Fisher information and Shannon's entropy power.However, this expression is limited to the special case where the parameters in the Fisher information expression correspond to a location parameter.
A more general result was recently proven by Stein et al. [51], which says that given a multidimensional random variable with density function f X;θ with: If the Fisher information matrix is defined by: then: if ∂µ(θ) ∂θ exists.The authors of [51] explain that this is the same as saying that: The previous expression states that the difference of matrices between the large parenthesis is a positive semi-definite matrix.Thus, its diagonal elements are non-negative, and it can be stated: Corollary 4. The following lower bound for Fisher information holds: and c −1 ij stands for the ij-th element of Σ −1 (θ).

In Some Cases, Conditioning Increases the Fisher Information
The following result states that in some cases, conditioning a random variable with another variable may increase the Fisher information.This result is a generalization of another published previously by Zamir [32].
Theorem 12 (Conditioning Increases Information).If f Y|X;θ depends on θ k and f X does not depend on it, then: Proof.Thus, given that only f Y|X;θ depends on θ k , Theorem 9 guarantees that: Hence, from the previous mutual Fisher information expressions:

Data Processing Inequality
Following the same analysis done by Cover and Thomas to present the data processing theorem for Shannon entropy [30] and continuing with the work done by Zamir [32], the case where the joint density function of the random variables R, S and T can be expressed by f R,S,T;θ = f R;θ • f S|R;θ • f T|S;θ is considered.In this case, they form a short Markov chain that is represented by R → S → T. Because Markovicity implies conditional independence, then it is true that f R,T|S;θ = f R|S;θ • f T|S;θ .
Theorem 13.Given a Markov chain R → S → T, where only f T|S;θ depends on θ k , then: Proof.From the previous results: Analogously: Because only f T|S;θ depends on θ k and all of the information correlation terms have derivatives of density functions that do not depend on this parameter, then all of the information correlation terms are zero.Hence: Given that m F (I) (f R,T|S;θ ) θ k = 0 because R and T are independent given S, and m F (I) (f S,T|R;θ ) θ k ≥ 0, then: Given that in the previous proof, all of the information correlation terms are zero, then m F (II) (f R,T;θ ) θ k = m F (I) (f R,T;θ ) θ k , and m F (II) (f S,T;θ ) θ k = m F (I) (f S,T;θ ) θ k .Thus, the following corollary is obtained: Corollary 5. Given a Markov chain R → S → T, where only f T|S;θ depends on θ k , then: Proof.The conditional independence provided by the Markovicity of the random variables follows directly from the mutual Fisher information Type II definition, and in this case, the values of mutual Fisher information Type I and mutual Fisher information Type II are identical.
Using the definition of mutual Fisher information Type II and the previous expression, it is readily obtained, in a simpler way, a result already proven by Plastino et al. [52]: Corollary 6.From the previous results, it is obvious that: Proof.From Equation (121): In other words, in any Markovian process, the further away that the random variables used by the estimator are, the larger is the variance of the estimated parameter.

Upper Bound on Estimation Error
A well-known result states that given a variance η, of all possible density functions, the one that maximizes the differential entropy is the Gaussian density function [30].Hence, for an arbitrary density function f X , some side information Y and an estimator X, it is possible to obtain an estimation version of the Fano inequality [10] (p.255): In the context of Fisher information, the same question arises: is it possible to bound the estimation error using this quantity as well?Surprisingly, the answer is yes, but in the form of an upper bound.Thus, Shannon entropy can be used to set error lower bounds and Fisher information upper ones.In order to establish this bound, the following setup is defined, where a random variable R is given, and a related random variable Y is observed, which, in turn, is used to calculate a function R = g(Y).It is desired to bound the probability that (R − R) 2 > .It is important to note that R → Y → R is a Markov chain and that R depends on θ.Theorem 14.Given a random variable R and an estimator of it named R, the estimation error is defined by: E Then, the probability that the estimation error exceeds some ε value: Proof.Using the chain rule for Fisher information: Using the fact that given R and R, then E is no longer a random variable, then: Neglecting i F (f E| R;θ ) θ k , because it is always greater or equal to zero, it is obtained: Moreover, the term: Hence:

Discussion
The Fisher information, which sets a bound on how precise the estimation of an unknown parameter of a density function can be, has an associated set of properties that are equivalent to those of Shannon's differential entropy.The properties presented in this work help to understand how to manipulate and use Fisher information in ways that so far have been exclusive to Shannon's differential entropy.These properties that are of special importance to the generalization of the mutual information concept for the Fisher information realm are a new version of the data processing theorem that shows that Fisher information decreases in a Markov chain and an upper bound of the estimation error of a random variable that is regulated by the Fisher information.

Acknowledgments
The author thanks Alexis Fuentes and Carlos Alarcón for reviewing this work, and helping to improve some expressions.The author also thanks CONICYT Chile for its grant FONDECYT 1120680.

Conflicts of Interest
The author declares no conflict of interest.

A. Boundary Condition
A general result from calculus establishes that for any function g(x, θ k ), the following is true: In the case of a vector integral, the previous expression applies to all of the components without any loss of generality.Some of the results in this work use the following condition: Condition 1 (Boundary Condition).A function complies with the boundary condition if it is possible to neglect the boundary terms in Equation (141), such that: This condition corresponds to what sometimes are called regular cases [34] (p.373).
It is important to keep in mind that not all density functions go along with this condition.As an example, in calculations that involve the uniform density function, where the parameters define the support, it is not possible to neglect the terms, and the boundary condition does not hold.Hence, it is always necessary to check whether the condition holds or not.If not, one may arrive at false results.

Definition 7 .
The mutual Fisher information Type I is defined by: