Duality of Maximum Entropy and Minimum Divergence

We discuss a special class of generalized divergence measures by the use of generator functions. Any divergence measure in the class is separated into the difference between cross and diagonal entropy. The diagonal entropy measure in the class associates with a model of maximum entropy distributions; the divergence measure leads to statistical estimation via minimization, for arbitrarily giving a statistical model. The dualistic relationship between the maximum entropy model and the minimum divergence estimation is explored in the framework of information geometry. The model of maximum entropy distributions is characterized to be totally geodesic with respect to the linear connection associated with the divergence. A natural extension for the classical theory for the maximum likelihood method under the maximum entropy model in terms of the Boltzmann-Gibbs-Shannon entropy is given. We discuss the duality in detail for Tsallis entropy as a typical example.


Introduction
Information divergence plays a central role in the understanding of integrating statistics, information science, statistical physics and machine learning.Let F be the space of all the probability density functions with a common support with respect to a carrier measure Λ of a data space.Usually Λ is taken as the Lebesgue measure and the counting measure corresponding to continuous and discrete random variables, respectively.The most typical example of information divergence is the Kullback-Leibler divergence on F, which is decomposed into the difference of cross and diagonal entropy measures and The entropy H 0 (f ) is nothing but Boltzmann-Gibbs-Shannon entropy.In effect, D 0 (f, g) connects the maximum likelihood [1,2], and the maximum entropy [3].If we take a canonical statistic t(X), then the maximum entropy distribution under a moment constraint for t(X) belongs to the exponential model associated with t(X), M (e) = {f 0 (x, θ) := exp{θ t(x) − κ 0 (θ)} : θ ∈ Θ} (1) where κ 0 (θ) = log exp{θ t(x)}dΛ(x) and Θ = {θ : κ 0 (θ) < ∞}.In this context, the statistic t(X) is minimally sufficient in the model, in which the maximum likelihood estimator (MLE) for the parameter θ of the model is given by one-to-one correspondence with t(X), see [4] for the convex geometry.If we consider the expectation parameter, µ = E f 0 (•,θ) {t(X)} in place of θ, then for a given random sample X 1 , • • • , X n , the MLE for µ is given by the sample mean of t(X i )'s, that is We define two kinds of geodesic curves connecting f and g in F. We call a curve mixture-geodesic.Alternatively, we call a curve t (x) := exp{(1 − t) log f (x) + t log g(x) − κ(t)} : t ∈ (0, 1)} exponential geodesic, where κ(t) = log f (x) 1−t g(x) t dΛ(x).We denote Γ (m) and Γ (e) the two linear connections induced by the mixture and exponential geodesic curves on F, which we call the mixture connection and exponential connection on F, respectively, see [5,6].Thus all tangent vectors on a mixture geodesic curve are parallel to each other with respect to Γ (m) ; all tangent vectors on an exponential geodesic curve are parallel to each other with respect to Γ (e) .It is well-known that M (e) is totally exponential-geodesic, that is, for any f 0 (x, θ 0 ) and f 0 (x, θ 1 ) in M (e) it holds that the exponential geodesic curve connecting f 0 (x, θ 0 ) and f 0 (x, θ 1 ) is in M (e) .In effect we observe that C (e) t (x) = f 0 (x, θ t ) with θ t = (1 − t)θ 0 + tθ 1 .Thus C (e) t (x) ∈ M (e) for all t ∈ (0, 1) because Θ is a convex set.Alternatively, consider a parametric model Then, M (m) is totally mixture-geodesic.
Because a mixture geodesic curve We discuss a generalized entropy and divergence measures with applications in statistical models and estimation.There have been recent developments for the generalization of Boltzmann-Shannon entropy and Kullback-Leibler divergence.We focus on U -divergence with a generator function U , in which U -divergence is separated into the differences between cross entropy and diagonal entropy.We observe a dualistic property associated with U -divergence between statistical model and estimation.The U -loss function is given by an empirical approximation for U -divergence based on a given dataset under a statistical model, in which the U -estimator is defined by minimization of the U -loss function on the parameter space.On the other hand, the diagonal entropy leads to a maximum entropy distribution with a mean equal space, where we call the family of distributions U -model.In accordance with this, the U -divergence leads to a pair of U -model and U -estimator as a statistical model and estimation.The typical example is that U (t) = exp(t), which is associated with the Kullback-Leibler divergence D 0 (f, g) generating a pair of an exponential family M (e) and the minus log-likelihood function.
This aspect is characterized as a minimax game between a decision maker and Nature.The paper is organized as follows.Section 2 introduces the class of U -divergence measures.The information geometric framework associated with a divergence measure is given in Section 3. In Section 3 we discuss the maximum entropy model with respect to U -diagonal entropy.The minimum divergence method via U -divergence is discussed in Section 5. We next explore the duality between maximum U -entropy and minimum U -divergence in Section 6.Finally, we discuss the relation to the robust statistics by minimum divergence, and a future problem on MaxEnt in Section 7.

U -Divergence
A class of information divergence is constructed by a generator function U via a simple employment of conjugate convexity, see [7].We introduce a class of generator functions by and hence U * (t) = tξ(t) − U (ξ(t)), where ξ(t) is the inverse function of the derivative of U (s), or equivalently (dU/ds)(ξ(t)) = t.The existence for ξ(t) is guaranteed from the assumption for U to be in U, in which we observe an important property that the derivative of U * is the inverse of the derivative of U , that is The conjugate function U * of U is reflexible, that is, U * * = U .By definition, for any s ∈ R and with equality if and only if s = ξ(t).We consider an information divergence functional using the generator function U as called U -divergence.We can easily confirm that D U (f, g) satisfies the first axiom of a distance function since the integrand in Equation ( 6) is always nonnegative with equality of 0 if and only if f (x) = g(x) because Equation (5).It follows from the construction that Here is called U -cross entropy; is called U -diagonal entropy.We can write H U (f ) = {U (ξ(f )) − f ξ(f )}dΛ by the definition for U * , which equals the diagonal C U (f, f ).We note that the U -divergence is expressed as because of Equation ( 4), which implies that U * plays a role on a generator function in place of U .In fact, this is also called U * -Bregman divergence, cf.[8,9] The first example of U is U 0 (s) = exp(s), which leads to U * 0 (t) = t log t − t and Thus U 0 -divergence, U 0 -cross entropy and U 0 -diagonal entropy equal D 0 (f, g), C 0 (f, g) and H 0 (f ) as defined in Introduction, respectively.As for the second example we consider where β is a scalar.The conjugate function becomes Then the generator function U β associates with the β-power cross entropy We observe that The class of β-power divergence functionals includes the Kullback-Leibler divergence in the limiting sense of , which is a half of the squared L 2 norm.If we take a limit of β to −1, then D β (f, g) becomes the Itakura-Saito divergence which is widely applied in signal processing and speech recognition, cf.[10][11][12].The β-power divergence D β (p, q) is proposed in [13]; the β-power entropy H β is equal to the Tsallis q-entropy with a relation q = β + 1, cf. [14][15][16].Tsallis entropy is connected with spin glass relaxation, dissipative optical lattices and so on beyond the classical statistical physics associated with the Boltzmann-Shannon entropy H 0 (p).See also [17,18] for the power entropy in the field of ecology.We will discuss the statistical property for the minimum β divergence method in the presence of outliers departing from a supposed model, cf.[19][20][21].A robustness performance is elucidated by appropriate selection for β.Beyond robustness perspective, a property of spontaneous learning to apply to clustering analysis is focused in [22], see also [23] for nonnegative matrix analysis.
The third example of a generator function is U η (s) = (1 − η) exp(s) − ηs with a scalar η.This generator function leads to the η-cross entropy and the η-entropy [24][25][26][27] for applications for pattern recognition.Obviously, if we take a limit of η to 0, then C η (f, g), H η (f ) and D η (f, g) converge to C 0 (f, g), H 0 (f ) and D 0 (f, g), respectively.A mislabeled model is derived by a maximum η-entropy distribution with momentary constraint if we consider a binary regression model.See [25,27] for a detailed discussion.

Geometry Associated with U -Divergence
We investigate geometric properties associated with U -divergence, which will help the discussion in subsequent sections.Let us arbitrarily fix a statistical model M = {f θ (x) : θ ∈ Θ} embedded in the total space F with mild regularity conditions.In fact, we consider the mixture geodesic curve C (m) , the exponential geodesic curve C (e) , the mixture model M (m) and the exponential model M (e) as typical examples of M .Here are difficult aspects to define F as a differentiable manifold of infinite dimension because the constraint for positivity on the support is intractable in the sense of the topology, see Section 2 in [6] for detailed discussion and historical remarks.On the other hand, if we confine ourselves to a statistical model M , then we can formulate M as a finite dimensional manifold, as in the following discussion.Thus, we produce a path geometry in which for any two elements f and g of F a class of geodesic curves connecting f and g including C (m) and C (e) is introduced so that the class of geodesic subspaces is derived as for M (m) and M (e) .

Riemannian Metric and Linear Connections
We view the statistical model M as a d-dimensional differentiable manifold with the coordinate θ = (θ 1 , • • • , θ d ).Any information divergence associates with a Riemaniann metric and dual linear connections, see [28,29] for detailed discussion.We focus on the geometry generated by the U -divergence D U (f, g) as follows.The Riemannian metric at f θ of M is given by and linear connections are and where ∂ i = ∂/∂θ i , see Appendix.for the derivation.Now we can assert the following theorem under an assumption for F: Let f be arbitrarily fixed in F. If a(x){g(x) − f (x)}dΛ(x) = 0 for any g of F, then a(x) is constant in x almost everywhere with respect to Λ.
Theorem 1.Let Γ (U ) be the linear connection defined in Equation (11).Then any Γ (U ) -geodesic curve is equal to the mixture-geodesic curve defined in Equation (2).
Proof.Let C (U ) := {f t (x) : t ∈ (0, 1)} be a Γ (U ) -geodesic curve with f 0 = f and f 1 = g.We consider a 2-dimensional model defined by , where θ = (s, t, u).Then we observe that if u = s, then which identically 0 for any g of F. It follows from the assumption for F that (d 2 /dt 2 )f t (x) = c almost everywhere with respect to Λ, which solved by from the endpoint condition for C (U ) .We observe that c = 0 because f t (x) ∈ F, which concludes that C (U ) equals the mixture-geodesic.The proof is complete.This property is elemental to characterize the U -divergence class, which is closely related with the empirical reducibility as discussed in a subsequent section.The assumption for F holds if the carrier measure Λ is Lebesgue measure or the counting measure.
On the other hand, for a * Γ (U ) -geodesic curve * C (U ) := {f * t (x) : t ∈ (0, 1)} with f 0 = f and f 1 = g we consider an embedding into a 2-dimensional model, where θ = (s, t), where u(s) = (d/dt)U (s) and κ θ is a normalizing constant to satisfy if s = t.This leads to (d 2 /dt 2 )ξ(f * t (x)) = c almost everywhere with respect to Λ, which is solved by We confirm that, if U = exp, then * Γ (U ) -geodesic curve reduces to the exponential geodesic curve defined in Equation (3).

Generalized Pythagorian Theorems
We next consider the Pythagorean theorem based on the U -divergence as an extension of the result associated with the Kullback-Leibler divergence in [6].
Theorem 2. Let p, q and r be in F. We connect p with q by the mixture geodesic Alternatively we connect r and q by * Γ (U ) -geodesic curve Two curves {f s (x) : s ∈ [0, 1]} orthogonally intersect at q with respect to the Riemannian metric G (U ) defined in Equation (10) if and only if Proof.A straightforward calculus yields that By the definition of G (U ) we see that 12 (θ) is nothing but the left side of Equation ( 17) when where θ = (t, s).Hence the orthogonality assumption is equivalent to Equation ( 16), which completes the proof.
Remark 1.We remark a further property such that, for any s and t in [0, 1], If U = exp, then Theorem 2 reduces to the Pythagoras theorem with the Kullback-Leibler divergence as shown in [6].Consider two geodesic subspaces defined by and For any m-geodesic curve C (m) and U -geodesic curve * C (U ) connecting q we assume that C (m) and C (U ) orthogonally intersect at q in the sense of the Riemannian metric G (U ) .Then, for any p ∈ M (m) and r ∈ M (U ) D U (p, r) = D U (p, q) + D U (q, r), in which two-way projection is associated with as First we confirm a kind of reduction property for the Kullback-Leibler divergence to the framework in information geometry such that (G , where G is the information metric.Second we return a case of the β-power divergence, which is reduced a special case of Theorem 2. Consider two curves Then we observe for the Riemannian metric G (β) generated by β-power divergence that which is (p − q)(p β − q β )dΛ.We observe that if C (m) and C (β) orthogonally intersect at q, then D β (p, r) = D β (p, q) + D β (q, r).

Maximum Entropy Distribution
The maximum entropy principle is based on the Boltzmann-Shannon entropy in which the maximum entropy distribution is characterized by an exponential model.The maximum entropy method has been widely enhanced in fields of natural language processing, ecological analysis and so forth.However, there are other types of entropy measures proposed as the Hill diversity index, the Gini-Simpson index, the Tsallis entropy and so on, cf.[14,17,18] in different fields.We introduced the class of U -entropy functionals, which include all the entropy measures mentioned above.In this subsection, we discuss the maximum entropy distribution based on an arbitrarily fixed U -entropy.
We check a finite discrete case with K + 1 cells as a special situation, where F reduces to a K-dimensional simplex S K .The maximum U -entropy distribution is defined by The Lagrange function is

We observe that
Therefore the maximum U -entropy distribution f * is a uniform distribution on S K for any generator function U .
In general the U -entropy is an unbounded functional on F unless F is finite discrete.For this we introduce a moment constraint as follows.Let t(X) be a k-dimensional statistic vector.Henceforth we assume that E f { t(X) 2 } < ∞ for all f of F. We consider a mean equal space for t(X) as where τ is a fixed vector in R k .By definition Γ(τ ) is totally mixture geodesic, that is, if f and g are in Γ(τ ), then (1 − t)f + tg is also in Γ(τ ) for any t ∈ (0, 1).
, where H U (f ) is U -diagonal entropy defined in Equation (7).Then the maximum U -entropy distribution is given by where κ U (θ) is the normalizing factor and θ is a parameter vector determined by the moment constraint Proof.The Eular-Lagrange functional is given by The equation in Equation ( 21) yields that for any g τ (x) in Γ(τ ), which concludes Equation (20).Since ξ(t) is an increasing function, we observe that for any t ∈ [0, 1], which implies the inequality in Equation ( 21).Since g τ ∈ Γ(τ ), we observe that which is nonnegative by the definition of U -divergence.The proof is complete.
Here we give a definition of the model of maximum U -entropy distributions as follows.
Definition 1.We define a k-dimensional model The Naudts' deformed exponential family discussed from a statistical physical viewpoint as in [15] is closely related with U -model.The one-parameter family {r s (x) : s ∈ [0, 1]} as defined in Equation ( 15) is a one-dimensional U -model and M (U ) defined in Equation ( 18) is a K-dimensional U -model.For a U -model M U defined in Equation ( 23), the parameter θ is an affine parameter for the linear connection * Γ (U ) defined in Equation (12).In fact, we observe from the definition Equation ( 12 which is identically 0 for all θ ∈ Θ.We have a geometric understanding for the U -model similar to the exponential model discussed in Introduction. Theorem 4. Assume for U (t) that U (t) > 0 for any t in R.Then, the U -model is totally * Γ (U ) -geodesic.
Proof.For arbitrarily fixed θ 1 and θ 2 in Θ, we define the U -geodesic curve connecting between f U (x, θ 1 ) and f U (x, θ 2 ) such that, for λ ∈ (0, 1), with a normalizing factor κ(λ), which is written by Hence it suffices to show θ λ ∈ Θ for all λ ∈ (0, 1), where Θ is defined in Definition 1.We look at the identity f U (x, θ)dΛ(x) = 1 from a fact that f U (x, θ) is a probability density function.This implies that the first derivative gives and the second derivative gives Since the identity Equation (24) shows that the Hessian of κ U (θ) is proportional to a Gramian matrix, which implies that κ This concludes that θ λ ∈ Θ for any λ ∈ (0, 1), which completes the proof.We discuss a typical example by the power entropy H β (f ), see [15,[30][31][32][33][34] from a viewpoint of statistical physics.First we consider a mean equal space of univariate distributions on (0, ∞) Note that lim β→0 t(x) = (x, (κ − 1) log x).To get the maximum entropy distribution with H β we consider the Euler-Lagrange function given by where θ and λ are Lagrange multiplier parameters.This yields that the maximum entropy distribution is A gamma distribution is defined by the density function Second, we consider a case of multivariate distributions, where the moment constraints are supposed that for a fixed p-dimensional vector µ and matrix V of size p × p If we consider a limit case of β to 0, then H β (f ) reduces to the Boltzmann-Shannon entropy and the maximum entropy distribution is the Gaussian distribution with the density function In general we deduce that if β > − 2 p+2 , then the maximum β-power entropy distribution uniquely exists such that the density function is given by where See [35,36] for the detailed discussion [37,38] for the discussion on group invariance.Thus, if β > 0, then the maximum β-power entropy distribution has a compact support The typical case is β = 2, which is called the Wigner semicircle distribution.On the other hand, if − 2 p+2 < β < 0, the maximum β-power entropy distribution has a full support of R p , and equals a p-variate t-distribution with a degree of freedom depending on β.

Minimum Divergence Method
We have shown a variety of U -divergence functionals using various generator functions in which the minimum divergence methods are applied to analyses in statistics and statistical machine learning.In effect the U -cross entropy for any λ j > 0 with J j=1 λ j = 1.It is closely related with a characteristic property that the linear connection Γ (U ) associated with U -divergence is equal to the mixture connection Γ (m) as discussed in Theorem 1. Furthermore, for a fixed g, C U (f, g) can be viewed as a functional of F in place of f as follows: where F is the probability distribution generated from f (x).If we assume to have a random sequence X 1 , • • • , X n from a density function f (x), then the U -cross entropy is approximated as where Fn is the empirical distribution based on the data Consequently, if we model g by a model function f (•, θ), then the right side of Equation ( 25) depends only on the data set (X i ) n i=1 and parameter θ without any knowledge for the underlying density function f (x).This gives the empirical approximation, which is advantageous over other classes of divergence measures.The minimum U -divergence method is directly applied to minimization of the empirical approximation with respect to θ.We note that the minimum divergence is equivalent to the minimum cross entropy, in which the diagonal entropy is just a constant in θ.In particular, in the classical case, which is equivalent to the minus log-likelihood function.Let X 1 , • • • , X n be independently and identically distributed from an underlying density function f (x) which is approximated by a statistical model M = {f (x, θ) : θ ∈ Θ}.The U -loss function is introduced by where b U (θ) = U (ξ(f (x, θ)))dΛ(x).We call θU = argmin θ∈Θ L U (θ) U -estimator for the parameter θ.By definition E f {L U (θ)} = C U (F, f (•, θ)) for all θ in Θ, which implies that L U (θ) almost surely converges to C U (F, f (•, θ)) as n goes to ∞.Let us define a statistical functional as where C U (F, g) is written C U (f, g) placing f into the probability distribution F generated from f .Then θ U (F ) is model-consistent, or θ U (F θ ) = θ for any θ ∈ Θ because The estimating equation is given by which is written by Hence, if we consider the U -estimator for a parameter η by the transformation of θ defined by ϕ(θ) = E f (•,θ) {t(X)}, then the U -estimator ηU is nothing but the sample mean t.Here we confirm that the transformation ϕ(θ) is one-to-one as follows.The Jacobian matrix of the transformation is given by since the first identity for M U leads to Therefore, we conclude that the Jacobian matrix is symmetric and positive-definite since u (t) is a positive function from the assumption of the convexity for U , which implies that ϕ(θ) is one-to-one.Consequently, the estimator θU for θ is given by ϕ −1 ( t).We summarize these results in the following theorem.
Theorem 5. Let M U be a U -model with a canonical statistic t(X) as defined in Equation (28).Then the U -estimator for the expectation parameter η of t(X) is always t, where t = 1 n n i=1 t(X i ).
Remark 2. We remark that the empirical Pythagorean theorem holds as in which gives another proof for which θU is ϕ −1 ( t).The statistic t is a sufficient statistic in the sense that the U -loss function L U (θ) is a function of t as in Equation (29).Accordingly, the U -estimator under U -model is a function only of t from the observations X 1 , • • • , X n .In this extension, the MLE is a function of t under the exponential model with the canonical statistic t(X).
Let us look at the case of the β-power divergence.Under the β-power model given by The β-power estimator for the expectation parameter of t(X) is exactly given by t.

Discussion
We concentrate on elucidating the dual structure of the U -estimator under the U -model, in which the perspective extends the relation of the maximum likelihood under the exponential model with a functional degree of freedom.Thus, we explore a rich and practical class of duality structures; however, there remains an unsolved problem when we directly treat the space F as an differentiable manifold, see [39] for an infinite dimensional exponential family.The approach here is not a direct extension of an infinite dimensional manifold, but a path geometry in the following sense.For all pairs of elements of F the geodesic curve connecting the pair is represented in an explicit form in the class of * Γ (U ) connections in our context.
The U -divergence approach was the first trial to introduce a dually flat structure to F which is different from the alpha-geometry.However, there are many related studies.For example, a nonparametric information geometry on the space of all functions without constraints for positivity and normalizing is discussed in Zhang [40].Amari [41] characterizes (ρ, τ )-divergence with decomposable dually flat structure, see also [42].If ρ is an identity function and τ (s) = (d/ds)U (s), (ρ, τ )-divergence is no less than U -divergence.In effect we confine ourselves to discussing the U -divergence class for the sake of the direct estimability for U -estimator.
The duality between the maximum entropy and the minimum divergence has been explored in the minimax theorem for a zero-sum game between a decision maker and Nature.The pay-off function is taken by the cross U -entropy in which Nature tries to maximize the pay-off function under the mean equal constraint; the decision maker tries to minimize the pay-off function.The equilibrium is given by the minimax solution, which is the maximum U -entropy distribution, see [43] for the extensive discussion and the relation with Bayesean robustness.The observation explored in this paper is closely related with this minimax argument, however the duality between the statistical model and estimation is focused on, where the minimum U -divergence leads to projection onto the U -model.
In principle, the U -estimator is applicable for all the statistical model since U -loss function is written by a sample as well as the log-likelihood function.If the choice of the model is different from the U -model, then U -estimator has different performance from the present situation.For example, we consider an exponential model (U (s) = exp(s)), and a β-estimator (U (s) = (1 − βs) 1/β for getting a robustness property for outlying observations, cf.[19,20].In such situations, the duality property is no longer valid, since the β-estimator for the parameter of the exponential model is not a function of the sufficient statistic t defined in Theorem 5. Thus, we have to pay attention to another aspect than the duality structure in the presence of outlying, or misspecification for the statistical model.Furthermore, another type of divergence measures including projective power divergence is recommended to perform super robustness, cf.[21,44].
We presented the method of generalized maximum entropy based on the proposed entropy measure, as an extension of the classical maximum entropy method based on the Boltzmann-Gibbs-Shannon entropy.Practical applications of MaxEnt are actively followed in ecological and computational linguistic researches based on the classical maximum entropy, cf.[45,46].Difficult aspects are discussed, in which the MaxEnt is apt to be over-learning on data sets because it basically employs the maximum likelihood estimator.There is a great potential for the proposed method to implement these research fields in order to overcome these difficult aspects, by selecting an appropriate generator function.A detailed discussion is beyond the scope of the present paper; however, it will be challenged in the near future with concrete objectives motivated by real data analysis.