Next Article in Journal
Simulation of Entropy Generation under Stall Conditions in a Centrifugal Fan
Next Article in Special Issue
A Maximum Entropy Fixed-Point Route Choice Model for Route Correlation
Previous Article in Journal
Hybrid Quantum-Classical Protocol for Storage and Retrieval of Discrete-Valued Information
Previous Article in Special Issue
A Maximum Entropy Method for a Robust Portfolio Problem
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Duality of Maximum Entropy and Minimum Divergence

1
The Institute of Statistical Mathematics and The Graduate University of Advanced Studies, Tachikawa Tokyo 190-8562, Japan
2
The Institute of Statistical Mathematics, Tachikawa Tokyo 190-8562, Japan
3
Department of Electrical and Electronics Engineering, University of Fukui, Fukui 910-8507, Japan
*
Author to whom correspondence should be addressed.
Entropy 2014, 16(7), 3552-3572; https://doi.org/10.3390/e16073552
Submission received: 28 April 2014 / Revised: 19 June 2014 / Accepted: 24 June 2014 / Published: 26 June 2014
(This article belongs to the Special Issue Maximum Entropy and Its Application)

Abstract

:
We discuss a special class of generalized divergence measures by the use of generator functions. Any divergence measure in the class is separated into the difference between cross and diagonal entropy. The diagonal entropy measure in the class associates with a model of maximum entropy distributions; the divergence measure leads to statistical estimation via minimization, for arbitrarily giving a statistical model. The dualistic relationship between the maximum entropy model and the minimum divergence estimation is explored in the framework of information geometry. The model of maximum entropy distributions is characterized to be totally geodesic with respect to the linear connection associated with the divergence. A natural extension for the classical theory for the maximum likelihood method under the maximum entropy model in terms of the Boltzmann-Gibbs-Shannon entropy is given. We discuss the duality in detail for Tsallis entropy as a typical example.

1. Introduction

Information divergence plays a central role in the understanding of integrating statistics, information science, statistical physics and machine learning. Let be the space of all the probability density functions with a common support with respect to a carrier measure Λ of a data space. Usually Λ is taken as the Lebesgue measure and the counting measure corresponding to continuous and discrete random variables, respectively. The most typical example of information divergence is the Kullback-Leibler divergence
D 0 ( f , g ) = f ( x ) { log f ( x ) log g ( x ) } d Λ ( x )
on , which is decomposed into the difference of cross and diagonal entropy measures
C 0 ( f , g ) = f ( x ) log g ( x ) d Λ ( x )
and
H 0 ( f , g ) = f ( x ) log f ( x ) d Λ ( x ) .
The entropy H0(f) is nothing but Boltzmann-Gibbs-Shannon entropy. In effect, D0(f, g) connects the maximum likelihood [1,2], and the maximum entropy [3]. If we take a canonical statistic t(X), then the maximum entropy distribution under a moment constraint for t(X) belongs to the exponential model associated with t(X),
M ( e ) = { f 0 ( x , θ ) : = exp { θ t ( x ) κ 0 ( θ ) } : θ Θ }
where κ0(θ) = log ∫ exp{θ t (x)}dΛ(x) and Θ = {θ : κ0(θ) < ∞}. In this context, the statistic t(X) is minimally sufficient in the model, in which the maximum likelihood estimator (MLE) for the parameter θ of the model is given by one-to-one correspondence with t(X), see [4] for the convex geometry. If we consider the expectation parameter,
μ = 𝔼 f 0 ( , θ ) { t ( X ) }
in place of θ, then for a given random sample X1, ···, Xn, the MLE for μ is given by the sample mean of t(Xi)’s, that is
μ ^ 0 = 1 n i = 1 n t ( X i ) .
We define two kinds of geodesic curves connecting f and g in . We call a curve
C ( m ) = { C t ( m ) ( x ) : = ( 1 t ) f ( x ) + t g ( x ) : t ( 0 , 1 ) }
mixture-geodesic. Alternatively, we call a curve
C ( e ) = { C t ( e ) ( x ) : = exp { ( 1 t ) log f ( x ) + t log g ( x ) κ ( t ) } : t ( 0 , 1 ) }
exponential geodesic, where κ(t) = log ∫ f(x)1–tg(x)tdΛ(x). We denote Γ(m) and Γ(e) the two linear connections induced by the mixture and exponential geodesic curves on , which we call the mixture connection and exponential connection on , respectively, see [5,6]. Thus all tangent vectors on a mixture geodesic curve are parallel to each other with respect to Γ(m); all tangent vectors on an exponential geodesic curve are parallel to each other with respect to Γ(e). It is well-known that M(e) is totally exponential-geodesic, that is, for any f0(x, θ0) and f0(x, θ1) in M(e) it holds that the exponential geodesic curve connecting f0(x, θ0) and f0(x, θ1) is in M(e). In effect we observe that C t ( e ) ( x ) = f 0 ( x , θ t ) with θt = (1–t)θ0 +1. Thus C t ( e ) ( x ) M ( e ) for all t ∈ (0, 1) because Θ is a convex set. Alternatively, consider a parametric model
M ( m ) = { f 1 ( x , π ) : = j = 0 d π j f j ( x ) : π j > 0 ( j = 0 , , d ) , j = 0 d π j = 1 } .
Then, M(m) is totally mixture-geodesic. Because a mixture geodesic curve C t ( m ) ( x ) = ( 1 t ) f 1 ( x , π 0 ) + t f 1 ( x , π 1 ) is in M(m) for any t ∈ (0, 1) on account of C t ( m ) ( x ) = f 1 ( x , π t ), where (1 – t)π0 + 1.
We discuss a generalized entropy and divergence measures with applications in statistical models and estimation. There have been recent developments for the generalization of Boltzmann-Shannon entropy and Kullback-Leibler divergence. We focus on U-divergence with a generator function U, in which U-divergence is separated into the differences between cross entropy and diagonal entropy. We observe a dualistic property associated with U-divergence between statistical model and estimation. The U-loss function is given by an empirical approximation for U-divergence based on a given dataset under a statistical model, in which the U-estimator is defined by minimization of the U-loss function on the parameter space. On the other hand, the diagonal entropy leads to a maximum entropy distribution with a mean equal space, where we call the family of distributions U-model. In accordance with this, the U-divergence leads to a pair of U-model and U-estimator as a statistical model and estimation. The typical example is that U(t) = exp(t), which is associated with the Kullback-Leibler divergence D0(f, g) generating a pair of an exponential family M(e) and the minus log-likelihood function.
This aspect is characterized as a minimax game between a decision maker and Nature. The paper is organized as follows. Section 2 introduces the class of U-divergence measures. The information geometric framework associated with a divergence measure is given in Section 3. In Section 3 we discuss the maximum entropy model with respect to U-diagonal entropy. The minimum divergence method via U-divergence is discussed in Section 5. We next explore the duality between maximum U-entropy and minimum U-divergence in Section 6. Finally, we discuss the relation to the robust statistics by minimum divergence, and a future problem on MaxEnt in Section 7.

2. U-Divergence

A class of information divergence is constructed by a generator function U via a simple employment of conjugate convexity, see [7]. We introduce a class of generator functions by
𝒰 = { U : + : d d s U ( s ) 0 , d 2 d s 2 U ( s ) 0 } .
Then we consider the conjugate convex function defined on ℝ+ of U in 𝒰 as
U * ( t ) = max s { s t U ( s ) } ,
and hence U*(t) = (t) – U(ξ(t)), where ξ(t) is the inverse function of the derivative of U(s), or equivalently (dU/ds)(ξ(t)) = t. The existence for ξ(t) is guaranteed from the assumption for U to be in 𝒰, in which we observe an important property that the derivative of U* is the inverse of the derivative of U, that is
d d t U * ( t ) = ξ ( t ) .
The conjugate function U* of U is reflexible, that is, U** = U. By definition, for any s ∈ ℝ and t ∈ ℝ+,
U * ( t ) s t U ( s )
with equality if and only if s = ξ(t). We consider an information divergence functional using the generator function U as
D U ( f , g ) = { U * ( f ) f ξ ( g ) + U ( ξ ( g ) ) } d Λ ,
called U-divergence. We can easily confirm that DU (f, g) satisfies the first axiom of a distance function since the integrand in Equation (6) is always nonnegative with equality of 0 if and only if f(x)= g(x) because Equation (5). It follows from the construction that DU (f, g) is decomposed into CU (f, g) and HU (f) such that
D U ( f , g ) = C U ( f , g ) H U ( f ) .
Here
C U ( f , g ) = { U ( ξ ( g ) ) f ξ ( g ) } d Λ ,
is called U-cross entropy;
H U ( f ) = U * ( f ) d Λ
is called U-diagonal entropy. We can write HU (f) = ∫ {U(ξ(f)) – (f)}dΛ by the definition for U*, which equals the diagonal CU (f, f). We note that the U-divergence is expressed as
D U ( f , g ) = { U * ( f ) U * ( g ) + ξ ( g ) ( f g ) } d Λ ,
because of Equation (4), which implies that U* plays a role on a generator function in place of U. In fact, this is also called U*-Bregman divergence, cf. [8,9]
The first example of U is U0(s) = exp(s), which leads to U 0 * ( t ) = t log t t and
log ( t ) = argmax s { s t exp ( s ) } ,
Thus U0-divergence, U0-cross entropy and U0-diagonal entropy equal D0(f, g), C0(f, g) and H0(f) as defined in Introduction, respectively. As for the second example we consider
U β ( s ) = 1 β + 1 ( 1 + β s ) 1 + β β
where β is a scalar. The conjugate function becomes
U β * ( t ) = 1 β ( β + 1 ) t β + 1 1 β t .
Then the generator function Uβ associates with the β-power cross entropy
C β ( f , g ) = 1 β + 1 g β + 1 d Λ 1 β f ( g β 1 ) d Λ ,
β-diagonal power entropy
H β ( f ) = 1 β ( β + 1 ) f β + 1 d Λ + 1 β
and the β-power divergence Dβ(f, g) = Cβ(f, g) – Hβ(f), that is,
D β ( f , g ) = 1 β ( β + 1 ) f ( β + 1 ) d Λ 1 β f g β d Λ + 1 β + 1 g β + 1 d Λ .
We observe that
lim β 0 ( C β ( f , g ) , H β ( f ) ) = ( C 0 ( f , g ) , H 0 ( f ) ) .
The class of β-power divergence functionals includes the Kullback-Leibler divergence in the limiting sense of limβ→0 Dβ(f, g) = D0(f, g). If β = 1, then D β ( f , g ) = 1 2 ( f g ) 2 d Λ, which is a half of the squared L2 norm. If we take a limit of β to −1, then Dβ(f, g) becomes the Itakura-Saito divergence
D IS ( f , g ) = ( log g log f + f g 1 ) d Λ ,
which is widely applied in signal processing and speech recognition, cf. [1012].
The β-power divergence Dβ(p, q) is proposed in [13]; the β-power entropy Hβ is equal to the Tsallis q-entropy with a relation q = β + 1, cf. [1416]. Tsallis entropy is connected with spin glass relaxation, dissipative optical lattices and so on beyond the classical statistical physics associated with the Boltzmann-Shannon entropy H0(p). See also [17,18] for the power entropy in the field of ecology. We will discuss the statistical property for the minimum β divergence method in the presence of outliers departing from a supposed model, cf. [1921]. A robustness performance is elucidated by appropriate selection for β. Beyond robustness perspective, a property of spontaneous learning to apply to clustering analysis is focused in [22], see also [23] for nonnegative matrix analysis.
The third example of a generator function is Uη(s) = (1 – η)exp(s) – ηs with a scalar η. This generator function leads to the η-cross entropy
C η ( f , g ) = { f ( x ) + η } log { g ( x ) + η } d Λ ( x )
and the η-entropy
H β ( f ) = { f ( x ) + η } log { f ( x ) + η } d Λ ( x ) ,
so that the η-divegence is Dη(f, g) = Cη(f, g) – Hη(f), see [2427] for applications for pattern recognition. Obviously, if we take a limit of η to 0, then Cη(f, g), Hη(f) and Dη(f, g) converge to C0(f, g), H0(f) and D0(f, g), respectively. A mislabeled model is derived by a maximum η-entropy distribution with momentary constraint if we consider a binary regression model. See [25,27] for a detailed discussion.

3. Geometry Associated with U-Divergence

We investigate geometric properties associated with U-divergence, which will help the discussion in subsequent sections. Let us arbitrarily fix a statistical model M = {fθ(x) : θ ∈ Θ} embedded in the total space with mild regularity conditions. In fact, we consider the mixture geodesic curve C(m), the exponential geodesic curve C(e), the mixture model M(m) and the exponential model M(e) as typical examples of M. Here are difficult aspects to define as a differentiable manifold of infinite dimension because the constraint for positivity on the support is intractable in the sense of the topology, see Section 2 in [6] for detailed discussion and historical remarks. On the other hand, if we confine ourselves to a statistical model M, then we can formulate M as a finite dimensional manifold, as in the following discussion. Thus, we produce a path geometry in which for any two elements f and g of a class of geodesic curves connecting f and g including C(m) and C(e) is introduced so that the class of geodesic subspaces is derived as for M(m) and M(e).

3.1. Riemannian Metric and Linear Connections

We view the statistical model M as a d-dimensional differentiable manifold with the coordinate θ = (θ1, ···, θd). Any information divergence associates with a Riemaniann metric and dual linear connections, see [28,29] for detailed discussion. We focus on the geometry generated by the U-divergence DU (f, g) as follows. The Riemannian metric at fθ of M is given by
G i j ( U ) ( θ ) = i f θ j ξ ( f θ ) d Λ ,
and linear connections are
Γ i , j , k ( U ) ( θ ) = i j f θ k ξ ( f θ ) d Λ
and
Γ * i , j , k ( U ) ( θ ) = k f θ i j ξ ( f θ ) d Λ ,
where i = ∂/∂θi, see Appendix. for the derivation. Now we can assert the following theorem under an assumption for : Let f be arbitrarily fixed in . If ∫ a(x){g(x) – f(x)}dΛ(x) = 0 for any g of , then a(x) is constant in x almost everywhere with respect to Λ.
Theorem 1. Let Γ(U) be the linear connection defined in Equation (11). Then any Γ(U)-geodesic curve is equal to the mixture-geodesic curve defined in Equation (2).
Proof. Let C(U) := {ft(x): t ∈ (0, 1)} be a Γ(U)-geodesic curve with f0 = f and f1 = g. We consider a 2-dimensional model defined by fθ(x) = (1 – s + u)ft(x)+(su)g(x), where θ =(s, t, u). Then we observe that if u = s, then
Γ 11 , 2 ( U ) ( θ ) = ( d 2 d t 2 f t ) ξ ( f t ) ( g f t ) d Λ
which identically 0 for any g of . It follows from the assumption for that (d2/dt2)ft(x)= c almost everywhere with respect to Λ, which solved by
f t ( x ) = 1 2 c t ( t 1 ) + ( 1 t ) f ( x ) + t g ( x )
from the endpoint condition for C(U). We observe that c = 0 because ft(x) , which concludes that C(U) equals the mixture-geodesic. The proof is complete.
This property is elemental to characterize the U-divergence class, which is closely related with the empirical reducibility as discussed in a subsequent section. The assumption for holds if the carrier measure Λ is Lebesgue measure or the counting measure.
On the other hand, for a *Γ(U)-geodesic curve C * ( U ) : = { f t * ( x ) : t ( 0 , 1 ) } with f0 = f and f1 = g we consider an embedding into a 2-dimensional model,
f θ * ( x ) = u ( ( 1 s + t ) ξ ( f t * ( x ) ) + ( s t ) ξ ( g ( x ) ) + κ θ ) ,
where θ = (s, t), where u(s) = (d/dt)U(s) and κθ is a normalizing constant to satisfy f θ * ( x ) d Λ ( x ) = 1. By definition
Γ * 11 , 2 ( U ) ( θ ) = ( d 2 d t 2 ξ ( f t * ) ) u ( ξ ( f t * ) ) { ξ ( g ) ξ ( f t * ) } d Λ = 0
if s = t. This leads to ( d 2 / d t 2 ) ξ ( f t * ( x ) ) = c almost everywhere with respect to Λ, which is solved by
f t * ( x ) = u ( ( 1 t ) ξ ( f ( x ) ) t ξ ( g ( x ) ) κ t ) ,
We confirm that, if U = exp, then *Γ(U)-geodesic curve reduces to the exponential geodesic curve defined in Equation (3). □

3.2. Generalized Pythagorian Theorems

We next consider the Pythagorean theorem based on the U-divergence as an extension of the result associated with the Kullback-Leibler divergence in [6].
Theorem 2. Let p, q and r be in . We connect p with q by the mixture geodesic
f t ( m ) ( x ) = ( 1 t ) p ( x ) + t q ( x ) ,
Alternatively we connect r and q by *Γ(U)-geodesic curve
f s ( U ) ( x ) = u ( ( 1 s ) ξ ( r ( x ) ) s ξ ( q ( x ) ) κ ( s ) ) .
Two curves { f t ( m ) ( x ) : t [ 0 , 1 ]} and { f s ( U ) ( x ) : s [ 0 , 1 ]} orthogonally intersect at q with respect to the Riemannian metric G(U) defined in Equation (10) if and only if
D U ( p , r ) = D U ( p , q ) + D U ( q , r ) .
Proof. A straightforward calculus yields that
2 t s D U ( f t ( m ) , f s ( U ) ) | t = 1 , s = 1 = D U ( p , r ) { D U ( p , q ) + D U ( q , r ) } .
By the definition of G(U) we see that G 12 ( U ) ( θ ) is nothing but the left side of Equation (17) when
f θ ( x ) = ( 1 t ) p ( x ) + t f s ( U ) ( x ) ,
where θ = (t, s). Hence the orthogonality assumption is equivalent to Equation (16), which completes the proof.
Remark 1. We remark a further property such that, for any s and t in [0, 1],
D U ( p t , r ) = D U ( p t , q ) + D U ( q , r s ) .
If U = exp, then Theorem 2 reduces to the Pythagoras theorem with the Kullback-Leibler divergence as shown in [6]. Consider two geodesic subspaces defined by
M ( m ) = { p π ( x ) = π 0 q ( x ) + j = 1 J π j p j ( x ) : π j 0 ( j = 0 , , J ) , j = 0 J π j = 1 }
and
M ( U ) = { r ( x ) = u ( 0 ξ ( q ( x ) ) + k = 1 K k ξ ( r k ( x ) ) κ ( ) ) : k 0 ( k = 0 , K ) , k = 0 K k = 1 } .
For any m-geodesic curve C(m) and U-geodesic curve *C(U) connecting q we assume that C(m) and C(U) orthogonally intersect at q in the sense of the Riemannian metric G(U). Then, for any pM(m) and rM(U)
D U ( p , r ) = D U ( p , q ) + D U ( q , r ) ,
in which two-way projection is associated with as
D U ( p , q ) = min r M 2 D U ( p , r ) and D U ( q , r ) = min p M 1 D U ( p , r ) .
First we confirm a kind of reduction property for the Kullback-Leibler divergence to the framework in information geometry such that (G(D0), Γ(D0), *Γ(D0)) = (G, Γ(m), Γ(e)), where G is the information metric. Second we return a case of the β-power divergence, which is reduced a special case of Theorem 2. Consider two curves C ( m ) = { C t ( m ) ( x ) = ( 1 t ) p ( x ) + t q ( x ) : t [ 0 , 1 ] } and
C ( β ) = { C s ( β ) ( x ) = { ( 1 s ) r ( x ) β + t q ( x ) β + c ( s ) } 1 β : s [ 0 , s ] } .
Then we observe for the Riemannian metric G(β) generated by β-power divergence that
G ( β ) ( C ˙ 1 m , C ˙ 1 β ) ( q ) = D β ( p , r ) { D β ( p , q ) + D β ( q , r ) } ,
which is ∫ (pq)(pβqβ)dΛ. We observe that if C(m) and C(β) orthogonally intersect at q, then
D β ( p , r ) = D β ( p , q ) + D β ( q , r ) .

4. Maximum Entropy Distribution

The maximum entropy principle is based on the Boltzmann-Shannon entropy in which the maximum entropy distribution is characterized by an exponential model. The maximum entropy method has been widely enhanced in fields of natural language processing, ecological analysis and so forth. However, there are other types of entropy measures proposed as the Hill diversity index, the Gini-Simpson index, the Tsallis entropy and so on, cf. [14,17,18] in different fields. We introduced the class of U-entropy functionals, which include all the entropy measures mentioned above. In this subsection, we discuss the maximum entropy distribution based on an arbitrarily fixed U-entropy.
We check a finite discrete case with K + 1 cells as a special situation, where reduces to a K-dimensional simplex 𝒮K. The maximum U-entropy distribution is defined by
f * = argmax f 𝒮 k H U ( f ) .
The Lagrange function is
L ( f , λ ) = i = 1 K + 1 { ξ ( f i ) f i + U ( ξ ( f i ) ) } + λ ( i = 1 K + 1 f i 1 ) .
We observe that
f i L ( f , λ ) = ξ ( f i ) + λ = 0 ,
which implies f i * = 1 / ( K + 1 ) for i = 1, ···, K + 1. Therefore the maximum U-entropy distribution f* is a uniform distribution on 𝒮K for any generator function U.
In general the U-entropy is an unbounded functional on unless is finite discrete. For this we introduce a moment constraint as follows. Let t(X) be a k-dimensional statistic vector. Henceforth we assume that 𝔼f {‖ t(X) ‖ 2} < ∞ for all f of . We consider a mean equal space for t(X) as
Γ ( τ ) = { f : 𝔼 f { t ( X ) } = τ } ,
where τ is a fixed vector in ℝk. By definition Γ(τ) is totally mixture geodesic, that is, if f and g are in Γ(τ), then (1 – t)f + tg is also in Γ(τ) for any t ∈ (0, 1).
Theorem 3. Let f τ * = argmax { H U ( f ) : f Γ ( τ ) }, where HU (f) is U-diagonal entropy defined in Equation (7). Then the maximum U-entropy distribution is given by
f τ * ( x ) = u ( θ t ( x ) κ U ( θ ) ) ,
where κU (θ) is the normalizing factor and θ is a parameter vector determined by the moment constraint
t ( x ) u ( θ t ( x ) κ U ( θ ) ) d Λ ( x ) = τ .
Proof. The Eular-Lagrange functional is given by
Φ ( f , θ , λ ) H U ( f ) θ [ 𝔼 f { t ( X ) } τ ] λ { f ( x ) d Λ ( x ) 1 }
If gτ ∈ Γ(τ) and f τ * ( x ) + t g τ ( x ), then ft ∈ Γ(τ), and
d d t Φ ( f t , θ , λ ) | t = 0 = 0 , d 2 d t 2 Φ ( f t , θ , λ ) | t = 0 < 0.
The equation in Equation (21) yields that
{ ξ ( f τ * ( x ) ) θ ( t ( x ) τ ) λ } { g ( x ) f * ( x ) } d Λ ( x ) = 0
for any gτ (x) in Γ(τ), which concludes Equation (20). Since ξ(t) is an increasing function, we observe that
d 2 d t 2 Φ ( f t , θ , λ ) = ξ ( f t ( g ) ) { g ( x ) f τ * ( x ) } 2 d Λ ( x ) < 0
for any t ∈ [0, 1], which implies the inequality in Equation (21). Since gτ Γ(τ), we observe that
𝔼 g τ { ξ ( f τ * ( X ) ) } = 𝔼 f τ * { ξ ( f τ * ( X ) ) }
Therefore we can confirm that H U ( f τ * ) H U ( g τ ) for any gτ ∈ Γ(τ) since
H U ( f τ * ) H U ( g τ ) = D U ( g τ , f τ * ) ,
which is nonnegative by the definition of U-divergence. The proof is complete.
Here we give a definition of the model of maximum U-entropy distributions as follows.
Definition 1. We define a k-dimensional model
M U = { f U ( x , θ ) : = u ( θ t ( x ) κ U ( θ ) ) : θ Θ } ,
which is called U-model, where Θ = {θ ∈ ℝk : κU (θ) < ∞}.
The Naudts’ deformed exponential family discussed from a statistical physical viewpoint as in [15]is closely related with U-model. The one-parameter family {rs(x): s ∈ [0, 1]} as defined in Equation (15) is a one-dimensional U-model and M(U) defined in Equation (18) is a K-dimensional U-model. For a U-model MU defined in Equation (23), the parameter θ is an affine parameter for the linear connection *Γ(U) defined in Equation (12). In fact, we observe from the definition Equation (12) that
Γ * i , j , k ( U ) ( θ ) = j k κ U ( θ ) k f U ( θ , x ) d Λ ( x )
which is identically 0 for all θ ∈ Θ. We have a geometric understanding for the U-model similar to the exponential model discussed in Introduction.
Theorem 4. Assume for U(t) that U′″ (t) > 0 for any t in ℝ. Then, the U-model is totally *Γ(U)-geodesic.
Proof. For arbitrarily fixed θ1 and θ2 in Θ, we define the U-geodesic curve connecting between fU (x, θ1) and fU (x, θ2) such that, for λ ∈ (0, 1),
f λ ( x ) = u ( λ ξ ( f U ( x , θ 1 ) ) + ( 1 λ ) ξ ( f U ( x , θ 2 ) ) κ ( λ ) )
with a normalizing factor κ(λ), which is written by fλ(x) = fU (x, θλ), where θλ = λθ1 +(1 – λ)θ2. Hence it suffices to show θλ ∈ Θ for all λ ∈ (0, 1), where Θ is defined in Definition 1. We look at the identity ∫ fU (x, θ)dΛ(x) = 1 from a fact that fU (x, θ) is a probability density function. This implies that the first derivative gives
u ( θ t ( x ) κ U ( θ ) ) { t ( x ) θ κ U ( θ ) } d Λ ( x ) = 0
and the second derivative gives
u ( θ t ( x ) κ U ( θ ) ) { t ( x ) θ κ U ( θ ) } { t ( x ) θ κ U ( θ ) } d Λ ( x ) u ( θ t ( x ) κ U ( θ ) ) d Λ ( x ) 2 θ θ κ U ( θ ) = 0
Since the identity Equation (24) shows that the Hessian of κU (θ) is proportional to a Gramian matrix, which implies that κU (θ) is convex in θ. Since κU (θλ) (1 – λ)κU (θ1)+ λκU (θ2) and θ1 and θ2 in Θ, κU (θλ) ∞. This concludes that θλ ∈ Θ for any λ ∈ (0, 1), which completes the proof.
We discuss a typical example by the power entropy Hβ(f), see [15,3034] from a viewpoint of statistical physics. First we consider a mean equal space of univariate distributions on (0, ∞)
Γ ( μ ) = { f : 𝔼 f { t ( X ) } = μ }
where
t ( x ) = ( x , x β ( κ 1 ) 1 β )
Note that limβ→0 t(x) = (x, (κ – 1) log x). To get the maximum entropy distribution with Hβ we consider the Euler-Lagrange function given by
E β ( f , λ ) = 1 β ( β + 1 ) 0 ( x ) 1 + β d x + θ { 0 t ( x ) f ( x ) d x μ } + λ { 0 f ( x ) d x 1 } ,
where θ and λ are Lagrange multiplier parameters. This yields that the maximum entropy distribution is
f β ( x , θ ) = Z β ( θ ) 1 ( 1 + β θ t ( x ) ) 1 β = Z β ( θ ) 1 ( β θ 1 x + θ 2 x β ( κ 1 ) ) 1 β = Z β ( θ ) 1 x κ 1 ( θ 2 β θ 1 x 1 β ( κ 1 ) ) 1 β ,
where θ is determined by μ such that 𝔼fβ(·,θ)t(X) = μ and
Z β ( θ ) = 0 x ( θ 2 β θ 1 x 1 β ) 1 β d x .
A gamma distribution is defined by the density function
f ( x , κ , θ ) = x κ 1 exp ( x θ ) Γ ( κ ) θ κ
Second, we consider a case of multivariate distributions, where the moment constraints are supposed that for a fixed p-dimensional vector μ and matrix V of size p × p
Γ ( μ , V ) = { f : 𝔼 f ( X ) = μ , 𝕍 f ( X ) = V } .
Let
f β ( , μ , V ) = argmax f Γ ( μ , V ) H β ( f ) .
If we consider a limit case of β to 0, then Hβ(f) reduces to the Boltzmann-Shannon entropy and the maximum entropy distribution is the Gaussian distribution with the density function
φ ( x , μ , V ) = { det ( 2 π V ) } p / 2 exp { 1 2 ( x μ ) V 1 ( x μ ) } .
In general we deduce that if β > 2 p + 2, then the maximum β-power entropy distribution uniquely exists such that the density function is given by
f β ( x , μ , V ) = c β det ( 2 π V ) 1 2 { 1 β 2 + p β + 2 β ( x μ ) V 1 ( x μ ) } + ,
where
c β = { ( 2 β 2 + p β + 2 β ) p 2 Γ ( 1 + p 2 + 1 β ) { Γ ( 1 + 1 β ) } 1 if β 0 ( 2 β 2 + p β + 2 β ) p 2 Γ ( 1 β ) { Γ ( 1 β p 2 ) } 1 if 2 p + 2 < β 0
See [35,36] for the detailed discussion [37,38] for the discussion on group invariance. Thus, if β> 0, then the maximum β-power entropy distribution has a compact support
{ x p : ( x μ ) V 1 ( x μ ) 2 β + p + 2 }
The typical case is β =2, which is called the Wigner semicircle distribution. On the other hand, if 2 p + 2 < β < 0, the maximum β-power entropy distribution has a full support of ℝp, and equals a p-variate t-distribution with a degree of freedom depending on β.

5. Minimum Divergence Method

We have shown a variety of U-divergence functionals using various generator functions in which the minimum divergence methods are applied to analyses in statistics and statistical machine learning. In effect the U-cross entropy CU (f, g) is convex-linear in f, that is,
C U ( j = 1 J λ j f j , g ) = j = 1 J λ j C U ( f j , g )
for any λj > 0 with j = 1 J λ j = 1. It is closely related with a characteristic property that the linear connection Γ(U) associated with U-divergence is equal to the mixture connection Γ(m) as discussed in Theorem 1. Furthermore, for a fixed g, CU (f, g) can be viewed as a functional of F in place of f as follows:
C U ( F , g ) = { ξ ( g ( x ) ) U ( ξ ( g ( x ) ) ) d Λ ( x ) } d F ( x ) ,
where F is the probability distribution generated from f(x). If we assume to have a random sequence X1, ···, Xn from a density function f(x), then the U-cross entropy is approximated as
C U ( F ¯ n , g ) = 1 n i = 1 n ξ ( g ( X i ) ) + U ( ξ ( g ) ) d Λ ,
where n is the empirical distribution based on the data X1, ···, Xn, that is, F ¯ n ( B ) = 1 n i = 1 n I ( X i B ) for any Borel measurable set B. By definition,
ξ ( g ( x ) ) d F ¯ n ( x ) = 1 n i = 1 n ξ ( g ( X i ) ) .
Consequently, if we model g by a model function f(·,θ), then the right side of Equation (25) depends only on the data set ( X i ) i = 1 n and parameter θ without any knowledge for the underlying density function f(x). This gives the empirical approximation, which is advantageous over other classes of divergence measures. The minimum U-divergence method is directly applied to minimization of the empirical approximation with respect to θ. We note that the minimum divergence is equivalent to the minimum cross entropy, in which the diagonal entropy is just a constant in θ. In particular, in the classical case,
C 0 ( F ¯ n , f ( , θ ) ) = 1 n i = 1 n log f ( X i , θ ) + 1 ,
which is equivalent to the minus log-likelihood function.
Let X1, ···, Xn be independently and identically distributed from an underlying density function f(x) which is approximated by a statistical model M = {f(x, θ): θ ∈ Θ}. The U-loss function is introduced by
L U ( θ ) = 1 n i = 1 n ξ ( f ( X i , θ ) ) + b U ( θ ) ,
where bU (θ) = ∫ U(ξ(f(x, θ)))dΛ(x). We call θ̂U = argminθ∈Θ LU (θ) U-estimator for the parameter θ. By definition 𝔼f {LU (θ)} = CU (F, f(·,θ)) for all θ in Θ, which implies that LU (θ) almost surely converges to CU (F, f(·,θ)) as n goes to ∞. Let us define a statistical functional as
θ U ( F ) = argmax θ Θ C U ( F , f ( , θ ) ) ,
where CU (F, g) is written CU (f, g) placing f into the probability distribution F generated from f. Then θU (F) is model-consistent, or θU (Fθ) = θ for any θ ∈ Θ because
C U ( F θ , f ( , θ ) ) H U ( f ( , θ ) )
with equality if and if θ = θ, where Fθ is the probability distribution induced form f(x, θ).
Hence U-estimator θ̂U is asymptotically consistent. The estimating function is given by
s U ( x , θ ) = θ ξ ( f ( x , θ ) ) 𝔼 f ( , f ) { θ ξ ( f ( X , θ ) ) } .
Consequently we confirm that sU (x, θ) is unbiased in the sense that 𝔼f(·,θ){sU (X, θ)} = 0.
We next investigate the asymptotic normality for U-estimator. The estimating equation for the U-estimator is given by
1 n i = 1 n s U ( X i , θ ^ U ) = 0 ,
of which the Taylor approximation gives
1 n i = 1 n { s U ( X i , θ U ( F ) ) } + s U θ ( X i , θ U ) ( θ ^ U θ U ( F ) ) = o ( n P 1 ) .
In accordance with this, we get the asymptotic approximation,
n { θ ^ U θ U ( F ) } = 1 n J ( θ U ( F ) ) 1 i = 1 n s U ( X i , θ U ( F ) ) + o ( n P 1 2 ) ,
where
J ( θ ) = 𝔼 f ( , θ ) { s U θ ( X , θ ) } .
Because the strong law of large number gives
1 n i = 1 n s U θ ( X i , θ U ( f ) ) a.s. J ( θ U ( F ) )
as n goes to ∞, where a.s. denotes almost sure convergence. If the underlying density function is in the model M, that is f(x) = f(x, θ), then it follows from the model consistency for θU (F) that
n ( θ ^ U θ ) = 1 n J ( θ ) 1 i = 1 n s U ( X i , θ ) + o ( n P 1 2 ) ,
which implies that
n ( θ ^ U θ ) D N ( 0 , J ( θ ) 1 V ( θ ) J ( θ ) 1 ) ,
where D denotes convergence in distribution and
V ( θ ) = 𝕍 f ( , θ ) { s U ( X , θ ) } .
If the generator function is taken as U(s) = exp(s), then the U-estimator reduces to the MLE with the asymptotic normality to N(0,G(θ)1), where G(θ) is the Fisher information matrix for θ.
Consider U-estimator for the parameter θ of the exponential model M(e) in Equation (1). In particular we are concerned with a possible outlying contaminated in the exponential model, and hence a ∊-contamination model is defined as
F θ , , y ( x ) = ( 1 ) F 0 ( x , θ ) + δ y ( x ) ,
where ∊, 0 << 1 is a sufficiently small constant, F0(x, θ) is the cumulative distribution function of the exponential model, and δy(x) denotes a degenerate distribution at y. The influence function for U-estimator is given by
IF ( θ ^ U , y ) : = lim 0 θ U ( F θ , , y ) θ = J ( θ ) 1 s U ( y , θ ) ,
See [19,20,27]. Thus we can check the robustness for U-estimator whether the influence function is bounded in y or not. For example, if we adopt as U(s) = (1+ βs)1, then
IF ( θ ^ U , y ) = J ( θ ) 1 [ { t ( y ) μ } f 0 ( y , θ ) β b ( θ , β ) ] ,
where b(θ, β)= ∫ {t(x) – μ}f0(x, θ)βdΛ(x). Thus, if β> 0, then the influence function is confirmed to be bounded in y for almost cases including a normal, exponential and Poisson distribution models since the term {t(y) – μ}f0(y, θ)β in Equation (27) is bounded in y for these models. On the other hand, If β = 0, that is the maximum likelihood estimator entails the unbounded influence functions because the term t(y) – μ is unbounded in y for theses models.

6. Duality of Maximum Entropy and Minimum Divergence

In this section, we discuss a dualistic interplay between statistical model and estimation. In statistical literature, the maximum likelihood estimation has a special position over other estimation methods in the sense of efficiency, invariance and sufficiency; while the statistical model has been explored various candidates in the presence of misspecification. For example, we frequently consider a Laplace distribution for estimating a Gaussian mean, which leads to the sample median as the maximum likelihood estimator for the mean of the Laplace distribution. In this sense, there is an unbalance in the employment for the model and estimator. In principle, we can select arbitrarily different generator functions U0 and U1 so that the U1-estimation gives consistency under the U0-model. There is a natural question which situation happens if we consider the U-estimation under the U-model?
Let MU be a U-model defined by
M U = { f U ( x , θ ) : = u ( θ t ( x ) κ U ( θ ) ) : θ Θ } ,
where Θ = {θ ∈ ℝk : κU (θ) < ∞}. The the U-loss function under the U-model for a given data set {X1, ···, Xn} is defined by
L U ( θ ) = 1 n i = 1 n ξ ( f U ( X i , θ ) ) + U ( ξ ( f U ( x , θ ) ) ) d Λ ( x ) ,
which is reduced to
L U ( θ ) = θ t ¯ + κ U ( θ ) + b U ( θ ) ,
where t ¯ = 1 n i = 1 n t ( X i ) and
b U ( θ ) = U ( ξ ( θ t ( x ) κ U ( θ ) ) ) d Λ ( x ) .
The estimating equation is given by
θ L U ( θ ) t ¯ + θ κ U ( θ ) + θ θ b U ( θ ) ,
which is written by
θ L U ( θ ) = t ¯ + 𝔼 f ( , θ ) { t ( X ) } .
Hence, if we consider the U-estimator for a parameter η by the transformation of θ defined by ϕ(θ) = 𝔼f(·,θ){t(X)}, then the U-estimator η̂U is nothing but the sample mean . Here we confirm that the transformation ϕ(θ) is one-to-one as follows. The Jacobian matrix of the transformation is given by
θ ϕ ( θ ) = u ( θ t ( x ) κ U ( θ ) ) { t ( x ) θ κ U ( θ ) ) } { t ( x ) θ κ U ( θ ) ) } d Λ ( x ) ,
since the first identity for MU leads to
θ f U ( x , θ ) d Λ ( x ) = u ( θ t ( x ) κ U ( θ ) ) { t ( x ) θ κ U ( θ ) } d Λ ( x ) = 0.
Therefore, we conclude that the Jacobian matrix is symmetric and positive-definite since u′ (t) is a positive function from the assumption of the convexity for U, which implies that ϕ(θ) is one-to-one. Consequently, the estimator θ̂ U for θ is given by ϕ1(). We summarize these results in the following theorem.
Theorem 5. Let MU be a U-model with a canonical statistic t(X) as defined in Equation (28). Then the U-estimator for the expectation parameter η of t(X) is always t̄, where t ¯ = 1 n i = 1 n t ( X i ).
Remark 2. We remark that the empirical Pythagorean theorem holds as in
L U ( θ ) = L U ( θ ^ U ) + D U ( θ ^ U , θ ) ,
since we observe that
L U ( θ ) L U ( θ ^ U ) = ( θ ^ U θ ) t ¯ + κ U ( θ ) + b U ( θ ) κ U ( θ ^ U ) + b U ( θ ^ U ) ,
which gives another proof for which θ̂U is ϕ1(). The statistic t̄ is a sufficient statistic in the sense that the U-loss function LU (θ) is a function of t̄ as in Equation (29). Accordingly, the U-estimator under U-model is a function only of t̄ from the observations X1, ···, Xn. In this extension, the MLE is a function of t̄under the exponential model with the canonical statistic t(X).
Let us look at the case of the β-power divergence. Under the β-power model given by
M β = { f β ( x , θ ) : = { κ β ( θ ) + β θ t ( x ) } 1 β : θ Θ } ,
the β-loss function is written by
L β ( θ ) = β θ t ¯ + κ β ( θ ) + b β ( θ ) ,
where
b β ( θ ) = 1 β + 1 { κ β ( θ ) + β θ t ( x ) } 1 + β β d Λ ( x ) .
The β-power estimator for the expectation parameter of t(X) is exactly given by .

7. Discussion

We concentrate on elucidating the dual structure of the U-estimator under the U-model, in which the perspective extends the relation of the maximum likelihood under the exponential model with a functional degree of freedom. Thus, we explore a rich and practical class of duality structures; however, there remains an unsolved problem when we directly treat the space as an differentiable manifold, see [39] for an infinite dimensional exponential family. The approach here is not a direct extension of an infinite dimensional manifold, but a path geometry in the following sense. For all pairs of elements of the geodesic curve connecting the pair is represented in an explicit form in the class of *Γ(U) connections in our context.
The U-divergence approach was the first trial to introduce a dually flat structure to which is different from the alpha-geometry. However, there are many related studies. For example, a nonparametric information geometry on the space of all functions without constraints for positivity and normalizing is discussed in Zhang [40]. Amari [41] characterizes (ρ, τ)-divergence with decomposable dually flat structure, see also [42]. If ρ is an identity function and τ(s) = (d/ds)U(s), (ρ, τ)-divergence is no less than U-divergence. In effect we confine ourselves to discussing the U-divergence class for the sake of the direct estimability for U-estimator.
The duality between the maximum entropy and the minimum divergence has been explored in the minimax theorem for a zero-sum game between a decision maker and Nature. The pay-off function is taken by the cross U-entropy in which Nature tries to maximize the pay-off function under the mean equal constraint; the decision maker tries to minimize the pay-off function. The equilibrium is given by the minimax solution, which is the maximum U-entropy distribution, see [43] for the extensive discussion and the relation with Bayesean robustness. The observation explored in this paper is closely related with this minimax argument, however the duality between the statistical model and estimation is focused on, where the minimum U-divergence leads to projection onto the U-model.
In principle, the U-estimator is applicable for all the statistical model since U-loss function is written by a sample as well as the log-likelihood function. If the choice of the model is different from the U-model, then U-estimator has different performance from the present situation. For example, we consider an exponential model (U(s) = exp(s)), and a β-estimator (U(s) = (1 – βs)1 for getting a robustness property for outlying observations, cf. [19,20]. In such situations, the duality property is no longer valid, since the β-estimator for the parameter of the exponential model is not a function of the sufficient statistic defined in Theorem 5. Thus, we have to pay attention to another aspect than the duality structure in the presence of outlying, or misspecification for the statistical model. Furthermore, another type of divergence measures including projective power divergence is recommended to perform super robustness, cf. [21,44].
We presented the method of generalized maximum entropy based on the proposed entropy measure, as an extension of the classical maximum entropy method based on the Boltzmann-Gibbs-Shannon entropy. Practical applications of MaxEnt are actively followed in ecological and computational linguistic researches based on the classical maximum entropy, cf. [45,46]. Difficult aspects are discussed, in which the MaxEnt is apt to be over-learning on data sets because it basically employs the maximum likelihood estimator. There is a great potential for the proposed method to implement these research fields in order to overcome these difficult aspects, by selecting an appropriate generator function. A detailed discussion is beyond the scope of the present paper; however, it will be challenged in the near future with concrete objectives motivated by real data analysis.

Author Contributions

Atsumi Ohara and Shinto Eguchi contributed to differential geometric parts associated with minimum divergence, and Osamu Komori and Shinto Eguchi contributed to statistical discussion for the maximum entropy model and minimum divergence estimation.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix: Derivation for G(U), Γ(U) and *Γ(U)

We apply the general formula for the Riemannian metric and the pair of linear connections discussed in [29]to U-divergence DU (f, g). The Riemannian metric is defined by
G i j ( U ) ( θ ) = 2 θ i θ 1 j D U ( f θ , f θ 1 ) | θ 1 = θ .
Hence G i j ( U ) ( θ ) is expressed by Equation (10). Next the pair of linear connections Γ(U) and *Γ(U) are defined by
Γ i j , k ( U ) ( θ ) = 3 θ i θ j θ 1 k D U ( f θ , f θ 1 ) | θ 1 = θ .
and
Γ * i , j , k ( U ) ( θ ) = 3 θ i θ j θ 1 k D U ( f θ 1 , f θ ) | θ 1 = θ
which means Equations (11) and (12), respectively. We confirm the formula for G(U), Γ(U) and *Γ(U).

Acknowledgments

We thank to anonymous referees for their useful comments and suggestions for our revision. Shinto Eguchi and Osamu Komori were supported by Japan Science and Technology Agency (JST), Core Research for Evolutionary Science and Technology (CREST).

References

  1. Fisher, R.A. On an Absolute Criterion for Fitting Frequency Curves. Messenger Math 1912, 41, 155–160. [Google Scholar]
  2. Fisher, R.A. On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. A: Math. Phys. Eng. Sci 1922, 222, 309–368. [Google Scholar]
  3. Jaynes, E.T.; Information, Theory. Statistical Mechanics. In Statistical Physics; Ford, K., Ed.; Benjamin: New York, NY, USA, 1963. [Google Scholar]
  4. Barndorff-Nielsen, O. Information and Exponential Families in Statistical Theory; John Wiley: Chichester, UK, 1978. [Google Scholar]
  5. Amari, S. Differential-Geometrical Methods in Statistics; Lecture Notes in Statistics, 28; Springer: New York, NY, USA, 1985. [Google Scholar]
  6. Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: Oxford, UK, 2000. [Google Scholar]
  7. Eguchi, S. Information divergence geometry and the application to statistical machine learning. In Information Theory and Statistical Learning; Emmert-Streib, F., Dehmer, M., Eds.; Springer US: New York, NY, USA, 2008; pp. 309–332. [Google Scholar]
  8. Bregman, L.M. The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys 1967, 7, 200–217. [Google Scholar]
  9. Barndorff-Nielsen, O.E.; Jupp, P.E. Statistics, yokes and symplectic ge-ometry. Ann. Fac. Sci. Toulouse Math 1997, 3, 389–427. [Google Scholar]
  10. Scharf, L.L. Statistical Signal Processing; Addison-Wesley: Reading, MA, USA, 1991; Volume 98. [Google Scholar]
  11. Fëvotte, C.; Bertin, N.; Durrieu, J.-L. Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis. Neural Comput 2009, 21, 793–830. [Google Scholar]
  12. Cichocki, A.; Amari, S.I. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar]
  13. Basu, A.; Harris, I.R.; Hjort, N.L.; Jones, M.C. Robust and efficient estimation by minimising a density power divergence. Biometrika 1998, 85, 549–559. [Google Scholar]
  14. Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys 1988, 52, 479–487. [Google Scholar]
  15. Naudts, J. Generalized Thermostatistics; Springer: New York, NY, USA, 2011. [Google Scholar]
  16. Tsallis, C. Introduction to Nonextensive Statistical Mechanics; Springer: New York, NY, USA, 2009. [Google Scholar]
  17. Simpson, E.H. Measurement of diversity. Nature 1949, 163, 688. [Google Scholar]
  18. Hill, M.O. Diversity and evenness: a unifying notation and its consequences. Ecology 1973, 54, 427–432. [Google Scholar]
  19. Minami, M.; Eguchi, S. Robust blind source separation by beta divergence. Neural Comput 2002, 14, 1859–1886. [Google Scholar]
  20. Fujisawa, H.; Eguchi, S. Robust estimation in the normal mixture model. J. Stat. Plan. Inference 2006, 136, 3989–4011. [Google Scholar]
  21. Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal 2008, 99, 2053–2081. [Google Scholar]
  22. Notsu, A.; Komori, O.; Eguchi, S. Spontaneous clustering via minimum gamma-divergence. Neural Comput 2014, 26, 421–448. [Google Scholar]
  23. Cichocki, A.; Cruces, S.; Amari, S. Generalized Alpha-Beta Divergences and Their Application to Robust Nonnegative Matrix Factorization. Entropy 2011, 13, 134–170. [Google Scholar]
  24. Eguchi, S.; Copas, J. A class of logistic-type discriminant functions. Biometrika 2002, 89, 1–22. [Google Scholar]
  25. Takenouchi, T.; Eguchi, S. Robustifying AdaBoost by adding the naive error rate. Neural Comput 2004, 16, 767–787. [Google Scholar]
  26. Murata, N.; Takenouchi, T.; Kanamori, T.; Eguchi, S. Information geometry of U-Boost and Bregman divergence. Neural Comput 2004, 16, 1437–1481. [Google Scholar]
  27. Eguchi, S. Information geometry and statistical pattern recognition. Sugaku Expo. Amer. Math. Soc 2006, 19, 197–216. [Google Scholar]
  28. Eguchi, S. Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat 1983, 11, 793–803. [Google Scholar]
  29. Eguchi, S. Geometry of minimum contrast. Hiroshima Math. J 1992, 22, 631–647. [Google Scholar]
  30. Naudts, J. The q-exponential family in statistical Physics. Cent. Eur. J. Phys 2009, 7, 405–413. [Google Scholar]
  31. Naudts, J. Generalized exponential families and associated entropy functions. Entropy 2008, 10, 131–149. [Google Scholar]
  32. Ohara, A.; Wada, T. Information geometry of q-Gaussian densities and behaviors of solutions to related diffusion equations. J. Phys. A: Math. Theor 2010. [Google Scholar] [CrossRef]
  33. Suyari, H. Mathematical structures derived from the q-multinomial coefficient in Tsallis statistics. Phys. A: Stat. Mech. Appl 2006, 368, 63–82. [Google Scholar]
  34. Suyari, H.; Wada, T. Multiplicative duality, q-triplet and μ, ν, q-relation derived from the one-to-one correspondence between the (μ, ν)-multinomial coefficient and Tsallis entropy Sq. Phys. A: Stat. Mech. Appl 2008, 387, 71–83. [Google Scholar]
  35. Eguchi, S.; Kato, S. Entropy and divergence associated with power function and the statistical application. Entropy 2010, 12, 262–274. [Google Scholar]
  36. Eguchi, S.; Komori, O.; Kato, S. Projective Power Entropy and Maximum Tsallis Entropy Distributions. Entropy 2011, 13, 1746–1764. [Google Scholar]
  37. Ohara, A.; Eguchi, S. Geometry on positive definite matrices deformed by V-potentials and its submanifold structure. In Geometric Theory of Information; Nielsen, F., Ed.; Springer: New York, NY, USA, 2014; Chapter 2; pp. 31–55. [Google Scholar]
  38. Ohara, A.; Eguchi, S. Group invariance of information geometry on q-Gaussian distributions induced by beta-divergence. Entropy 2013, 15, 4732–4747. [Google Scholar]
  39. Pistone, G.; Sempi, C. An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Stat 1995, 33, 1543–1561. [Google Scholar]
  40. Zhang, J. Nonparametric information geometry: From divergence function to referential-representational biduality on Statistical Manifolds. Entropy 2013, 15, 5384–5418. [Google Scholar]
  41. Amari, S.-I. Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure. Entropy 2014, 16, 2131–2145. [Google Scholar]
  42. Harsha, K.V.; Subrahamanian, M.K.S. F-Geometry and AmariâĂŹs α-Geometry on a Statistical Manifold. Entropy 2014, 16, 2472–2487. [Google Scholar]
  43. Grunwald, P.D.; Dawid, A.P. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Ann. Stat 2004, 32, 1367–1433. [Google Scholar]
  44. Chen, P.-W.; Hung, H.; Komori, O.; Huang, S.-Y.; Eguchi, S. Robust independent component analysis via minimum gamma-divergence estimation. IEEE J. Sel. Top. Signal Process 2013, 7, 614–624. [Google Scholar]
  45. Phillips, S.J.; Dudik, M. Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation. Ecography 2008, 31, 161–175. [Google Scholar]
  46. Berger, A.L.; Pietra, V.J.D.; Pietra, S.A.D. A maximum entropy approach to natural language processing. Comput. Linguist 1996, 22, 39–71. [Google Scholar]

Share and Cite

MDPI and ACS Style

Eguchi, S.; Komori, O.; Ohara, A. Duality of Maximum Entropy and Minimum Divergence. Entropy 2014, 16, 3552-3572. https://doi.org/10.3390/e16073552

AMA Style

Eguchi S, Komori O, Ohara A. Duality of Maximum Entropy and Minimum Divergence. Entropy. 2014; 16(7):3552-3572. https://doi.org/10.3390/e16073552

Chicago/Turabian Style

Eguchi, Shinto, Osamu Komori, and Atsumi Ohara. 2014. "Duality of Maximum Entropy and Minimum Divergence" Entropy 16, no. 7: 3552-3572. https://doi.org/10.3390/e16073552

Article Metrics

Back to TopTop