Projective Power Entropy and Maximum Tsallis Entropy Distributions

Eguchi, Shinto; Komori, Osamu; Kato, Shogo

doi:10.3390/e13101746

Open AccessArticle

Projective Power Entropy and Maximum Tsallis Entropy Distributions

by

Shinto Eguchi

^*,

Osamu Komori

and

Shogo Kato

The Institute of Statistical Mathematics, Tachikawa, Tokyo 190-8562, Japan

^*

Author to whom correspondence should be addressed.

Entropy 2011, 13(10), 1746-1764; https://doi.org/10.3390/e13101746

Submission received: 26 July 2011 / Revised: 20 September 2011 / Accepted: 20 September 2011 / Published: 26 September 2011

(This article belongs to the Special Issue Tsallis Entropy)

Download

Browse Figures

Versions Notes

Abstract

:

We discuss a one-parameter family of generalized cross entropy between two distributions with the power index, called the projective power entropy. The cross entropy is essentially reduced to the Tsallis entropy if two distributions are taken to be equal. Statistical and probabilistic properties associated with the projective power entropy are extensively investigated including a characterization problem of which conditions uniquely determine the projective power entropy up to the power index. A close relation of the entropy with the Lebesgue space

L_{p}

and the dual

L_{q}

is explored, in which the escort distribution associates with an interesting property. When we consider maximum Tsallis entropy distributions under the constraints of the mean vector and variance matrix, the model becomes a multivariate q-Gaussian model with elliptical contours, including a Gaussian and t-distribution model. We discuss the statistical estimation by minimization of the empirical loss associated with the projective power entropy. It is shown that the minimum loss estimator for the mean vector and variance matrix under the maximum entropy model are the sample mean vector and the sample variance matrix. The escort distribution of the maximum entropy distribution plays the key role for the derivation.

Keywords:

elliptical contoured distribution; escort distribution; L_p space; maximum entropy distribution; statistical distribution; Tsallis entropy

1. Introduction

In the classical statistical physics and the information theory the close relation with Boltzmann-Shannon entropy has been well established to offer elementary and clear understandings. The Kullback-Leibler divergence is directly connected with maximum likelihood, which is one of the most basic ideas in statistics. Tsallis opened new perspectives for the power entropy to elucidate non-equilibrium states in statistical physics, and these give the strong influence on the research for non-extensive and chaotic phenomenon, cf. [1,2]. There are proposed several generalized versions of entropy and divergence, cf. [3,4,5,6,7]. We consider generalized entropy and divergence defined on the space of density functions with finite mass,

\begin{matrix} F = \{f : \int f (x) d x < \infty, f (x) \geq 0 for almost everywhere x\} \end{matrix}

in a framework of information geometry originated by Amari, cf. [8,9].

A functional

D : F \times F \mapsto [0, \infty)

is called a divergence if

D (g, f) \geq 0

with equality if and only if

g = f

. It is shown in [10,11] that any divergence associates with a Riemannian metric and a pair of conjugate connections in a manifold modeled in

F

under mild conditions.

We begin with the original form of power cross entropy [12] with the index β of

R

defined by

\begin{matrix} C_{β}^{(o)} (g, f) = - \frac{1}{β} \int g (x) {f {(x)}^{β} - 1} d x + \frac{1}{1 + β} \int f {(x)}^{1 + β} d x \end{matrix}

for all g and f in

F

, and so the power (diagonal) entropy

\begin{matrix} H_{β}^{(o)} (f) = C_{β}^{(o)} (f, f) = - \frac{1}{β (β + 1)} \int f {(x)}^{1 + β} d x + \frac{1}{β} \int f (x) d x \end{matrix}

See [13,14] for the information geometry and statistical applications for the independent component analysis and pattern recognition. Note that this is defined in the continuous case for probability density functions, but can be reduced to a discrete case, see Tsallis [2] for the extensive discussion on statistical physics. In fact, the Tsallis entropy

\begin{matrix} S_{q} (f) = \frac{1}{q - 1} \{1 - \int f {(x)}^{q} d x\} \end{matrix}

for a probability density function

f (x)

is proportional to the power entropy to a constant with

q H_{β}^{(o)} (g) - 1

, where

q = 1 + β

. The power divergence is given by

\begin{matrix} D_{β}^{(o)} (g, f) = C_{β}^{(o)} (g, f) - H_{β}^{(o)} (g) \end{matrix}

as, in general, defined by the difference of the cross entropy and the diagonal entropy.

In this paper we focus on the projective power cross entropy defined by

\begin{matrix} C_{γ} (g, f) = - \frac{1}{γ (1 + γ)} \frac{\int g (x) f {(x)}^{γ} d x}{{\{\int f {(x)}^{1 + γ} d x\}}^{\frac{γ}{1 + γ}}} \end{matrix}

(1)

and so the projective power entropy is

\begin{matrix} H_{γ} (f) = - \frac{1}{γ (1 + γ)} {\{\int f {(x)}^{1 + γ} d x\}}^{\frac{1}{1 + γ}} \end{matrix}

(2)

The log expression for

C_{γ} (g, f)

is defined by

\begin{matrix} C_{γ}^{\log} (g, f) = - \frac{1}{γ} log {- γ (1 + γ) C_{γ} (g, f)} \end{matrix}

See [15,16] for the derivation of

C_{γ}^{\log}

, and detailed discussion on the relation between

C_{β}^{(o)} (g, f)

and

C_{γ} (g, f)

. The projective power cross entropy

C_{γ} (g, f)

satisfies the linearity with respect to g and the projective invariance, that is

C_{γ} (g, λ f) = C_{γ} (g, f)

for any constant

λ > 0

. Note that

H_{γ} (f)

has a one-to-one correspondence with

S_{q} (f)

as given by

\begin{matrix} H_{γ} (f) = - \frac{1}{q (q - 1)} {1 - (q - 1) S_{q} (f)}^{\frac{1}{q}} \end{matrix}

where

q = 1 + γ

. The projective power divergence is

\begin{matrix} D_{γ} (g, f) = C_{γ} (g, f) - H_{γ} (g) \end{matrix}

(3)

which will be discussed on a close relation with the H

\ddot{o}

lder’s inequality. The divergence defined by

C_{γ} (g, f)

satisfies

\begin{matrix} D_{γ}^{\log} (g, f) = C_{γ}^{\log} (g, f) - C_{γ}^{\log} (g, g) \geq 0 \end{matrix}

for all γ of

R

if there exist integrals in

D_{γ}^{\log} (g, f)

. The nonnegativity leads to

\begin{matrix} D_{γ} (g, f) \geq 0 \end{matrix}

(4)

We remark that the existence range of the power index γ for

C_{γ} (g, f)

and

H_{γ} (f)

depends on the sample space on which f and g are defined. If the sample space is compact, both

C_{γ} (g, f)

and

H_{γ}

are well-defined for all

γ \in R

. If the sample space is not compact,

C_{γ} (g, f)

is defined for

γ \geq 0

and

H_{γ} (f)

is for

γ > - 1

. More precisely we will explore the case that the sample space is

R^{d}

in a subsequent discussion together with moment conditions. Typically we observe that

\begin{matrix} lim_{γ \to 0} D_{γ} (g, f) = D_{0} (g, f) \end{matrix}

(5)

where

D_{0} (g, f)

denotes the Kullback-Leibler divergence,

\begin{matrix} D_{0} (g, f) = \int g (x) log \frac{g (x)}{f (x)} d x \end{matrix}

(6)

See Appendix 1 for the derivation of (5).

Let

{x_{1}, \dots, x_{n}}

be a random sample from a distribution with the probability density function

g (x)

. A statistical model

{f (x, θ) : θ \in Θ}

with parameter θ is assumed to sufficiently approximate the underlying density function

g (x)

, where Θ is a parameter space. Then the loss function associated with the projective power entropy

C_{γ} (g, f (\cdot, θ))

based on the sample is given by

\begin{matrix} L_{γ} (θ) = - \frac{1}{γ (1 + γ)} \frac{1}{n} \sum_{i = 1}^{n} k_{γ} (θ) f {(x_{i}, θ)}^{γ} \end{matrix}

in which we call

\begin{matrix} {\hat{θ}}_{γ} \equiv \underset{θ \in Θ}{argmin} L_{γ} (θ) \end{matrix}

(7)

the γ-estimator, where

\begin{matrix} k_{γ} (θ) = {\{\int f {(x, θ)}^{1 + γ} d x\}}^{- \frac{γ}{1 + γ}} \end{matrix}

We note that

\begin{matrix} E_{g} {L_{γ} (θ)} = C_{γ} (g, f (\cdot, θ)) \end{matrix}

where

E_{g}

denotes the statistical expectation with respect to g. It is observed that the 0-estimator is nothing but the maximum likelihood estimator (MLE) since the loss

L_{γ} (θ)

converges to the minus log-likelihood function,

\begin{matrix} L_{0} (θ) = - \frac{1}{n} \sum_{i = 1}^{n} log f (x_{i}, θ) \end{matrix}

in the sense that

\begin{matrix} lim_{γ \to 0} \{L_{γ} (θ) + \frac{1}{γ (1 + γ)}\} = L_{0} (θ) \end{matrix}

If the underlying density function

g (x)

belongs to a Gaussian model with mean μ and variance

σ^{2}

, then the MLEs for μ and

σ^{2}

are the sample mean and sample variance. The reverse statement is shown in [17,18]. We will extend this theory to a case of the γ-estimator under γ-model.

In Section 2 we discuss characterization of the projective power entropy. In Section 3 the maximum entropy distribution with the Tsallis entropy

S_{q}

with

q = 1 + γ

under the constraints of mean vector μ and variance matrix Σ is considered. We discuss the model of maximum entropy distributions, called the γ-model, in which 0-model and 2-model equal Gaussian and Wigner models, respectively. Then we show that the γ-estimators for μ and Σ under the γ-model are the sample mean and sample variance. Section 4 gives concluding remarks and further comments.

2. Projective Invariance

Let us look at a close relation of

F

with Lebesgue’s space

\begin{matrix} L_{p} = \{{f (x) : \int | f (x) |}^{p} d x < \infty\} \end{matrix}

where

p \geq 1

and the

L_{p}

-norm

{∥ ∥}_{p}

is defined by

\begin{matrix} {∥ f ∥}_{p} = {\{{\int | f (x) |}^{p} d x\}}^{\frac{1}{p}} \end{matrix}

Let q be the conjugate index of p satisfying

1 / p + 1 / q = 1

, in which p and q can be expressed as functions of the parameter

γ > 0

such that

p = 1 + γ^{- 1}

and

q = 1 + γ

. We note that this q is equal to the index q in Tsallis entropy

S_{q}

in the relation

q = 1 + γ

. For any probability density function

f (x)

we define the escort distribution with the probability density function,

\begin{matrix} e_{q} (f (x)) = \frac{f {(x)}^{q}}{\int f {(y)}^{q} d y} \end{matrix}

cf. [2] for extensive discussion. We discuss an interesting relation of the projective cross entropy (1) with the escort distribution. By the definition of the escort distribution,

\begin{matrix} C_{γ} (g, f) = - \frac{1}{γ (1 + γ)} \int {e_{q} (f (x))}^{\frac{1}{p}} g (x) d x \end{matrix}

(8)

We note that

e_{q} {(f)}^{\frac{1}{p}}

is in the unit sphere of

L_{p}

in the representation. The projective power diagonal entropy (2) is proportional to the

L_{q}

-norm, that is,

\begin{matrix} H_{γ} (f) = - \frac{1}{γ (1 + γ)} {∥ f ∥}_{q} \end{matrix}

from which the H

\ddot{o}

lder’s inequality

\begin{matrix} \int g (x) f {(x)}^{γ} d x \leq {∥ g ∥}_{q} {∥ f^{γ} ∥}_{p} \end{matrix}

(9)

claims that

C_{γ} (g, f) \geq H_{γ} (g)

, or equivalently

\begin{matrix} D_{γ} (g, f) \geq 0 \end{matrix}

(10)

for all f and g in

F

, which is also led by

C_{γ}^{(o)} (g, f) \geq H_{γ}^{(o)} (g)

. The equality in (10) holds if and only if

f (x) = λ g (x)

for almost everywhere x, where λ is a positive constant. The power transform suggests an interplay between the space

L_{p}

and

L_{q}

by the relation,

\begin{matrix} ∥ f^{γ} ∥_{p} = {∥ f ∥}_{q}^{γ} \end{matrix}

Taking the limit of γ to 0 in the H

\ddot{o}

lder’s inequality (9) yields that

\begin{matrix} \int g (x) log f (x) d x \leq \int g (x) log g (x) d x \end{matrix}

since

\begin{matrix} lim_{γ \to 0} \int g (x) \frac{f {(x)}^{γ} - 1}{γ} d x = \int g (x) log f (x) d x \end{matrix}

and

\begin{matrix} lim_{γ \to 0} \frac{∥ f^{γ} ∥_{p} {∥ g ∥}_{q} - 1}{γ} = \int g (x) log g (x) d x \end{matrix}

(11)

This limit regarding p associates with another space rather than the

L_{\infty}

space, which is nothing but the space of all density functions with finite Boltzmann-Shannon entropy, say

L_{\log}

. The power index γ reparameterizes the Lebesgue space

L_{p}

and the dual space

L_{q}

with the relation

p = 1 + γ^{- 1}

, however, to take the power transform

f {(x)}^{γ}

is totally different from the ordinary discussion of the Lebesgue space, so that the duality converges to

(L_{\log}, L_{1})

as observed in (11). In information geometry the pair

(L_{\log}, L_{1})

corresponds to that of mixture and exponential connections, cf. [9]. See also another one-parameterization of

L_{p}

space [19].

We now discuss a problem of the uniqueness for

C_{γ} (g, f)

as given in the following theorem. A general discussion on the characterization is given in [16], however, the derivation is rather complicated. Here we assume a key condition that a cross entropy

Γ (g, f)

is linear in g to give an elementary proof. The Riesz representation theorem suggests

\begin{matrix} Γ (g, f) = c (f) \int g (x) ψ (f (x)) d x \end{matrix}

where

c (f)

is a constant that depends on f. Thus we observe the following theorem when we make a specific form for

c (f)

to guarantee the scale invariance.

Theorem 1.. Define a functional

Γ : F \times F \mapsto R

by

\begin{matrix} Γ (g, f) = φ (\int ρ (f (x)) d x) \int g (x) ψ (f (x)) d x \end{matrix}

(12)

where φ, ρ and ψ are differentiable and monotonic functions. Assume that

(i).

Γ (g, g) = {min}_{f \in F} Γ (g, f)

for all

g \in F

,

and that

(ii).

Γ (g, f) = Γ (g, λ f)

for all

λ > 0

and all

g, f \in F .

Then there exists γ such that

Γ (g, f) = C_{γ} (g, f)

up to a constant factor, where

C_{γ} (g, f)

is the projective power cross entropy defined by (1).

Proof.. The requirement (ii) means that

\begin{matrix} \frac{\partial}{\partial λ} \{φ (\int ρ (λ f (x)) d x) \int ψ (λ f (x)) g (x) d x\} = 0 \end{matrix}

which implies that, if f is absolutely continuous and g is the Dirac measure at

x_{0}

, then

\begin{matrix} \frac{\dot{ψ} (λ f (x_{0}))}{ψ (λ f (x_{0}))} λ f (x_{0}) = c (λ) \end{matrix}

where

\begin{matrix} c (λ) = - \frac{λ \dot{φ} (\int ρ (λ f (x)) d x) \int \dot{ρ} (λ f (x)) f (x) d x}{φ (\int ρ (λ f (x)) d x)} \end{matrix}

Since we can take an arbitrary value

f (x_{0})

for any fixed λ,

\begin{matrix} \frac{\dot{ψ} (t)}{ψ (t)} = c (λ) t^{- 1} \end{matrix}

which is uniquely solved as

ψ (t) = t^{γ}

where

γ = c (λ)

. Next let us consider a case of a finite discrete space,

{x_{i} : 1 \leq i \leq m}

. Then, since

ψ (f) = f^{γ}

, we can write

\begin{matrix} Γ (g, f) = φ (\sum_{i = 1}^{m} ρ (f_{i})) \sum_{i = 1}^{m} g_{i} f_{i}^{γ} \end{matrix}

where

f_{i} = f (x_{i})

and

g_{i} = g (x_{i})

. The requirement (i) leads that

(\partial / \partial f_{j}) Γ (g, f) |_{f = g} = 0

for all

j, 1 \leq j \leq m

, which implies that

\begin{matrix} \dot{ρ} (g_{j}) = - γ c (g_{1}, \dots, g_{m}) g_{j}^{γ} \end{matrix}

(13)

where

\begin{matrix} c (g_{1}, \dots, g_{m}) = \frac{φ (\sum_{i = 1}^{m} ρ (g_{i}))}{\sum_{i = 1}^{m} g_{i}^{1 + γ} \dot{φ} (\sum_{i = 1}^{m} ρ (g_{i}))} \end{matrix}

(14)

It follows from (13) that

c (g_{1}, \dots, g_{m})

must be a constant in

g_{1}, \dots, g_{m}

, say C, so that we solve (13) as

ρ (g_{j}) = - γ C g_{j}^{1 + γ} / (1 + γ)

. Therefore, Equation (14) is written by

\begin{matrix} \frac{\dot{φ} (t)}{φ (t)} = - \frac{γ}{1 + γ} t^{- 1} \end{matrix}

which leads to

φ (t) = t^{- \frac{γ}{1 + γ}}

. We conclude that

Γ (g, f) \propto C_{γ} (g, f)

, which completes the proof.☐

Remark 1.. The proof above is essentially applicable for the case that the integral (11) is given by the summation just for a binary distributions. In this sense the statement of Theorem 1 is not tight, however, statistical inference is discussed in a unified manner such that the distribution is either continuous or discrete. In a subsequent discussion we focus on the case for continuous distributions defined on

R^{d}

.

Remark 2. We see the multiplicative decomposition for

C_{γ} (g, f)

for statistical independence. In fact, if f and g are decomposed as

f = f_{1} \otimes f_{2}, g = g_{1} \otimes g_{2}

in the same partition, then

\begin{matrix} C_{γ} (g, f) = C_{γ} (g_{1}, f_{1}) C_{γ} (g_{2}, f_{2}) \end{matrix}

This property is also elemental, but we do not assume this decomposability as the requirement in Theorem 1.

3. Model of Maximum Entropy Distributions

We will elucidate a dualistic structure between the maximum entropy model on

H_{γ}

, defined in (2) and the minimum cross entropy estimator on

C_{γ}

, defined in (1). Before the discussion we overview the classical case in which the maximum likelihood estimation nicely makes good performance under the maximum entropy model on the Boltzmann-Shannon entropy, that is, a Gaussian model if we consider the mean and variance constraint. We will use conventional notations that X denotes random variable with value x. Let

{x_{1}, \dots, x_{n}}

be a random sample from a Gaussian distribution with the density function

\begin{matrix} f_{0} (x, μ, Σ) = det {(2 π Σ)}^{- \frac{1}{2}} exp {- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ)} \end{matrix}

The Gaussian density function is written by a canonical form

\begin{matrix} {(2 π)}^{- \frac{d}{2}} exp \{- \frac{1}{2} {(x - μ)}^{T} Ξ (x - μ) + \frac{1}{2} log det (Ξ)\} \end{matrix}

(15)

where Ξ is called the canonical parameter defined by

Σ^{- 1}

. The differentiation of (15) on μ and Ξ yields

\begin{matrix} E_{f_{0} (\cdot, μ, Σ)} (X) = μ and V_{f_{0} (\cdot, μ, Σ)} (X) = Σ \end{matrix}

where

E_{f}

and

V_{f}

denote the expectation vector and variance matrix with respect to a probability density function

f (x)

, respectively.

The maximum likelihood estimator is given by

\begin{matrix} ({\hat{μ}}_{0}, {\hat{Σ}}_{0}) = (\bar{x}, S) \end{matrix}

(16)

where

\bar{x}

and S are the sample mean vector and the sample variance matrix,

\begin{matrix} \bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}, S = \frac{1}{n} \sum_{i = 1}^{n} (x_{i} - \bar{x}) {(x_{i} - \bar{x})}^{T} \end{matrix}

(17)

This is because the minus log-likelihood function is

\begin{matrix} L_{0} (μ, Σ) = - \frac{1}{n} \sum_{i = 1}^{n} log f_{0} (x_{i}, μ, Σ) \end{matrix}

which is written by

\begin{matrix} \frac{1}{2} trace (S (μ) Ξ) - \frac{1}{2} log det (Ξ) \end{matrix}

apart from a constant, where

\begin{matrix} S (μ) = \frac{1}{n} \sum_{i = 1}^{n} (x_{i} - μ) {(x_{i} - μ)}^{T} \end{matrix}

(18)

Hence the estimating system is

\begin{matrix} [\begin{matrix} \frac{\partial}{\partial μ} L_{0} (μ, Σ) \\ \frac{\partial}{\partial Ξ} L_{0} (μ, Σ) \end{matrix}] = [\begin{matrix} Ξ (\bar{x} - μ) \\ \frac{1}{2} {S (μ) - Σ} \end{matrix}] = [\begin{matrix} 0 \\ O \end{matrix}] \end{matrix}

which concludes the Expression (16) of the MLE since

S (μ) = S + (\bar{x} - μ) {(\bar{x} - μ)}^{T}

. Alternatively, we have another route to show (16) as follows. The Kullback-Leibler divergence defined in (6) is given by

\begin{matrix} D_{0} (f_{0} (\cdot, μ, Σ), f_{0} (\cdot, μ_{1}, Σ_{1})) \end{matrix}

\begin{matrix} = & \frac{1}{2} {(μ - μ_{1})}^{T} Σ_{1}^{- 1} (μ - μ_{1}) + \frac{1}{2} trace {(Σ - Σ_{1}) Σ_{1}^{- 1}} - \frac{1}{2} log det (Σ Σ_{1}^{- 1}) \end{matrix}

Thus, we observe that

\begin{matrix} L_{0} (μ, Σ) - L_{0} (\bar{x}, S) = D_{0} (f_{0} (\cdot, \bar{x}, S), f_{0} (\cdot, μ, Σ)) \end{matrix}

(19)

which is nonnegative with equality if and only if

(μ, Σ) = (\bar{x}, S)

. This implies (16).

Under mild regularity conditions the reverse statement holds, that is, the MLE for a location and scatter model satisfies (16) if and only if the model is Gaussian, cf. [17,18]. However, even if we do not assume anything for the underling distribution

g (x)

, the statistics

\bar{x}

and S are asymptotically consistent for

\begin{matrix} μ_{g} = E_{g} (X) and Σ_{g} = V_{g} (X) \end{matrix}

This is a direct result from the strong law of large numbers, and the central limit theorem leads to the asymptotic normality for these two statistics. In this sense,

(\bar{x}, S)

is also a nonparametric estimator for

(μ_{g}, Σ_{g})

.

We explore a close relation of the statistical model and the estimation method. We consider a maximum entropy distribution with the γ-entropy

H_{γ}

over the space of d-dimensional distributions with a common mean and variance,

\begin{matrix} F_{(μ, Σ)} = {f \in F : \int f (x) d x = 1, E_{f} (x) = μ, V_{f} (x) = Σ} \end{matrix}

(20)

Then we define a distribution with a probability density function written by

\begin{matrix} f_{γ} (x, μ, Σ) = \frac{c_{γ}}{det {(2 π Σ)}^{\frac{1}{2}}} {\{1 - \frac{γ}{2 + d γ + 2 γ} {(x - μ)}^{T} Σ^{- 1} (x - μ)\}}_{+}^{\frac{1}{γ}} \end{matrix}

(21)

where

{()}_{+}

denotes a positive part and

c_{γ}

is the normalizing factor,

\begin{matrix} c_{γ} = \{\begin{matrix} {(\frac{2 γ}{2 + d γ + 2 γ})}^{\frac{d}{2}} \frac{Γ (1 + \frac{d}{2} + \frac{1}{γ})}{Γ (1 + \frac{1}{γ})} & if γ > 0 \\ {(- \frac{2 γ}{2 + d γ + 2 γ})}^{\frac{d}{2}} \frac{Γ (- \frac{1}{γ})}{Γ (- \frac{1}{γ} - \frac{d}{2})} & if - \frac{2}{d + 2} < γ < 0 \end{matrix} \end{matrix}

(22)

See the derivation for

c_{γ}

in Appendix 2. If the dimension d equals 1, then

f_{γ} (x, μ, Σ)

is a q-Gaussian distribution with

q = γ + 1

. We remark that

\begin{matrix} lim_{γ ↑ 0} c_{γ} = lim_{γ ↓ 0} c_{γ} = 1 \end{matrix}

in which

f_{γ} (x, μ, Σ)

is reduced to a d-variate Gaussian density when

γ = 0

. The support of

f_{γ} (\cdot, μ, Σ)

becomes an ellipsoid defined as

\begin{matrix} \{x \in R^{d} : {(x - μ)}^{T} Σ^{- 1} (x - μ) \leq \frac{2 + d γ + 2 γ}{γ}\} \end{matrix}

if

γ > 0

. On the other hand, if

- \frac{2}{d + 2} < γ < 0

, the density function (21) is written as

\begin{matrix} f_{γ} (x, μ, Σ) = det {(π τ Σ)}^{- \frac{1}{2}} \frac{Γ (- \frac{1}{γ})}{Γ (- \frac{1}{γ} - \frac{d}{2})} {\{1 + \frac{1}{τ} {(x - μ)}^{T} Σ^{- 1} (x - μ)\}}^{\frac{1}{γ}} \end{matrix}

(23)

where

\begin{matrix} τ = - \frac{2 + (d + 2) γ}{γ} \end{matrix}

The d-variate t-distribution is defined by

\begin{matrix} g_{ν} (x, μ, P) = det {(π ν P)}^{- \frac{1}{2}} \frac{Γ (\frac{ν + d}{2})}{Γ (\frac{ν}{2})} {\{1 + \frac{1}{ν} {(x - μ)}^{T} P^{- 1} (x - μ)\}}^{- \frac{ν + d}{2}} \end{matrix}

(24)

cf. [20] for the extensive discussion. Assume that

\begin{matrix} \frac{ν + d}{2} = - \frac{1}{γ} and ν P = τ Σ \end{matrix}

Then we observe from (23) and (24) that

\begin{matrix} f_{γ} (x, μ, Σ) = g_{ν} (x, μ, P) \end{matrix}

Accordingly, the density function

f_{γ} (x, μ, Σ)

with

- \frac{2}{d + 2} < γ < 0

is a t-distribution. The distribution has elliptical contours on the Euclidean space

R^{d}

for any

γ > - \frac{2}{d + 2}

, as shown in Figure 1 for typical cases of γ.

Figure 1. t-distribution

(γ = - 0.4)

, Gaussian

(γ = 0)

and Wigner

(γ = 2)

distributions.

Figure 1. t-distribution

(γ = - 0.4)

, Gaussian

(γ = 0)

and Wigner

(γ = 2)

distributions.

Let

\begin{matrix} M_{γ} = \{f_{γ} (x, μ, Σ) : μ \in R^{d}, Σ \in S_{d}\} \end{matrix}

(25)

which we call γ-model, where

S_{d}

denotes the space of all symmetric, positive-definite matrices of order d. We confirm the mean and variance of the γ-model as follows.

Lemma. Under the model

M_{γ}

defined in (25) with the index

γ > - \frac{2}{d + 2}

,

\begin{matrix} E_{f_{γ} (\cdot, μ, Σ)} (X) = μ and V_{f_{γ} (\cdot, μ, Σ)} (X) = Σ \end{matrix}

Proof. We need to consider a family of escort distributions. In the model

M_{γ}

we can define the escort distribution as

\begin{matrix} e_{q} (f_{γ} (x, μ, Σ)) = \frac{c_{γ}^{*}}{det {(Σ)}^{\frac{1}{2}}} {\{1 - \frac{γ}{2 + d γ + 2 γ} {(x - μ)}^{T} Σ^{- 1} (x - μ)\}}_{+}^{\frac{1 + γ}{γ}} \end{matrix}

(26)

where

q = 1 + γ

and

c_{γ}^{*}

is the normalizing factor. Hence,

\begin{matrix} e_{q} (f_{γ} (x, μ, Σ)) = c_{γ}^{*} {\{det {(Σ)}^{- \frac{1}{2} \frac{γ}{1 + γ}} - \frac{γ}{2 + d γ + 2 γ} {(x - μ)}^{T} {det {(Σ)}^{- \frac{1}{2} \frac{γ}{1 + γ}} Σ^{- 1}} (x - μ)\}}_{+}^{\frac{1 + γ}{γ}} \end{matrix}

(27)

Here we define alternative parameter

Ξ_{γ}

to the original parameter Σ by the transform

\begin{matrix} Ξ_{γ} = det {(Σ)}^{- \frac{1}{2} \frac{γ}{1 + γ}} Σ^{- 1} \end{matrix}

(28)

and so that the inverse transform is given by

\begin{matrix} Σ = det {(Ξ_{γ})}^{\frac{γ}{d γ + 2 γ + 2}} Ξ_{γ}^{- 1} \end{matrix}

(29)

noting that

det (Ξ_{γ}) = det {(Σ)}^{- \frac{1}{2} \frac{d γ + 2 γ + 2}{1 + γ}}

. Thus, we get a canonical form of (26) as

\begin{matrix} e_{q} (f_{γ} (x, μ, Σ)) = c_{γ}^{*} {\{det {(Ξ_{γ})}^{\frac{γ}{2 + d γ + 2 γ}} - \frac{γ}{2 + d γ + 2 γ} {(x - μ)}^{T} Ξ_{γ} (x - μ)\}}_{+}^{\frac{1 + γ}{γ}} \end{matrix}

(30)

By analogy of the discussion for an exponential family we have the following expression for the braced term in (30) as

\begin{matrix} - \frac{2 γ}{2 + d γ + 2 γ} \{\frac{1}{2} trace (x x^{T} Ξ_{γ}) - μ^{T} Ξ_{γ} x + \frac{1}{2} μ^{T} Ξ_{γ} μ - \frac{2 + d γ + 2 γ}{2 γ} det {(Ξ_{γ})}^{\frac{γ}{d γ + 2 γ + 2}}\} \end{matrix}

(31)

A property of the escort distribution suggests moment formulae for the distribution (25) as follows: We have an identity

\begin{matrix} \frac{\partial}{\partial μ} \int c_{γ}^{*} {\{det {(Ξ_{γ})}^{\frac{γ}{d γ + 2 γ + 2}} - \frac{γ}{2 + d γ + 2 γ} {(x - μ)}^{T} Ξ_{γ} (x - μ)\}}_{+}^{\frac{1 + γ}{γ}} d x = 0 \end{matrix}

which implies that

\begin{matrix} \int {\{det {(Ξ_{γ})}^{\frac{γ}{d γ + 2 γ + 2}} - \frac{γ}{2 + d γ + 2 γ} {(x - μ)}^{T} Ξ_{γ} (x - μ)\}}_{+}^{\frac{1}{γ}} Ξ_{γ} (x - μ) d x = 0 \end{matrix}

which concludes that

\begin{matrix} E_{f_{γ} (\cdot, μ, Σ)} (X) = μ \end{matrix}

Similarly,

\begin{matrix} \frac{\partial}{\partial Ξ_{γ}} \int c_{γ}^{*} {\{det {(Ξ_{γ})}^{\frac{γ}{d γ + 2 γ + 2}} - \frac{γ}{2 + d γ + 2 γ} {(x - μ)}^{T} Ξ_{γ} (x - μ)\}}_{+}^{\frac{1 + γ}{γ}} d x = 0 \end{matrix}

which is

\begin{matrix} \int c_{γ}^{*} {\{det {(Ξ_{γ})}^{\frac{γ}{d γ + 2 γ + 2}} - \frac{γ}{2 + d γ + 2 γ} {(x - μ)}^{T} Ξ_{γ} (x - μ)\}}_{+}^{\frac{1}{γ}} \end{matrix}

\begin{matrix} \times & \{\frac{γ}{d γ + 2 γ + 2} det {(Ξ_{γ})}^{\frac{γ}{d γ + 2 γ + 2}} Ξ_{γ}^{- 1} - \frac{γ}{2 + d γ + 2 γ} (x - μ) {(x - μ)}^{T}\} d x = 0 \end{matrix}

(32)

which concludes that

\begin{matrix} V_{f_{γ} (\cdot, μ, Σ)} (X) = Σ \end{matrix}

because of the relation of

Ξ_{γ}

and Σ as observed in (29). The proof is complete. ☐

Remark 3. The canonical form (30) of the escort distribution (26) plays an important role on the proof of Lemma. Basically we can write the canonical form of (21), however it is not known any link to distributional properties like a case of exponential family.

Remark 4. In Equation (31) the function

\begin{matrix} φ (Ξ) = \frac{1}{2 ω} det {(Ξ)}^{ω} \end{matrix}

(33)

is viewed as a potential function in the Fenchel convex duality, where

\begin{matrix} ω = \frac{γ}{2 + d γ + 2 γ} \end{matrix}

cf. [21,22] for the covariance structure model.

From Lemma we observe that

f_{γ} (\cdot, μ, Σ) \in F (μ, Σ)

. Next we show that the distribution with density

f_{γ} (\cdot, μ, Σ)

maximizes the γ-entropy

H_{γ}

over the space

F (μ, Σ)

, where

H_{γ}

is defined in (2).

Theorem 2.

(i). If

- \frac{2}{d + 2} < γ \leq 0

, then

\begin{matrix} f_{γ} (\cdot, μ, Σ) = \underset{f \in F_{(μ, Σ)}}{argmax} H_{γ} (f) \end{matrix}

(34)

where

F_{(μ, Σ)}

is defined in (20).

(ii). If

γ > 0

, then

\begin{matrix} f_{γ} (\cdot, μ, Σ) = \underset{f \in {F_{(μ, Σ)}}^{(γ)}}{argmax} H_{γ} (f) \end{matrix}

(35)

where

\begin{matrix} {F_{(μ, Σ)}}^{(γ)} = {f \in F_{(μ, Σ)} : f (x) = 0 for almost everywhere x \in B (μ, Σ)} \end{matrix}

with

B (μ, Σ)

being

{x \in R^{d} : {(x - μ)}^{T} Σ^{- 1} (x - μ) > \frac{2 + d γ + 2 γ}{γ}}

.

Proof. By the definition of

F_{(μ, Σ)}

, we see from Lemma that

f_{γ} (\cdot, μ, Σ) \in F_{(μ, Σ)}

for any

γ \in (- \frac{2}{d + 2}, 0)

. This leads to

\begin{matrix} E_{f_{γ} (\cdot, μ, Σ)} {f_{γ} {(X, μ, Σ)}^{γ}} = E_{f} {f_{γ} {(X, μ, Σ)}^{γ}} \end{matrix}

for any f in

F_{(μ, Σ)}

, which implies that

\begin{matrix} H_{γ} (f_{γ} (\cdot, μ, Σ)) = C_{γ} (f, f_{γ} (\cdot, μ, Σ)) \end{matrix}

Hence

\begin{matrix} H_{γ} (f_{γ} (\cdot, μ, Σ)) - H_{γ} (f) = D_{γ} (f, f_{γ} (\cdot, μ, Σ)) \end{matrix}

(36)

which is nonnegative as discussed in (4). This concludes (34). Similarly, we observe that (36) holds for any

γ > 0

and any f in

{F_{(μ, Σ)}}^{(γ)}

since the support of f includes that of

f (\cdot, μ, Σ)

. This concludes (35). ☐

We would like to elucidate a similar structure for the statistical inference by the minimum projective cross entropy in which the data set

{x_{1}, \dots, x_{n}}

is assumed to follow the model

M_{γ}

. We recall (8) from the relation of the projective cross entropy with the escort distribution

\begin{matrix} C_{γ} (g, f) = - \frac{1}{γ (1 + γ)} \int e_{q} {(f (x))}^{\frac{γ}{1 + γ}} g (x) d x \end{matrix}

When we have got data

{x_{1}, \dots, x_{n}}

to be fitted to the model

M_{γ}

, the loss function is

\begin{matrix} L_{γ} (μ, Σ) = - \frac{1}{γ (1 + γ)} \frac{1}{n} \sum_{i = 1}^{n} e_{q} {(f_{γ} (x_{i}, μ, Σ))}^{\frac{γ}{1 + γ}} \end{matrix}

where

f_{γ} (x, μ, Σ)

defined in (21). The γ-estimator is defined by

\begin{matrix} ({\hat{μ}}_{γ}, {\hat{Σ}}_{γ}) = \underset{(μ, Σ)}{argmin} L_{γ} (μ, Σ) \end{matrix}

see the general definition (7). It follows from the canonical form defined in (30) with the canonical parameter

Ξ_{γ}

defined in (28) that

\begin{matrix} L_{γ} (μ, Σ) = - \frac{1}{γ (1 + γ)} {(c_{γ}^{*})}^{\frac{γ}{γ + 1}} [det {(Ξ_{γ})}^{ω} - ω {trace (Ξ_{γ} S) + {(μ - \bar{x})}^{T} Ξ_{γ} (μ - \bar{x})}] \end{matrix}

(37)

where

(\bar{x}, S)

and ω are defined in (17) and (33), and

c_{γ}^{*}

is the normalizing factor defined in (27). Here we note that if

γ > 0

, then the parameter

(μ, Σ)

must be assumed to be in

Θ_{n}

, where

\begin{matrix} Θ_{n} = {(μ, Σ) \in R^{d} \times S_{d} : {(x_{i} - μ)}^{T} Σ^{- 1} (x_{i} - μ) < ω^{- 1} (\forall i = 1, \dots, n)} \end{matrix}

(38)

We note that

L_{γ} (μ, Σ) = C_{γ} (f (\cdot, \bar{x}, S), f (\cdot, μ, Σ))

and

L_{γ} (\bar{x}, S) = H_{γ} (f (\cdot, \bar{x}, S))

since

\begin{matrix} E_{f (\cdot, \bar{x}, S)} (X) = \bar{x}, and V_{f (\cdot, \bar{x}, S)} (X) = S \end{matrix}

Accordingly, we observe the argument similar to (19) for the MLE. The projective divergence

D_{γ}

defined in (3) equals the difference of the γ-loss functions as

\begin{matrix} L_{γ} (μ, Σ) - L_{γ} (\bar{x}, S) = D_{γ} (f_{γ} (\cdot, \bar{x}, S), f_{γ} (\cdot, μ, Σ)), \end{matrix}

(39)

which is nonnegative with equality if and only if

(μ, Σ) = (\bar{x}, S)

. See the discussion after equation (10). In this way, we can summarize the above discussion as follows:

Theorem 3. Let

{x_{1}, \dots, x_{n}}

be a random sample from a γ-model defined in (21). Then the γ-estimator defined in (7) for

(μ, Σ)

is

(\bar{x}, S)

, where

(\bar{x}, S)

is defined in (17).

Proof. Let us give another proof. The estimating system is given by

\begin{matrix} [\begin{matrix} \frac{\partial}{\partial μ} L_{γ} (μ, Σ) \\ \frac{\partial}{\partial Ξ_{γ}} L_{γ} (μ, Σ) \end{matrix}] = [\begin{matrix} Ξ_{γ} (\bar{x} - μ) \\ ω {det {(Ξ_{γ})}^{ω} Ξ_{γ}^{- 1} - S (μ)} \end{matrix}] = [\begin{matrix} 0 \\ O \end{matrix}] \end{matrix}

(40)

which is equivalent to

\begin{matrix} [\begin{matrix} μ - \bar{x} \\ Σ - S (μ) \end{matrix}] = [\begin{matrix} 0 \\ O \end{matrix}] \end{matrix}

because of the relation of

Ξ_{γ}

into Σ as given in (29). Thus, we also attain the conclusion

({\hat{μ}}_{γ}, {\hat{Σ}}_{γ}) = (\bar{x}, S) .

In this way, we obtain the solution of the equation system defined by (40) via the parameter

Ξ_{γ}

using the relation of the escort distribution with the loss function (37). ☐

Remark 5. Consider the location model

{f_{γ} (\cdot, μ, Σ)}

with the location parameter μ, where Σ is known in Theorem 3. Then we easily see that the γ-estimator for μ is

\bar{x}

. What about the reverse statement? We observe that if the γ-estimator for μ is

\bar{x}

with the sample size

n \geq 3

, then the model is the γ-model,

{f_{γ} (\cdot, μ, Σ)

with the known Σ. The proof is parallel to that of Theorem 2 given in [17]. In fact, we conclude that the model density function

f (x)

satisfies that

\begin{matrix} {f (x - μ)}^{γ} = a + b {(x - μ)}^{T} Σ^{- 1} (x - μ) \end{matrix}

where a and b are constants.

Remark 6. If we look at jointly Theorem 2 and 3, then

\begin{matrix} min_{(μ, Σ) \in R^{d} \times S_{d}} L_{γ} (μ, Σ) = max_{f \in F (\bar{x}, S)} H_{γ} (f) \end{matrix}

(41)

since

L_{γ} (\bar{x}, S) = H_{γ} (f_{γ} (\cdot, \bar{x}, S))

. Both sides of (41) associate with inequalities (39) and (36) on γ-divergence in separate discussion.

Remark 7. The derivation of the γ-estimator in Theorem 3 is provided by the canonical parameter

Ξ_{γ}

of the escort distribution as given in (28). Here we directly calculate the gradient of the loss with respect to Σ as follows:

\begin{matrix} \frac{\partial}{\partial Σ} L_{γ} (μ, Σ) & = & - \frac{1}{2 {(1 + γ)}^{2}} det {(Σ)}^{- \frac{1}{2} \frac{γ}{1 + γ}} (1 - ω trace {S (μ) Σ^{- 1}}) Σ^{- 1} \\ + \frac{γ}{(1 + γ)} ω det {(Σ)}^{- \frac{1}{2} \frac{γ}{1 + γ}} Σ^{- 1} S (μ) Σ^{- 1} \\ = & - \frac{1}{2 {(1 + γ)}^{2}} det {(Σ)}^{- \frac{1}{2} \frac{γ}{1 + γ}} \\ \times [(1 - ω trace {S (μ) Σ^{- 1}}) Σ^{- 1} - \frac{1 + γ}{1 + \frac{1}{2} d γ + γ} Σ^{- 1} S (μ) Σ^{- 1}] \end{matrix}

Therefore we observe that if we put

μ = \bar{x}

and

Σ = α S (\bar{x})

, then

\begin{matrix} \frac{\partial}{\partial Σ} L_{γ} (\bar{x}, α S (\bar{x})) = - \frac{1}{2} \frac{γ}{1 + γ} det {(α S (\bar{x}))}^{- \frac{1}{2} \frac{γ}{1 + γ}} {(α S (\bar{x}))}^{- 1} \end{matrix}

\begin{matrix} \times [(1 - ω trace {S (\bar{x}) {(α S (\bar{x}))}^{- 1}}) α S (\bar{x}) - \frac{1 + γ}{1 + \frac{1}{2} d γ + γ} S (\bar{x})] {(α S (\bar{x}))}^{- 1} \end{matrix}

(42)

The bracketed term of (42) is given by

\begin{matrix} [α (1 - ω trace {S (\bar{x}) {(α S (\bar{x}))}^{- 1}}) - \frac{1 + γ}{1 + \frac{1}{2} d γ + γ}] S (\bar{x}) \\ = & (α - \frac{d γ}{2 + d γ + 2 γ} - \frac{1 + γ}{1 + \frac{1}{2} d γ + γ}) S (\bar{x}) \end{matrix}

which concludes that if

α = 1

, then

(\partial / \partial Σ) L_{γ} (\bar{x}, α S (\bar{x})) = 0

. This is a direct proof for Theorem 3, but it would accompany with a heuristic discussion for the substitution of

(μ, Σ)

into

(\bar{x}, α S (\bar{x}))

.

4. Concluding Remarks

We explored the elegant property (39), the empirical Pythagoras relation between the γ-model and γ-estimator, in the sense that (39) directly shows Theorem 3 without any differential calculus. Another elegant expression is in the minimax game between Nature and a decision maker, see [23]. Consider the space

F (μ, Σ)

defined in (20). The intersection of the γ-model (21) and

F (μ, Σ)

is a singleton

{f_{γ} (\cdot, μ, Σ)}

, which is the minimax solution of

\begin{matrix} max_{g \in F (μ, Σ)} min_{f \in F} C_{γ} (g, f) = C_{γ} (f_{γ} (\cdot, μ, Σ), f_{γ} (\cdot, μ, Σ)) \end{matrix}

Consider different indices γ and

γ^{*}

which specify the γ-model and

γ^{*}

-estimator, respectively. Basically the

γ^{*}

-estimator is consistent under the γ-model for any choice of γ and

γ^{*}

. If we specifically fix

γ = 0

for the model, that is, a Gaussian model, then the

γ^{*}

-estimator is shown to be qualitatively robust for any

γ^{*} > 0

, see [16]. The degree of robustness is proportional to the value of

γ^{*}

with a trade for the efficiency. The

γ^{*}

-estimator for

(μ, Σ)

of the Gaussian model is given by the solution of

\begin{matrix} μ & = & \frac{\sum_{i = 1}^{n} f_{0} {(x_{i}, μ, Σ)}^{γ *} x_{i}}{\sum_{i = 1}^{n} f_{0} {(x_{i}, μ, Σ)}^{γ *}} \end{matrix}

\begin{matrix} Σ & = & (1 + γ^{*}) \frac{\sum_{i = 1}^{n} f_{0} {(x_{i}, μ, Σ)}^{γ *} (x_{i} - μ) {(x_{i} - μ)}^{T}}{\sum_{i = 1}^{n} f_{0} {(x_{i}, μ, Σ)}^{γ *}} \end{matrix}

The weight function

f_{0} {(x_{i}, μ, Σ)}^{γ *}

for the i-th observation

x_{i}

becomes almost 0 when

x_{i}

is an outlier. Alternatively, the classical robust method employs

γ^{*} = 0

, that is, the MLE for the misspecified model

γ < 0

or t-distribution model, see [24,25]. Thus, the different indices γ and

γ^{*}

work robust statistics in a dualistic manner.

This property is an extension of that associated between the exponential model and MLE, however, it is fragile in the sense that (19) does not hold if the indices in the γ-model and

γ^{*}

-estimator are slightly different. In practice, we find some difficulties for a numerical task for solving the MLE under the γ-model with

γ > 0

because the support of the density depends on the parameter and the index γ. We discussed statistical and probabilistic properties on the model and estimation associated with the specific cross entropy. A part of properties discussed still holds for any cross entropy in a much wider class, which is investigated from the point of the Fenchel duality in [13,26].

Acknowledgements

We would like to express our thanks to two referees for their helpful comments and constructive suggestions.

References

Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Statist. Physics. 1988, 52, 479–487. [Google Scholar] [CrossRef]
Tsallis, C. Introduction to Nonextensive Statistical Mechanics: Approaching a Complex World; Springer-Verlag: New York, NY, USA, 2009. [Google Scholar]
Cichocki, A.; Cruces, S.; Amari, S. Families of alpha- beta- and gamma- divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef]
Cichocki, A.; Cruces, S.; Amari, S. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy 2011, 13, 134–170. [Google Scholar] [CrossRef]
Csiszàr, I. Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica 1967, 2, 229–318. [Google Scholar]
Rèny, I.A. On measures of entropy and information. Proc. Fourth Berkeley Symp. Math. Statist. Prob. 1961, 1, 547–561. [Google Scholar]
Tøpsoe, F. Some inequalities for information divergence and related measures of discrimination. IEEE Trans. Inform. Theor. 2000, 46, 1602–1609. [Google Scholar]
Amari, S. Differential-geometrical methods in statistics. In Lecture Notes in Statistics; Springer-Verlag: New York, NY, USA, 1985; Volume 28. [Google Scholar]
Amari, S.; Nagaoka, H. Methods of information geometry. In Translations of Mathematical Monographs; American Mathematical Society: Providence, RI, USA, 2000; Volume 191. [Google Scholar]
Eguchi, S. Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Statist. 1983, 11, 793–803. [Google Scholar] [CrossRef]
Eguchi, S. Geometry of minimum contrast. Hiroshima Math. J. 1992, 22, 631–647. [Google Scholar]
Basu, A.; Harris, I.R.; Hjort, N.L.; Jones, M.C. Robust and efficient estimation by minimizing a density power divergence. Biometrika 1988, 85, 549–559. [Google Scholar] [CrossRef]
Eguchi, S. Information divergence geometry and the application to statistical machine learning. In Information Theory and Statistical Learning; Emmert-Streib, F., Dehmer, M., Eds.; Springer: New York, NY, USA, 2008; pp. 309–332. [Google Scholar]
Minami, M.; Eguchi, S. Robust blind source separation by beta-divergence. Neural Comput. 2002, 14, 1859–1886. [Google Scholar]
Eguchi, S.; Kato, S. Entropy and divergence associated with power function and the statistical application. Entropy 2010, 12, 262–274. [Google Scholar] [CrossRef]
Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivariate Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef]
Azzalini, A.; Genton, M.G. On Gauss’s characterization of the normal distribution. Bernoulli 2007, 13, 169–174. [Google Scholar] [CrossRef]
Teicher, H. Maximum likelihood characterization of distributions. Ann. Math. Statist. 1961, 32, 1214–1222. [Google Scholar] [CrossRef]
Amari, S.; Ohara, A. Geometry of q-exponential family of probability distributions. Entropy 2011, 13, 1170–1185. [Google Scholar] [CrossRef]
Kotz, S.; Nadarajah, S. Multivariate T Distributions and Their Applications; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Eguchi, S. A differential geometric approach to statistical inference on the basis of contrast functionals. Hiroshima Math. J. 1985, 15, 341–391. [Google Scholar]
Wakaki, H.; Eguchi, S.; Fujikoshi, Y. A class of tests for general covariance structure. J. Multivariate Anal. 1990, 32, 313–325. [Google Scholar] [CrossRef]
Grünwald, P.D.; Dawid, A.P. Game theory, maximum entropy, minimum discrepancy, and robust Bayesian decision theory. Ann. Statist. 2004, 32, 1367–1433. [Google Scholar]
Kent, J.T.; Tyler, D.E. Redescending M-estimates of multivariate location and scatter. Ann. Statist. 1991, 19, 2102–2119. [Google Scholar] [CrossRef]
Marrona, R.A. Robust M-estimators of multivariate location and scatter. Ann. Statist. 1976, 4, 51–67. [Google Scholar] [CrossRef]
Eguchi, S. Information geometry and statistical pattern recognition. Sugaku Exposition 2006, 19, 197–216. [Google Scholar]

Appendix 1

We show (5). It follows from l’H

\hat{o}

pital’s rule that

\begin{matrix} lim_{γ \to 0} D_{γ} (g, f) = {(\frac{\partial}{\partial γ} [{\{\int g {(x)}^{1 + γ} d x\}}^{\frac{1}{1 + γ}} - \frac{\int g (x) f {(x)}^{γ} d x}{{\{\int f {(x)}^{1 + γ} d x\}}^{\frac{γ}{1 + γ}}}])}_{γ = 0} \end{matrix}

which is written as

\begin{matrix} (\frac{1}{1 + γ} {\{\int g {(x)}^{1 + γ} d x\}}^{\frac{- γ}{1 + γ}} \int g {(x)}^{1 + γ} log g (x) d x - \frac{\int g (x) f {(x)}^{γ} log f (x) d x}{{\{\int f {(x)}^{1 + γ} d x\}}^{\frac{γ}{1 + γ}}} \\ {+ \frac{γ}{1 + γ} \frac{\int g (x) f {(x)}^{γ} d x}{{\{\int f {(x)}^{1 + γ} d x\}}^{\frac{1 + 2 γ}{1 + γ}}} \int f {(x)}^{1 + γ} log f (x) d x)}_{γ = 0} \end{matrix}

which is reduced to

\begin{matrix} \int g (x) log g (x) d x - \int g (x) log f (x) d x \end{matrix}

This completes the proof of (5). ☐

Appendix 2

First, we give the formula for

c_{γ}

in (22) when

γ > 0

. Let

\begin{matrix} I = \frac{1}{\det {(2 π Σ)}^{\frac{1}{2}}} \int {\{1 - ω {(x - μ)}^{T} Σ^{- 1} (x - μ)\}}_{+}^{\frac{1}{γ}} d x \end{matrix}

where

ω = \frac{γ}{2 + d γ + 2 γ} .

The integral is rewritten as

\begin{matrix} I = {(2 π ω)}^{- \frac{d}{2}} \int {(1 - y^{T} y)}_{+}^{\frac{1}{γ}} d y \end{matrix}

where

y = {(ω)}^{\frac{1}{2}} Σ^{- 1 / 2} (x - μ)

. It is expressed in polar coordinates as

\begin{matrix} I = {(2 π ω)}^{- \frac{d}{2}} S^{d - 1} \int_{0}^{1} {(1 - r^{2})}^{\frac{1}{γ}} r^{d - 1} d r \end{matrix}

(43)

where

S^{d - 1}

is the surface area of the unit sphere of

d - 1

dimension, that is,

\begin{matrix} S^{d - 1} = \frac{2 π^{\frac{d}{2}}}{Γ (\frac{d}{2})} \end{matrix}

Since the integral in (43) is expressed by a beta function, we have

\begin{matrix} c_{γ} = I^{- 1} = {(2 ω)}^{\frac{d}{2}} \frac{Γ (1 + \frac{d}{2} + \frac{1}{γ})}{Γ (1 + \frac{1}{γ})} \end{matrix}

Second, we give the formula when

- \frac{2}{d + 2} < γ < 0 .

The argument similar to the above

\begin{matrix} I = {(- 2 π ω)}^{- \frac{d}{2}} \int {(1 + y^{T} y)}^{\frac{1}{γ}} d y \end{matrix}

where

y = {(- 2 π ω)}^{1 / 2} Σ^{- 1 / 2} (x - μ)

. It is expressed in polar coordinates as

\begin{matrix} I = {(- 2 π ω)}^{- \frac{d}{2}} S^{d - 1} \int_{0}^{\infty} {(1 + r^{2})}^{\frac{1}{γ}} r^{d - 1} d r \end{matrix}

which leads that

\begin{matrix} c_{γ} = {(- 2 ω)}^{\frac{d}{2}} \frac{Γ (- \frac{1}{γ})}{Γ (- \frac{1}{γ} - \frac{d}{2})} \end{matrix}

© 2011 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/.)

Share and Cite

MDPI and ACS Style

Eguchi, S.; Komori, O.; Kato, S. Projective Power Entropy and Maximum Tsallis Entropy Distributions. Entropy 2011, 13, 1746-1764. https://doi.org/10.3390/e13101746

AMA Style

Eguchi S, Komori O, Kato S. Projective Power Entropy and Maximum Tsallis Entropy Distributions. Entropy. 2011; 13(10):1746-1764. https://doi.org/10.3390/e13101746

Chicago/Turabian Style

Eguchi, Shinto, Osamu Komori, and Shogo Kato. 2011. "Projective Power Entropy and Maximum Tsallis Entropy Distributions" Entropy 13, no. 10: 1746-1764. https://doi.org/10.3390/e13101746

APA Style

Eguchi, S., Komori, O., & Kato, S. (2011). Projective Power Entropy and Maximum Tsallis Entropy Distributions. Entropy, 13(10), 1746-1764. https://doi.org/10.3390/e13101746

Article Menu

Projective Power Entropy and Maximum Tsallis Entropy Distributions

Abstract

1. Introduction

2. Projective Invariance

3. Model of Maximum Entropy Distributions

4. Concluding Remarks

Acknowledgements

References

Appendix 1

Appendix 2

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI