Group Invariance of Information Geometry on q-Gaussian Distributions Induced by Beta-Divergence

We demonstrate that the q-exponential family particularly admits natural geometrical structures among deformed exponential families. The property is the invariance of structures with respect to a general linear group, which transitively acts on the space of positive definite matrices. We prove this property via the correspondence between information geometry induced by a deformed potential on the space and the one induced by what we call β-divergence defined on the q-exponential family with q = β + 1. The results are fundamental in robust multivariate analysis using the q-Gaussian family.


Introduction
Generalizations of the exponential family have recently had much attention paid to them in mathematical statistics or statistical physics [1][2][3][4][5][6].One of their goals is to attack a wider class of statistical problems that ranges outside the ones solved via the well-established theory of the exponential family [7].Among such generalizations, the q-exponential family, where the ordinary exponential function is replaced by the q-exponential [8], often naturally appears to play important roles in experimental and theoretical aspects.
For example, the family not only describes phenomena obeying the power-law well [6], but also, it is theoretically proven to include a velocity distribution of the classical gas with N particles [9,10], an attracting invariant manifold of porous media flow [11], and so on.In statistics, it is reported to provide a reasonable statistical model in robust inference from data losing normality [1,[12][13][14].In addition, quite interesting and abundant mathematical structures have been developed [15][16][17][18][19] for the q-exponential function itself.
On certain families of elliptical distributions [20,21], we can introduce information geometric structures [22,23] starting from what is called the U-divergence [1] instead of the Kullback-Leibler divergence, in order to geometrically tackle various statistical problems that include the above robustness analysis.Zero-mean elliptical density functions f P (x) = u(−x T P x/2 − c U (det P )), with a fixed function, u, and the normalizing constant, c U (det P ), can be specified by positive definite matrices, P .Hence, we can discuss the geometric structures of such density functions explicitly, e.g., with no integrations, via the corresponding information geometry on the parameter space of positive definite matrices, called the V-geometry [24].
In the present paper, we focus on investigating the geometric structures of the q-Gaussian family induced from the β-divergence [1] via the V -geometry following the above idea.For this purpose, we establish the correspondence between two geometries and derive explicit formulas of important geometric quantities, such as the Riemannian metric and mutually dual affine connections.Consequently, we can prove that the information geometry on the q-Gaussian family enjoys fairly natural group invariance properties.These invariances or homogeneities are important in multivariate analysis (see, e.g., [21,25]).Further, they practically assure that the statistical inferences based on the geometrical theory are independent of linear transformations of multivariate random variables, such as scaling or numerical conditioning in computations.It should be additionally mentioned that our results might shed new light on the rich mathematical structures of the q-exponential or power functions.
The organization of the paper is as follows: Section 2 collects necessary results on the V -geometry of positive definite matrices and the U -divergence defined on elliptical distributions.In Section 3, we discuss the group invariances of fundamental structures of the V -geometry induced by the power potentials.We find that its pair of mutually dual connections and orthogonality are GL(n, R)-invariant, which is a natural requirement for geometries on positive definite matrices.Section 4 is devoted to demonstrating that the dualistic geometry on the q-Gaussian family induced by the β-divergence coincides with the V -geometry with the power potentials.Finally, Section 5 gives concluding remarks.

Preliminaries: Geometries on Positive Definite Matrices and the U -Model
We recall the relation between information geometry on positive definite matrices induced by V -potentials and that on a multivariate statistical model called the U -model.Details described in this section and the ideas behind them can be found in [1,24].

V-Potential Function and the Induced Geometry on Positive Definite Matrices
Denote by Sym(n, R) the vector space of n × n real symmetric matrices, and by P D(n, R) the convex cone of n × n positive definite matrices in Sym(n, R), respectively.For two matrices, X and Y , in Sym(n, R), we denote tr(XY ) by X, Y .For an arbitrary set of basis matrices is isomorphic to a tangent vector space at each P ∈ P D(n, R).Hence, we particularly identify E i with a tangent vector, (∂/∂x i ) P .Definition 1.Let V (s) be a smooth function of real numbers s > 0. The function defined by: is called a V -potential on P D(n, R).
When V (s) = − log s, the V -potential reduces to the standard one, called the characteristic function on P D(n, R), which plays a fundamental role in the geometrical theory of P D(n, R) [26][27][28].
Let ν i (s), i = 1, 2, • • • be functions defined by We assume that V (s) satisfies the following two conditions: which are later shown to ensure the convexity of ϕ (V ) (P ) on P D(n, R).Note that the first condition, ν 1 (s) < 0, for all s > 0 implies the function, V (s), is strictly decreasing on s > 0.
Using the formula grad det P = (det P )P −1 , we have the gradient mapping, gradϕ (V ) : gradϕ (V ) : P → P * = ν 1 (det The Hessian of ϕ (V ) at P ∈ P D(n, R), which we write as g P , is given by: for arbitrary tangent vectors, X and Y , in Sym(n, R).
Proposition 1. [24] The Hessian, g (V ) , is positive definite on P D(n, R), if and only if the conditions in Equation (3) hold.
To establish the Legendre relation on P D(n, R), we consider the conjugate function of ϕ (V ) denoted by ϕ (V ) * .Define the Legendre transform: Since the extremal condition is: and gradϕ (V ) is invertible by the positive definiteness of g (V ) , we have the following expression for ϕ (V ) * with respect to P : Hence, the canonical divergence [23] ) is obtained as: Regarding g (V ) as a Riemannian metric, we can consider P D(n, R) as a Riemannian manifold.Further, using the canonical flat affine connection on Sym(n, R) denoted by ∇, define the dual affine connection [23], * ∇ (V ) , satisfying: for arbitrary tangent vector fields, X, Y and Z, on P D(n, R); then, we can introduce a dually flat structure [23] or a Hessian structure [29] Their covariant derivatives at P are actually given by: where: and s = det P .Since several properties of the pair of mutually dual connection, ∇ and * ∇ (V ) , are stated in [24], we omit them here for the sake of simplicity.However, from a geometrical viewpoint, we should note that the following two important properties are related to the invariance of structures.In Section 3, we shall return to these points and discuss them in detail.

Proposition 2. [24]
1.The dually flat structure, 2. When V (s) is a power function (with a constant term) of the form: for real constants, c 1 , c 2 and β, satisfying Equation (3), both affine connections, ∇ and * ∇ (V ) , are GL(n, R)-invariant.Further, the orthogonality with respect to g (V ) is also GL(n, R)-invariant, while g (V ) itself is not.
Remark 1.One of interesting implications is that the above second point means that for the power function in Equation ( 12), both ∇and * ∇ (V ) -projections [23], which can be also variationally characterized by the divergence in Equation ( 9) and its dual, are GL(n, R)-invariant; hence, so is the Pythagorean theorem [23] as a one-dimensional special case for this implication.Conversely, GL(n, R)-invariance of the Pythagorean theorem means those of both projections, because ∇ and * ∇ (V ) are torsion-free [23].

Relation between Information Geometries on the U-model and Positive Definite Matrices
We briefly introduce the U -divergence and U -model and show how the dualistic geometries induced from U -divergence and V -potential are related.
In the field of statistical inference, the well-established method is the maximum likelihood method, which is based on the Kullback-Leibler divergence.To improve the robustness performance of the method, maintaining its theoretical advantages, such as efficiency, the methods of minimizing general divergences have been proposed as alternatives to the maximum likelihood method [1,13,[30][31][32].
Definition 2. Let U (s) be a smooth convex function with the positive derivative u(s) = U (s) > 0 on R or its (semi-infinite) interval and ξ be the inverse function of u there.If the following functional for two functions, f (x) and g(x), on R n : exists, we call it the U-divergence.
It follows that D U (f, g) ≥ 0 and D U (f, g) = 0, if and only if f = g, because the integrand, U (ξ g ) − {U (ξ f ) + u(ξ f )(ξ g − ξ f )}, where ξ f = ξ(f ) and ξ g = ξ(g), is interpreted as the difference of the convex function, U , and its supporting function.If we set U (s) = 1 β+1 (1 + βs) (β+1)/β for β ∈ R, then the corresponding U -divergence is the beta-divergence [1] defined by: As β goes to zero, it reduces to the Kullback-Leibler divergence; on the other hand, as β goes to one, it reduces to the squared L 2 -distance.Thus, the efficiency increases as β goes to zero, while the robustness increases as β goes to one.In this sense, we could find an appropriate β between zero and one as a trade-off between efficiency and robustness.The beta-divergence is strongly connected to the Tsallis entropy [33].
When we consider the family of functions parametrized by elements in a manifold, M, the divergences on the family induce the dualistic structure on M [1].Concretely, we here confine our attention to the family of multivariate probability density functions specified by P in M = P D(n, R) and then study its structure on P D(n, R).The family is natural in the sense that it is a dually flat statistical manifold with respect to the dualistic geometry induced by the U -divergence.Definition 3. Let U and u be the functions given in Definition 2. The family of elliptical distributions with the following density functions: is called the U-model associated with the U -divergence.Here, we set f (x, P ) = 0 if the right-hand side is nonpositive or undefined, and c U (det P ) is a normalizing constant.
Note that if u satisfies the following self-similarity property: where a t and b t are positive constants depending on det P , the density function, f U , in the U -model can be alternatively expressed in the usual form of an elliptical distribution [20,21], i.e., One such example is the β-model, to be discussed in Section 4 and the Appendix.Thus, the probability density function, f U (x, P ), has the mean vector of zero and the variance matrix, cP −1 , where c is a positive constant obtained from the characteristic function of f U (x, P ), and P is called the precision matrix.Now, we consider the correspondence between the dualistic geometry induced by D U on the U -model and that on P D(n, R) induced by the V -potential function discussed.Proposition 3. [24] Define the V -potential function, ϕ (V ) , via: Assume that V satisfies the conditions in Equation (3); then, the dually flat structure, (g (V ) , ∇, * ∇ (V ) ), on P D(n, R) coincides with that on U -model induced by the U -divergence, D U .

Statistical Estimation on the U-Model
We discuss a statistical estimation for the precision matrix parameter, P , in the U -model, M U .The U -divergence, D U (f, g), is decomposed into the difference of U-cross entropy, C U (f, g), and U-entropy, and Consider a maximum U -entropy distribution on the moment equal space.Let F be the space of all probability density functions on R n and the zero-mean and equal variance space in F, where E f denotes the statistical expectation with respect to f , and c is the aforementioned constant.Then, we observe that: In effect, for any f in F(P ), with equality only if f = f U (•, P ).Thus, the U -model, M U , is characterized by maximum U -entropy distributions.
Let {X i } 1≤i≤N be random samples from a probability density function, f U (x, P ).Then, the U -loss function is defined by in which the U-estimator, PU , is defined by the minimizer of L U (P ).The U -loss function is an empirical analogue of U -cross entropy, in the sense that By definition, PU is a solution of ∂ ∂P L U (P ) = 0 if the solution is unique.Hence, we conclude that if N ≥ n, then PU = cS −1 , where S is the sample variance matrix defined by and cP −1 = xx T f U (x, P )dx.We remark that S is positive-definite with a probability of one if N ≥ n.
The derivation for PU is confirmed by the following fact with equality, only if P = PU .Surprisingly, such U -estimators are independent of the choice of U , which implies that a U -estimator for a U -model equals the maximum likelihood estimator for the Gaussian model.On the other hand, assume that {X i } 1≤i≤N are random samples from a Gaussian density function, that is, f U (x, P ) with U = exp.Then, unless U = exp, the U -estimator, PU , for the precision parameter, P , in the Gaussian model has no exact expression.This is rather a different aspect from the situation discussed above, and PU solved by an iteration algorithm is shown to be robust against heavy outliers if the generator function, U , satisfies a tail condition.
For example, if we select as U (s) = 1 β+1 (1 + βs) (β+1)/β with a fixed β > 0, then the corresponding divergence is the β-divergence given in Equation (14), and the corresponding U -estimator is called the β-estimator.The estimator is associated with the iteration algorithm, say {P t , t = 1, 2, • • • }, with an initial value, P 1 , in which the update, P t+1 , from the t-step, P t , is given by: where w(x, P ) = exp(− β 2 x T P x) and d β = β/(β + 1) n+1 .See [1] for a detailed discussion.We remark that the β-estimator, Pβ , satisfies the fixed-point condition by setting as P t+1 = P t = Pβ in Equation (26).Therefore, if the i-th observation, X i , has an extremely large value, X T i Pβ X i , then the i-th weight w(X i , Pβ ) in the weighted variance form of Equation ( 26) becomes negligible, so that the β-estimator is automatically robust for theses outliers.The degree of robustness for Pβ depends on the value of β.In this way, it is also possible to introduce a dualistic structure on the product space of (U, U ) for U -models and U -estimators.

GL(n, R)-Invariance Induced from the Power Function
The transformation group, τ G , with G ∈ GL(n, R), transitively acts on P D(n, R), i.e., there exists G ∈ GL(n, R) for all P, P ∈ P D(n, R), such that τ G P = GP G T = P .We denote by τ G * the differential of τ G .The invariance of geometry for these transformations is defined as follows: Definition 4. We say a dually flat structure, (P D(n, R), g (V ) , ∇, * ∇ (V ) ), is GL(n, R)-invariant if the Riemannian metric, g (V ) , and the pair of mutually dual connections, ∇ and * ∇ (V ) , satisfy: for arbitrary G ∈ GL(n, R), where We can similarly define SL(n, R)-invariance while SL(n, R) is not transitive on P D(n, R).These invariances mean that the geometries are homogeneous to the corresponding transformations.They practically imply that the obtained geometrical results are not influenced by scaling (unit change), numerical conditioning, and so on.Note that (P D(n, R), because the dually flat structure can be alternatively derived from the canonical divergence [1,23].Now, we fix the form of V as V (s) = c 1 + c 2 s β to actually confirm the invariance property described in the second statement of Proposition 2. First, the convexity conditions in Equation (3) reduce to: The dual variables, P * , and the conjugate function, ϕ (V ) * , are expressed by P as: The corresponding Riemannian metric in Equation ( 5) and divergence in Equation ( 9) are respectively given by: When we particularly set c 1 = −c 2 = 1/β, i.e., V (s) = (1−s β )/β, and move β to zero, we find that they converge to the standard Riemannian metric and divergence [27] for V (s) = − log s.We immediately see that the above g (V ) and D (V ) are not GL(n, R)but SL(n, R)-invariant.However, g P (X, Y ) = 0, if and only if g (V ) P (X , Y ) = 0 for any G ∈ GL(n, R).Thus, the orthogonality is GL(n, R)-invariant.Next, the covariant derivatives of ∇ vanish everywhere, since it is the canonical flat affine connection.For the covariant derivatives of * ∇ (V ) , it is seen that the first and second terms in Equation (11) are GL(n, R)-invariant.The third and fourth terms of the dual covariant derivatives respectively reduce to: which are independent of det P .Thus, we can find that both ∇ and * ∇ (V ) are GL(n, R)-invariant.
Finally, consider two smooth curves, γ = {C γ (t)| − < t < } and γ * = {C γ * (t)| − < t < }, in P D(n, R), satisfying: Since ∇ is the canonical flat connection on Sym(n, R), γ is ∇-geodesic iff it is represented by C γ (t) = Q+tX.On the other hand, γ * is * ∇ (V ) -geodesic iff it is represented as a straight line in the dual variable, C * γ * (t) [23].Let us obtain its explicit form.Since it follows that: we have: by substituting Equation (35).Thus, γ * is * ∇ (V ) -geodesic iff it is represented in the dual variables by: Assume that X and Y are mutually orthogonal at Q, and two points, P and R, in P D(n, R) are, respectively, located on the ∇-geodesic γ and * ∇ (V ) -geodesic γ * satisfying Equation (35), i.e., for some real numbers, t 1 and t 2 .Then, we have: which results in the Pythagorean theorem.If we, respectively, transform Q, X, Y in Equation (35) to , we see that Equations (39) and (40) hold by replacing t 2 with t 2 (det G) 2β .Thus, even if such three points, P, Q, R, are, respectively, transformed by τ G to P , Q , R with arbitrary, but common, G ∈ GL(n, R), the Pythagorean theorem still holds for P , Q , R .The GL(n, R)-invariances of both ∇and * ∇ (V ) -projections are similarly confirmed.Thus, we have confirmed the second statement of Proposition 2, i.e., the above GL(n, R)-invariance holds if V (s) = c 1 + c 2 s β .In fact, we can show that the converse of Proposition 2 is also true.
Theorem 1. Assume that the function, V , meets Equation (3).Mutually dual connections, ∇ and * ∇ (V ) , and the orthogonality with respect to g (V ) are GL(n, R)-invariant, if and only if Proof.We only show the if part.The covariant derivatives of ∇ are independent of V and clearly invariant.For those of * ∇ (V ) in Equation (11), the first and second terms can be readily understood to be invariant.The third term, Φ, and the fourth term, Φ ⊥ , are invariant, only if the ratios, ν 2 (s)/ν 1 (s) and ν 3 (s)/ν 1 (s), also are, because the coefficients in these terms are respectively expressed by: where tr(P −1 X), tr(P −1 Y ) and tr(P −1 XP −1 Y ) are invariant.From the definition of ν i (s)'s, the invariance of ν 2 (s)/ν 1 (s) is satisfied by the solutions, ν 1 (s), of the ordinary differential equation: for a real constant, β.By solving the ODEand integrating again, we have V (s) = c 1 + c 2 s β or V (s) = c 1 + c 2 log s for real constants, c i , i = 1, 2. These forms also meet the invariance of ν 3 (s)/ν 1 (s).The invariance of the orthogonality for such functions, V , have been already confirmed.
Note that the above theorem does not reject the possibility of the GL(n, R)-invariances for the potentials of the other forms, except the V -potentials.
4. Geometry on the q-Gaussian Family Induced by the β-Divergence and the V -Geometry This section demonstrates the main result.Let β be a real parameter satisfying β = 0 and β = −1 and define a function, U , by: Using its derivative, we define a function, u, by: and the inverse, ξ of u, on I β by: Note that U is convex and u is positive where s > −1/β if β > 0 and s < −1/β if β < 0, respectively.Further, u and ξ respectively approach the usual exponential and logarithmic functions when β goes to zero.Hence, by introducing a parameter, q = 1 + β, they are called the q-exponential and q-logarithmic functions in the literature of nonextensive statistical physics [4,6].
Let us fix the parameter, β, arbitrarily and consider an elliptical density function, f (x, P ), specified by P ∈ P D(n, R) using the q-exponential function, u: where c β (det P ) is the normalizing constant.The density function, f , is said to be the (zero-mean) q-Gaussian [4,6], and we call the family of all such f 's denoted by M β the q-Gaussian family or β-model.By starting from the β-divergence, we can define a dualistic structure on M β , which invokes the corresponding V -geometry on the parameter space, P D(n, R).The V -potential for the V -geometry is obtained as follows: Theorem 2. The information geometry on M β induced from the β-divergence is characterized by a dually flat structure, (g (V ) , ∇, * ∇ (V ) ), on P D(n, R) induced by the V -potentials: satisfying Equation (30), where s = det P , n β = n/2 + 1/β and c ± are constants depending on β and n.
The proof can be found in the Appendix.
The above theorem implies that geometric structure on M β induced from the β-divergence admits the natural invariance properties discussed in Section 3.

Conclusions
We prove that information geometry on the q-Gaussian family induced by the β-divergence is equivalently characterized by the V -geometry on the space of positive definite matrices induced by the power potential.Studying the corresponding V -geometry, we show that some of the dually flat structures of the q-Gaussian family admit the GL(n, R)-invariances.This fact implies the importance of the family in multivariate statistical analysis, as well as gives a geometrical viewpoint to mathematical properties of the q-exponential functions.Following the way given in Section 2.1, we can introduce the other dually flat structures via any convex potentials in addition to the V -potential, by defining Riemannian metrics as their Hessians and dual flat connections satisfying Equation (10) for ∇.The relations between such dually flat structures and the ones on the other deformed exponential families are left for future work.
Robustness in statistical estimations involving the q-Gaussian family (β-model) and the β-divergence, which is another important aspect, is roughly explained at the end of Section 2.3.
Recently, the theory of optimal transportation has made great developments in which a geometrical insight is founded with a close relation to the problem of Poincaré's conjecture [34].We can find some ideas and arguments similar to those established in this paper, while no direct link exists between their objectives.In effect, a divergence is defined in [35] and shown to play a significant role on checking a condition for the existence of the optimal transportation, where geometry is deeply explored to consider a family of probability density functions on a space in place of investigating properties of the space directly.It is expected that a coupling of the theory and information geometry will give fruitful results in the near future.