Information Geometry of Positive Measures and Positive-Definite Matrices : Decomposable Dually Flat Structure

Information geometry studies the dually flat structure of a manifold, highlighted by the generalized Pythagorean theorem. The present paper studies a class of Bregman divergences called the (ρ, τ)-divergence. A (ρ, τ)-divergence generates a dually flat structure in the manifold of positive measures, as well as in the manifold of positive-definite matrices. The class is composed of decomposable divergences, which are written as a sum of componentwise divergences. Conversely, a decomposable dually flat divergence is shown to be a (ρ, τ)-divergence. A (ρ, τ)-divergence is determined from two monotone scalar functions, ρ and τ . The class includes the KL-divergence, α-, βand (α, β)-divergences as special cases. The transformation between an affine parameter and its dual is easily calculated in the case of a decomposable divergence. Therefore, such a divergence is useful for obtaining the center for a cluster of points, which will be applied to classification and information retrieval in vision. For the manifold of positive-definite matrices, in addition to the dually flatness and decomposability, we require the invariance under linear transformations, in particular under orthogonal transformations. This opens a way to define a new class of divergences, called the (ρ, τ)-structure in the manifold of positive-definite matrices.


Introduction
Information geometry, originated from the invariant structure of a manifold of probability distributions, consists of a Riemannian metric and dually coupled affine connections with respect to the metric [1].A manifold having a dually flat structure is particularly interesting and important.In such a manifold, there are two dually coupled affine coordinate systems and a canonical divergence, which is a Bregman divergence.The highlight is given by the generalized Pythagorean theorem and projection theorem.Information geometry is useful not only for statistical inference, but also for machine learning, pattern recognition, optimization and even for neural networks.It is also related to the statistical physics of Tsallis q-entropy [2][3][4].
The present paper studies a general and unique class of decomposable divergence functions in R n + , the manifold of n-dimensional positive measures.This is the (ρ, τ )-divergences, introduced by Zhang [5,6], from the point of view of "representation duality".They are Bregman divergences generating a dually flat structure.The class includes the well-known Kullback-Leibler divergence, α-divergence, β-divergence and (α, β)-divergence [1,[7][8][9] as special cases.The merit of a decomposable Bregman divergence is that the θ-η Legendre transformation is computationally tractable, where θ and η are two affine coordinates systems coupled by the Legendre transformation.When one uses a dually flat divergence to define the center of a cluster of elements, the center is easily given by the arithmetic mean of the dual coordinates of the elements [10,11].However, we need to calculate its primal coordinates.This is the θ-η transformation.Hence, our new type of divergences has an advantage of calculating θ-coordinates for clustering and related pattern matching problems.The most general class of dually flat divergences, not necessarily decomposable, is further given in R n + .They are the (ρ, τ ) divergence.Positive-definite (PD) matrices appear in many engineering problems, such as convex programming, diffusion tensor analysis and multivariate statistical analysis [12][13][14][15][16][17][18][19][20].The manifold, PD n , of n × n PD matrices form a cone, and its geometry is by itself an important subject of research.If we consider the submanifold consisting of only diagonal matrices, it is equivalent to the manifold of positive measures.Hence, PD matrices can be regarded as a generalization of positive measures.There are many studies on geometry and divergences of the manifold of positive-definite matrices.We introduce a general class of dually flat divergences, the (ρ, τ )-divergence.We analyze the cases when a (ρ, τ )-divergence is invariant under the general linear transformations, Gl(n), and invariant under the orthogonal transformations, O(n).They not only include many well-known divergences of PD matrices, but also give new important divergences.
The present paper is organized as follows.Section 2 is preliminary, giving a short introduction to a dually flat manifold and the Bregman divergence.It also defines the cluster center due to a divergence.Section 3 defines the (ρ, τ )-structure in R n + .It gives dually flat decomposable affine coordinates and a related canonical divergence (Bregman divergence).Section 4 is devoted to the (ρ, τ )-structure of the manifold, PD n , of PD matrices.We first study the class of divergences that are invariant under O(n).We further study a decomposable divergence that is invariant under Gl(n).It coincides with the invariant divergence derived from zero-mean Gaussian distributions with PD covariance matrices.They not only include various known divergences, but new remarkable ones.Section 5 discusses a general class of non-decomposable flat divergences and miscellaneous topics.Section 6 is the conclusions.

Dually Flat Manifold
A manifold is said to have the dually flat Riemannian structure, when it has two affine coordinate systems θ = (θ 1 , • • • , θ n ) and η = (η 1 , • • • , η n ) (with respect to two flat affine connections) together with two convex functions, ψ(θ) and ϕ(η), such that the two coordinates are connected by the Legendre transformations: where ∇ is the gradient operator.The Riemannian metric is given by: in the respective coordinate systems.A curve that is linear in the θ-coordinates is called a θ-geodesic, and a curve linear in the η-coordinates is called an η-geodesic.
A dually flat manifold has a unique canonical divergence, which is the Bregman divergence defined by the convex functions, where , where θ i P and η Qi are components of θ p and η Q , respectively.The Pythagorean and projection theorems hold in a dually flat manifold: Pythagorean Theorem Given three points, P, Q, R, when the η-geodesic connecting P and Q is orthogonal to the θ-geodesic connecting Q and R with respect to the Riemannian metric, Projection Theorem Given a smooth submanifold, S, let P S be the minimizer of divergence from P to S, P S = min Then, P S is the η-geodesic projection of P to S, that is the η-geodesic connecting P and P S is orthogonal to S.
We have the dual of the above theorems where θand η-geodesics are exchanged and

Decomposable Divergence
A divergence, D[P : Q], is said to be decomposable, when it is written as a sum of component-wise divergences, where θ i P and θ i Q are the components of θ P and θ Q and d θ i P , θ i Q is a scalar divergence function.
An f -divergence: is a typical example of decomposable divergence in the manifold of probability distributions, where P = (p) and Q = (q) are two probability vectors with p i = q i = 1.A convex function, ψ(θ), is said to be decomposable, when it is written as: by using a scalar convex function, ψ(θ).The Bregman divergence derived from a decomposable convex function is decomposable.When ψ(θ) is a decomposable convex function, its Legendre dual is also decomposable.The Legendre transformation is given componentwise as: where is the differentiation of a function, so that it is computationally tractable.Its inverse transformation is also componentwise, where φ is the Legendre dual of ψ.

Cluster Center
Consider a cluster of points The center, R, of the cluster with respect to the divergence, D[P : Q], is defined by: By differentiating D [Q : P i ] by θ (the θ-coordinates of Q), we have: Hence, the cluster-center theorem due to Banerjee et al. [10] follows; see also [11]: Cluster-Center Theorem The η-coordinates η R of the cluster center are given by the arithmetic average of the η-coordinates of the points in the cluster: When we need to obtain the θ-coordinates of the cluster center, it is given by the However, in many cases, the transformation is computationally heavy and intractable when the dimensions of a manifold is large.The transformation is easy in the case of a decomposable divergence.This is motivation for considering a general class of decomposable Bregman divergences.
A measure (or a weight) of x i is given by: and and ξ forms a coordinate system of R n + .Let ρ(ξ) and τ (ξ) be two monotonically increasing differentiable functions.We call: the ρand τ -representations of positive measure ξ.This is a generalization of the ±α representations [1] and was introduced in [5] for a manifold of probability distributions.See also [6].By using these functions, we construct new coordinate systems θ and η of R n + .They are given, for θ = (θ i ) and η = (η i ), by componentwise relations, They are called the ρand τ -representations of ξ ∈ R n + , respectively.We search for convex functions, ψ ρ,τ (θ) and ϕ ρ,τ (η), which are Legendre duals to each other, such that θ and η are two dually coupled affine coordinate systems.
We then define two decomposable convex functions of θ and η by: They are Legendre duals to each other.

(ρ, τ )-Divergence
The (ρ, τ )-divergence between two points, ξ, ξ ∈ R + n , is defined by: where θ and η are ρand τ -representations of ξ and ξ , respectively.The (ρ, τ )-divergence gives a dually flat structure having θ and η as affine and dual affine coordinate systems.This is originally due to Zhang [5] and a generalization of our previous results concerning the q and deformed exponential families [4].The transformation between θ and η is simple in the (ρ, τ )-structure, because it can be done componentwise, The Riemannian metric is: and hence Euclidean, because the Riemann-Christoffel curvature due to the Levi-Civita connection vanishes, too.
Theorem 1.The (ρ, τ )-divergences form a unique class of divergences in R n + that are dually flat and decomposable.

Various Examples
As a special case of the (ρ, τ )-divergence, we have the (α, β)-divergence obtained from the following power functions, This was introduced by Cichocki, Cruse and Amari in [7,8].
The affine and dual affine coordinates are: and the convex functions are: where: The induced (α, β)-divergence has a simple form, for ξ, ξ ∈ R n + .It is defined similarly in the manifold, S n , of probability distributions, but it is not a Bregman divergence in S n .This is because the total mass constraint ξ i = 1 is not linear in θor η-coordinates in general.
The α-divergence is a special case of the (α, β)-divergence, so that it is a (ρ, τ )-divergence.By putting: we have: The β-divergence [19] is obtained from: It is written as: The β-divergence is special in the sense that it gives a dually flat structure, even in S n .This is because u(ξ) is linear in ξ.
The classes of α-divergences and β-divergences intersect at the KL-divergence, and their duals are different in general.They are the only intersecting points of the two classes.
Zhang already introduced the (α, β)-divergence in [5], which is not a (ρ, τ )-divergence in R n + and different from ours.We regret for our confusing the naming of the (α, β)-divergence.

Invariant and Decomposable Convex Function
Let P be a positive-definite matrix and ψ(P) be a convex function.Then, a Bregman divergence is defined between two positive definite matrices, P and Q, by: where ∇ is the gradient operator with respect to matrix P = (P ij ), so that ∇ψ(P) is a matrix and the inner product of two matrices is defined by: where tr is the trace of a matrix.It induces a dually flat structure to the manifold of positive-definite matrices, where the affine coordinate system (θ-coordinates) is Θ = P and the dual affine coordinate system (η-coordinates) is: A convex function, ψ(P), is said to be invariant under the orthogonal group O(n), when: holds for any orthogonal transformation O, where O T is the transpose of O.An invariant function is written as a symmetric function of n eigenvalues λ 1 , • • • , λ n of P. See Dhillon and Tropp [12].When an invariant convex function of P is written, by using a convex function, f , of one variable, in the additive form: it is said to be decomposable.We have:

Invariant, Flat and Decomposable Divergence
A divergence D[P : Q] is said to be invariant under O(n), when it satisfies: When it is derived from a decomposable convex function, ψ(P), it is invariant, flat and decomposable.We give well-known examples of decomposable convex functions and the divergences derived from them: The Euclidean, Gaussian and von Neuman divergences given in Equations ( 51), ( 54) and (56) are special examples of (ρ, τ )-divergences.They are given, respectively, by: (2) When ρ and τ are power functions, we have the (α, β)-structure in the manifold of positive-definite matrices.
By using the (α, β) power functions given by Equation (34), we have: so that the (α, β)-divergence of matrices is: This is a Bregman divergence, where the affine coordinate system is Θ = P α and its dual is H = P β .

Invariance Under Gl(n)
We extend the concept of invariance under the orthogonal group to that under the general linear group, Gl(n), that is the set of invertible matrices, L, det |L| = 0.This is a stronger condition.A divergence is said to be invariant under Gl(n), when: holds for any L ∈ Gl(n).
We identify matrix P with the zero-mean Gaussian distribution: where c is a constant.We know that an invariant divergence belongs to the class of f -divergences in the case of a manifold of probability distributions, where the invariance means the geometry does not change under a one-to-one mapping of x to y.Moreover, the only invariant flat divergence is the KL-divergence [22].These facts suggest the following conjecture.
Proposition.The invariant, flat and decomposable divergence under Gl(n) is the KL-divergence given by:

Non-Decomposable Divergence
We have focused on flat and decomposable divergences.
There are many interesting non-decomposable divergences.We first discuss a general class of flat divergences in R n + and then touch upon interesting flat and non-flat divergences in the manifold of positive-definite matrices.

+
We can describe a general class of flat divergence in R n + , which are not necessarily decomposable.This is introduced in [23], which studies the conformal structure of general total Bregman divergences ( [11,13]).When R n + is endowed with a dually flat structure, it has a θ-coordinate system given by: which is not necessarily a componentwise function.Any pair of invertible θ = ρ(ξ) and convex function ψ(θ) defines a dually flat structure and, hence, a Bregman divergence in R n + .The dual coordinates η = τ (ξ) are given by: so that we have: This implies that a pair (ρ, τ ) of coordinate systems can define dually coupled affine coordinates and, hence, a dually flat structure, when and only when η = τ {ρ −1 (θ)} is a gradient of a convex function.This is different from the case of decomposable divergence, where any monotone pair of ρ(ξ) and τ (ξ) gives a dually flat structure.

γ-Divergence in P D n
The γ-divergence is introduced by Fujisawa and Eguchi [24].It gives a super-robust estimator.It is interesting to generalize it to P D n , This is not flat nor decomposable.This is a projective divergence in the sense that, for any c, c > 0, Therefore, it can be defined in the submanifold of tr P = 1.

Concluding Remarks
We have shown that the (ρ, τ )-divergence introduced by Zhang [5] is a general dually flat decomposable structure of the manifold of positive measures.We then extended it to the manifold of positive-definite matrices, where the criterion of invariance under linear transformations (in particular, under orthogonal transformations) were added.The decomposability is useful from the computational point of view, because the θ-η transformation is tractable.This is the motivation for studying decomposable flat divergences.
When we treat the manifold of probability distributions, it is a submanifold of the manifold of positive measures, where the total sum of measures are restricted to one.This is a nonlinear constraint in the θ or η coordinates, so that the manifold is not flat, but curved in general.Hence, our arguments hold in this case only when at least one of the ρ and τ functions are linear.The U -divergence [21] and β-divergence [19] are such cases.However, for clustering, we can take the average of the η-coordinates of member probability distributions in the larger manifold of positive measures and then project it to the manifold of probability distributions.This is called the exterior average, and the projection is simply a normalization of the result.Therefore, the (ρ, τ )-structure is useful in the case of probability distributions.The same situation holds in the case of positive-definite matrices.
Quantum information theory deals with positive-definite Hermitian matrices of trace one [25,26].We need to extend our discussions to the case of complex matrices.The trace one constraint is not linear with respect to θor η-coordinates, as is the same in the case of probability distributions.Many interesting divergence functions have been introduced in the manifold of positive-definite Hermitian matrices.It is an interesting future problem to apply our theory to quantum information theory.