Next Article in Journal
Principled Limitations on Self-Representation for Generic Physical Systems
Previous Article in Journal
Vulnerability Analysis Method Based on Network and Copula Entropy
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Divergences Induced by the Cumulant and Partition Functions of Exponential Families and Their Deformations Induced by Comparative Convexity

Sony Computer Science Laboratories, Tokyo 141-0022, Japan
Entropy 2024, 26(3), 193; https://doi.org/10.3390/e26030193
Submission received: 20 December 2023 / Revised: 21 February 2024 / Accepted: 21 February 2024 / Published: 23 February 2024

Abstract

:
Exponential families are statistical models which are the workhorses in statistics, information theory, and machine learning, among others. An exponential family can either be normalized subtractively by its cumulant or free energy function, or equivalently normalized divisively by its partition function. Both the cumulant and partition functions are strictly convex and smooth functions inducing corresponding pairs of Bregman and Jensen divergences. It is well known that skewed Bhattacharyya distances between the probability densities of an exponential family amount to skewed Jensen divergences induced by the cumulant function between their corresponding natural parameters, and that in limit cases the sided Kullback–Leibler divergences amount to reverse-sided Bregman divergences. In this work, we first show that the α -divergences between non-normalized densities of an exponential family amount to scaled α -skewed Jensen divergences induced by the partition function. We then show how comparative convexity with respect to a pair of quasi-arithmetical means allows both convex functions and their arguments to be deformed, thereby defining dually flat spaces with corresponding divergences when ordinary convexity is preserved.

1. Introduction

In information geometry [1], any strictly convex and smooth function induces a dually flat space (DFS) with a canonical divergence which can be expressed in charts either as dual Bregman divergences [2] or equivalently as dual Fenchel–Young divergences [3]. For example, the cumulant function of an exponential family [4] (also called the free energy) generates a DFS, that is, an exponential family manifold [5] with the canonical divergence yielding the reverse Kullback–Leibler divergence. Another typical example of a strictly convex and smooth function generating a DFS is the negative entropy of a mixture family, that is, a mixture family manifold with the canonical divergence yielding the (forward) Kullback–Leibler divergence [3]. In addition, any strictly convex and smooth function induces a family of scaled skewed Jensen divergences [6,7], which in limit cases includes the sided forward and reverse Bregman divergences.
In Section 2, we present two equivalent approaches to normalizing an exponential family: first by its cumulant function, and second by its partition function. Because both the cumulant and partition functions are strictly convex and smooth, they induce corresponding families of scaled skewed Jensen divergences and Bregman divergences, with corresponding dually flat spaces and related statistical divergences.
In Section 3, we recall the well-known result that the statistical α -skewed Bhattacharyya distances between the probability densities of an exponential family amount to a scaled α -skewed Jensen divergence between their natural parameters. In Section 4, we prove that the α -divergences [8] between the unnormalized densities of a exponential family amount to scaled α -skewed Jensen divergence between their natural parameters (Proposition 5). More generally, we explain in Section 5 how to deform a convex function using comparative convexity [9]: When the ordinary convexity of the deformed convex function is preserved, we obtain new skewed Jensen divergences and Bregman divergences with corresponding dually flat spaces. Finally, Section 6 concludes this work with a discussion.

2. Dual Subtractive and Divisive Normalizations of Exponential Families

2.1. Natural Exponential Families

Let ( X , A , μ ) be a measure space [10], where X denotes the sample set (e.g., finite alphabet, N , R d , space of positive-definite matrices Sym + + ( d ) , etc.), A a σ -algebra on X (e.g., power set 2 X , Borel σ -algebra B ( X ) , etc.), and μ a positive measure (e.g., counting measure or Lebesgue measure) on the measurable space ( X , A ) .
A natural exponential family [4,11] (commonly abbreviated as NEF [12]) is a set of probability distributions P = { P θ : θ Θ } all dominated by μ such that their Radon–Nikodym densities p θ ( x ) = d P θ d μ ( x ) can be expressed canonically as
p θ ( x ) p ˜ θ ( x ) = exp i = 1 m θ i x i ,
where θ is called the natural parameter and x = ( x 1 , , x m ) denotes the linear sufficient statistic vector [11]. The order of the NEF [13] is m. When the parameter θ ranges in the full natural parameter space
Θ = θ : X p ˜ θ ( x ) d μ ( x ) < R m ,
the family is called full. The NEF is said to be regular when Θ is topologically open.
The unnormalized positive density p ˜ θ ( x ) is indicated with a tilde notation and the corresponding normalized probability density is obtained as p θ ( x ) = 1 Z ( θ ) p ˜ θ ( x ) , where Z ( θ ) = p ˜ θ ( x ) d μ ( x ) is the Laplace transform of μ (the density normalizer). For example, the family of exponential distributions E = { λ e λ x : λ > 0 } is an NEF with densities defined on the support X = R 0 = { x R : x 0 } , natural parameter θ = λ in Θ = R < 0 = { θ R : θ < 0 } , sufficient linear statistic x, and normalizer Z ( θ ) = 1 θ .

2.2. Exponential Families

More generally, exponential families include many well known distributions after reparameterization [4] of their ordinary parameter λ by θ ( λ ) . The general canonical form of the densities of an exponential family is
p λ ( x ) p ˜ λ ( x ) = exp θ ( λ ) , t ( x ) h ( x ) ,
where t ( x ) = ( t 1 ( x ) , , t m ( x ) ) are the sufficient statistic vector (such that 1 , t 1 ( x ) , , t m ( x ) are linearly independent), h ( x ) is an auxiliary term used to define the base measure with respect to μ , and · , · is an inner product (e.g., scalar product of R m , trace product of symmetric matrices, etc.). By defining a new measure ν such that d μ d ν ( x ) = h ( x ) , we may consider without loss of generality the densities p ¯ λ ( x ) = d P λ d ν ( x ) with h ( x ) = 1 .
For example, the Bernoulli distributions, Gaussian or normal distributions, Gamma and Beta distributions, Poisson distributions, Rayleigh distributions, and Weibull distributions with prescribed shape parameter are just a few examples of exponential families with the inner product on R m defined as the scalar product. The categorical distributions (i.e., discrete distributions on a finite alphabet sample space) form an exponential family as well [1]. Zero-centered Gaussian distributions and Wishart distributions are examples of exponential families parameterized by positive-definite matrices with inner products defined by the matrix trace product, which is A , B = tr ( A B ) .
Exponential families abound in statistics and machine learning. Any two probability measures Q and R with densities q and r with respect to a dominating measure, say, μ = Q + R 2 , define an exponential family
P Q , R = p λ ( x ) q λ ( x ) r 1 λ ( x ) : λ ( 0 , 1 ) ,
which is called the likelihood ratio exponential family [14], as the sufficient statistic is t ( x ) = log q ( x ) r ( x ) (with auxiliary carrier term h ( x ) = r ( x ) ), or the Bhattacharyya arc, as the cumulant function of P Q , R is expressed as the negative of the skewed Bhattacharyya distances [7,15].
In machine learning, undirected graphical models [16] and energy-based models [17], including Markov random fields [18] and conditional random fields, are exponential families [19]. Exponential families are universal approximators of smooth densities [20].
From a theoretical standpoint, it is often enough to consider (without loss of generality) natural exponential families with densities expressed as in Equation (1). However, here we consider generic exponential families with the densities expressed in Equation (2) in order to report common examples encountered in practice, such as the multivariate Gaussian family [21].
When the natural parameter space Θ is not full but rather parameterized by λ = c ( λ ) for λ Λ with dim ( Λ ) < m and a smooth function c ( u ) , the exponential family is called a curved exponential family [1]. For example, the special family of normal distributions { p μ , σ 2 = μ 2 : μ R } is a curved exponential family with u = μ and c ( u ) = ( u , u 2 ) [1].

2.3. Normalizations of Exponential Families

Recall that p ˜ θ ( x ) = exp θ , t ( x ) h ( x ) denotes the unnormalized density expressed using the natural parameter θ = θ ( λ ) . We can normalize p ˜ θ ( x ) using either the partition function Z ( θ ) or equivalently using the cumulant function F ( θ ) , as follows:
p θ ( x ) = exp ( θ , t ( x ) ) Z ( θ ) h ( x ) ,
= exp θ , t ( x ) F ( θ ) + k ( x ) ,
where h ( x ) = exp ( k ( x ) ) , Z ( θ ) = p ˜ θ ( x ) d μ ( x ) , and F ( θ ) = log Z ( θ ) = log p ˜ θ ( x ) d μ ( x ) . Thus, the logarithm and exponential functions allow conversion to and from the dual normalizers Z and F:
Z ( θ ) = exp ( F ( θ ) ) F ( θ ) = log Z ( θ ) .
We may view Equation (3) as an exponential tilting [13] of density h ( x ) d μ ( x ) .
In the context of λ -deformed exponential families [22] which generalize exponential families, the function Z ( θ ) is called the divisive normalization factor (Equation (3)) and the function F ( θ ) is called the subtractive normalization factor (Equation (4)). Notice that F ( θ ) is called the cumulant function because when X p θ ( x ) is a random variable following a probability distribution of an exponential family, the function F ( θ ) appears in the cumulant generating function of X: K X ( t ) = log E X [ e t , X ] = F ( θ + t ) F ( θ ) . In statistical physics, the cumulant function is called the log-normalizer or log-partition function. Because Z > 0 and F = log Z , we can deduce that F Z , as log x x for x > 0 .
It is well known that the cumulant function F ( θ ) is a strictly convex function and that the partition function Z ( θ ) is strictly log-convex [11].
Proposition 1
([11]). The natural parameter space Θ of an exponential family is convex.
Proposition 2
([11]). The cumulant function F ( θ ) is strictly convex and the partition function Z ( θ ) is positive and strictly log-convex.
It can be shown that the cumulant and partition functions are smooth C analytic functions [4]. A remarkable property is that strictly log-convex functions are also strictly convex.
Proposition 3
([23], Section 3.5). A strictly log-convex function Z : Θ R m R is strictly convex.
The converse of Proposition 3 is not necessarily true, however; certain convex functions are not log-convex, and as such the class of strictly log-convex functions is a proper subclass of strictly convex functions. For example, θ 2 is convex but log-concave, as ( log θ 2 ) = 2 θ 2 < 0 (Figure 1).
Remark 1.
Because Z = exp ( F ) is strictly convex (Proposition 3), F is exponentially convex.
Definition 1.
The cumulant function F and partition function Z of a regular exponential family are both strictly convex and smooth functions inducing a pair of dually flat spaces with corresponding Bregman divergences [2] B F (i.e., B log Z ) and B Z (i.e., B exp F ):
B Z ( θ 1 : θ 2 ) = Z ( θ 1 ) Z ( θ 2 ) θ 1 θ 2 , Z ( θ 2 ) 0 ,
B log Z ( θ 1 : θ 2 ) = log Z ( θ 1 ) Z ( θ 2 ) θ 1 θ 2 , Z ( θ 2 ) Z ( θ 2 ) 0 ,
along with a pair of families of skewed Jensen divergences J F , α and J Z , α :
J Z , α ( θ 1 : θ 2 ) = α Z ( θ 1 ) + ( 1 α ) Z ( θ 2 ) Z ( α θ 1 + ( 1 α ) θ 2 ) 0 ,
J log Z , α ( θ 1 : θ 2 ) = log Z ( θ 1 ) α Z ( θ 2 ) 1 α Z ( α θ 1 + ( 1 α ) θ 2 ) 0 .
For a strictly convex function F ( θ ) , we define the symmetric Jensen divergence as follows:
J F ( θ 1 , θ 2 ) = J F , 1 2 ( θ 1 : θ 2 ) = F ( θ 1 ) + F ( θ 2 ) 2 F θ 1 + θ 2 2 .
Let B Θ denote the set of real-valued strictly convex and differentiable functions defined on an open set Θ , called Bregman generators. We may equivalently consider the set of strictly concave and differentiable functions G ( θ ) and let F ( θ ) = G ( θ ) ; see [24] (Equation (1)).
Remark 2.
The non-negativeness of the Bregman divergences for the cumulant and partition functions define the criteria for checking the strict convexity or log-convexity of a C 1 function:
F ( θ )   i s   s t r i c t l y   c o n v e x θ 1 θ 2 , B F ( θ 1 : θ 2 ) > 0 , θ 1 θ 2 , F ( θ 1 ) > F ( θ 2 ) + θ 1 θ 2 , F ( θ ) ,
and
Z ( θ )   i s   s t r i c t l y   l o g c o n v e x θ 1 θ 2 , B log Z ( θ 1 : θ 2 ) > 0 , θ 1 θ 2 , log Z ( θ 1 ) > log Z ( θ 2 ) + θ 1 θ 2 , Z ( θ 2 ) Z ( θ 2 ) .
The forward Bregman divergence B F ( θ 1 : θ 2 ) and reverse Bregman divergence B F ( θ 2 : θ 1 ) can be unified with the α -skewed Jensen divergences by rescaling J F , α and allowing α to range in R [6,7]:
J F , α s ( θ 1 : θ 2 ) = 1 α ( 1 α ) J F , α ( θ 1 : θ 2 ) , α R { 0 , 1 } , B F ( θ 1 : θ 2 ) , α = 0 , 4 J F ( θ 1 , θ 2 ) , α = 1 2 , B F * ( θ 1 : θ 2 ) = B F ( θ 2 : θ 1 ) , α = 1 . ,
where B F * denotes the reverse Bregman divergence obtained by swapping the parameter order (reference duality [6]): B F * ( θ 1 : θ 2 ) = B F ( θ 2 : θ 1 ) .
Remark 3.
Alternatively, we may rescale J F by a factor κ ( α ) = 1 α ( 1 α ) 4 4 α ( 1 α ) , i.e., J F , α s ¯ ( θ 1 : θ 2 ) = κ ( α ) J F , α ( θ 1 : θ 2 ) such that κ ( 1 2 ) = 1 and J F , 1 2 s ¯ ( θ 1 : θ 2 ) = J F ( θ 1 , θ 2 ) .
Next, in Section 3 we first recall the connections between these Jensen and Bregman divergences, which are divergences between parameters, and the statistical divergence counterparts between probability densities. Then, in Section 4 we introduce the novel connections between these parameter divergences and α -divergences between unnormalized densities.

3. Divergences Related to the Cumulant Function

Consider the scaled α -skewed Bhattacharyya distances [7,15] between two probability densities p ( x ) and q ( x ) :
D B , α s ( p : q ) = 1 α ( 1 α ) log p α q 1 α d μ , α R { 0 , 1 } .
The scaled α -skewed Bhattacharyya distances can additionally be interpreted as Rényi divergences [25] scaled by 1 α : D B , α s ( p : q ) = 1 α D R , α ( p : q ) , where the Rényi α -divergences are defined by
D R , α ( p : q ) = 1 α 1 log p α q 1 α d μ .
The Bhattacharyya distance D B ( p , q ) = log p q d μ corresponds to one-fourth of D B , 1 2 s ( p : q ) : D B ( p , q ) = 1 4 D B , 1 2 s ( p : q ) . Because D B , α s tends to the Kullback–Leibler divergence D KL when α 1 and to the reverse Kullback–Leibler divergence D KL * when α 0 , we have
D B , α s ( p : q ) = 1 α ( 1 α ) log p α q 1 α d μ , α R { 0 , 1 } , D KL ( p : q ) , α = 1 , 4 D B ( p , q ) α = 1 2 , D KL * ( p : q ) = D KL ( q : p ) α = 0 .
When both probability densities belong to the same exponential family E = { p θ ( x ) : θ Θ } with cumulant F ( θ ) , we have the following proposition.
Proposition 4
([7]). The scaled α-skewed Bhattacharyya distances between two probability densities p θ 1 and p θ 2 of an exponential family amount to the scaled α-skewed Jensen divergence between their natural parameters:
D B , α s ( p θ 1 : p θ 2 ) = J F , α s ( θ 1 , θ 2 ) .
Proof. 
The proof follows by first considering the α -skewed Bhattacharyya similarity coefficient ρ α ( p , q ) = p α q 1 α d μ .
ρ α ( p θ 1 : p θ 2 ) = exp θ 1 , x F ( θ 1 ) α exp θ 2 , x F ( θ 2 ) 1 α d μ , = exp ( α θ 1 + ( 1 α ) θ 2 ) , x ) exp ( α F ( θ 1 ) + ( 1 α ) F ( θ 2 ) ) d μ .
Multiplying the last equation by exp ( F ( α θ 1 + ( 1 α ) θ 2 ) ) exp ( F ( α θ 1 + ( 1 α ) θ 2 ) ) = exp ( 0 ) = 1 with θ ¯ = α θ 1 + ( 1 α ) θ 2 , we obtain
ρ α ( p θ 1 : p θ 2 ) = exp ( ( α F ( θ 1 ) + ( 1 α ) F ( θ 2 ) ) exp ( F ( θ ¯ ) ) exp ( θ ¯ , x F ( θ ¯ ) ) d μ .
Because θ ¯ Θ , we have exp ( θ ¯ , x F ( θ ¯ ) ) d μ = 1 ; therefore, we obtain
ρ α ( p θ 1 : p θ 2 ) = exp ( J F , α ( θ 1 : θ 2 ) ) .
For practitioners in machine learning, it is well known that the Kullback–Leibler divergence between two probability densities p θ 1 and p θ 2 of an exponential family amounts to a Bregman divergence for the cumulant generator on a swapped parameter order (e.g., [26,27]):
D KL ( p θ 1 : p θ 2 ) = B F ( θ 2 : θ 1 ) .
This is a particular instance of Equation (10) obtained for α = 1 :
D B , 1 s ( p θ 1 : p θ 2 ) = J F , 1 s ( θ 1 , θ 2 ) .
This formula has been further generalized in [28] by considering truncations of exponential family densities. Let X 1 X 2 X and E 1 = { 1 X 1 ( x ) p θ ( x ) } , E 2 = { 1 X 2 ( x ) q θ ( x ) } be two truncated families of X with corresponding cumulant functions
F 1 ( θ ) = log X 1 exp ( t ( x ) , θ ) d μ
and
F 2 ( θ ) = log X 2 exp ( t ( x ) , θ ) d μ F 1 ( θ ) .
Then, we have
D KL ( p θ 1 : q θ 2 ) = B F 2 , F 1 ( θ 2 : θ 1 ) , = F 2 ( θ 2 ) F 1 ( θ 1 ) θ 2 θ 1 , F 1 ( θ 1 ) .
Truncated exponential families are normalized exponential families which may not be regular [29], i.e., the parameter space Θ may not be open.

4. Divergences Related to the Partition Function

Certain exponential families have intractable cumulant/partition functions (e.g., exponential families with sufficient statistics t ( x ) = ( x , x 2 , , x m ) for high degrees m [20]) or cumulant/partition functions which require exponential time to compute [30] (e.g., graphical models [16], high-dimensional grid sample spaces, energy-based models [17] in deep learning, etc.). In such cases, the maximum likelihood estimator (MLE) cannot be used to infer the natural parameter of exponential densities. Many alternative methods have been proposed to handle such exponential families with untractable partition functions, e.g., score matching [31] or divergence-based inference [32,33]). Thus, it is important to consider dissimilarities between non-normalized statistical models.
The squared Hellinger distance [1] between two positive potentially unnormalized densities p ˜ and q ˜ is defined by
D H 2 ( p ˜ , q ˜ ) = 1 2 ( p ˜ q ˜ ) 2 d μ , = p ˜ d μ + q ˜ d μ 2 p ˜ q ˜ d μ .
Notice that the Hellinger divergence can be interpreted as the integral of the difference between the arithmetical mean A ( p ˜ , q ˜ ) = p ˜ + q ˜ 2 minus the geometrical mean G ( p ˜ , q ˜ ) = p ˜ q ˜ of the densities: D H 2 ( p ˜ , q ˜ ) = ( A ( p ˜ , q ˜ ) G ( p ˜ , q ˜ ) ) d μ . This further proves that D H ( p ˜ , q ˜ ) 0 , as A G . The Hellinger distance D H satisfies the metric axioms of distances.
When considering unnormalized densities p ˜ θ 1 = exp ( t ( x ) , θ 1 ) and p ˜ θ 2 = exp ( t ( x ) , θ 2 ) of an exponential family E with a partition function Z ( θ ) = p ˜ θ d μ , we obtain
D H 2 ( p ˜ θ 1 , p ˜ θ 2 ) = Z ( θ 1 ) + Z ( θ 2 ) 2 Z θ 1 + θ 2 2 = J Z ( θ 1 , θ 2 ) ,
as p ˜ θ 1 p ˜ θ 2 = p ˜ θ 1 + θ 2 2 .
The Kullback–Leibler divergence [1] as extended to two positive densities p ˜ and q ˜ is defined by
D KL ( p ˜ : q ˜ ) = p ˜ log p ˜ q ˜ + q ˜ p ˜ d μ .
When considering unnormalized densities p ˜ θ 1 and p ˜ θ 2 of E , we obtain
D KL ( p ˜ θ 1 : p ˜ θ 2 ) = p ˜ θ 1 ( x ) log p ˜ θ 1 ( x ) p ˜ θ 2 ( x ) + p ˜ θ 2 ( x ) p ˜ θ 1 ( x ) d μ ( x ) ,
= e t ( x ) , θ 1 θ 1 θ 2 , t ( x ) + e t ( x ) , θ 2 e t ( x ) , θ 1 d μ ( x ) ,
= t ( x ) e t ( x ) , θ 1 d μ ( x ) , θ 1 θ 2 + Z ( θ 2 ) Z ( θ 1 ) ,
= θ 1 θ 2 , Z ( θ 1 ) + Z ( θ 2 ) Z ( θ 1 ) = B Z ( θ 2 : θ 1 ) ,
as Z ( θ ) = t ( x ) p ˜ θ ( x ) d μ ( x ) . Let D KL * ( p ˜ : q ˜ ) = D KL ( q ˜ : p ˜ ) denote the reverse KLD.
More generally, the family of α -divergences [1] between the unnormalized densities p ˜ and q ˜ is defined for α R by
D α ( p ˜ : q ˜ ) = 1 α ( 1 α ) α p ˜ + ( 1 α ) q ˜ p ˜ α q ˜ 1 α d μ , α { 0 , 1 } D KL * ( p ˜ : q ˜ ) = D KL ( q ˜ : p ˜ ) α = 0 , 4 D H 2 ( p ˜ , q ˜ ) α = 1 2 , D KL ( p ˜ : q ˜ ) α = 1 .
We now have D α * ( p ˜ : q ˜ ) = D α ( q ˜ : p ˜ ) = D 1 α ( p ˜ : q ˜ ) , and the α -divergences are homogeneous divergences of degree 1. For all λ > 0 , we have D α ( λ q ˜ : λ p ˜ ) = λ D α ( q ˜ : p ˜ ) . Moreoever, because α p ˜ + ( 1 α ) q ˜ p ˜ α q ˜ 1 α can be expressed as the difference of the weighted arithmetic mean minus the weighted geometric mean A ( p ˜ , q ˜ ; α ; 1 α ) G ( p ˜ , q ˜ ; α ; 1 α ) , it follows from the arithmetical–geometrical mean inequality that we have D α ( p ˜ : q ˜ ) 0 .
When considering unnormalized densities p ˜ θ 1 and p ˜ θ 2 of E , we obtain
D α ( p ˜ θ 1 : p ˜ θ 2 ) = 1 α ( 1 α ) J Z , α ( θ 1 : θ 2 ) , α { 0 , 1 } B Z ( θ 1 : θ 2 ) α = 0 , 4 J Z ( θ 1 , θ 2 ) α = 1 2 , B Z * ( θ 1 : θ 2 ) = B Z ( θ 2 : θ 1 ) α = 1 .
Proposition 5.
The α-divergences between the unnormalized densities of an exponential family amount to scaled α-Jensen divergences between their natural parameters for the partition function
D α ( p ˜ θ 1 : p ˜ θ 2 ) = J Z , α s ( θ 1 : θ 2 ) .
When α { 0 , 1 } , the oriented Kullback–Leibler divergences between unnormalized exponential family densities amount to reverse Bregman divergences on their corresponding natural parameters for the partition function
D KL ( p ˜ : q ˜ ) = B Z ( θ 2 : θ 1 ) .
Proof. 
For α { 0 , 1 } , consider
D α ( p ˜ θ 1 : p ˜ θ 2 ) = 1 α ( 1 α ) α p ˜ θ 1 + ( 1 α ) p ˜ θ 2 p ˜ θ 1 α p ˜ θ 2 1 α d μ .
Here, we have α p ˜ θ 1 d μ = α Z ( θ 1 ) , ( 1 α ) p ˜ θ 2 d μ = ( 1 α ) Z ( θ 2 ) and p ˜ θ 1 α p ˜ θ 2 1 α d μ = p ˜ α θ 1 + ( 1 α ) θ 2 d μ = Z ( α θ 1 + ( 1 α ) θ 2 ) . It follows that
D α ( p ˜ θ 1 : p ˜ θ 2 ) = 1 α ( 1 α ) J Z , α ( θ 1 : θ 2 ) = J Z , α s ( θ 1 : θ 2 ) .
Notice that the KLD extended to unnormalized densities can be written as a generalized relative entropy, i.e., it can be obtained as the difference of the extended cross-entropy minus the extended entropy (self cross-entropy):
D KL ( p ˜ : q ˜ ) = H × ( p ˜ : q ˜ ) H ( p ˜ ) , = p ˜ log p ˜ q ˜ + q ˜ p ˜ d μ
with
H × ( p ˜ : q ˜ ) = p ˜ ( x ) log 1 q ˜ ( x ) + q ˜ ( x ) d μ ( x ) 1
and
H ( p ˜ ) = H × ( p ˜ : p ˜ ) = p ˜ ( x ) log 1 p ˜ ( x ) + p ˜ ( x ) d μ ( x ) 1 .
Remark 4.
In general, we can consider two unnormalized positive densities p ˜ ( x ) and q ˜ ( x ) . Let p ( x ) = p ˜ ( x ) Z p and q ( x ) = q ˜ ( x ) Z q denote their corresponding normalized densities (with normalizing factors Z p = p ˜ d μ and Z q = q ˜ d μ ); then, the KLD between p ˜ and q ˜ can be expressed using the KLD between their normalized densities and normalizing factors, as follows:
D KL ( p ˜ : q ˜ ) = Z p D KL ( p : q ) + log Z p Z q + Z q Z p .
Similarly, we have
H × ( p ˜ : q ˜ ) = Z p H × ( p : q ) Z p log Z q + Z q 1 ,
H ( p ˜ ) = Z p H ( p ) Z p log Z p + Z p 1 ,
and D KL ( p ˜ : q ˜ ) = H × ( p ˜ : q ˜ ) H ( p ˜ ) .
Notice that Equation (17) allows us to derive the following identity between B Z and B F :
B Z ( θ 2 : θ 1 ) = Z ( θ 1 ) B F ( θ 2 : θ 1 ) + Z ( θ 1 ) log Z ( θ 1 ) Z ( θ 2 ) + Z ( θ 2 ) Z ( θ 1 ) ,
= exp ( F ( θ 1 ) ) B F ( θ 2 : θ 1 ) + ( exp F ( θ 1 ) ) ( F ( θ 1 ) F ( θ 2 ) ) + exp ( F ( θ 2 ) ) exp ( F ( θ 1 ) ) .
Let D skl ( a : b ) = a log a b + b a be the scalar KLD for a > 0 and b > 0 . Then, we can rewrite Equation (17) as
D KL ( p ˜ : q ˜ ) = Z p D KL ( p : q ) + D skl ( Z p : Z q ) ,
and we have
B Z ( θ 2 : θ 1 ) = Z ( θ 1 ) B F ( θ 2 : θ 1 ) + D skl ( Z ( θ 1 ) : Z ( θ 2 ) ) .
In addition, the KLD between the unnormalized densities p ˜ and q ˜ with support X can be written as a definite integral of a scalar Bregman divergence:
D KL ( p ˜ : q ˜ ) = X D skl ( p ˜ ( x ) : q ˜ ( x ) ) d μ ( x ) = X B f skl ( p ˜ ( x ) : q ˜ ( x ) ) d μ ( x ) ,
where f skl ( x ) = x log x x . Because B f skl ( a : b ) 0 a > 0 , b > 0 , we can deduce that D KL ( p ˜ : q ˜ ) 0 with equality iff p ˜ ( x ) = q ˜ ( x ) μ almost everywhere.
Notice that B Z ( θ 2 : θ 1 ) = Z ( θ 1 ) B F ( θ 2 : θ 1 ) + D skl ( Z ( θ 1 ) : Z ( θ 2 ) ) can be interpreted as the sum of two divergences, that is, a conformal Bregman divergence with a scalar Bregman divergence.
Remark 5.
Consider the KLD between the normalized p θ 1 and unnormalized p ˜ θ 2 densities of the same exponential family. In this case, we have
D KL ( p θ 1 : p ˜ θ 2 ) = B F ( θ 2 : θ 1 ) log Z ( θ 2 ) + Z ( θ 2 ) 1 , = Z ( θ 2 ) 1 F ( θ 1 ) θ 2 θ 1 , F ( θ 2 ) ,
= B Z 1 , F ( θ 2 : θ 1 ) .
The divergence B Z 1 , F is a dual Bregman pseudo-divergence [28]:
B F 1 , F 2 ( θ 1 : θ 2 ) = F 1 ( θ 1 ) F 2 ( θ 2 ) θ 1 θ 2 , F 2 ( θ 2 ) ,
for F 1 and F 2 that are two strictly convex and smooth functions such that F 1 F 2 . Indeed, we can check that generators F 1 ( θ ) = Z ( θ ) 1 and F 2 ( θ ) = F ( θ ) are both Bregman generators; then, we have F 1 ( θ ) F 2 ( θ ) , as e x x + 1 for all x (with equality when x = 0 ), i.e., Z ( θ ) 1 F ( θ ) .
More generally, the α-divergences between p θ 1 and p ˜ θ 2 can be written as
D α ( p θ 1 : p ˜ θ 2 ) = 1 α ( 1 α ) α Z ( θ 1 ) + ( 1 α ) Z ( α θ 1 + ( 1 α ) θ 2 ) Z ( θ 2 ) ,
with the (signed) α-skewed Bhattacharyya distances provided by
D B , α ( p θ 1 : p ˜ θ 2 ) = log Z ( θ 2 ) log Z ( α θ 1 + ( 1 α ) θ 2 ) .
Let us illustrate Proposition 5 with some examples.
Example 1.
Consider the family of exponential distributions E = { p λ ( x ) = 1 x 0 λ exp ( λ x ) } , where E is an exponential family with a natural parameter θ = λ , parameter space Θ = R > 0 , sufficient statistic t ( x ) = x . The partition function is Z ( θ ) = 1 θ , with Z ( θ ) = 1 θ 2 and Z ( θ ) = 2 λ 3 > 0 , while the cumulant function is F ( θ ) = log Z ( θ ) = log θ with moment parameter η = E p λ [ t ( x ) ] = F ( θ ) = 1 θ . The α-divergences between two unnormalized exponential distributions are
D α ( p ˜ λ 1 : p ˜ λ 2 ) = 1 α ( 1 α ) J Z , α ( θ 1 : θ 2 ) = ( λ 1 λ 2 ) 2 ) α λ 1 2 λ 2 + ( 1 α ) λ 1 λ 2 2 α { 0 , 1 } D KL ( p ˜ λ 2 : p ˜ λ 1 ) = B Z ( θ 1 : θ 2 ) = ( λ 1 λ 2 ) 2 λ 1 λ 2 2 α = 0 , 4 J Z ( θ 1 , θ 2 ) = ( λ 1 λ 2 ) 2 2 ( λ 1 λ 2 2 + λ 1 2 λ 2 ) α = 1 2 , D KL ( p ˜ λ 1 : p ˜ λ 2 ) = B Z ( θ 2 : θ 1 ) = ( λ 1 λ 2 ) 2 λ 2 λ 1 2 α = 1 .
Example 2.
Consider the family of univariate centered normal distributions with p ˜ σ 2 ( x ) exp ( x 2 2 σ 2 ) and partition function Z ( σ 2 ) = 2 π σ 2 such that p σ 2 ( x ) = 1 Z ( σ 2 ) p ˜ σ 2 ( x ) = 1 2 π σ 2 exp ( x 2 2 σ 2 ) . Here, we have a natural parameter θ = 1 σ 2 Θ = R > 0 and sufficient statistic t ( x ) = x 2 2 . The partition function expressed with the natural parameter is Z ( θ ) = 2 π θ , with Z ( θ ) = π 2 θ 3 2 and Z ( θ ) = 3 π 2 3 2 θ 5 2 > 0 (strictly convex on Θ). The unnormalized KLD between p ˜ σ 1 2 and p ˜ σ 2 2 is
D KL ( p ˜ σ 1 2 : p ˜ σ 2 2 ) = B Z ( θ 2 : θ 1 ) = π 2 2 σ 2 3 σ 1 + σ 1 3 σ 2 2 .
We can check that we have D KL ( p ˜ σ 2 : p ˜ σ 2 ) = 0 .
For the Hellinger divergence, we have
D H 2 ( p ˜ σ 1 2 : p ˜ σ 2 2 ) = J Z ( θ 1 , θ 2 ) = π 2 ( σ 1 + σ 2 ) 2 π σ 1 σ 2 σ 1 2 + σ 2 2 ,
and we can check that D H ( p ˜ σ 2 : p ˜ σ 2 ) = 0 .
Consider the family of the d-variate case of centered normal distributions with unnormalized density
p ˜ Σ ( x ) exp ( 1 2 x Σ 1 x ) = exp 1 2 tr ( x Σ 1 x ) = exp 1 2 tr ( x x Σ 1 )
obtained using the matrix trace cyclic property, where Σ is the covariance matrix. Here, we have θ = Σ 1 (precision matrix) and Θ = Sym + + ( d ) for t ( x ) = 1 2 x x , with the matrix inner product A , B = tr ( A B ) . The partition function Z ( Σ ) = ( 2 π ) d 2 det ( Σ ) expressed with the natural parameter is Z ( θ ) = ( 2 π ) d 2 1 det ( θ ) . This is a convex function with
Z ( θ ) = 1 2 ( 2 π ) d 2 θ det ( θ ) det ( θ ) 3 2 = 1 2 ( 2 π ) d 2 θ 1 det ( θ ) 1 2 ,
as θ det ( θ ) = det ( θ ) θ using matrix calculus.
Now, consider the family of univariate normal distributions
E = p μ , σ 2 ( x ) = 1 2 π σ 2 exp 1 2 x μ σ 2 .
Let θ = θ 1 = 1 σ 2 , θ 2 = μ σ 2 and
Z ( θ 1 , θ 2 ) = 2 π θ 1 exp 1 2 θ 2 2 θ 1 .
The unnormalized densities are p ˜ θ ( x ) = exp θ 1 x 2 2 + x θ 2 , and we have
Z ( θ ) = π 2 ( θ 1 + θ 2 2 ) exp θ 2 2 2 θ 1 θ 1 5 2 2 π θ 2 exp θ 2 2 2 θ 1 θ 1 3 2 .
It follows that D KL [ p ˜ θ : p ˜ θ ] = B Z ( θ : θ ) .

5. Deforming Convex Functions and Their Induced Dually Flat Spaces

5.1. Comparative Convexity

The log-convexity can be interpreted as a special case of comparative convexity with respect to a pair ( M , N ) of comparable weighted means [9], as follows.
A function Z is ( M , N ) -convex if and only if for α [ 0 , 1 ] we have
Z ( M ( x , y ; α , 1 α ) ) N ( Z ( x ) , Z ( y ) ; α , 1 α ) ,
and is strictly ( M , N ) -convex iff we have strict inequality for α ( 0 , 1 ) and x y . Furthermore, a function Z is (strictly) ( M , N ) -concave if Z is (strictly) ( M , N ) -convex.
Log-convexity corresponds to ( A , G ) -convexity, i.e., convexity with respect to the weighted arithmetical and geometrical means defined respectively by A ( x , y ; α , 1 α ) = α x + ( 1 α ) y and G ( x , y ; α , 1 α ) = x α y 1 α . Ordinary convexity is ( A , A ) -convexity.
A weighted quasi-arithmetical mean [34] (also called a Kolmogorov–Nagumo mean [35]) is defined for a continuous and strictly increasing function h by
M h ( x , y ; α , 1 α ) = h 1 ( α h ( x ) + ( 1 α ) h ( x ) ) .
We let M h ( x , y ) = M h x , y ; 1 2 , 1 2 . Quasi-arithmetical means include the arithmetical mean obtained for h ( u ) = id ( u ) = u and the geometrical mean for h ( u ) = log ( u ) , and more generally power means
M p ( x , y ; α , 1 α ) = α x p + ( 1 α ) y p 1 p = M h p ( x , y ; α , 1 α ) , p 0 ,
which are quasi-arithmetical means obtained for the family of generators h p ( u ) = u p 1 p with inverse h p 1 ( u ) = ( 1 + u p ) 1 p . In the limit p 0 , we have M 0 ( x , y ) = G ( x , y ) for the generator lim p 0 h p ( u ) = h 0 ( u ) = log u .
Proposition 6
([36,37]). A function Z ( θ ) is strictly ( M ρ , M τ ) -convex with respect to two strictly increasing smooth functions ρ and τ if and only if the function F = τ Z ρ 1 is strictly convex.
Notice that the set of strictly increasing smooth functions form a non-Abelian group, with the group operation as the function composition, the neutral element as the identity function, and the inverse element as the functional inverse function.
Because log-convexity is ( A = M id , G = M log ) -convexity, a function Z is strictly log-convex iff log Z id 1 = log Z is strictly convex. We have
Z = τ 1 F ρ F = τ Z ρ 1 .
Starting from a given convex function F ( θ ) , we can deform the function F ( θ ) to obtain a function Z ( θ ) using two strictly monotone functions τ and ρ : Z ( θ ) = τ 1 ( F ( ρ ( θ ) ) ) .
For a ( M ρ , M τ ) -convex function Z ( θ ) which is also strictly convex, we can define a pair of Bregman divergences B Z and B F with F ( θ ) = τ ( Z ( ρ 1 ( θ ) ) ) and a corresponding pair of skewed Jensen divergences.
Thus, we have the following generic deformation scheme.
F = τ Z ρ 1 ( M ρ 1 , M τ 1 ) - c o n v e x   w h e n   Z   i s   c o n v e x     ( ρ , τ ) -deformation ( ρ 1 , τ 1 ) -deformation     Z = τ 1 F ρ ( M ρ , M τ ) - c o n v e x   w h e n   F   i s   c o n v e x
In particular, when the function Z is deformed by strictly increasing the power functions h p 1 and h p 2 for p 1 and p 2 in R as
Z p 1 , p 2 = h p 2 Z h p 1 1 ,
then Z p 1 , p 2 is strictly convex when it is strictly ( M p 1 , M p 2 ) -convex, and as such induces corresponding Bregman and Jensen divergences.
Example 3.
Consider the partition function Z ( θ ) = 1 θ of the exponential distribution family ( θ > 0 with Θ = R > 0 ). Let Z p ( θ ) = ( h p Z ) ( θ ) = θ p 1 p ; then, we have Z p ( θ ) = ( 1 + p ) 1 θ 2 + p > 0 when p > 1 . Thus, we can deform Z smoothly by Z p while preserving the convexity by ranging p from 1 to + . In this way, we obtain a corresponding family of Bregman and Jensen divergences.
The proposed convex deformation using quasi-arithmetical mean generators differs from the interpolation of convex functions using the technique of proximal averaging [38].
Note that in [37] the comparative convexity with respect to a pair of quasi-arithmetical means ( M ρ , M τ ) is used to define a ( M ρ , M τ ) -Bregman divergence, which turns out to be equivalent to a conformal Bregman divergence on the ρ -embedding of the parameters.

5.2. Dually Flat Spaces

We start with a refinement of the class of convex functions used to generate dually flat spaces.
Definition 2
(Legendre type function [39]). ( Θ , F ) is of Legendre type if the function F : Θ R is strictly convex and differentiable with Θ and
lim λ 0 d d λ F ( λ θ + ( 1 λ ) θ ¯ ) = , θ Θ , θ ¯ Θ .
Legendre-type functions F ( Θ ) admit a convex conjugate F * ( η ) via the Legendre transform F * ( η ) = sup θ Θ θ , η F ( θ ) :
F * ( η ) = F 1 ( η ) , η F ( F 1 ( η ) ) .
A smooth and strictly convex function ( Θ , F ( θ ) ) of Legendre type induces a dually flat space [1] M , i.e., a smooth Hessian manifold [40] with a single global chart ( Θ , θ ( · ) ) [1]. A canonical divergence D ( p : q ) between two points p and q of M is viewed as a single-parameter contrast function [41] D ( r p q ) on the product manifold M × M . The canonical divergence and its dual canonical divergence D * ( r q p ) = D ( r p q ) can be expressed equivalently as either dual Bregman divergences or dual Fenchel–Young divergences (Figure 2):
D ( r p q ) = B F ( θ ( p ) : θ ( q ) ) = Y F , F * ( θ ( p ) : η ( q ) ) , = D * ( r q p ) = B F * ( η ( q ) : η ( p ) ) = Y F * , F ( η ( q ) : θ ( p ) ) ,
where Y F , F * is the Fenchel–Young divergence:
Y F , F * ( θ ( p ) : η ( q ) ) = F ( θ ( p ) ) + F * ( η ( q ) ) θ ( p ) , η ( q ) .
We have the dual global coordinate system η = F ( θ ) and the domain H = { F ( θ ) : θ Θ } which defines the dual Legendre-type potential function ( H , F * ( η ) ) . The Legendre-type function ensures that F * * = F (a sufficient condition is to have F be convex and lower semi-continuous [42]).
A manifold M is called dually flat, as the torsion-free affine connections ∇ and * induced by the potential functions F ( θ ) and F * ( η ) linked with the Legendre–Fenchel transformation are flat [1], that is, their Christoffel symbols vanishes in the dual coordinate system: Γ ( θ ) = 0 and Γ * ( η ) = 0 .
The Legendre-type function ( Θ , F ( θ ) ) is not defined uniquely; the function F ¯ ( θ ¯ ) = F ( A θ + b ) + C θ + d with θ ¯ = A θ + b for A and C invertible matrices and b and d vectors defines the same dually flat space with the same canonical divergence D ( p , q ) :
D ( p : q ) = B F ( θ ( p ) : θ ( q ) ) = B F ¯ ( θ ¯ ( p ) : θ ¯ ( q ) ) .
Thus, a log-convex Legendre-type function Z ( θ ) induces two dually flat spaces by considering the DFSs induced by Z ( θ ) and F ( θ ) = log Z ( θ ) . Let the gradient maps be η = Z ( θ ) and η ˜ = F ( θ ) = η Z ( θ ) .
When F ( θ ) is chosen as the cumulant function of an exponential family, the Bregman divergence B F ( θ 1 : θ 2 ) can be interpreted as a statistical divergence between corresponding probability densities, meaning that the Bregman divergence amounts to the reverse Kullback–Leibler divergence: B F ( θ 1 : θ 2 ) = D KL * ( p θ 1 : p θ 2 ) , where D KL * is the reverse KLD.
Notice that deforming a convex function F ( θ ) into F ( ρ ( θ ) ) such that F ρ remains strictly convex has been considered by Yoshizawa and Tanabe [43] to build a two-parameter deformation ρ α , β of the dually flat space induced by the cumulant function F ( θ ) of the multivariate normal family. Additionally, see the method of Hougaard [44] for obtaining other exponential families from a given exponential family.
Thus, in general, there are many more dually flat spaces with corresponding divergences and statistical divergences than the usually considered exponential family manifold [5] induced by the cumulant function. It is interesting to consider their use in information sciences.

6. Conclusions and Discussion

For machine learning practioners, it is well known that the Kullback–Leibler divergence (KLD) between two probability densities p θ 1 and p θ 2 of an exponential family with cumulant function F (free energy in thermodynamics) amounts to a reverse Bregman divergence [26] induced by F, or equivalently to a reverse Fenchel–Young divergence [27]
D KL ( p θ 1 : p θ 2 ) = B F ( θ 2 : θ 1 ) = Y F , F * ( θ 2 : η 1 ) ,
where η = F ( θ ) is the dual moment or expectation parameter.
In this paper, we have shown that the KLD as extended to positive unnormalized densities p ˜ θ 1 and p ˜ θ 2 of an exponential family with a convex partition function Z ( θ ) (Laplace transform) amounts to a reverse Bregman divergence induced by Z, or equivalently to a reverse Fenchel–Young divergence
D KL ( p ˜ θ 1 : p ˜ θ 2 ) = B Z ( θ 2 : θ 1 ) = Y Z , Z * ( θ 2 : η ˜ 1 ) ,
where η ˜ = Z ( θ ) .
More generally, we have shown that the scaled α -skewed Jensen divergences induced by the cumulant and partition functions between natural parameters coincide with the scaled α -skewed Bhattacharyya distances between probability densities and the α -divergences between unnormalized densities, respectively:
D B , α s ( p θ 1 : p θ 2 ) = J F , α s ( θ 1 : θ 2 ) , D α ( p ˜ θ 1 : p ˜ θ 2 ) = J Z , α s ( θ 1 : θ 2 ) .
We have noted that the partition functions Z of exponential families are both convex and log-convex, and that the corresponding cumulant functions are both convex and exponentially convex.
Figure 3 summarizes the relationships between statistical divergences and between the normalized and unnormalized densities of an exponential family, as well as the corresponding divergences between their natural parameters. Notice that Brekelmans and Nielsen [45] considered deformed uni-order likelihood ratio exponential families (LREFs) for annealing paths and obtained an identity for the α -divergences between unnormalized densities and Bregman divergences induced by multiplicatively scaled partition functions.
Because the log-convex partition function is also convex, we have generalized the principle of building pairs of convex generators using the comparative convexity with respect to a pair of quasi-arithmetical means, and have further discussed the induced dually flat spaces and divergences. In particular, by considering the convexity-preserving deformations obtained by power mean generators, we have shown how to obtain a family of convex generators and dually flat spaces. Notice that some parametric families of Bregman divergences, such as the α -divergences [46], β -divergences [47], and V-geometry [48] of symmetric positive-definite matrices, yield families of dually flat spaces.
Banerjee et al. [49] proved a duality between regular exponential families and a subclass of Bregman divergences, which they accordingly termed regular Bregman divergences. In particular, this duality allows the Maximum Likelihood Estimator (MLE) of an exponential family with a cumulant function F to be viewed as a right-sided Bregman centroid with respect to the Legendre–Fenchel dual F * . In [50], the scope of this duality was further extended for arbitrary Bregman divergences by introducing a class of generalized exponential families.
Concave deformations have been recently studied in [51], where the authors introduced the log ϕ -concavity induced by a positive continuous function ϕ generating a deformed logarithm log ϕ as the ( A , log ϕ ) -comparative concavity (Definition 1.2 in [51]), as well as the weaker notion of F-concavity which corresponds to the ( A , F ) -concavity (Definition 2.1 in [51], requiring strictly increasing functions F). Our deformation framework Z = τ 1 F ρ is more general, as it is double-sided. We jointly deform the function F by F τ = τ 1 F and its argument θ by θ ρ = ρ ( θ ) .
Exponentially concave functions have been considered as generators of L-divergences in [24]; α -exponentially concave functions G such that exp ( α G ) are concave for α > 0 generalize the L-divergences to L α -divergences, which can be expressed equivalently using a generalization of the Fenchel–Young divergence based on the c-transforms [24]. When α < 0 , exponentially convex functions are considered instead of exponentially concave functions. The information geometry induced by L α -divergences are dually projectively flat with constant curvature, and reciprocally possess a dually projectively flat structure with constant curvature, inducing (locally) a canonical L α -divergence. Wong and Zhang [52] investigated a one-parameter deformation of convex duality, called λ -duality, by considering functions f such that 1 λ ( e λ f 1 ) are convex for λ 0 . They defined the λ -conjugate transform as a particular case of the c-transform [24] and studie the information geometry of the induced λ -logarithmic divergences. The λ -duality yields a generalization of exponential and mixture families to λ -exponential and λ -mixture families related to the Rényi divergence.
Finally, certain statistical divergences, called projective divergences, are invariant under rescaling, and as such can define dissimilarities between non-normalized densities. For example, the γ -divergences [32] D γ are such that D γ ( p : q ) = D γ ( p ˜ : q ˜ ) (with γ -divergences tending to the KLD when γ 0 ) or the Cauchy–Schwarz divergence [53].

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

The author heartily thanks the three reviewers for their helpful comments which led to this improved paper.

Conflicts of Interest

Author Frank Nielsen is employed by the company Sony Computer Science Laboratories Inc. The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare no conflicts of interest.

References

  1. Amari, S.I. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Tokyo, Japan, 2016. [Google Scholar]
  2. Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
  3. Nielsen, F.; Hadjeres, G. Monte Carlo information-geometric structures. In Geometric Structures of Information; Springer: Berlin/Heidelberg, Germany, 2019; pp. 69–103. [Google Scholar]
  4. Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. In Lecture Notes-Monograph Series; Cornell University: Ithaca, NY, USA, 1986; Volume 9. [Google Scholar]
  5. Scarfone, A.M.; Wada, T. Legendre structure of κ-thermostatistics revisited in the framework of information geometry. J. Phys. Math. Theor. 2014, 47, 275002. [Google Scholar] [CrossRef]
  6. Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195. [Google Scholar] [CrossRef] [PubMed]
  7. Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef]
  8. Cichocki, A.; Amari, S.I. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef]
  9. Niculescu, C.; Persson, L.E. Convex Functions and Their Applications, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2018; Volume 23, first edition published in 2006. [Google Scholar]
  10. Billingsley, P. Probability and Measure; John Wiley & Sons: Hoboken, NJ, USA, 2017. [Google Scholar]
  11. Barndorff-Nielsen, O. Information and Exponential Families; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
  12. Morris, C.N. Natural exponential families with quadratic variance functions. Ann. Stat. 1982, 10, 65–80. [Google Scholar] [CrossRef]
  13. Efron, B. Exponential Families in Theory and Practice; Cambridge University Press: Cambridge, UK, 2022. [Google Scholar]
  14. Grünwald, P.D. The Minimum Description Length Principle; MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
  15. Kailath, T. The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Technol. 1967, 15, 52–60. [Google Scholar] [CrossRef]
  16. Wainwright, M.J.; Jordan, M.I. Graphical models, exponential families, and variational inference. Found. Trends® Mach. Learn. 2008, 1, 1–305. [Google Scholar]
  17. LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; Huang, F. A tutorial on energy-based learning. In Predicting Structured Data; University of Toronto: Toronto, ON, USA, 2006; Volume 1. [Google Scholar]
  18. Kindermann, R.; Snell, J.L. Markov Random Fields and Their Applications; American Mathematical Society: Providence, RI, USA, 1980; Volume 1. [Google Scholar]
  19. Dai, B.; Liu, Z.; Dai, H.; He, N.; Gretton, A.; Song, L.; Schuurmans, D. Exponential family estimation via adversarial dynamics embedding. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2019; Volume 32. [Google Scholar]
  20. Cobb, L.; Koppstein, P.; Chen, N.H. Estimation and moment recursion relations for multimodal distributions of the exponential family. J. Am. Stat. Assoc. 1983, 78, 124–130. [Google Scholar] [CrossRef]
  21. Garcia, V.; Nielsen, F. Simplification and hierarchical representations of mixtures of exponential families. Signal Process. 2010, 90, 3197–3212. [Google Scholar] [CrossRef]
  22. Zhang, J.; Wong, T.K.L. λ-Deformed probability families with subtractive and divisive normalizations. In Handbook of Statistics; Elsevier: Amsterdam, The Netherlands, 2021; Volume 45, pp. 187–215. [Google Scholar]
  23. Boyd, S.P.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  24. Wong, T.K.L. Logarithmic divergences from optimal transport and Rényi geometry. Inf. Geom. 2018, 1, 39–78. [Google Scholar] [CrossRef]
  25. Van Erven, T.; Harremos, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
  26. Azoury, K.S.; Warmuth, M.K. Relative loss bounds for on-line density estimation with the exponential family of distributions. Mach. Learn. 2001, 43, 211–246. [Google Scholar] [CrossRef]
  27. Amari, S.I. Differential-Geometrical Methods in Statistics, 1st ed.; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 28. [Google Scholar]
  28. Nielsen, F. Statistical divergences between densities of truncated exponential families with nested supports: Duo Bregman and duo Jensen divergences. Entropy 2022, 24, 421. [Google Scholar] [CrossRef] [PubMed]
  29. Del Castillo, J. The singly truncated normal distribution: A non-steep exponential family. Ann. Inst. Stat. Math. 1994, 46, 57–66. [Google Scholar] [CrossRef]
  30. Wainwright, M.J.; Jaakkola, T.S.; Willsky, A.S. A new class of upper bounds on the log partition function. IEEE Trans. Inf. Theory 2005, 51, 2313–2335. [Google Scholar] [CrossRef]
  31. Hyvärinen, A.; Dayan, P. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 2005, 6, 695–709. [Google Scholar]
  32. Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef]
  33. Eguchi, S.; Komori, O. Minimum Divergence Methods in Statistical Machine Learning; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
  34. Kolmogorov, A. Sur la Notion de la Moyenne; Cold Spring Harbor Laboratory: Cold Spring Harbor, NY, USA, 1930. [Google Scholar]
  35. Komori, O.; Eguchi, S. A unified formulation of k-Means, fuzzy c-Means and Gaussian mixture model by the Kolmogorov–Nagumo average. Entropy 2021, 23, 518. [Google Scholar] [CrossRef]
  36. Aczél, J. A generalization of the notion of convex functions. Det K. Nor. Vidensk. Selsk. Forh. Trondheim 1947, 19, 87–90. [Google Scholar]
  37. Nielsen, F.; Nock, R. Generalizing skew Jensen divergences and Bregman divergences with comparative convexity. IEEE Signal Process. Lett. 2017, 24, 1123–1127. [Google Scholar] [CrossRef]
  38. Bauschke, H.H.; Goebel, R.; Lucet, Y.; Wang, X. The proximal average: Basic theory. SIAM J. Optim. 2008, 19, 766–785. [Google Scholar] [CrossRef]
  39. Rockafellar, R.T. Conjugates and Legendre transforms of convex functions. Can. J. Math. 1967, 19, 200–205. [Google Scholar] [CrossRef]
  40. Shima, H. The Geometry of Hessian Structures; World Scientific: Singapore, 2007. [Google Scholar]
  41. Eguchi, S. A differential geometric approach to statistical inference on the basis of contrast functionals. Hiroshima Math. J. 1985, 15, 341–391. [Google Scholar] [CrossRef]
  42. Rockafellar, R. Convex Analysis; Princeton Landmarks in Mathematics and Physics; Princeton University Press: Princeton, NJ, USA, 1997. [Google Scholar]
  43. Yoshizawa, S.; Tanabe, K. Dual differential geometry associated with the Kullbaek-Leibler information on the Gaussian distributions and its 2-parameter deformations. SUT J. Math. 1999, 35, 113–137. [Google Scholar] [CrossRef]
  44. Hougaard, P. Convex Functions in Exponential Families; Department of Mathematical Sciences, University of Copenhagen: Copenhagen, Denmark, 1983. [Google Scholar]
  45. Brekelmans, R.; Nielsen, F. Variational representations of annealing paths: Bregman information under monotonic embeddings. Inf. Geom. 2024. [Google Scholar] [CrossRef]
  46. Amari, S.I. α-Divergence is unique, belonging to both f-divergence and Bregman divergence classes. IEEE Trans. Inf. Theory 2009, 55, 4925–4931. [Google Scholar] [CrossRef]
  47. Hennequin, R.; David, B.; Badeau, R. Beta-divergence as a subclass of Bregman divergence. IEEE Signal Process. Lett. 2010, 18, 83–86. [Google Scholar] [CrossRef]
  48. Ohara, A.; Eguchi, S. Group invariance of information geometry on q-Gaussian distributions induced by Beta-divergence. Entropy 2013, 15, 4732–4747. [Google Scholar] [CrossRef]
  49. Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J.; Lafferty, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
  50. Frongillo, R.; Reid, M.D. Convex Found. Gen. Maxent Model. 2014, 1636, 11–16. [Google Scholar]
  51. Ishige, K.; Salani, P.; Takatsu, A. Hierarchy of deformations in concavity. Inf. Geom. 2022, 7, 251–269. [Google Scholar] [CrossRef]
  52. Zhang, J.; Wong, T.K.L. λ-Deformation: A canonical framework for statistical manifolds of constant curvature. Entropy 2022, 24, 193. [Google Scholar] [CrossRef] [PubMed]
  53. Jenssen, R.; Principe, J.C.; Erdogmus, D.; Eltoft, T. The Cauchy–Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels. J. Frankl. Inst. 2006, 343, 614–629. [Google Scholar] [CrossRef]
Figure 1. Strictly log-convex functions form a proper subset of strictly convex functions.
Figure 1. Strictly log-convex functions form a proper subset of strictly convex functions.
Entropy 26 00193 g001
Figure 2. The canonical divergence D and dual canonical divergence D * on a dually flat space M equipped with potential functions F and F * can be viewed as single-parameter contrast functions on the product manifold M × M : The divergence D can be expressed using either the θ × θ -coordinate system as a Bregman divergence or the mixed θ × η -coordinate system as a Fenchel–Young divergence. Similarly, the dual divergence D can be expressed using either the η × η -coordinate system as a dual Bregman divergence or the mixed η × θ -coordinate system as a dual Fenchel–Young divergence.
Figure 2. The canonical divergence D and dual canonical divergence D * on a dually flat space M equipped with potential functions F and F * can be viewed as single-parameter contrast functions on the product manifold M × M : The divergence D can be expressed using either the θ × θ -coordinate system as a Bregman divergence or the mixed θ × η -coordinate system as a Fenchel–Young divergence. Similarly, the dual divergence D can be expressed using either the η × η -coordinate system as a dual Bregman divergence or the mixed η × θ -coordinate system as a dual Fenchel–Young divergence.
Entropy 26 00193 g002
Figure 3. Statistical divergences between normalized p θ and unnormalized p ˜ θ densities of an exponential family E with corresponding divergences between their natural parameters. Without loss of generality, we consider a natural exponential family (i.e., t ( x ) = x and k ( x ) = 0 ) with cumulant function F and partition function Z, with J F and B F respectively denoting the Jensen and Bregman divergences induced by the generator F. The statistical divergences D R , α and D B , α denote the Rényi α -divergences and skewed α -Bhattacharyya distances, respectively. The superscript “s” indicates rescaling by the multiplicative factor 1 α ( 1 α ) , while the superscript “*” denotes the reverse divergence obtained by swapping the parameter order.
Figure 3. Statistical divergences between normalized p θ and unnormalized p ˜ θ densities of an exponential family E with corresponding divergences between their natural parameters. Without loss of generality, we consider a natural exponential family (i.e., t ( x ) = x and k ( x ) = 0 ) with cumulant function F and partition function Z, with J F and B F respectively denoting the Jensen and Bregman divergences induced by the generator F. The statistical divergences D R , α and D B , α denote the Rényi α -divergences and skewed α -Bhattacharyya distances, respectively. The superscript “s” indicates rescaling by the multiplicative factor 1 α ( 1 α ) , while the superscript “*” denotes the reverse divergence obtained by swapping the parameter order.
Entropy 26 00193 g003
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nielsen, F. Divergences Induced by the Cumulant and Partition Functions of Exponential Families and Their Deformations Induced by Comparative Convexity. Entropy 2024, 26, 193. https://doi.org/10.3390/e26030193

AMA Style

Nielsen F. Divergences Induced by the Cumulant and Partition Functions of Exponential Families and Their Deformations Induced by Comparative Convexity. Entropy. 2024; 26(3):193. https://doi.org/10.3390/e26030193

Chicago/Turabian Style

Nielsen, Frank. 2024. "Divergences Induced by the Cumulant and Partition Functions of Exponential Families and Their Deformations Induced by Comparative Convexity" Entropy 26, no. 3: 193. https://doi.org/10.3390/e26030193

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop