Next Article in Journal
The New Marshall–Olkin–Type II Exponentiated Half-Logistic–Odd Burr X-G Family of Distributions with Properties and Applications
Previous Article in Journal
Optimal ANOVA-Based Emulators of Models With(out) Derivatives
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Affine Calculus for Constrained Minima of the Kullback–Leibler Divergence

by
Giovanni Pistone
1,2
1
De Castro Statistics, Collegio Carlo Alberto, 10122 Torino, Italy
2
Nuovo SEFIR, c/o Coworld, Centro Direzionale Milano Due, Palazzo Canova, 20054 Segrate, Italy
Stats 2025, 8(2), 25; https://doi.org/10.3390/stats8020025
Submission received: 6 February 2025 / Revised: 17 March 2025 / Accepted: 19 March 2025 / Published: 21 March 2025

Abstract

:
The non-parametric version of Amari’s dually affine Information Geometry provides a practical calculus to perform computations of interest in statistical machine learning. The method uses the notion of a statistical bundle, a mathematical structure that includes both probability densities and random variables to capture the spirit of Fisherian statistics. We focus on computations involving a constrained minimization of the Kullback–Leibler divergence. We show how to obtain neat and principled versions of known computations in applications such as mean-field approximation, adversarial generative models, and variational Bayes.

1. Introduction and Notations

Many modern Artificial Intelligence (AI) and machine learning (ML) algorithms are based on non-parametric statistical methods and optimization algorithms based on the minimization of a divergence measure between probability functions. In particular, one computes the gradient of a function defined on the probability simplex; then, the learning uses a gradient ascent technique. Such a basic approach is illustrated, for example, in the textbook [1] (Ch. 18).
In most papers, ordinary convex calculus tools on the open probability simplex provide the relevant derivatives and gradients. The relation between the analytic computations and their statistical meaning is not exposed. This paper focuses on the derivative and gradient computations by providing the geometric framework called Information Geometry (IG). This geometry differs from the usual convex analysis because its devices have a direct statistical meaning. For example, the velocity of a one-dimensional parametric curve θ p ( θ ) in the open probability simplex is defined to be the Fisher’s score d d θ log p ( θ ) instead of the ordinary derivative d d θ p ( θ ) . Generally speaking, IG is a geometric interpretation of Fisherian inference ([1], Ch. 5).
Amari’s Information Geometry (IG) [2,3,4] has been successfully applied to modern AI algorithms; see, for example, [5]. Here, we use the non-parametric version of IG of [6,7]. This version is non-parametric because the basic set of states is the open probability simplex, it is affine, as it satisfies a generalization of the classical Weyl’s axioms [8]. Moreover, it is dually affine in the sense already defined in Amari contributions because the covariance bilinear operator appears as a duality pairing in the vector space of coordinates.
The specific applications we will consider as examples come from the literature in statistical ML, particularly those that involve the constrained minimization of the Kullback–Leibler divergence (KL-divergence). Indeed, our main result in Section 2 is a form of the total gradient of the KL-divergence as expressed in the dually affine geometry. Namely, we consider symmetric divergences [9], generative adversarial networks [10], mixed entropy and transport optimization [11,12], and variational Bayes [13,14].
Non-parametric IG stands in general sample spaces and under various functional assumptions. One option, among many, is the use of Orlicz spaces [15]; see [6,16]. In this paper, we are not interested in discussing the functional setup. Still, we are interested in presenting the peculiar affine calculus of positive probability functions on a finite state space Ω in a geometric language compatible with the infinite-dimensional theory [17]. Such a calculus provides principled definitions of a curve’s velocity, the scalar field’s gradient, and the gradient flow.

1.1. Prerequisites

Below, we provide a schematic summary of the theory. For complete details, we refer to previous presentations in [7,18].
Let Ω be a finite sample space. We look at the open simplex as the maximal exponential model denoted as E Ω . In fact, we present every couple of positive probability functions on Ω , say p , q , in the form inspired by Statistical Physics [19]:
q = e v Ψ · p ,
where p represents a ground state, q is a perturbation of the ground state, v is a random variable, Ψ is a normalizing constant, and ψ = κ p ( v ) = log E p e v is the cumulant function.
The random variable v depends on p and q up to a constant. If we specify E p v = 0 in Equation (1), then a straightforward computation gives
v = log q p E p log q p , q = e v κ p ( v ) · p , κ p ( v ) = log E p e v = E p log p q = D p q ,
where D p q is the KL-divergence. Regarding the entropy,
D p q = E p log p q = E p log p E p log q = H p + H p , q .
If we specify E q v = 0 in Equation (1), an analogous computation gives
v = log q p E q log q p , q = e v κ p ( v ) · p , κ p ( v ) = log E p e v = E q log p q = D q p .
A vector bundle is a collection of vector spaces, and each vector space is called a fiber of the bundle. For example, the tangent bundle collects all tangent vectors at each point in differential geometry. In Fisher’s statistics of the open probability simplex, one considers the vector space of all Fisher’s scores of one-dimensional models through the probability function q. Inspired by this last example, we call the statistical bundle the vector bundle S E Ω of all couples ( q , v ) of a positive probability function q and a q-centered random variable, E q v = 0 ,
S E Ω = ( q , v ) | q E Ω , E q v = 0 .
Each fiber S q E Ω is a Euclidean space for the covariance inner product v , w q = E q v w .
The covariance inner product is both a Riemannian metric and a duality pairing. The metric interpretation leads to the Riemannian version of IG. The duality pairing interpretation leads to our dually affine IG. Because of that, we want to distinguish between the fibers S p E Ω and the dual fibers S * p E Ω . The first bundle is called exponential bundle, while the second bundle is called mixture bundle. We use the notation
S * p E Ω × S p E Ω ( v , w ) v , w p , p E Ω .
In our setup, all the vector spaces of random variables are finite-dimensional; hence, the fibers S p E Ω and S * p E Ω are equal vector spaces. However, it is a useful distinction, as it will be apparent in the discussion of parallel transports below.
The definition of the statistical bundle aims to capture an essential mechanism of Fisher’s approach to statistics ([1], Ch. 4). Suppose t q ( t ) E Ω is a one-dimensional statistical model. In that case, the Fisher’s score is t d d t log q ( t ) = q ( t ) , and t ( q ( t ) , q ( t ) ) S E Ω is the lift of the curve to the statistical bundle.
Dually affine geometry follows from the definition of two parallel transports on the fibers and two affine charts. The parallel transports act between the fibers
S q E Ω v U e q r v = v E r v S r E Ω is the exponential transport ,
S * q E Ω w U m q r w = q r v S * r E Ω is the mixture transport .
It is easy to check that the transports are duals of each other:
v , U e r q w q = U m q r v , w r is the transport s duality ,
v , w q = U m q r v , U e q r w r is the inner product push .
The affine charts that define the two dual affine geometries by mapping the base set to a vector space of coordinates are
q s p ( q ) = log q p E p log q p S p E Ω is the exponential chart
q η p ( q ) = q p 1 S * p E Ω is the mixture chart ,
and the geometries defined by the two atlases are affine because the parallelogram law holds in both cases:
s p ( q ) + U e q p s q ( r ) = s p ( r ) η p ( q ) + U m q p η q ( r ) = η p ( r )
The inverse of the exponential chart is a non-parametric exponential family ([1], Ch. 5), and the known mechanisms of the cumulant function provide a fundamental calculus tool [20]. If K p is the restriction of κ p to S p E Ω and v = s p ( q ) then q = s p 1 ( v ) = e p ( v ) = e v K p ( v ) · p ,
K p ( v ) = log E p e v cumulant function
D p e p ( v ) = K p ( v ) cumulant function express the KL - divergence ,
d K p ( v ) [ h ] = E e p ( v ) h is the derivative of K p in the direction h ,
d 2 K ( v ) [ h , k ] = Cov e p ( v ) h , k is the second derivative of K p in the directions h , k .
Equations (10) and (11) are the non-parametric version of the well-known properties of the derivative of the cumulant function in exponential models; see ([1], § 5.5) and [20].
We can now show that Fisher’s score is a velocity in the technical sense of a velocity computed in the moving frame of both charts. If t q ( t ) E Ω is a smooth curve, and Φ : E Ω R is a smooth mapping,
q ( t ) = d d t s p ( q ( t ) ) p = q ( t ) = d d t η p ( q ( t ) ) p = q ( t ) = d d t log q ( t ) is the velocity ,
d d t Φ ( q ( t ) ) = grad Φ ( q ( t ) ) , q ( t ) q ( t ) is the gradient .
The squared norm of the velocity (12),
q ( t ) , q ( t ) q ( t ) = E q ( t ) d d t log q ( t ) 2 = q ˙ ( t ) 2 q ( t ) ,
is the Fisher information that appeared first in the classical Cramer–Rao lower bound.
The gradient defined in Equation (13) is frequently called the natural gradient in the IG literature, following the use introduced in the case of parametric models by Amari [3]. In Riemannian geometry [17,21], the metric acts as a duality pairing, and the definition of the gradient is similar to Equation (13). The classic example of the computation of the gradient is the gradient of the expected value as a function of the probability function,
d d t E q ( t ) u = d d t u q ˙ ( t ) q ( t ) q ( t ) = u E q ( t ) u , q ( t ) q ( t ) ,
so that grad E q u = u E q u .
The gradient of Φ gives the velocities of curves “orthogonal” to the surfaces of constant Φ -value, that is, the curves of steepest ascent. The solutions of the equation grad Φ = 0 are the stationary points of Φ , and an equation of the form
q ( t ) = ϵ ( t ) grad Φ ( q ( t ) )
is a gradient flow equation.
In conclusion, we review the derivation of a function f between two maximal exponential models using the mixture charts Equation (7). The expressions of f and its derivative d f in the charts centered, respectively, at p 1 and p 2 , are
E μ 1          f      E μ 2 n p 1 1 η p 2 S p 1 E μ 1 f p 1 · p 2 S p 2 E μ 2 and S q 1 E μ 1          d f ( q 1 )       S f ( q 1 ) E μ 2 U m p 1 q 1 U m f ( p 1 ) q 2 S p 1 E μ 1     d f p 1 · p 2 ( η p 1 ( q 1 ) S p 2 E μ 2
It follows that the computation of the derivative from its expression is
d f ( q ) [ q ] = U m p 2 q d f p 1 , p 2 ( η p 1 ( q ) ) [ U m q p 1 q ] .

1.2. Summary of Content

In the following sections, we give both new results and new versions of the known results. The aim is to show the interest of the non-parametric dually affine IG in computing the gradient flow of a constrained KL-divergence.
In Section 2, we show how to use the statistical bundle formalism to compute derivatives of functions defined on the open probability simplex and how to compute natural gradients and total natural gradients of the KL-divergence, the cross entropy, the entropy, and the Jensen–Shannon divergence.
In Section 3, we apply the general computations of the previous section to independence models and marginal conditional probabilities in a factorial product setting. The dually affine methodology methodically reproduces known computations and suggests neat variations of potential interest. In particular, Section 3.5 contains a fully worked example of the derivation of a gradient flow equation of interest in approximate Bayes computations.

2. Total Natural Gradient of the KL-Divergence

The KL-divergence ([5], Ch. 3) as a function of two variables is
D : E Ω × E Ω ( q , r ) D q r = E q log q r .
The computation of the total derivative is well-known in Information Theory. However, we provide proof in the affine setting, expressing the result in the affine charts.
In the exponential chart at p and in the mixture chart at p, the expressions of the probability functions q and r are, respectively,
q = e p ( v ) = e v K p ( v ) · p , r = η p 1 ( w ) = ( 1 + w ) · p ,
By plugging (16) into (15) and using Equation (10), one sees that the expressions of the partial KL-divergences are, respectively,
D e p ( v ) r = E e p ( v ) v K p ( v ) E e p ( v ) log r p =                                                              E e p ( v ) v K p ( v ) E e p ( v ) s p ( v ) D p r = d K p ( v ) [ v ] K p ( v ) d K p ( v ) [ s p ( r ) ] D p r ,      
and
D q η p 1 ( w ) = D q p E q log ( 1 + w ) .
Notice that the peculiar choice of the charts in the combination exponential for the first variable and mixture for the second variable is inessential in the finite state space case because any other choice will produce the same final result in the computation of the total natural gradient. However, it is consistent with the dual affine setting, in which two connections exist between one space and its dual. However, the expression of the KL-divergence using the exponential chart in both variables is interesting because, in such a case, the resulting expression is equal to the Bregman divergence of the cumulant function K p ,
D e p ( v ) e p ( w ) = K p ( w ) K p ( v ) d K p ( v ) [ w v ] ,
which, in turn, is the second remainder in the Taylor expansion. For example, one closed form is
D e p ( v ) e p ( w ) = 0 1 0 1 d 2 K ( v + s t ( w v ) ) [ w v , w v ] s d s d t .
If q ( s , t ) = e p ( v + s t ( w v ) ) q 1 s t r s t , then by Equation (11),
D q r = 0 1 Var q ( s , t ) log r q s d s d t .

2.1. Total Natural Gradient of the KL-Divergence

We compute our gradients in the duality induced on each fiber by the covariance; hence, the total natural gradient of the KL-divergence has two components implicitly defined by
d d t D q ( t ) r ( t ) = q ( t ) , grad 1 D q ( t ) r ( t ) q ( t ) + grad 2 D q ( t ) r ( t ) , r ( t ) r ( t ) ,
where grad 1 D q r is a random variable in the fiber at Q, while grad 2 D q r is a random variable in the fiber at r. The adjective total refers to the fact that D is a function of two variables.
Proposition 1.
The total natural gradient of the KL-divergence is
( q , r ) grad D q r = s q ( r ) , η r ( q ) S E Ω × S * E Ω .
That is, more explicitly, for each smooth couple of curves t q ( t ) and t r ( t ) , Equation (19) becomes
d d t D q ( t ) r ( t ) = q ( t ) , s q ( t ) ( r ( t ) ) q ( t ) η r ( t ) ( q ( t ) ) , r ( t ) r ( t ) .
Proof. 
From Equation (11), the derivative at v S p E Ω of Equation (17) in the direction h = U e q p q is
d 2 K p ( v ) [ v , h ] d K p ( v ) [ h ] + d K p ( v ) [ h ] d 2 K p ( v ) [ s p ( r ) , h ] =                                                  Cov q s p ( q ) s p ( r ) , h = Cov q log q r E p log q r , h =          E q log q r E q log q r h E q h = s q ( r ) , q q .
The derivative at w S * p E Ω of Equation (18) in the direction k = U m r p r is
E q p r k = E r q r r = η r ( q ) , r r .
The gradient computation forms the corresponding gradient flow equation, whose discretization provides basic optimization algorithms. Here are two basic examples.
Given r E Ω , the solution of the gradient flow equation
q ( t ) = grad 1 D q ( t ) r = s q ( r ) , q ( 0 ) = q 0 ,
is the exponential family
t q ( t ) = e e t v 0 K r ( e t v 0 ) · r , v 0 = s r ( q 0 ) .
In fact, the LHS of Equation (21) is given by
log q ( t ) = e t v 0 K r ( e t v 0 ) + log r , q ( t ) = e t v 0 + d K r ( e t v 0 ) [ e t v 0 ] ,
while the RHS is
s q ( t ) ( r ) = e t v 0 + K r ( e t v 0 ) E q ( t ) e t v 0 + K r ( e t v 0 ) =                                         e t v 0 + E q ( t ) e t v 0 .
The conclusion follows from Equation (10).
Given q E Ω , the solution of the gradient flow equation
r ( t ) = grad 2 D q r ( t ) = η r ( t ) ( q ) , r ( 0 ) = r 0 ,
is the mixture family
t r ( t ) = e t r 0 + ( 1 e t ) q .
The LHS of Equation (22) is
r ( t ) = r ˙ ( t ) r ( t ) = e t r 0 + e t q e t r 0 + ( 1 e t ) q = q r 0 r 0 + ( e t 1 ) q ,
while the RHS is
η r ( t ) ( q ) = q e t r 0 + ( 1 e t ) q 1 = q e t r 0 + ( 1 e t ) q e t r 0 + ( 1 e t ) q = q r 0 r 0 + ( e t 1 ) q .
Notice that in both cases, the t parameter appears in the solution in exponential form. Other forms of the temperature parameter will follow from a weighted form of the gradient flow equation.

2.2. Natural Gradient of the Entropy and Total Natural Gradient of the Cross Entropy

The KL-divergence equals the cross entropy minus the entropy,
D q r = E q log r E q log q = H q , r H q .
In the exponential chart at p for the first variable, the cross entropy is
H s p 1 ( v ) , r = E e p ( v ) log r = E e p ( v ) log r E p log r + H p , r =                                                                             d K p ( v ) [ log r E p log r ] + H p , r ,
with derivative at v in the direction h
d 2 K p ( v ) [ log r E p log r , h ] = Cov q log r , h =                                                             E q log r H q , r h E q h = q , log r H q , r q .
In the mixture chart at p for the second variable
H q , η p 1 ( w ) = E q log ( 1 + w ) · p = E q log 1 + w + H q , p ,
with derivative at w in the direction k,
E q ( 1 + w ) 1 k = E r q r U m p r k = E r q r 1 r = η r ( q ) , r r .
Proposition 2.
The total natural gradient of the cross entropy is
grad H q , r = ( log r H q , r , η r ( q ) )
and the natural gradient of the entropy is
grad H q = log q H q .
Proof. 
The first statement follows from Equations (23) and (24). From the decomposition H q = H q , r D q r , we find the gradient of the entropy,
grad H q = log r H q , r + s q ( r ) = log q H q .

2.3. Total Natural Gradient of the Jensen–Shannon Divergence

The Jensen–Shannon divergence [9] is
JS q , r = 1 2 D q 1 2 ( q + r ) + 1 2 D r 1 2 ( q + r ) =                                                      H 1 2 ( q + r ) 1 2 H q 1 2 H r .
It is the minimum value of the function
ϕ : p 1 2 D q p + D r p .
In fact,
grad ϕ ( p ) = 1 2 η p ( q ) + η p ( r ) = 1 2 q p + 1 r p + 1 = 1 2 ( q + r ) p + 1
which vanishes for p = 1 2 ( q + r ) .
Let us compute the derivative of f : q 1 2 ( q + r ) . The mixture expression of f at p according to Equation (7) is the affine function
f p ( v ) = η p f η p 1 ( v ) = 1 2 ( 1 + v ) · p + r p 1 = 1 2 v + 1 2 η p ( r ) ,
so that the derivative in the direction h is d f p ( v ) [ h ] = h / 2 .
The push-back, according to the mixture transport Equation (3), is
d f ( q ) [ q ] = U m p 1 2 ( q + r ) 1 2 U m q p q = p 1 2 ( q + r ) 1 2 q p q = 1 2 q 1 2 ( q + r ) q = 1 2 U m q 1 2 ( q + r ) q .
We now compute the gradient of q JS q , r of Equation (26), using the total natural gradient of the KH-divergence of Proposition 1, the derivative Equation (27), and the duality of parallel transports Equation (4):
grad ( q JS q , r ) =         1 2 s q 1 2 ( q + r ) 1 2 U e 1 2 ( q + r ) q η 1 2 ( q + r ) ( q ) + 1 2 1 2 U e 1 2 ( q + r ) q η 1 2 ( q + r ) ( r ) = 1 2 s q 1 2 ( q + r ) 1 2 U e 1 2 ( q + r ) q η 1 2 ( q + r ) ( q ) + η 1 2 ( q + r ) ( r ) = 1 2 s q 1 2 ( q + r ) 1 2 U e 1 2 ( q + r ) q q 1 2 ( q + r ) 1 + r 1 2 ( q + r ) 1 = 1 2 s q 1 2 ( q + r ) .
It is also instructive to use the expression of the Jensen–Shannon divergence in terms of entropies. From Equation (25),
grad ( q JS q , r ) =                 1 2 U e 1 2 ( q + r ) q log 1 2 ( q + r ) H 1 2 ( q + r ) 1 2 log q H q =                                          1 2 log 1 2 ( q + r ) E q log 1 2 ( q + r ) + log q E q log q =                      1 2 s q 1 2 ( q + r ) .

3. Product Sample Space

This section uses Ω = Ω 1 × Ω 2 as a factorial sample space. For each r E Ω , the margins are r 1 E Ω 1 and r 2 E Ω 2 . In the mean-field assumption, the model equals the tensor product of the margins,
r ¯ = r 1 r 2 E Ω 1 E Ω 2 E Ω .
The velocities are, respectively,
r ( x , y ; t ) = d d t log r ( x , y ; t ) ,
r ¯ ( x , y ; t ) = d d t log r ¯ ( x , y ; t ) = r 1 ( x ; t ) + r 2 ( y ; t ) .
Below, we will discuss the optimality of a mean-field approximation.

3.1. Product Sample Space: Marginalization

The (first) marginalization is
Π 1 : E Ω 1 × Ω 2 r r 1 E Ω 1 , r 1 ( x ) = b r ( x , b ) .
We will compute the bundle derivative of Equation (30) following the scheme of Equation (14).
Proposition 3.
The derivative d Π 1 of the marginalization Equation (30) is
d Π 1 : S * E Ω ( r , r ) r 1 , E r r | Π 1 S * E Ω 1 .
Proof. 
In the mixture chart centered at p 1 p 2 and p 1 , respectively, the expression of the marginalization is
η p 1 Π 1 η p 1 p 2 1 ( v ) = Π 1 η p 1 p 2 1 ( v ) p 1 1 =                                                                         b ( 1 + v ( · , b ) ) · p 1 p 2 ( b ) p 1 1 = b v ( · , b ) p 2 ( b ) .
Note that the expression in Equation (31) is linear. Hence, the derivative at v in the direction h is x b h ( · , b ) p 2 ( b ) with h = U m r p 1 p 2 r so that the bundle derivative is
d Π 1 ( r ) [ r ] = U m p 1 r 1 b r ( · , b ) p 1 p 2 ( b ) q ( · , b ) p 2 ( b ) =                                                                                 p 1 r 1 b r ( · , b ) p 1 p 2 ( b ) q ( · , b ) p 2 ( b ) = b q ( · , b ) r 1 | 2 ( · | b ) = E r r | Π 1 .
There is an interesting relation between conditional expectation and mixture transport. The conditional expectation commutes with the mixture transports,
U m r 1 q 1 E r U m q r v | Π 1 = E q v | Π 1 , v S * q E Ω .
It is a way to express Bayes’ theorem for conditional expectations. For all ϕ ,
E q U m r 1 q 1 E r U m q r v | Π 1 ϕ ( Π 1 ) = E q 1 r 1 q 1 E r q r v | Π 1 ϕ ( Π 1 ) =      E r 1 E r q r v | Π 1 ϕ ( Π 1 ) = E r E r q r v | Π 1 ϕ ( Π 1 ) = E r q r v Φ ( Π 1 ) = E q v ϕ ( Π 1 ) .

3.2. Product Sample Space: Mean-Field Approximation

The derivative of the joint marginalization
Π : E Ω 1 × Ω 2 r r 1 Π 1 ( r ) Π 2 ( r ) E Ω 1 × Ω 2 ,
follows from the derivative of the marginalization in Equation (29).
Proposition 4.
The derivative d Π of the joint marginalization in  Section 3.2  is
d Π : ( r , r ) r 1 r 2 , E r r | Π 1 + E r r | Π 2 .
Proof. 
Compose the partial derivatives with the mapping ( r 1 , r 2 ) r 1 r 2
The decomposition of the velocity v e l o c i t y r according to Equation (32) provides a better decomposition than r 1 + r 2 of Equations (28) and (29) and provides a definition of the mean-field approximation. In the language of ANOVA decomposition of statistical interactions, the derivative part in Equation (32) is the sum of the simple effects of the velocity,
q = q 1 + q 2 + q 12 ,
where q i = E q q | Π i , i = 1 , 2 , and the last term is the interaction, the q-orthogonal residual. See [22] for a discussion of the ANOVA decomposition in the context of the statistical bundle.
The equation for the total natural gradient of the KL-divergence and the computation of the derivative above provide the natural gradients of the divergence between the joint probability function and the mean-field approximation. In information theory [23], the KL-divergence in Equation (34) is called mutual information.
Proposition 5.
The natural gradients of the divergences of a joint distribution r and its mean-field approximation Π ( r ) are
grad D Π ( r ) r = E r 1 r 2 s r ( r 1 r 2 ) | Π 1 + E r 1 r 2 s r ( r 1 r 2 ) | Π 2 η r ( r 1 r 2 ) .
grad D r Π ( r ) = s r ( r 1 r 2 ) + E r η r ( r 1 r 2 ) | Π 1 + E r η r ( r 1 r 2 ) | Π 2 .
The conditional terms in Equation (33) depend on the mean-field model; hence, we could express them as a disintegration of r. For example,
E r 1 r 2 s r ( r 1 r 2 ) | Π 1 = x =    y r 2 ( y ) log r 1 ( x ) r 2 ( y ) r ( x , y ) x y r 1 ( x ) r 2 ( y ) log r 1 ( x ) r 2 ( y ) r ( x , y ) =                                                                              H r 2 , r 2 | 1 ( · | x ) x r 1 ( x ) H r 2 , r 2 | 1 ( · | x ) ,
where the last term is the conditional entropy H Π 2 | Π 1 under r.
Proof of Equation (33).
We find the natural gradient of r D Π ( r ) r by computing with Equations (20) and (32) the variation along a smooth curve t r ( t ) E Ω 1 × Ω 2 such that r ( 0 ) = r and r ( 0 ) = r . It holds that
d d t D Π ( r ( t ) ) r ( t ) t 0 = s Π ( r ) ( r ) , d Π ( r ) [ r ] Π ( r ) η r ( Π ( r ) ) , r r =                                                  s r 1 r 2 ( r ) , E r r | Π 1 + E r r | Π 2 r 1 r 2 η r ( r 1 r 2 ) , r r
We want to present the first term of the RHS as an inner product at r applied to r . Let us push the inner product from r 1 r 2 to r with Equation (5). It holds that
s r 1 r 2 ( r ) , ( E r r | Π 1 + E r r | Π 2 ) r 1 r 2 =    U e r 1 r 2 r s r 1 r 2 ( r ) , U m r 1 r 2 r ( E r r | Π 1 + E r r | Π 2 ) r = s r ( r 1 r 2 ) , r 1 r 2 r ( E r r | Π 1 + E r r | Π 2 ) r = E r s r ( r 1 r 2 ) r 1 r 2 r ( E r r | Π 1 + E r r | Π 2 ) =    E r E r s r ( r 1 r 2 ) r 1 r 2 r | Π 1 + E r s r ( r 1 r 2 ) r 1 r 2 r | Π 2 r =                                                                 E r 1 r 2 s r ( r 1 r 2 ) | Π 1 + E r 1 r 2 s r ( r 1 r 2 ) | Π 2 , r r .
The last equality follows from
E r s r ( r 1 r 2 ) r 1 r 2 r | Π i = E r 1 r 2 s r ( r 1 r 2 ) | Π i , i = 1 , 2 .
Proof of Equation (34).
d d t D r ( t ) Π ( r ( t ) ) t = 0 = s r ( r 1 r 2 ) , r r η r 1 r 2 ( r ) , d Π ( r ) [ r ] r 1 r 2 =                                                  s r ( r 1 r 2 ) , r r η r 1 r 2 ( r ) , E r r | Π 1 + E r r | Π 2 r 1 r 2
and compute the second term as
     η r 1 r 2 ( r ) , E r r | Π 1 + E r r | Π 2 r 1 r 2 = U m r 1 r 2 r η r 1 r 2 ( r ) , U e r 1 r 2 r E r r | Π 1 + E r r | Π 2 r = η r ( r 1 r 2 ) , E r r | Π 1 + E r r | Π 2 r = E r η r ( r 1 r 2 ) E r r | Π 1 + E r r | Π 2 = E r E r η r ( r 1 r 2 ) | Π 1 + E r η r ( r 1 r 2 ) | Π 2 r =                                                                               E r η r ( r 1 r 2 ) | Π 1 + E r η r ( r 1 r 2 ) | Π 2 , r r .
Equation (34) follows. □

3.3. Product Sample Space: Kantorovich and Scrödinger

If Π denotes the joint marginalization, the set of transport plans with margins q 1 and q 2 is
Γ ( q 1 , q 2 ) = Φ 1 ( q 1 q 2 ) = q | Π ( q ) = q 1 q 2 .
Here, we deal with a classical topic with considerable literature. We mention only the monograph of ref. [12] and, from the Information Geometry perspective, ref. [11,22].
Let us consider first the Kantorovich problem. Given the cost function (i.e., potential function)
U : Ω R ,
and a curve t q ( t ) γ ( q 1 , q 2 ) , we want to minimize the cost
S ( t ) = E q ( t ) U .
As E q ( t ) ϕ ( Π i ) = E q i ϕ for all ϕ ,
0 = d d t U E q ( t ) U , q ( t ) q ( t ) ,
so that E q ( t ) q ( t ) | Π i = 0 . The velocity of a curve in the transport plans is an interaction. Now, the derivative of the cost is
d d t S ( q ( t ) ) = U E q ( t ) U , q ( t ) q ( t ) .
From the interaction property of q ( t ) , it follows that if the ANOVA decomposition
U = E q U + ( u 1 ( Π 1 ; q ) + u 2 ( Π 2 ; q ) ) + u 12 ( Π 1 , Π 2 ; q )
holds, then
d d t S ( q ( t ) ) = u 12 ( Π 1 , Π 2 ; q ( t ) ) , q ( t ) q ( t ) .
The Scrödinger problem is similar. Given the cost function (i.e., potential function)
U : Ω R ,
consider the exponential perturbation of the mean-field probability function
exp U ϵ ϕ ( ϵ ) · q 1 q 2 .
The parameter ϵ > 0 is called temperature, and the normalizing constant is
ψ ( ϵ ) = log E q 1 q 2 e U / ϵ .
The KL-divergence of q relative to the perturbed probability function of Equation (36) is
S ϵ ( q ) = D q e U / ϵ ϕ ( ϵ ) · ( q 1 q 2 ) =       E q log q log q 1 q 2 + ϵ 1 E q U + ψ ( ϵ ) =                                                                                       ϵ 1 S ( q ) + ϵ D q q 1 q 2 + ϵ ϕ ( ϵ ) .
The gradient of q S ϵ ( q ) is, from Equations (35) and (34),
grad S ϵ ( q ) =                      ϵ 1 U E q U s q ( q 1 q 2 ) + E q η q ( q 1 q 2 ) | Π 1 + E q η q ( q 1 q 2 ) | Π 2 .
Only the interaction part is relevant in the constrained problem q Γ ( q 1 , q 2 ) , and the interaction kills the two conditional expectations, which leaves
U E q U ) 12 ; q s q ( q 1 q 2 ) 12 ; q .
We refer to [22] for a method to compute the interaction part of a random variable.

3.4. Product Sample Space: Conditional Probability Function

When the sample space is a product, Ω = Ω 1 × Ω 2 , we can represent each probability function in the maximal exponential model via conditioning on one margin,
E Ω 1 × Ω 2 = q = q 1 | 2 · q 2 | q 1 | 2 ( · | y ) E Ω 1 , y Ω 2 , q 2 E Ω 2
                   = q = q 2 | 1 · q 1 | q 2 | 1 ( · | x ) E Ω 2 , x Ω 1 , q 1 E Ω 1 .
The two representations are
E Ω 1 Ω 2 × E Ω 2 E Ω 1 × Ω 2 E Ω 1 × E Ω 2 Ω 1 .
Following the approach of [10], ([5], Ch. 11), we look at the transition mapping
Ω 2 y q 1 | 2 ( · | y ) E Ω 1
as a family of probability functions representing alternative probability models. The other transition mapping
Ω 1 x q 2 | 1 ( · | x ) E Ω 2
is the discriminator, that is, q 2 | 1 ( y | x ) is the probability that the sample x comes from q 1 | 2 ( · | y ) .
The right-to-left second mapping in Equation (39),
B : E Ω 1 × E Ω 2 Ω 1 q 1 , q 2 | 1 ( · | x ) , x Ω 1 q = q 2 | 1 · q 1 E Ω 1 × Ω 2 ,
maps the vector of the 1-margin and the set of alternative probability functions to the joint probability function. The kinematics of Equation (40), that is, the computation of velocities, is
q ( x , y ; t ) = q 1 ( x ; t ) + q 2 | 1 ( y | x ; t ) ,
Hence, the total derivative of B is
d B ( q 1 , q 2 | 1 ( · | x ) , x Ω 1 ) [ q 1 , q 2 | 1 ( · | x ) , x Ω 1 ] : ( x , y ) q 1 ( x ) + q 2 | 1 ( y | x ) .
The transposed total derivative is defined by
v , d B ( q 1 , q 2 | 1 ) [ q 1 , q 2 | 1 ] B ( q 1 , q 2 | 1 ) = d B ( q 1 , q 2 | 1 ) * [ v ] , ( q 1 , q 2 | 1 ) ( q 1 , q 2 | 1 ) ,
that is,
x , y v ( x , y ) q 1 ( x ) + q 2 | 1 ( y | x ) q 1 ( x ) q 2 | 1 ( y | x ) =              x , y v ( x , y ) q 1 ( x ) q 1 ( x ) q 2 | 1 ( y | x ) + x , y v ( x , y ) q 2 | 1 ( y | x ) q 1 ( x ) q 2 | 1 ( y | x ) =              x y v ( x , y ) q 2 | 1 ( y | x ) q 1 ( x ) q 1 ( x ) + x y q 1 ( x ) v ( x , y ) q 2 | 1 ( y | x ) q 2 | 1 ( y | x ) =              E q v | Π 1 , q 1 q 1 + x q 1 ( x ) y v ( x , y ) q 2 | 1 ( y | x ) q 2 | 1 ( y | x ) =              E q v | Π 1 , q 1 q 1 + x q 1 ( x ) v ( x , · ) E q 2 | 1 ( · | x ) v ( x , · ) , q 2 | 1 ( y | x ) q 2 | 1 ( · | x ) .
In conclusion, the transposed total derivative is
d B ( q 1 , q 2 | 1 ) * [ v ] : ( x , y ) E q v | Π 1 , q 1 ( x ) v ( x , · ) E q 2 | 1 ( · | x ) v ( x , · ) , x Ω 1 .
It is interesting to derive d B in the mixture atlas. The mixture expression of B with respect to ( p 1 , p 2 Ω 1 ) and p 1 p 2 is
B p 1 , p 2 : ( v 1 , v 2 | 1 ( · | x ) , x Ω 1 ) ( x , y ) ( 1 + v 1 ( x ) ) p 1 ( x ) ( 1 + v 2 | 1 ( y | x ) ) p 2 ( y )                                                                                                                       ( 1 + v 1 ) ( 1 + v 2 | 1 ) 1
and the total derivative in the directions h 1 , h 2 | 1 is
d B p 1 , p 2 ( v 1 , v 2 | 1 ( · | z ) , z Ω 1 ) [ h 1 , h 2 | 1 ( · | z ) , z Ω 1 ] =                                                                                         ( 1 + v 2 | 1 ) h 1 + z ( 1 + v 1 ( z ) ) h 2 | 1 ( · | z ) .
The push-back of the total derivative expression to the statistical bundles uses the equations
1 + v 1 = q 1 / p 1 , 1 + v 2 | 1 = q 2 | 1 / p 2 , h 1 = U m q 1 p 1 q 1 , h 2 | 1 ( · | z ) = U m q 2 | 1 ( · | z ) p 2 q 2 | 1 ( · | z )
to obtain
d B ( q 1 , q 2 | 1 ( · | z ) , z Ω 1 ) [ q 1 , q 2 | 1 ( · | z ) , z Ω 1 ] = U m p 1 p 2 q 2 | 1 · q 1 q 2 | 1 p 2 U m q 1 p 1 q 1 + z q 1 ( z ) p 1 ( z ) U m q 2 | 1 ( · | z ) p 2 q 2 | 1 ( · | z ) =                                                                      p 1 p 2 q 2 | 1 · q 1 q 2 | 1 p 2 q 1 p 1 q 1 + z q 1 ( z ) p 1 ( z ) q 2 | 1 ( · | z ) p 2 q 2 | 1 ( · | z )
In our affine language, we repeat computations in [10]. We especially derive the natural gradient of a composite function by the equation
grad Φ B = d B * grad Φ ( B ) .
For a given target p E Ω 1 × Ω 2 , we express the KL-divergence as a function of B in Equation (40),
K q : ( r 1 , r 2 | 1 ( · | x ) , x Ω 1 ) D p B ( q 1 , q 2 | 1 ( · | x ) , x Ω 1 ) =                                                                                                    x , y p ( x , y ) log p ( x , y ) q 1 ( x ) q 2 | 1 ( y | x ) .
We have
grad K p ( B ( q 1 , q 2 | 1 ) ) = η q 1 · q 2 | 1 ( p ) = 1 p q 1 · q 2 | 1 .
The first component of grad K p B is
[ d B ( q 1 , q 2 | 1 ) * grad K p ( B ( q 1 , q 2 | 1 ) ) ] 1 : x         z 1 p ( x , z ) q 1 ( x ) q 2 | 1 ( z | x ) q 2 | 1 ( z | x ) = z q 1 ( x ) q 2 | 1 ( z | x ) p ( x , z ) q 1 ( x ) = 1 p 1 ( x ) q 1 ( x ) ,
so that [ grad K p B ( q 1 , q 2 | 1 ) ] 1 = η q 1 ( p 1 ) . The x-component is
[ d B ( q 1 , q 2 | 1 ) * grad K p ( B ( q 1 , q 2 | 1 ) ) ] x : y              q 1 ( x ) 1 p ( x , y ) q 1 ( x ) q 2 | 1 ( y | x ) q 1 ( x ) y 1 p ( x , y ) q 1 ( x ) q 2 | 1 ( y | x ) q 2 | 1 ( y | x ) =                                                p ( x , y ) q 2 | 1 ( y | x ) + p 1 ( x ) ,
so that [ d B ( q 1 , q 2 | 1 ) * grad K p ( B ( q 1 , q 2 | 1 ) ) ] x = p 1 ( x ) η q 2 | 1 ( · | x ) ( p 2 | 1 ( · | x ) ) .
We now assume a target probability function g E Ω 1 and consider the probability function on the product sample space where all the model probability functions equal g, and the discriminator is uniform, say p = 1 m g , m = # Ω 2 .
In this case, Equation (41) becomes
grad K p B ( q 1 , q 2 | 1 ) ] 1 = η q 1 ( g )
and Equation (42) becomes
grad K p B ( q 1 , q 2 | 1 ) ] x = q 1 ( x ) 1 g ( x ) / m q 1 ( x ) q 2 | 1 ( y | x )

3.5. Variational Bayes

We revisit and develop some computations of ([13], § 2.2). We keep the same notation as above so that Bayes’ formula is
q 2 | 1 ( y | x ) = q 1 | 2 ( x | y ) q 2 ( y ) q 1 ( x ) = q 1 | 2 ( x | y ) q 2 ( y ) y q 1 | 2 ( x | y ) q 2 ( y ) ,
where x is a sample value and y is a latent variable value.
For a fixed x Ω 1 , we look for a r in some model M E Ω 2 in order to approximate the conditional q 2 | 1 ( · | x ) . If q L ( r , x ) satisfies
log q 1 ( x ) = D r q 2 | 1 ( · | x ) + L ( r , x ) ,
then
L ( r , x ) = log q 1 ( x ) y r ( y ) log r ( y ) q 2 | 1 ( y | x ) = y r ( y ) log q 1 ( x ) log r ( y ) q 2 | 1 ( y | x ) = y r ( y ) log q 2 | 1 ( y | x ) q 1 ( x ) r ( y ) = y r ( y ) log q 12 ( x , y ) r ( y ) = y r ( y ) log q 2 ( y ) r ( y ) + y r ( y ) log q 12 ( x , y ) q 2 ( y ) =                                                                                                  D r q 2 + y r ( y ) log q 1 | 2 ( x | y ) .
The so-called variational lower bound follows from D r q 2 | 1 ( · | x ) 0 ,
log q 1 ( x ) = D r q 2 | 1 ( · | x ) D r q 2 + E r log q 1 | 2 ( x | · )                                                                                 D r q 2 + E r log q 1 | 2 ( x | · ) = L ( r ; x ) .
for all r M . The bound is exact because r ( y ) = q 2 | 1 ( y | x ) for all y if, and only if, D r q 2 | 1 ( · | x ) = 0 The lower-bound variation along a curve t r ( t ) M is
d d t L ( r ( t ) ; x ) =    r ( t ) , s r ( t ) ( q 2 ) r ( t ) + r ( t ) , log q 1 | 2 ( x | · ) E r ( t ) log q 1 | 2 ( x | · ) r ( t ) =    r ( t ) , log q 2 r ( t ) + log q 1 | 2 ( x | · ) E r ( t ) log q 2 r + log q 1 | 2 ( x | · ) r ( t ) =                                                                     r ( t ) , log q 12 ( x , · ) r ( t ) E r ( t ) log q 12 ( x , · ) r ( t ) r ( t ) .
If the model M is an exponential tilting of the margin q 2 ,
r = e q 2 ( θ · u ) = e θ · u ψ ( θ ) · q 2 ,
where θ Θ R d is a vector parameter, and u is the vector of sufficient statistics of the exponential family with E q 2 u = 0 , then the velocity in Equation (43) becomes
r ( t ) = θ ˙ ( t ) · ( u ψ ( θ ( t ) ) = θ ˙ ( t ) · ( u E r ( t ) u ) = θ ˙ ( t ) · U e q 2 r ( t ) u ,
and the gradient in Equation (43) becomes
log q 12 ( x , · ) r ( t ) E r ( t ) log q 12 ( x , · ) r ( t ) = log q 12 ( x , · ) q 2 ( θ ( t ) · u ψ ( u ) ) E r ( t ) log q 12 ( x , · ) q 2 ( θ ( t ) · u ψ ( u ) ) = log q 1 | 2 ( x | · ) θ ( t ) · u E r ( t ) log q 1 | 2 ( x | · ) θ ( t ) · u =                                                                  U e q 2 r ( t ) log q 1 | 2 ( x | · ) E q 2 log q 1 | 2 ( x | · ) θ ( t ) · u .
Using repeatedly Equation (11), we find the derivative of the lower bound in Equation (43),
   d d t L ( r ( t ) ; x ) = θ ˙ ( t ) · U e q 2 r ( t ) u , U e q 2 r ( t ) log q 1 | 2 ( x | · ) E q 2 log q 1 | 2 ( x | · ) θ ( t ) · u r ( t ) = i = 1 d θ ˙ i ( t ) Cov r ( t ) u i , log q 1 | 2 ( x | · ) j = 1 d θ j ( t ) Cov r ( t ) u i , u j =                                                                θ ˙ ( t ) · Cov r ( t ) u , log q 1 | 2 ( x | · ) Hess ψ ( θ ( t ) ) θ ( t ) .
In conclusion, the gradient flow equation for the maximization of the lower bound under the model M is
θ ˙ = Hess ψ ( θ ( t ) ) θ ( t ) + Cov e q 2 ( θ ( t ) · u ) u , log q 1 | 2 ( x | · ) .
As a sanity check, assume that the model is exact for the given x,
q 2 | 1 ( y | x ) = e θ ¯ · u ( y ) ψ ( θ ¯ ) · q 2 ( y ) .
Hence,
log q 1 | 2 ( x | y ) = log e θ ¯ · u ( y ) ψ ( θ ¯ ) q 2 ( y ) q 1 ( x ) q 2 ( y ) = θ ¯ · u = ψ ( θ ¯ ) + log q 1 ( x ) ,
where the last two terms do not depend on y so that
Cov e q 2 ( θ · u u , θ ¯ · u = Hess ψ ( θ ) θ ¯
and Equation (44) becomes
θ ˙ = Hess ψ ( θ ( t ) ) ( θ ( t ) θ ¯ ) .
The solution of the gradient flow Equation (44) requires the ability to compute the covariance for the current model distribution. We do not discuss the numerical and simulation issues related to the implementation here.

4. Discussion

In this paper, we have shown how the dually affine formalism for the open probability simplex provides a system of affine charts in which the statistical notion of Fisher’s score becomes the moving-chart affine velocity, and the natural gradient becomes the gradient. The construction applies to standard computations of interest for statistical machine learning. In particular, we have discussed the neat form of the total gradient of the KL-divergence and its applications in a factorial sample space, such as mean-field approximation and Bayes computations.
This approach is helpful because the unique features of the Fisherian approach to statistics, such as Fisher’s score, maximum likelihood, and Fisher’s information, are formalized as an affine calculus so that all the statistical tools are available in this more extensive theory. Moreover, this setting potentially unifies the formalisms of Statistics, Optimal Transport, and Statistical Physics, examples being the affine modeling of Optimal Transport [24] and the second-order methods of optimization [25].
We have not considered the implementation of the formal gradient flow equation as a practical learning algorithm. Such further development is currently outside the scope of this piece of research. It would require the numerical analysis of the continuous equation and the search for sampling versions of the expectation operators. We hope this note will prompt further research. On the abstract side, topics worth studying seem to be the cases of continuous sample space as in [16], or Gaussian models as in [26].

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This study is purely methodological. No data were produced or used.

Acknowledgments

The Author acknowledges the partial support of de Castro Statistics, Collegio Carlo Alberto. He is a member of the non-profit organization Nuovo SEFIR and the UMI interest group AI&ML&MAT.

Conflicts of Interest

The Author declares no conflicts of interest.

References

  1. Efron, B.; Hastie, T. Computer Age Statistical Inference; Institute of Mathematical Statistics (IMS) Monographs; Algorithms, evidence, and data science; Cambridge University Press: New York, NY, USA, 2016; Volume 5, pp. xix+475. [Google Scholar]
  2. Amari, S. Geometry of Semiparametric Models and Applications. Invited Papers Meeting IP64 Likelihood and Geometry. Organizer Preben F. Blaesild. In Proceedings of the 51st Session of the International Statistical Institute, Istanbul, Turkey, 18–26 August 1997. [Google Scholar]
  3. Amari, S.I. Natural gradient works efficiently in learning. Neural Comput. 1998, 10, 251–276. [Google Scholar] [CrossRef]
  4. Amari, S.; Nagaoka, H. Methods of Information Geometry; Translated from the 1993 Japanese original by Daishi Harada; American Mathematical Society: Providence, RI, USA, 2000; pp. x+206. [Google Scholar]
  5. Amari, S.I. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Tokyo, Japan, 2016; Volume 194, pp. xiii+374. [Google Scholar]
  6. Pistone, G.; Sempi, C. An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Statist. 1995, 23, 1543–1561. [Google Scholar] [CrossRef]
  7. Chirco, G.; Pistone, G. Dually affine Information Geometry modeled on a Banach space. arXiv 2022, arXiv:2204.00917. [Google Scholar] [CrossRef]
  8. Weyl, H. Space- Time- Matter; Translation of the 1921 RAUM ZEIT MATERIE; Dover: New York, NY, USA, 1952. [Google Scholar]
  9. Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
  10. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
  11. Amari, S.i.; Karakida, R.; Oizumi, M. Information geometry connecting Wasserstein distance and Kullback-Leibler divergence via the entropy-relaxed transportation problem. Inf. Geom. 2018, 1, 13–37. [Google Scholar] [CrossRef]
  12. Peyré, G.; Cuturi, M. Computational Optimal Transport. Found. Trends Mach. Learn. 2019, 11, 355–607. [Google Scholar] [CrossRef]
  13. Kingma, D.P.; Welling, M. Auto-encoding variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
  14. Khan, M.E.; Rue, H.V. The Bayesian learning rule. J. Mach. Learn. Res. 2023, 24, 46. [Google Scholar]
  15. Musielak, J. Orlicz Spaces and Modular Spaces; Lecture Notes in Mathematics; Springer: Berlin/Heidelberg, Germany, 1983; Volume 1034, pp. iii+222. [Google Scholar]
  16. Pistone, G. Information geometry of the Gaussian space. In Information Geometry and Its Applications; Springer: Cham, Switzerland, 2018; Volume 252, pp. 119–155. [Google Scholar]
  17. Lang, S. Differential and Riemannian Manifolds, 3rd ed.; Graduate Texts in Mathematics; Springer: Berlin/Heidelberg, Germany, 1995; Volume 160, pp. xiv+364. [Google Scholar]
  18. Pistone, G. Information Geometry of the Probability Simplex: A Short Course. Nonlinear Phenom. Complex Syst. 2020, 23, 221–242. [Google Scholar] [CrossRef]
  19. Landau, L.D.; Lifshits, E.M. Course of Theoretical Physics, 3rd ed.; Statistical Physics; Butterworth-Heinemann: Oxford, UK, 1980; Volume V. [Google Scholar]
  20. Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory; Number 9 in IMS Lecture Notes; Monograph Series; Institute of Mathematical Statistics: Ann Arbor, MI, USA, 1986; pp. x+283. [Google Scholar]
  21. do Carmo, M.P. Riemannian Geometry; Mathematics: Theory & Applications; Translated from the Second Portuguese Edition by Francis Flaherty; Birkhäuser Boston Inc.: Berlin, Germany, 1992; pp. xiv+300. [Google Scholar]
  22. Pistone, G. Statistical Bundle of the Transport Model. In Geometric Science of Information; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 752–759. [Google Scholar] [CrossRef]
  23. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2006; pp. xxiv+748. [Google Scholar]
  24. Ay, N. Information geometry of the Otto metric. Inf. Geom. 2024, 1–24. [Google Scholar] [CrossRef]
  25. Chirco, G.; Malagò, L.; Pistone, G. Lagrangian and Hamiltonian dynamics for probabilities on the statistical bundle. Int. J. Geom. Methods Mod. Phys. 2022, 19, 2250214. [Google Scholar] [CrossRef]
  26. Malagò, L.; Montrucchio, L.; Pistone, G. Wasserstein Riemannian geometry of Gaussian densities. Inf. Geom. 2018, 1, 137–179. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pistone, G. Affine Calculus for Constrained Minima of the Kullback–Leibler Divergence. Stats 2025, 8, 25. https://doi.org/10.3390/stats8020025

AMA Style

Pistone G. Affine Calculus for Constrained Minima of the Kullback–Leibler Divergence. Stats. 2025; 8(2):25. https://doi.org/10.3390/stats8020025

Chicago/Turabian Style

Pistone, Giovanni. 2025. "Affine Calculus for Constrained Minima of the Kullback–Leibler Divergence" Stats 8, no. 2: 25. https://doi.org/10.3390/stats8020025

APA Style

Pistone, G. (2025). Affine Calculus for Constrained Minima of the Kullback–Leibler Divergence. Stats, 8(2), 25. https://doi.org/10.3390/stats8020025

Article Metrics

Back to TopTop