Affine Calculus for Constrained Minima of the Kullback–Leibler Divergence

Giovanni Pistone

doi:10.3390/stats8020025

¹

De Castro Statistics, Collegio Carlo Alberto, 10122 Torino, Italy

²

Nuovo SEFIR, c/o Coworld, Centro Direzionale Milano Due, Palazzo Canova, 20054 Segrate, Italy

Stats2025, 8(2), 25;https://doi.org/10.3390/stats8020025

Version Notes

Order Reprints

Abstract

The non-parametric version of Amari’s dually affine Information Geometry provides a practical calculus to perform computations of interest in statistical machine learning. The method uses the notion of a statistical bundle, a mathematical structure that includes both probability densities and random variables to capture the spirit of Fisherian statistics. We focus on computations involving a constrained minimization of the Kullback–Leibler divergence. We show how to obtain neat and principled versions of known computations in applications such as mean-field approximation, adversarial generative models, and variational Bayes.

Keywords:

information geometry; Kullback–Leibler divergence; statistical bundle; natural gradient

1. Introduction and Notations

Many modern Artificial Intelligence (AI) and machine learning (ML) algorithms are based on non-parametric statistical methods and optimization algorithms based on the minimization of a divergence measure between probability functions. In particular, one computes the gradient of a function defined on the probability simplex; then, the learning uses a gradient ascent technique. Such a basic approach is illustrated, for example, in the textbook [] (Ch. 18).

In most papers, ordinary convex calculus tools on the open probability simplex provide the relevant derivatives and gradients. The relation between the analytic computations and their statistical meaning is not exposed. This paper focuses on the derivative and gradient computations by providing the geometric framework called Information Geometry (IG). This geometry differs from the usual convex analysis because its devices have a direct statistical meaning. For example, the velocity of a one-dimensional parametric curve

θ \mapsto p (θ)

in the open probability simplex is defined to be the Fisher’s score

\frac{d}{d θ} log p (θ)

instead of the ordinary derivative

\frac{d}{d θ} p (θ)

. Generally speaking, IG is a geometric interpretation of Fisherian inference ([], Ch. 5).

Amari’s Information Geometry (IG) [,,] has been successfully applied to modern AI algorithms; see, for example, []. Here, we use the non-parametric version of IG of [,]. This version is non-parametric because the basic set of states is the open probability simplex, it is affine, as it satisfies a generalization of the classical Weyl’s axioms []. Moreover, it is dually affine in the sense already defined in Amari contributions because the covariance bilinear operator appears as a duality pairing in the vector space of coordinates.

The specific applications we will consider as examples come from the literature in statistical ML, particularly those that involve the constrained minimization of the Kullback–Leibler divergence (KL-divergence). Indeed, our main result in Section 2 is a form of the total gradient of the KL-divergence as expressed in the dually affine geometry. Namely, we consider symmetric divergences [], generative adversarial networks [], mixed entropy and transport optimization [,], and variational Bayes [,].

Non-parametric IG stands in general sample spaces and under various functional assumptions. One option, among many, is the use of Orlicz spaces []; see [,]. In this paper, we are not interested in discussing the functional setup. Still, we are interested in presenting the peculiar affine calculus of positive probability functions on a finite state space

Ω

in a geometric language compatible with the infinite-dimensional theory []. Such a calculus provides principled definitions of a curve’s velocity, the scalar field’s gradient, and the gradient flow.

1.1. Prerequisites

Below, we provide a schematic summary of the theory. For complete details, we refer to previous presentations in [,].

Let

Ω

be a finite sample space. We look at the open simplex as the maximal exponential model denoted as

E (Ω)

. In fact, we present every couple of positive probability functions on

Ω

, say

p, q

, in the form inspired by Statistical Physics []:

q = e^{v - Ψ} \cdot p,

(1)

where p represents a ground state, q is a perturbation of the ground state, v is a random variable,

Ψ

is a normalizing constant, and

ψ = κ_{p} (v) = log E_{p} [e^{v}]

is the cumulant function.

The random variable v depends on p and q up to a constant. If we specify

E_{p} [v] = 0

in Equation (1), then a straightforward computation gives

\begin{matrix} v = log \frac{q}{p} - E_{p} [log \frac{q}{p}], q = e^{v - κ_{p} (v)} \cdot p, \\ κ_{p} (v) = log E_{p} [e^{v}] = E_{p} [log \frac{p}{q}] = D (p ∥ q), \end{matrix}

where

D (p ∥ q)

is the KL-divergence. Regarding the entropy,

D (p ∥ q) = E_{p} [log \frac{p}{q}] = E_{p} [log p] - E_{p} [log q] = - H (p) + H (p, q) .

If we specify

E_{q} [v] = 0

in Equation (1), an analogous computation gives

\begin{matrix} v = log \frac{q}{p} - E_{q} [log \frac{q}{p}], q = e^{v - κ_{p} (v)} \cdot p, \\ κ_{p} (v) = log E_{p} [e^{v}] = - E_{q} [log \frac{p}{q}] = - D (q ∥ p) . \end{matrix}

A vector bundle is a collection of vector spaces, and each vector space is called a fiber of the bundle. For example, the tangent bundle collects all tangent vectors at each point in differential geometry. In Fisher’s statistics of the open probability simplex, one considers the vector space of all Fisher’s scores of one-dimensional models through the probability function q. Inspired by this last example, we call the statistical bundle the vector bundle

S E (Ω)

of all couples

(q, v)

of a positive probability function q and a q-centered random variable,

E_{q} [v] = 0

,

S E (Ω) = \{(q, v) | q \in E (Ω), E_{q} [v] = 0\} .

Each fiber

S_{q} E (Ω)

is a Euclidean space for the covariance inner product

{⟨v, w⟩}_{q} = E_{q} [v w]

.

The covariance inner product is both a Riemannian metric and a duality pairing. The metric interpretation leads to the Riemannian version of IG. The duality pairing interpretation leads to our dually affine IG. Because of that, we want to distinguish between the fibers

S_{p} E (Ω)

and the dual fibers

{{}^{*}S}_{p} E (Ω)

. The first bundle is called exponential bundle, while the second bundle is called mixture bundle. We use the notation

{{}^{*}S}_{p} E (Ω) \times S_{p} E (Ω) ∋ (v, w) \mapsto {⟨v, w⟩}_{p}, p \in E (Ω) .

In our setup, all the vector spaces of random variables are finite-dimensional; hence, the fibers

S_{p} E (Ω)

and

{{}^{*}S}_{p} E (Ω)

are equal vector spaces. However, it is a useful distinction, as it will be apparent in the discussion of parallel transports below.

The definition of the statistical bundle aims to capture an essential mechanism of Fisher’s approach to statistics ([], Ch. 4). Suppose

t \mapsto q (t) \in E (Ω)

is a one-dimensional statistical model. In that case, the Fisher’s score is

t \mapsto \frac{d}{d t} log q (t) = \overset{⋆}{q} (t)

, and

t \mapsto (q (t), \overset{⋆}{q} (t)) \in S E (Ω)

is the lift of the curve to the statistical bundle.

Dually affine geometry follows from the definition of two parallel transports on the fibers and two affine charts. The parallel transports act between the fibers

\begin{matrix} S_{q} E (Ω) ∋ v \mapsto {{}^{e}U}_{q}^{r} v = v - E_{r} [v] \in S_{r} E (Ω) is the exponential transport, \end{matrix}

(2)

\begin{matrix} {{}^{*}S}_{q} E (Ω) ∋ w \mapsto {{}^{m}U}_{q}^{r} w = \frac{q}{r} v \in {{}^{*}S}_{r} E (Ω) is the mixture transport . \end{matrix}

(3)

It is easy to check that the transports are duals of each other:

\begin{matrix} {⟨v, {{}^{e}U}_{r}^{q} w⟩}_{q} = {⟨{{}^{m}U}_{q}^{r} v, w⟩}_{r} is the {transport}^{'} s duality, \end{matrix}

(4)

\begin{matrix} {⟨v, w⟩}_{q} = {⟨{{}^{m}U}_{q}^{r} v, {{}^{e}U}_{q}^{r} w⟩}_{r} is the inner product push . \end{matrix}

(5)

The affine charts that define the two dual affine geometries by mapping the base set to a vector space of coordinates are

\begin{matrix} q \mapsto s_{p} (q) = log \frac{q}{p} - E_{p} [log \frac{q}{p}] \in S_{p} E (Ω) is the exponential chart \end{matrix}

(6)

\begin{matrix} q \mapsto η_{p} (q) = \frac{q}{p} - 1 \in {{}^{*}S}_{p} E (Ω) is the mixture chart, \end{matrix}

(7)

and the geometries defined by the two atlases are affine because the parallelogram law holds in both cases:

\begin{matrix} s_{p} (q) + {{}^{e}U}_{q}^{p} s_{q} (r) = s_{p} (r) \\ η_{p} (q) + {{}^{m}U}_{q}^{p} η_{q} (r) = η_{p} (r) \end{matrix}

The inverse of the exponential chart is a non-parametric exponential family ([], Ch. 5), and the known mechanisms of the cumulant function provide a fundamental calculus tool []. If

K_{p}

is the restriction of

κ_{p}

to

S_{p} E (Ω)

and

v = s_{p} (q)

then

q = s_{p}^{- 1} (v) = e_{p} (v) = e^{v - K_{p} (v)} \cdot p

,

\begin{matrix} K_{p} (v) = log E_{p} [e^{v}] cumulant function \end{matrix}

(8)

\begin{matrix} D (p ∥ e_{p} (v)) = K_{p} (v) cumulant function express the KL - divergence, \end{matrix}

(9)

\begin{matrix} d K_{p} (v) [h] = E_{e_{p} (v)} [h] is the derivative of K_{p} in the direction h, \end{matrix}

(10)

\begin{matrix} d^{2} K (v) [h, k] = {Cov}_{e_{p} (v)} (h, k) is the second derivative of K_{p} in the directions h, k . \end{matrix}

(11)

Equations (10) and (11) are the non-parametric version of the well-known properties of the derivative of the cumulant function in exponential models; see ([], § 5.5) and [].

We can now show that Fisher’s score is a velocity in the technical sense of a velocity computed in the moving frame of both charts. If

t \mapsto q (t) \in E (Ω)

is a smooth curve, and

Φ : E (Ω) \to R

is a smooth mapping,

\begin{matrix} \overset{⋆}{q} (t) = {\frac{d}{d t} s_{p} (q (t))|}_{p = q (t)} = {\frac{d}{d t} η_{p} (q (t))|}_{p = q (t)} = \frac{d}{d t} log q (t) is the velocity, \end{matrix}

(12)

\begin{matrix} \frac{d}{d t} Φ (q (t)) = {⟨grad Φ (q (t)), \overset{⋆}{q} (t)⟩}_{q (t)} is the gradient . \end{matrix}

(13)

The squared norm of the velocity (12),

{⟨\overset{⋆}{q} (t), \overset{⋆}{q} (t)⟩}_{q (t)} = E_{q (t)} [{(\frac{d}{d t} log q (t))}^{2}] = \sum \frac{{\dot{q} (t)}^{2}}{q (t)},

is the Fisher information that appeared first in the classical Cramer–Rao lower bound.

The gradient defined in Equation (13) is frequently called the natural gradient in the IG literature, following the use introduced in the case of parametric models by Amari []. In Riemannian geometry [,], the metric acts as a duality pairing, and the definition of the gradient is similar to Equation (13). The classic example of the computation of the gradient is the gradient of the expected value as a function of the probability function,

\frac{d}{d t} E_{q (t)} [u] = \frac{d}{d t} \sum u \frac{\dot{q} (t)}{q (t)} q (t) = {⟨u - E_{q (t)} [u], \overset{⋆}{q} (t)⟩}_{q (t)},

so that

grad E_{q} [u] = u - E_{q} [u]

.

The gradient of

Φ

gives the velocities of curves “orthogonal” to the surfaces of constant

Φ

-value, that is, the curves of steepest ascent. The solutions of the equation

grad Φ = 0

are the stationary points of

Φ

, and an equation of the form

\overset{⋆}{q} (t) = ϵ (t) grad Φ (q (t))

is a gradient flow equation.

In conclusion, we review the derivation of a function f between two maximal exponential models using the mixture charts Equation (7). The expressions of f and its derivative

d f

in the charts centered, respectively, at

p_{1}

and

p_{2}

, are

\begin{matrix} \begin{matrix} E (μ_{1}) \underset{\to}{f} \end{matrix} & E (μ_{2}) \\ n_{p_{1}}^{-_{1}} ↑ & ↓ η_{p 2} \\ S_{p_{1}} E (μ_{1}) \underset{f_{p_{1} \cdot p_{2}}}{\to} & S_{p_{2}} E (μ_{2}) \end{matrix} and \begin{matrix} \begin{matrix} S_{q_{1}} E (μ_{1}) \underset{\to}{d f (q_{1})} \end{matrix} & S_{f (q_{1})} E (μ_{2}) \\ {{}^{m}U}_{p_{1}}^{q_{1}} ↑ & ↓ {{}^{m}U}_{f (p_{1})}^{q_{2}} \\ S_{p_{1}} E (μ_{1}) \underset{d f_{p_{1} \cdot p_{2} (η_{p_{1}} (q_{1})}}{\to} & S_{p_{2}} E (μ_{2}) \end{matrix}

It follows that the computation of the derivative from its expression is

d f (q) [\overset{⋆}{q}] = {{}^{m}U}_{p_{2}}^{q} d f_{p_{1}, p_{2}} (η_{p_{1}} (q)) [{{}^{m}U}_{q}^{p_{1}} \overset{⋆}{q}] .

(14)

1.2. Summary of Content

In the following sections, we give both new results and new versions of the known results. The aim is to show the interest of the non-parametric dually affine IG in computing the gradient flow of a constrained KL-divergence.

In Section 2, we show how to use the statistical bundle formalism to compute derivatives of functions defined on the open probability simplex and how to compute natural gradients and total natural gradients of the KL-divergence, the cross entropy, the entropy, and the Jensen–Shannon divergence.

In Section 3, we apply the general computations of the previous section to independence models and marginal conditional probabilities in a factorial product setting. The dually affine methodology methodically reproduces known computations and suggests neat variations of potential interest. In particular, Section 3.5 contains a fully worked example of the derivation of a gradient flow equation of interest in approximate Bayes computations.

2. Total Natural Gradient of the KL-Divergence

The KL-divergence ([], Ch. 3) as a function of two variables is

D : E (Ω) \times E (Ω) ∋ (q, r) \mapsto D (q ∥ r) = E_{q} [log \frac{q}{r}] .

(15)

The computation of the total derivative is well-known in Information Theory. However, we provide proof in the affine setting, expressing the result in the affine charts.

In the exponential chart at p and in the mixture chart at p, the expressions of the probability functions q and r are, respectively,

q = e_{p} (v) = e^{v - K_{p} (v)} \cdot p, r = η_{p}^{- 1} (w) = (1 + w) \cdot p,

(16)

By plugging (16) into (15) and using Equation (10), one sees that the expressions of the partial KL-divergences are, respectively,

\begin{matrix} D (e_{p} (v) ∥ r) = E_{e_{p} (v)} [v - K_{p} (v)] - E_{e_{p} (v)} [log \frac{r}{p}] = \\ E_{e_{p} (v)} [v - K_{p} (v)] - E_{e_{p} (v)} [s_{p} (v)] - D (p ∥ r) = \\ d K_{p} (v) [v] - K_{p} (v) - d K_{p} (v) [s_{p} (r)] - D (p ∥ r), \end{matrix}

(17)

and

D (q ∥ η_{p}^{- 1} (w)) = D (q ∥ p) - E_{q} [log (1 + w)] .

(18)

Notice that the peculiar choice of the charts in the combination exponential for the first variable and mixture for the second variable is inessential in the finite state space case because any other choice will produce the same final result in the computation of the total natural gradient. However, it is consistent with the dual affine setting, in which two connections exist between one space and its dual. However, the expression of the KL-divergence using the exponential chart in both variables is interesting because, in such a case, the resulting expression is equal to the Bregman divergence of the cumulant function

K_{p}

,

D (e_{p} (v) ∥ e_{p} (w)) = K_{p} (w) - K_{p} (v) - d K_{p} (v) [w - v],

which, in turn, is the second remainder in the Taylor expansion. For example, one closed form is

D (e_{p} (v) ∥ e_{p} (w)) = \int_{0}^{1} \int_{0}^{1} d^{2} K (v + s t (w - v)) [w - v, w - v] s d s d t .

If

q (s, t) = e_{p} (v + s t (w - v)) \propto q^{1 - s t} r^{s t}

, then by Equation (11),

D (q ∥ r) = {\int \int}_{0}^{1} {Var}_{q (s, t)} (log \frac{r}{q}) s d s d t .

2.1. Total Natural Gradient of the KL-Divergence

We compute our gradients in the duality induced on each fiber by the covariance; hence, the total natural gradient of the KL-divergence has two components implicitly defined by

\frac{d}{d t} D (q (t) ∥ r (t)) = {⟨\overset{⋆}{q} (t), {grad}_{1} D (q (t) ∥ r (t))⟩}_{q (t)} + {⟨{grad}_{2} D (q (t) ∥ r (t)), \overset{⋆}{r} (t)⟩}_{r (t)},

(19)

where

{grad}_{1} D (q ∥ r)

is a random variable in the fiber at Q, while

{grad}_{2} D (q ∥ r)

is a random variable in the fiber at r. The adjective total refers to the fact that D is a function of two variables.

Proposition 1.

The total natural gradient of the KL-divergence is

(q, r) \mapsto grad D (q ∥ r) = (- s_{q} (r), - η_{r} (q)) \in S E (Ω) \times {}^{*}S E (Ω) .

(20)

That is, more explicitly, for each smooth couple of curves

t \mapsto q (t)

and

t \mapsto r (t)

, Equation (19) becomes

\frac{d}{d t} D (q (t) ∥ r (t)) = - {⟨\overset{⋆}{q} (t), s_{q (t)} (r (t))⟩}_{q (t)} - {⟨η_{r (t)} (q (t)), \overset{⋆}{r} (t)⟩}_{r (t)} .

Proof.

From Equation (11), the derivative at

v \in S_{p} E (Ω)

of Equation (17) in the direction

h = {{}^{e}U}_{q}^{p} \overset{⋆}{q}

is

\begin{matrix} d^{2} K_{p} (v) [v, h] - d K_{p} (v) [h] + d K_{p} (v) [h] - d^{2} K_{p} (v) [s_{p} (r), h] = \\ {Cov}_{q} (s_{p} (q) - s_{p} (r), h) = {Cov}_{q} (log \frac{q}{r} - E_{p} [log \frac{q}{r}], h) = \\ E_{q} [(log \frac{q}{r} - E_{q} [log \frac{q}{r}]) (h - E_{q} [h])] = {⟨- s_{q} (r), \overset{⋆}{q}⟩}_{q} . \end{matrix}

The derivative at

w \in {{}^{*}S}_{p} E (Ω)

of Equation (18) in the direction

k = {{}^{m}U}_{r}^{p} \overset{⋆}{r}

is

- E_{q} [\frac{p}{r} k] = - E_{r} [\frac{q}{r} \overset{⋆}{r}] = {⟨- η_{r} (q), \overset{⋆}{r}⟩}_{r} .

□

The gradient computation forms the corresponding gradient flow equation, whose discretization provides basic optimization algorithms. Here are two basic examples.

Given

r \in E (Ω)

, the solution of the gradient flow equation

\overset{⋆}{q} (t) = - {grad}_{1} D (q (t) ∥ r) = s_{q} (r), q (0) = q_{0},

is the exponential family

t \mapsto q (t) = e^{e^{- t} v_{0} - K_{r} (e^{- t} v_{0})} \cdot r, v_{0} = s_{r} (q_{0}) .

(21)

In fact, the LHS of Equation (21) is given by

\begin{matrix} log q (t) = e^{- t} v_{0} - K_{r} (e^{- t} v_{0}) + log r, \\ \overset{⋆}{q} (t) = - e^{- t} v_{0} + d K_{r} (e^{- t} v_{0}) [e^{- t} v_{0}], \end{matrix}

while the RHS is

\begin{matrix} s_{q (t)} (r) = - e^{- t} v_{0} + K_{r} (e^{- t} v_{0}) - E_{q (t)} [- e^{- t} v_{0} + K_{r} (e^{- t} v_{0})] = \\ - e^{- t} v_{0} + E_{q (t)} [e^{- t} v_{0}] . \end{matrix}

The conclusion follows from Equation (10).

Given

q \in E (Ω)

, the solution of the gradient flow equation

\overset{⋆}{r} (t) = - {grad}_{2} D (q ∥ r (t)) = η_{r (t)} (q), r (0) = r_{0},

is the mixture family

t \mapsto r (t) = e^{- t} r_{0} + (1 - e^{- t}) q .

(22)

The LHS of Equation (22) is

\overset{⋆}{r} (t) = \frac{\dot{r} (t)}{r (t)} = \frac{- e^{- t} r_{0} + e^{- t} q}{e^{- t} r_{0} + (1 - e^{- t}) q} = \frac{q - r_{0}}{r_{0} + (e^{t} - 1) q},

while the RHS is

η_{r (t)} (q) = \frac{q}{e^{- t} r_{0} + (1 - e^{- t}) q} - 1 = \frac{q - e^{- t} r_{0} + (1 - e^{- t}) q}{e^{- t} r_{0} + (1 - e^{- t}) q} = \frac{q - r_{0}}{r_{0} + (e^{t} - 1) q} .

Notice that in both cases, the t parameter appears in the solution in exponential form. Other forms of the temperature parameter will follow from a weighted form of the gradient flow equation.

2.2. Natural Gradient of the Entropy and Total Natural Gradient of the Cross Entropy

The KL-divergence equals the cross entropy minus the entropy,

D (q ∥ r) = E_{q} [- log r] - E_{q} [- log q] = H (q, r) - H (q) .

In the exponential chart at p for the first variable, the cross entropy is

\begin{matrix} H (s_{p}^{- 1} (v), r) = \\ E_{e_{p} (v)} [- log r] = E_{e_{p} (v)} [- log r - E_{p} [- log r]] + H (p, r) = \\ d K_{p} (v) [- log r - E_{p} [- log r]] + H (p, r), \end{matrix}

(23)

with derivative at v in the direction h

\begin{matrix} d^{2} K_{p} (v) [- log r - E_{p} [- log r], h] = {Cov}_{q} (- log r, h) = \\ E_{q} [(- log r - H (q, r)) (h - E_{q} [h])] = {⟨\overset{⋆}{q}, - log r - H (q, r)⟩}_{q} . \end{matrix}

In the mixture chart at p for the second variable

H (q, η_{p}^{- 1} (w)) = E_{q} [- log ((1 + w) \cdot p)] = E_{q} [- log (1 + w)] + H (q, p),

(24)

with derivative at w in the direction k,

E_{q} [- {(1 + w)}^{- 1} k] = E_{r} [- \frac{q}{r} {{}^{m}U}_{p}^{r} k] = E_{r} [- (\frac{q}{r} - 1) \overset{⋆}{r}] = {⟨η_{r} (q), \overset{⋆}{r}⟩}_{r} .

Proposition 2.

The total natural gradient of the cross entropy is

grad H (q, r) = (- log r - H (q, r), η_{r} (q))

and the natural gradient of the entropy is

grad H (q) = - log q - H (q) .

(25)

Proof.

The first statement follows from Equations (23) and (24). From the decomposition

H (q) = H (q, r) - D (q ∥ r)

, we find the gradient of the entropy,

grad H (q) = - log r - H (q, r) + s_{q} (r) = - log q - H (q) .

□

2.3. Total Natural Gradient of the Jensen–Shannon Divergence

The Jensen–Shannon divergence [] is

\begin{matrix} JS (q, r) = \frac{1}{2} D (q ∥ \frac{1}{2} (q + r)) + \frac{1}{2} D (r ∥ \frac{1}{2} (q + r)) = \\ H (\frac{1}{2} (q + r)) - \frac{1}{2} H (q) - \frac{1}{2} H (r) . \end{matrix}

(26)

It is the minimum value of the function

ϕ : p \mapsto \frac{1}{2} (D (q ∥ p) + D (r ∥ p)) .

In fact,

grad ϕ (p) = - \frac{1}{2} (η_{p} (q) + η_{p} (r)) = \frac{1}{2} (- \frac{q}{p} + 1 - \frac{r}{p} + 1) = - \frac{\frac{1}{2} (q + r)}{p} + 1

which vanishes for

p = \frac{1}{2} (q + r)

.

Let us compute the derivative of

f : q \mapsto \frac{1}{2} (q + r)

. The mixture expression of f at p according to Equation (7) is the affine function

f_{p} (v) = η_{p} \circ f \circ η_{p}^{- 1} (v) = \frac{\frac{1}{2} (1 + v) \cdot p + r}{p} - 1 = \frac{1}{2} v + \frac{1}{2} η_{p} (r),

so that the derivative in the direction h is

d f_{p} (v) [h] = h / 2

.

The push-back, according to the mixture transport Equation (3), is

d f (q) [\overset{⋆}{q}] = {{}^{m}U}_{p}^{\frac{1}{2} (q + r)} \frac{1}{2} {{}^{m}U}_{q}^{p} \overset{⋆}{q} = \frac{p}{\frac{1}{2} (q + r)} \frac{1}{2} \frac{q}{p} \overset{⋆}{q} = \frac{1}{2} \frac{q}{\frac{1}{2} (q + r)} \overset{⋆}{q} = \frac{1}{2} {{}^{m}U}_{q}^{\frac{1}{2} (q + r)} \overset{⋆}{q} .

(27)

We now compute the gradient of

q \mapsto JS (q, r)

of Equation (26), using the total natural gradient of the KH-divergence of Proposition 1, the derivative Equation (27), and the duality of parallel transports Equation (4):

\begin{matrix} grad (q \mapsto JS (q, r)) = \\ \frac{1}{2} (- s_{q} (\frac{1}{2} (q + r)) - \frac{1}{2} {{}^{e}U}_{\frac{1}{2} (q + r)}^{q} η_{\frac{1}{2} (q + r)} (q)) + \frac{1}{2} (- \frac{1}{2} {{}^{e}U}_{\frac{1}{2} (q + r)}^{q} η_{\frac{1}{2} (q + r)} (r)) = \\ \frac{1}{2} (- s_{q} (\frac{1}{2} (q + r)) - \frac{1}{2} {{}^{e}U}_{\frac{1}{2} (q + r)}^{q} (η_{\frac{1}{2} (q + r)} (q) + η_{\frac{1}{2} (q + r)} (r))) = \\ \frac{1}{2} (- s_{q} (\frac{1}{2} (q + r)) - \frac{1}{2} {{}^{e}U}_{\frac{1}{2} (q + r)}^{q} (\frac{q}{\frac{1}{2} (q + r)} - 1 + \frac{r}{\frac{1}{2} (q + r)} - 1)) = \\ - \frac{1}{2} s_{q} (\frac{1}{2} (q + r)) . \end{matrix}

It is also instructive to use the expression of the Jensen–Shannon divergence in terms of entropies. From Equation (25),

\begin{matrix} grad (q \mapsto JS (q, r)) = \\ \frac{1}{2} {{}^{e}U}_{\frac{1}{2} (q + r)}^{q} (- log \frac{1}{2} (q + r) - H (\frac{1}{2} (q + r))) - \frac{1}{2} (- log q - H (q)) = \\ - \frac{1}{2} (log \frac{1}{2} (q + r) - E_{q} [log \frac{1}{2} (q + r)] + log q - E_{q} [log q]) = \\ - \frac{1}{2} s_{q} (\frac{1}{2} (q + r)) . \end{matrix}

3. Product Sample Space

This section uses

Ω = Ω_{1} \times Ω_{2}

as a factorial sample space. For each

r \in E (Ω)

, the margins are

r_{1} \in E (Ω_{1})

and

r_{2} \in E (Ω_{2})

. In the mean-field assumption, the model equals the tensor product of the margins,

\bar{r} = r_{1} \otimes r_{2} \in E (Ω_{1}) \otimes E (Ω_{2}) \subset E (Ω) .

The velocities are, respectively,

\begin{matrix} \overset{⋆}{r} (x, y; t) = \frac{d}{d t} log r (x, y; t), \end{matrix}

(28)

\begin{matrix} \overset{⋆}{\bar{r}} (x, y; t) = \frac{d}{d t} log \bar{r} (x, y; t) = {\overset{⋆}{r}}_{1} (x; t) + {\overset{⋆}{r}}_{2} (y; t) . \end{matrix}

(29)

Below, we will discuss the optimality of a mean-field approximation.

3.1. Product Sample Space: Marginalization

The (first) marginalization is

Π_{1} : E (Ω_{1} \times Ω_{2}) ∋ r \mapsto r_{1} \in E (Ω_{1}), r_{1} (x) = \sum_{b} r (x, b) .

(30)

We will compute the bundle derivative of Equation (30) following the scheme of Equation (14).

Proposition 3.

The derivative

d Π_{1}

of the marginalization Equation (30) is

d Π_{1} : {}^{*}S E (Ω) ∋ (r, \overset{⋆}{r}) \mapsto (r_{1}, E_{r} (\overset{⋆}{r} | Π_{1})) \in {}^{*}S E (Ω_{1}) .

Proof.

In the mixture chart centered at

p_{1} \otimes p_{2}

and

p_{1}

, respectively, the expression of the marginalization is

\begin{matrix} η_{p_{1}} \circ Π_{1} \circ η_{p_{1} \otimes p_{2}}^{- 1} (v) = \frac{Π_{1} \circ η_{p_{1} \otimes p_{2}}^{- 1} (v)}{p_{1}} - 1 = \\ \frac{\sum_{b} (1 + v (\cdot, b)) \cdot p_{1} p_{2} (b)}{p_{1}} - 1 = \sum_{b} v (\cdot, b) p_{2} (b) . \end{matrix}

(31)

Note that the expression in Equation (31) is linear. Hence, the derivative at v in the direction h is

x \mapsto \sum_{b} h (\cdot, b) p_{2} (b)

with

h = {{}^{m}U}_{r}^{p_{1} \otimes p_{2}} \overset{⋆}{r}

so that the bundle derivative is

\begin{matrix} d Π_{1} (r) [\overset{⋆}{r}] = {{}^{m}U}_{p_{1}}^{r_{1}} \sum_{b} \frac{r (\cdot, b)}{p_{1} p_{2} (b)} \overset{⋆}{q} (\cdot, b) p_{2} (b) = \\ \frac{p_{1}}{r_{1}} \sum_{b} \frac{r (\cdot, b)}{p_{1} p_{2} (b)} \overset{⋆}{q} (\cdot, b) p_{2} (b) = \sum_{b} \overset{⋆}{q} (\cdot, b) r_{1 | 2} (\cdot | b) = E_{r} (\overset{⋆}{r} | Π_{1}) . \end{matrix}

□

There is an interesting relation between conditional expectation and mixture transport. The conditional expectation commutes with the mixture transports,

{{}^{m}U}_{r_{1}}^{q_{1}} E_{r} ({{}^{m}U}_{q}^{r} v | Π_{1}) = E_{q} (v | Π_{1}), v \in {{}^{*}S}_{q} E (Ω) .

It is a way to express Bayes’ theorem for conditional expectations. For all

ϕ

,

\begin{matrix} E_{q} [{{}^{m}U}_{r_{1}}^{q_{1}} E_{r} ({{}^{m}U}_{q}^{r} v | Π_{1}) ϕ (Π_{1})] = E_{q_{1}} [\frac{r_{1}}{q_{1}} E_{r} (\frac{q}{r} v | Π_{1}) ϕ (Π_{1})] = \\ E_{r_{1}} [E_{r} (\frac{q}{r} v | Π_{1}) ϕ (Π_{1})] = E_{r} [E_{r} (\frac{q}{r} v | Π_{1}) ϕ (Π_{1})] = E_{r} [\frac{q}{r} v Φ (Π_{1})] = E_{q} [v ϕ (Π_{1})] . \end{matrix}

3.2. Product Sample Space: Mean-Field Approximation

The derivative of the joint marginalization

Π : E (Ω_{1} \times Ω_{2}) ∋ r \mapsto r_{1} Π_{1} (r) \otimes Π_{2} (r) \in E (Ω_{1} \times Ω_{2}),

follows from the derivative of the marginalization in Equation (29).

Proposition 4.

The derivative

d Π

of the joint marginalization in Section 3.2 is

d Π : (r, \overset{⋆}{r}) \mapsto (r_{1} \otimes r_{2}, E_{r} (\overset{⋆}{r} | Π_{1}) + E_{r} (\overset{⋆}{r} | Π_{2})) .

(32)

Proof.

Compose the partial derivatives with the mapping

({\overset{⋆}{r}}_{1}, {\overset{⋆}{r}}_{2}) \mapsto {\overset{⋆}{r}}_{1} \otimes {\overset{⋆}{r}}_{2}

□

The decomposition of the velocity

v e l o c i t y r

according to Equation (32) provides a better decomposition than

{\overset{⋆}{r}}_{1} + {\overset{⋆}{r}}_{2}

of Equations (28) and (29) and provides a definition of the mean-field approximation. In the language of ANOVA decomposition of statistical interactions, the derivative part in Equation (32) is the sum of the simple effects of the velocity,

\overset{⋆}{q} = {\overset{⋆}{q}}_{1} + {\overset{⋆}{q}}_{2} + {\overset{⋆}{q}}_{12},

where

q_{i} = E_{q} (\overset{⋆}{q} | Π_{i})

,

i = 1, 2

, and the last term is the interaction, the q-orthogonal residual. See [] for a discussion of the ANOVA decomposition in the context of the statistical bundle.

The equation for the total natural gradient of the KL-divergence and the computation of the derivative above provide the natural gradients of the divergence between the joint probability function and the mean-field approximation. In information theory [], the KL-divergence in Equation (34) is called mutual information.

Proposition 5.

The natural gradients of the divergences of a joint distribution r and its mean-field approximation

Π (r)

are

\begin{matrix} grad D (Π (r) ∥ r) = E_{r_{1} \otimes r_{2}} (s_{r} (r_{1} \otimes r_{2}) | Π_{1}) + E_{r_{1} \otimes r_{2}} (s_{r} (r_{1} \otimes r_{2}) | Π_{2}) - η_{r} (r_{1} \otimes r_{2}) . \end{matrix}

(33)

\begin{matrix} grad D (r ∥ Π (r)) = - s_{r} (r_{1} \otimes r_{2}) + E_{r} (η_{r} (r_{1} \otimes r_{2}) | Π_{1}) + E_{r} (η_{r} (r_{1} \otimes r_{2}) | Π_{2}) . \end{matrix}

(34)

The conditional terms in Equation (33) depend on the mean-field model; hence, we could express them as a disintegration of r. For example,

\begin{matrix} E_{r_{1} \otimes r_{2}} (s_{r} (r_{1} \otimes r_{2}) | Π_{1} = x) = \\ \sum_{y} r_{2} (y) log \frac{r_{1} (x) r_{2} (y)}{r (x, y)} - \sum_{x y} r_{1} (x) r_{2} (y) log \frac{r_{1} (x) r_{2} (y)}{r (x, y)} = \\ H (r_{2}, r_{2 | 1} (\cdot | x)) - \sum_{x} r_{1} (x) H (r_{2}, r_{2 | 1} (\cdot | x)), \end{matrix}

where the last term is the conditional entropy

H (Π_{2} | Π_{1})

under r.

Proof of Equation (33).

We find the natural gradient of

r \mapsto D (Π (r) ∥ r)

by computing with Equations (20) and (32) the variation along a smooth curve

t \mapsto r (t) \in E (Ω_{1} \times Ω_{2})

such that

r (0) = r

and

\overset{⋆}{r} (0) = \overset{⋆}{r}

. It holds that

\begin{matrix} {\frac{d}{d t} D (Π (r (t)) ∥ r (t))|}_{t - 0} = - {⟨s_{Π (r)} (r), d Π (r) [\overset{⋆}{r}]⟩}_{Π (r)} - {⟨η_{r} (Π (r)), \overset{⋆}{r}⟩}_{r} = \\ - {⟨s_{r_{1} \otimes r_{2}} (r), E_{r} (\overset{⋆}{r} | Π_{1}) + E_{r} (\overset{⋆}{r} | Π_{2})⟩}_{r_{1} \otimes r_{2}} - {⟨η_{r} (r_{1} \otimes r_{2}), \overset{⋆}{r}⟩}_{r} \end{matrix}

We want to present the first term of the RHS as an inner product at r applied to

\overset{⋆}{r}

. Let us push the inner product from

r_{1} \otimes r_{2}

to r with Equation (5). It holds that

\begin{matrix} - {⟨s_{r_{1} \otimes r_{2}} (r), (E_{r} (\overset{⋆}{r} | Π_{1}) + E_{r} (\overset{⋆}{r} | Π_{2}))⟩}_{r_{1} \otimes r_{2}} = \\ - {⟨{{}^{e}U}_{r_{1} \otimes r_{2}}^{r} s_{r_{1} \otimes r_{2}} (r), {{}^{m}U}_{r_{1} \otimes r_{2}}^{r} (E_{r} (\overset{⋆}{r} | Π_{1}) + E_{r} (\overset{⋆}{r} | Π_{2}))⟩}_{r} = \\ {⟨s_{r} (r_{1} \otimes r_{2}), \frac{r_{1} \otimes r_{2}}{r} (E_{r} (\overset{⋆}{r} | Π_{1}) + E_{r} (\overset{⋆}{r} | Π_{2}))⟩}_{r} = \\ E_{r} [s_{r} (r_{1} \otimes r_{2}) \frac{r_{1} \otimes r_{2}}{r} (E_{r} (\overset{⋆}{r} | Π_{1}) + E_{r} (\overset{⋆}{r} | Π_{2}))] = \\ E_{r} [(E_{r} (s_{r} (r_{1} \otimes r_{2}) \frac{r_{1} \otimes r_{2}}{r} | Π_{1}) + E_{r} (s_{r} (r_{1} \otimes r_{2}) \frac{r_{1} \otimes r_{2}}{r} | Π_{2})) \overset{⋆}{r}] = \\ {⟨E_{r_{1} \otimes r_{2}} (s_{r} (r_{1} \otimes r_{2}) | Π_{1}) + E_{r_{1} \otimes r_{2}} (s_{r} (r_{1} \otimes r_{2}) | Π_{2}), \overset{⋆}{r}⟩}_{r} . \end{matrix}

The last equality follows from

E_{r} (s_{r} (r_{1} \otimes r_{2}) \frac{r_{1} \otimes r_{2}}{r} | Π_{i}) = E_{r_{1} \otimes r_{2}} (s_{r} (r_{1} \otimes r_{2}) | Π_{i}), i = 1, 2 .

□

Proof of Equation (34).

\begin{matrix} {\frac{d}{d t} D (r (t) ∥ Π (r (t)))|}_{t = 0} = - {⟨s_{r} (r_{1} \otimes r_{2}), \overset{⋆}{r}⟩}_{r} - {⟨η_{r_{1} \otimes r_{2}} (r), d Π (r) [\overset{⋆}{r}]⟩}_{r_{1} \otimes r_{2}} = \\ - {⟨s_{r} (r_{1} \otimes r_{2}), \overset{⋆}{r}⟩}_{r} - {⟨η_{r_{1} \otimes r_{2}} (r), E_{r} (\overset{⋆}{r} | Π_{1}) + E_{r} (\overset{⋆}{r} | Π_{2})⟩}_{r_{1} \otimes r_{2}} \end{matrix}

and compute the second term as

\begin{matrix} - {⟨η_{r_{1} \otimes r_{2}} (r), E_{r} (\overset{⋆}{r} | Π_{1}) + E_{r} (\overset{⋆}{r} | Π_{2})⟩}_{r_{1} \otimes r_{2}} = \\ - {⟨{{}^{m}U}_{r_{1} \otimes r_{2}}^{r} η_{r_{1} \otimes r_{2}} (r), {{}^{e}U}_{r_{1} \otimes r_{2}}^{r} (E_{r} (\overset{⋆}{r} | Π_{1}) + E_{r} (\overset{⋆}{r} | Π_{2}))⟩}_{r} = \\ {⟨η_{r} (r_{1} \otimes r_{2}), E_{r} (\overset{⋆}{r} | Π_{1}) + E_{r} (\overset{⋆}{r} | Π_{2})⟩}_{r} = \\ E_{r} [η_{r} (r_{1} \otimes r_{2}) (E_{r} (\overset{⋆}{r} | Π_{1}) + E_{r} (\overset{⋆}{r} | Π_{2}))] = \\ E_{r} [(E_{r} (η_{r} (r_{1} \otimes r_{2}) | Π_{1}) + E_{r} (η_{r} (r_{1} \otimes r_{2}) | Π_{2})) \overset{⋆}{r}] = \\ {⟨E_{r} (η_{r} (r_{1} \otimes r_{2}) | Π_{1}) + E_{r} (η_{r} (r_{1} \otimes r_{2}) | Π_{2}), \overset{⋆}{r}⟩}_{r} . \end{matrix}

Equation (34) follows. □

3.3. Product Sample Space: Kantorovich and Scrödinger

If

Π

denotes the joint marginalization, the set of transport plans with margins

q_{1}

and

q_{2}

is

Γ (q_{1}, q_{2}) = Φ^{- 1} (q_{1} \otimes q_{2}) = \{q | Π (q) = q_{1} \otimes q_{2}\} .

Here, we deal with a classical topic with considerable literature. We mention only the monograph of ref. [] and, from the Information Geometry perspective, ref. [,].

Let us consider first the Kantorovich problem. Given the cost function (i.e., potential function)

U : Ω \to R,

and a curve

t \mapsto q (t) \in γ (q_{1}, q_{2})

, we want to minimize the cost

S (t) = E_{q (t)} [U] .

As

E_{q (t)} [ϕ (Π_{i})] = E_{q_{i}} [ϕ]

for all

ϕ

,

0 = \frac{d}{d t} {⟨U - E_{q (t)} [U], \overset{⋆}{q} (t)⟩}_{q (t)},

so that

E_{q (t)} (\overset{⋆}{q} (t) | Π_{i}) = 0

. The velocity of a curve in the transport plans is an interaction. Now, the derivative of the cost is

\frac{d}{d t} S (q (t)) = {⟨U - E_{q (t)} [U], \overset{⋆}{q} (t)⟩}_{q (t)} .

(35)

From the interaction property of

\overset{⋆}{q} (t)

, it follows that if the ANOVA decomposition

U = E_{q} [U] + (u_{1} (Π_{1}; q) + u_{2} (Π_{2}; q)) + u_{12} (Π_{1}, Π_{2}; q)

holds, then

\frac{d}{d t} S (q (t)) = {⟨u_{12} (Π_{1}, Π_{2}; q (t)), \overset{⋆}{q} (t)⟩}_{q (t)} .

The Scrödinger problem is similar. Given the cost function (i.e., potential function)

U : Ω \to R,

consider the exponential perturbation of the mean-field probability function

exp (- \frac{U}{ϵ} - ϕ (ϵ)) \cdot q_{1} \otimes q_{2} .

(36)

The parameter

ϵ > 0

is called temperature, and the normalizing constant is

ψ (ϵ) = log E_{q_{1} \otimes q_{2}} [e^{- U / ϵ}] .

The KL-divergence of q relative to the perturbed probability function of Equation (36) is

\begin{matrix} S_{ϵ} (q) = D (q ∥ e^{- U / ϵ - ϕ (ϵ)} \cdot (q_{1} \otimes q_{2})) = \\ E_{q} [log \frac{q}{log q_{1} \otimes q_{2}}] + ϵ^{- 1} E_{q} [U] + ψ (ϵ) = \\ ϵ^{- 1} (S (q) + ϵ D (q ∥ q_{1} \otimes q_{2}) + ϵ ϕ (ϵ)) . \end{matrix}

The gradient of

q \mapsto S_{ϵ} (q)

is, from Equations (35) and (34),

\begin{matrix} grad S_{ϵ} (q) = \\ ϵ^{- 1} (U - E_{q} [U] - s_{q} (q_{1} \otimes q_{2}) + E_{q} (η_{q} (q_{1} \otimes q_{2}) | Π_{1}) + E_{q} (η_{q} (q_{1} \otimes q_{2}) | Π_{2})) . \end{matrix}

Only the interaction part is relevant in the constrained problem

q \in Γ (q_{1}, q_{2})

, and the interaction kills the two conditional expectations, which leaves

(U - E_{q} [U]))_{12; q} - {(s_{q} (q_{1} \otimes q_{2}))}_{12; q} .

We refer to [] for a method to compute the interaction part of a random variable.

3.4. Product Sample Space: Conditional Probability Function

When the sample space is a product,

Ω = Ω_{1} \times Ω_{2}

, we can represent each probability function in the maximal exponential model via conditioning on one margin,

\begin{matrix} E (Ω_{1} \times Ω_{2}) & = \{q = q_{1 | 2} \cdot q_{2} | q_{1 | 2} (\cdot | y) \in E (Ω_{1}), y \in Ω_{2}, q_{2} \in E (Ω_{2})\} \end{matrix}

(37)

\begin{matrix} = \{q = q_{2 | 1} \cdot q_{1} | q_{2 | 1} (\cdot | x) \in E (Ω_{2}), x \in Ω_{1}, q_{1} \in E (Ω_{1})\} . \end{matrix}

(38)

The two representations are

E {(Ω_{1})}^{\otimes Ω_{2}} \times E (Ω_{2}) \leftrightarrow E (Ω_{1} \times Ω_{2}) \leftrightarrow E (Ω_{1}) \times E {(Ω_{2})}^{\otimes Ω_{1}} .

(39)

Following the approach of [], ([], Ch. 11), we look at the transition mapping

Ω_{2} ∋ y \mapsto q_{1 | 2} (\cdot | y) \in E (Ω_{1})

as a family of probability functions representing alternative probability models. The other transition mapping

Ω_{1} ∋ x \mapsto q_{2 | 1} (\cdot | x) \in E (Ω_{2})

is the discriminator, that is,

q_{2 | 1} (y | x)

is the probability that the sample x comes from

q_{1 | 2} (\cdot | y)

.

The right-to-left second mapping in Equation (39),

B : E (Ω_{1}) \times E {(Ω_{2})}^{\otimes Ω_{1}} ∋ (q_{1}, q_{2 | 1} (\cdot | x), x \in Ω_{1}) \mapsto q = q_{2 | 1} \cdot q_{1} \in E (Ω_{1} \times Ω_{2}),

(40)

maps the vector of the 1-margin and the set of alternative probability functions to the joint probability function. The kinematics of Equation (40), that is, the computation of velocities, is

\overset{⋆}{q} (x, y; t) = {\overset{⋆}{q}}_{1} (x; t) + {\overset{⋆}{q}}_{2 | 1} (y | x; t),

Hence, the total derivative of B is

d B (q_{1}, q_{2 | 1} (\cdot | x), x \in Ω_{1}) [{\overset{⋆}{q}}_{1}, {\overset{⋆}{q}}_{2 | 1} (\cdot | x), x \in Ω_{1}] : (x, y) \mapsto {\overset{⋆}{q}}_{1} (x) + {\overset{⋆}{q}}_{2 | 1} (y | x) .

The transposed total derivative is defined by

{⟨v, d B (q_{1}, q_{2 | 1}) [{\overset{⋆}{q}}_{1}, {\overset{⋆}{q}}_{2 | 1}]⟩}_{B (q_{1}, q_{2 | 1})} = {⟨d B {(q_{1}, q_{2 | 1})}^{*} [v], ({\overset{⋆}{q}}_{1}, {\overset{⋆}{q}}_{2 | 1})⟩}_{(q_{1}, q_{2 | 1})},

that is,

\begin{matrix} \sum_{x, y} v (x, y) ({\overset{⋆}{q}}_{1} (x) + {\overset{⋆}{q}}_{2 | 1} (y | x)) q_{1} (x) q_{2 | 1} (y | x) = \\ \sum_{x, y} v (x, y) {\overset{⋆}{q}}_{1} (x) q_{1} (x) q_{2 | 1} (y | x) + \sum_{x, y} v (x, y) {\overset{⋆}{q}}_{2 | 1} (y | x) q_{1} (x) q_{2 | 1} (y | x) = \\ \sum_{x} (\sum_{y} v (x, y) q_{2 | 1} (y | x)) {\overset{⋆}{q}}_{1} (x) q_{1} (x) + \sum_{x} \sum_{y} q_{1} (x) v (x, y) {\overset{⋆}{q}}_{2 | 1} (y | x) q_{2 | 1} (y | x) = \\ {⟨E_{q} (v | Π_{1}), {\overset{⋆}{q}}_{1}⟩}_{q_{1}} + \sum_{x} q_{1} (x) \sum_{y} v (x, y) {\overset{⋆}{q}}_{2 | 1} (y | x) q_{2 | 1} (y | x) = \\ {⟨E_{q} (v | Π_{1}), {\overset{⋆}{q}}_{1}⟩}_{q_{1}} + \sum_{x} {⟨q_{1} (x) (v (x, \cdot) - E_{q_{2 | 1} (\cdot | x)} [v (x, \cdot)]), {\overset{⋆}{q}}_{2 | 1} (y | x)⟩}_{q_{2 | 1} (\cdot | x)} . \end{matrix}

In conclusion, the transposed total derivative is

d B {(q_{1}, q_{2 | 1})}^{*} [v] : (x, y) \mapsto (E_{q} (v | Π_{1}), q_{1} (x) (v (x, \cdot) - E_{q_{2 | 1} (\cdot | x)} [v (x, \cdot)]), x \in Ω_{1}) .

It is interesting to derive

d B

in the mixture atlas. The mixture expression of B with respect to

(p_{1}, p_{2}^{\otimes Ω_{1}})

and

p_{1} \otimes p_{2}

is

\begin{matrix} B_{p_{1}, p_{2}} : (v_{1}, v_{2 | 1} (\cdot | x), x \in Ω_{1}) \mapsto \\ (x, y) \mapsto (1 + v_{1} (x)) p_{1} (x) (1 + v_{2 | 1} (y | x)) p_{2} (y) \mapsto \\ (1 + v_{1}) (1 + v_{2 | 1}) - 1 \end{matrix}

and the total derivative in the directions

h_{1}

,

h_{2 | 1}

is

\begin{matrix} d B_{p_{1}, p_{2}} (v_{1}, v_{2 | 1} (\cdot | z), z \in Ω_{1}) [h_{1}, h_{2 | 1} (\cdot | z), z \in Ω_{1}] = \\ (1 + v_{2 | 1}) h_{1} + \sum_{z} (1 + v_{1} (z)) h_{2 | 1} (\cdot | z) . \end{matrix}

The push-back of the total derivative expression to the statistical bundles uses the equations

1 + v_{1} = q_{1} / p_{1}, 1 + v_{2 | 1} = q_{2 | 1} / p_{2}, h_{1} = {{}^{m}U}_{q_{1}}^{p_{1}} {\overset{⋆}{q}}_{1}, h_{2 | 1} (\cdot | z) = {{}^{m}U}_{q_{2 | 1} (\cdot | z)}^{p_{2}} {\overset{⋆}{q}}_{2 | 1} (\cdot | z)

to obtain

\begin{matrix} d B (q_{1}, q_{2 | 1} (\cdot | z), z \in Ω_{1}) [{\overset{⋆}{q}}_{1}, {\overset{⋆}{q}}_{2 | 1} (\cdot | z), z \in Ω_{1}] = \\ {{}^{m}U}_{p_{1} \otimes p_{2}}^{q_{2 | 1} \cdot q_{1}} (\frac{q_{2 | 1}}{p_{2}} {{}^{m}U}_{q_{1}}^{p_{1}} {\overset{⋆}{q}}_{1} + \sum_{z} \frac{q_{1} (z)}{p_{1} (z)} {{}^{m}U}_{q_{2 | 1} (\cdot | z)}^{p_{2}} {\overset{⋆}{q}}_{2 | 1} (\cdot | z)) = \\ \frac{p_{1} \otimes p_{2}}{q_{2 | 1} \cdot q_{1}} (\frac{q_{2 | 1}}{p_{2}} \frac{q_{1}}{p_{1}} {\overset{⋆}{q}}_{1} + \sum_{z} \frac{q_{1} (z)}{p_{1} (z)} \frac{q_{2 | 1} (\cdot | z)}{p_{2}} {\overset{⋆}{q}}_{2 | 1} (\cdot | z)) \end{matrix}

In our affine language, we repeat computations in []. We especially derive the natural gradient of a composite function by the equation

grad Φ \circ B = d B^{*} grad Φ (B) .

For a given target

p \in E (Ω_{1} \times Ω_{2})

, we express the KL-divergence as a function of B in Equation (40),

\begin{matrix} K_{q} : (r_{1}, r_{2 | 1} (\cdot | x), x \in Ω_{1}) \mapsto D (p ∥ B (q_{1}, q_{2 | 1} (\cdot | x), x \in Ω_{1})) = \\ \sum_{x, y} p (x, y) log \frac{p (x, y)}{q_{1} (x) q_{2 | 1} (y | x)} . \end{matrix}

We have

grad K_{p} (B (q_{1}, q_{2 | 1})) = - η_{q_{1} \cdot q_{2 | 1}} (p) = 1 - \frac{p}{q_{1} \cdot q_{2 | 1}} .

The first component of

grad K_{p} \circ B

is

\begin{matrix} {[d B {(q_{1}, q_{2 | 1})}^{*} grad K_{p} (B (q_{1}, q_{2 | 1}))]}_{1} : x \mapsto \\ \sum_{z} (1 - \frac{p (x, z)}{q_{1} (x) q_{2 | 1} (z | x)}) q_{2 | 1} (z | x) = \sum_{z} \frac{q_{1} (x) q_{2 | 1} (z | x) - p (x, z)}{q_{1} (x)} = 1 - \frac{p_{1} (x)}{q_{1} (x)}, \end{matrix}

(41)

so that

{[grad K_{p} \circ B (q_{1}, q_{2 | 1})]}_{1} = - η_{q_{1}} (p_{1})

. The x-component is

\begin{matrix} {[d B {(q_{1}, q_{2 | 1})}^{*} grad K_{p} (B (q_{1}, q_{2 | 1}))]}_{x} : y \mapsto \\ q_{1} (x) (1 - \frac{p (x, y)}{q_{1} (x) q_{2 | 1} (y | x)}) - q_{1} (x) \sum_{y} (1 - \frac{p (x, y)}{q_{1} (x) q_{2 | 1} (y | x)}) q_{2 | 1} (y | x) = \\ - \frac{p (x, y)}{q_{2 | 1} (y | x)} + p_{1} (x), \end{matrix}

(42)

so that

{[d B {(q_{1}, q_{2 | 1})}^{*} grad K_{p} (B (q_{1}, q_{2 | 1}))]}_{x} = - p_{1} (x) η_{q_{2 | 1} (\cdot | x)} (p_{2 | 1} (\cdot | x))

.

We now assume a target probability function

g \in E (Ω_{1})

and consider the probability function on the product sample space where all the model probability functions equal g, and the discriminator is uniform, say

p = \frac{1}{m} g

,

m = # Ω_{2}

.

In this case, Equation (41) becomes

grad K_{p} \circ B (q_{1}, q_{2 | 1})]_{1} = - η_{q_{1}} (g)

and Equation (42) becomes

grad K_{p} \circ B (q_{1}, q_{2 | 1})]_{x} = - q_{1} (x) (1 - \frac{g (x) / m}{q_{1} (x) q_{2 | 1} (y | x)})

3.5. Variational Bayes

We revisit and develop some computations of ([], § 2.2). We keep the same notation as above so that Bayes’ formula is

q_{2 | 1} (y | x) = \frac{q_{1 | 2} (x | y) q_{2} (y)}{q_{1} (x)} = \frac{q_{1 | 2} (x | y) q_{2} (y)}{\sum_{y} q_{1 | 2} (x | y) q_{2} (y)},

where x is a sample value and y is a latent variable value.

For a fixed

x \in Ω_{1}

, we look for a r in some model

M \subset E (Ω_{2})

in order to approximate the conditional

q_{2 | 1} (\cdot | x)

. If

q \mapsto L (r, x)

satisfies

log q_{1} (x) = D (r ∥ q_{2 | 1} (\cdot | x)) + L (r, x),

then

\begin{matrix} L (r, x) = log q_{1} (x) - \sum_{y} r (y) log \frac{r (y)}{q_{2 | 1} (y | x)} = \\ \sum_{y} r (y) (log q_{1} (x) - log \frac{r (y)}{q_{2 | 1} (y | x)}) = \sum_{y} r (y) log \frac{q_{2 | 1} (y | x) q_{1} (x)}{r (y)} = \\ \sum_{y} r (y) log \frac{q_{12} (x, y)}{r (y)} = \sum_{y} r (y) log \frac{q_{2} (y)}{r (y)} + \sum_{y} r (y) log \frac{q_{12} (x, y)}{q_{2} (y)} = \\ - D (r ∥ q_{2}) + \sum_{y} r (y) log q_{1 | 2} (x | y) . \end{matrix}

The so-called variational lower bound follows from

D (r ∥ q_{2 | 1} (\cdot | x)) \geq 0

,

\begin{matrix} log q_{1} (x) = D (r ∥ q_{2 | 1} (\cdot | x)) - D (r ∥ q_{2}) + E_{r} [log q_{1 | 2} (x | \cdot)] \geq \\ - D (r ∥ q_{2}) + E_{r} [log q_{1 | 2} (x | \cdot)] = L (r; x) . \end{matrix}

for all

r \in M

. The bound is exact because

r (y) = q_{2 | 1} (y | x)

for all y if, and only if,

D (r ∥ q_{2 | 1} (\cdot | x)) = 0

The lower-bound variation along a curve

t \mapsto r (t) \in M

is

\begin{matrix} \frac{d}{d t} L (r (t); x) = \\ {⟨\overset{⋆}{r} (t), s_{r (t)} (q_{2})⟩}_{r (t)} + {⟨\overset{⋆}{r} (t), log q_{1 | 2} (x | \cdot) - E_{r (t)} [log q_{1 | 2} (x | \cdot)]⟩}_{r (t)} = \\ {⟨\overset{⋆}{r} (t), log \frac{q_{2}}{r (t)} + log q_{1 | 2} (x | \cdot) - E_{r (t)} [log \frac{q_{2}}{r} + log q_{1 | 2} (x | \cdot)]⟩}_{r (t)} = \\ {⟨\overset{⋆}{r} (t), log \frac{q_{12} (x, \cdot)}{r (t)} - E_{r (t)} [log \frac{q_{12} (x, \cdot)}{r (t)}]⟩}_{r (t)} . \end{matrix}

(43)

If the model

M

is an exponential tilting of the margin

q_{2}

,

r = e_{q_{2}} (θ \cdot u) = e^{θ \cdot u - ψ (θ)} \cdot q_{2},

where

θ \in Θ \subset R^{d}

is a vector parameter, and u is the vector of sufficient statistics of the exponential family with

E_{q_{2}} [u] = 0

, then the velocity in Equation (43) becomes

\overset{⋆}{r} (t) = \dot{θ} (t) \cdot (u - \nabla ψ (θ (t)) = \dot{θ} (t) \cdot (u - E_{r (t)} [u]) = \dot{θ} (t) \cdot {{}^{e}U}_{q_{2}}^{r (t)} u,

and the gradient in Equation (43) becomes

\begin{matrix} log \frac{q_{12} (x, \cdot)}{r (t)} - E_{r (t)} [log \frac{q_{12} (x, \cdot)}{r (t)}] = \\ log \frac{q_{12} (x, \cdot)}{q_{2}} - (θ (t) \cdot u - ψ (u)) - E_{r (t)} [log \frac{q_{12} (x, \cdot)}{q_{2}} - (θ (t) \cdot u - ψ (u))] = \\ log q_{1 | 2} (x | \cdot) - θ (t) \cdot u - E_{r (t)} [log q_{1 | 2} (x | \cdot) - θ (t) \cdot u] = \\ {{}^{e}U}_{q_{2}}^{r (t)} (log q_{1 | 2} (x | \cdot) - E_{q_{2}} [log q_{1 | 2} (x | \cdot)] - θ (t) \cdot u) . \end{matrix}

Using repeatedly Equation (11), we find the derivative of the lower bound in Equation (43),

\begin{matrix} \frac{d}{d t} L (r (t); x) = \\ {⟨\dot{θ} (t) \cdot {{}^{e}U}_{q_{2}}^{r (t)} u, {{}^{e}U}_{q_{2}}^{r (t)} (log q_{1 | 2} (x | \cdot) - E_{q_{2}} [log q_{1 | 2} (x | \cdot)] - θ (t) \cdot u)⟩}_{r (t)} = \\ \sum_{i = 1}^{d} {\dot{θ}}_{i} (t) ({Cov}_{r (t)} (u_{i}, log q_{1 | 2} (x | \cdot)) - \sum_{j = 1}^{d} θ_{j} (t) {Cov}_{r (t)} (u_{i}, u_{j})) = \\ \dot{θ} (t) \cdot ({Cov}_{r (t)} (u, log q_{1 | 2} (x | \cdot)) - Hess ψ (θ (t)) θ (t)) . \end{matrix}

In conclusion, the gradient flow equation for the maximization of the lower bound under the model

M

is

\dot{θ} = - Hess ψ (θ (t)) θ (t) + {Cov}_{e_{q_{2}} (θ (t) \cdot u)} (u, log q_{1 | 2} (x | \cdot)) .

(44)

As a sanity check, assume that the model is exact for the given x,

q_{2 | 1} (y | x) = e^{\bar{θ} \cdot u (y) - ψ (\bar{θ})} \cdot q_{2} (y) .

Hence,

log q_{1 | 2} (x | y) = log e^{\bar{θ} \cdot u (y) - ψ (\bar{θ})} q_{2} (y) \frac{q_{1} (x)}{q_{2} (y)} = \bar{θ} \cdot u = ψ (\bar{θ}) + log q_{1} (x),

where the last two terms do not depend on y so that

{Cov}_{e_{q_{2}} (θ \cdot u} (u, \bar{θ} \cdot u) = Hess ψ (θ) \bar{θ}

and Equation (44) becomes

\dot{θ} = - Hess ψ (θ (t)) (θ (t) - \bar{θ}) .

The solution of the gradient flow Equation (44) requires the ability to compute the covariance for the current model distribution. We do not discuss the numerical and simulation issues related to the implementation here.

4. Discussion

In this paper, we have shown how the dually affine formalism for the open probability simplex provides a system of affine charts in which the statistical notion of Fisher’s score becomes the moving-chart affine velocity, and the natural gradient becomes the gradient. The construction applies to standard computations of interest for statistical machine learning. In particular, we have discussed the neat form of the total gradient of the KL-divergence and its applications in a factorial sample space, such as mean-field approximation and Bayes computations.

This approach is helpful because the unique features of the Fisherian approach to statistics, such as Fisher’s score, maximum likelihood, and Fisher’s information, are formalized as an affine calculus so that all the statistical tools are available in this more extensive theory. Moreover, this setting potentially unifies the formalisms of Statistics, Optimal Transport, and Statistical Physics, examples being the affine modeling of Optimal Transport [] and the second-order methods of optimization [].

We have not considered the implementation of the formal gradient flow equation as a practical learning algorithm. Such further development is currently outside the scope of this piece of research. It would require the numerical analysis of the continuous equation and the search for sampling versions of the expectation operators. We hope this note will prompt further research. On the abstract side, topics worth studying seem to be the cases of continuous sample space as in [], or Gaussian models as in [].

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This study is purely methodological. No data were produced or used.

Acknowledgments

The Author acknowledges the partial support of de Castro Statistics, Collegio Carlo Alberto. He is a member of the non-profit organization Nuovo SEFIR and the UMI interest group AI&ML&MAT.

Conflicts of Interest

The Author declares no conflicts of interest.

References

Efron, B.; Hastie, T. Computer Age Statistical Inference; Institute of Mathematical Statistics (IMS) Monographs; Algorithms, evidence, and data science; Cambridge University Press: New York, NY, USA, 2016; Volume 5, pp. xix+475. [Google Scholar]
Amari, S. Geometry of Semiparametric Models and Applications. Invited Papers Meeting IP64 Likelihood and Geometry. Organizer Preben F. Blaesild. In Proceedings of the 51st Session of the International Statistical Institute, Istanbul, Turkey, 18–26 August 1997. [Google Scholar]
Amari, S.I. Natural gradient works efficiently in learning. Neural Comput. 1998, 10, 251–276. [Google Scholar] [CrossRef]
Amari, S.; Nagaoka, H. Methods of Information Geometry; Translated from the 1993 Japanese original by Daishi Harada; American Mathematical Society: Providence, RI, USA, 2000; pp. x+206. [Google Scholar]
Amari, S.I. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Tokyo, Japan, 2016; Volume 194, pp. xiii+374. [Google Scholar]
Pistone, G.; Sempi, C. An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Statist. 1995, 23, 1543–1561. [Google Scholar] [CrossRef]
Chirco, G.; Pistone, G. Dually affine Information Geometry modeled on a Banach space. arXiv 2022, arXiv:2204.00917. [Google Scholar] [CrossRef]
Weyl, H. Space- Time- Matter; Translation of the 1921 RAUM ZEIT MATERIE; Dover: New York, NY, USA, 1952. [Google Scholar]
Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Amari, S.i.; Karakida, R.; Oizumi, M. Information geometry connecting Wasserstein distance and Kullback-Leibler divergence via the entropy-relaxed transportation problem. Inf. Geom. 2018, 1, 13–37. [Google Scholar] [CrossRef]
Peyré, G.; Cuturi, M. Computational Optimal Transport. Found. Trends Mach. Learn. 2019, 11, 355–607. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-encoding variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Khan, M.E.; Rue, H.V. The Bayesian learning rule. J. Mach. Learn. Res. 2023, 24, 46. [Google Scholar]
Musielak, J. Orlicz Spaces and Modular Spaces; Lecture Notes in Mathematics; Springer: Berlin/Heidelberg, Germany, 1983; Volume 1034, pp. iii+222. [Google Scholar]
Pistone, G. Information geometry of the Gaussian space. In Information Geometry and Its Applications; Springer: Cham, Switzerland, 2018; Volume 252, pp. 119–155. [Google Scholar]
Lang, S. Differential and Riemannian Manifolds, 3rd ed.; Graduate Texts in Mathematics; Springer: Berlin/Heidelberg, Germany, 1995; Volume 160, pp. xiv+364. [Google Scholar]
Pistone, G. Information Geometry of the Probability Simplex: A Short Course. Nonlinear Phenom. Complex Syst. 2020, 23, 221–242. [Google Scholar] [CrossRef]
Landau, L.D.; Lifshits, E.M. Course of Theoretical Physics, 3rd ed.; Statistical Physics; Butterworth-Heinemann: Oxford, UK, 1980; Volume V. [Google Scholar]
Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory; Number 9 in IMS Lecture Notes; Monograph Series; Institute of Mathematical Statistics: Ann Arbor, MI, USA, 1986; pp. x+283. [Google Scholar]
do Carmo, M.P. Riemannian Geometry; Mathematics: Theory & Applications; Translated from the Second Portuguese Edition by Francis Flaherty; Birkhäuser Boston Inc.: Berlin, Germany, 1992; pp. xiv+300. [Google Scholar]
Pistone, G. Statistical Bundle of the Transport Model. In Geometric Science of Information; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 752–759. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2006; pp. xxiv+748. [Google Scholar]
Ay, N. Information geometry of the Otto metric. Inf. Geom. 2024, 1–24. [Google Scholar] [CrossRef]
Chirco, G.; Malagò, L.; Pistone, G. Lagrangian and Hamiltonian dynamics for probabilities on the statistical bundle. Int. J. Geom. Methods Mod. Phys. 2022, 19, 2250214. [Google Scholar] [CrossRef]
Malagò, L.; Montrucchio, L.; Pistone, G. Wasserstein Riemannian geometry of Gaussian densities. Inf. Geom. 2018, 1, 137–179. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Affine Calculus for Constrained Minima of the Kullback–Leibler Divergence

Abstract

1. Introduction and Notations

1.1. Prerequisites

1.2. Summary of Content

2. Total Natural Gradient of the KL-Divergence

2.1. Total Natural Gradient of the KL-Divergence

2.2. Natural Gradient of the Entropy and Total Natural Gradient of the Cross Entropy

2.3. Total Natural Gradient of the Jensen–Shannon Divergence

3. Product Sample Space

3.1. Product Sample Space: Marginalization

3.2. Product Sample Space: Mean-Field Approximation

3.3. Product Sample Space: Kantorovich and Scrödinger

3.4. Product Sample Space: Conditional Probability Function

3.5. Variational Bayes

4. Discussion

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics