Abstract
The non-parametric version of Amari’s dually affine Information Geometry provides a practical calculus to perform computations of interest in statistical machine learning. The method uses the notion of a statistical bundle, a mathematical structure that includes both probability densities and random variables to capture the spirit of Fisherian statistics. We focus on computations involving a constrained minimization of the Kullback–Leibler divergence. We show how to obtain neat and principled versions of known computations in applications such as mean-field approximation, adversarial generative models, and variational Bayes.
1. Introduction and Notations
Many modern Artificial Intelligence (AI) and machine learning (ML) algorithms are based on non-parametric statistical methods and optimization algorithms based on the minimization of a divergence measure between probability functions. In particular, one computes the gradient of a function defined on the probability simplex; then, the learning uses a gradient ascent technique. Such a basic approach is illustrated, for example, in the textbook [] (Ch. 18).
In most papers, ordinary convex calculus tools on the open probability simplex provide the relevant derivatives and gradients. The relation between the analytic computations and their statistical meaning is not exposed. This paper focuses on the derivative and gradient computations by providing the geometric framework called Information Geometry (IG). This geometry differs from the usual convex analysis because its devices have a direct statistical meaning. For example, the velocity of a one-dimensional parametric curve in the open probability simplex is defined to be the Fisher’s score instead of the ordinary derivative . Generally speaking, IG is a geometric interpretation of Fisherian inference ([], Ch. 5).
Amari’s Information Geometry (IG) [,,] has been successfully applied to modern AI algorithms; see, for example, []. Here, we use the non-parametric version of IG of [,]. This version is non-parametric because the basic set of states is the open probability simplex, it is affine, as it satisfies a generalization of the classical Weyl’s axioms []. Moreover, it is dually affine in the sense already defined in Amari contributions because the covariance bilinear operator appears as a duality pairing in the vector space of coordinates.
The specific applications we will consider as examples come from the literature in statistical ML, particularly those that involve the constrained minimization of the Kullback–Leibler divergence (KL-divergence). Indeed, our main result in Section 2 is a form of the total gradient of the KL-divergence as expressed in the dually affine geometry. Namely, we consider symmetric divergences [], generative adversarial networks [], mixed entropy and transport optimization [,], and variational Bayes [,].
Non-parametric IG stands in general sample spaces and under various functional assumptions. One option, among many, is the use of Orlicz spaces []; see [,]. In this paper, we are not interested in discussing the functional setup. Still, we are interested in presenting the peculiar affine calculus of positive probability functions on a finite state space in a geometric language compatible with the infinite-dimensional theory []. Such a calculus provides principled definitions of a curve’s velocity, the scalar field’s gradient, and the gradient flow.
1.1. Prerequisites
Below, we provide a schematic summary of the theory. For complete details, we refer to previous presentations in [,].
Let be a finite sample space. We look at the open simplex as the maximal exponential model denoted as . In fact, we present every couple of positive probability functions on , say , in the form inspired by Statistical Physics []:
where p represents a ground state, q is a perturbation of the ground state, v is a random variable, is a normalizing constant, and is the cumulant function.
The random variable v depends on p and q up to a constant. If we specify in Equation (1), then a straightforward computation gives
where is the KL-divergence. Regarding the entropy,
If we specify in Equation (1), an analogous computation gives
A vector bundle is a collection of vector spaces, and each vector space is called a fiber of the bundle. For example, the tangent bundle collects all tangent vectors at each point in differential geometry. In Fisher’s statistics of the open probability simplex, one considers the vector space of all Fisher’s scores of one-dimensional models through the probability function q. Inspired by this last example, we call the statistical bundle the vector bundle of all couples of a positive probability function q and a q-centered random variable, ,
Each fiber is a Euclidean space for the covariance inner product .
The covariance inner product is both a Riemannian metric and a duality pairing. The metric interpretation leads to the Riemannian version of IG. The duality pairing interpretation leads to our dually affine IG. Because of that, we want to distinguish between the fibers and the dual fibers . The first bundle is called exponential bundle, while the second bundle is called mixture bundle. We use the notation
In our setup, all the vector spaces of random variables are finite-dimensional; hence, the fibers and are equal vector spaces. However, it is a useful distinction, as it will be apparent in the discussion of parallel transports below.
The definition of the statistical bundle aims to capture an essential mechanism of Fisher’s approach to statistics ([], Ch. 4). Suppose is a one-dimensional statistical model. In that case, the Fisher’s score is , and is the lift of the curve to the statistical bundle.
Dually affine geometry follows from the definition of two parallel transports on the fibers and two affine charts. The parallel transports act between the fibers
It is easy to check that the transports are duals of each other:
The affine charts that define the two dual affine geometries by mapping the base set to a vector space of coordinates are
and the geometries defined by the two atlases are affine because the parallelogram law holds in both cases:
The inverse of the exponential chart is a non-parametric exponential family ([], Ch. 5), and the known mechanisms of the cumulant function provide a fundamental calculus tool []. If is the restriction of to and then ,
Equations (10) and (11) are the non-parametric version of the well-known properties of the derivative of the cumulant function in exponential models; see ([], § 5.5) and [].
We can now show that Fisher’s score is a velocity in the technical sense of a velocity computed in the moving frame of both charts. If is a smooth curve, and is a smooth mapping,
The squared norm of the velocity (12),
is the Fisher information that appeared first in the classical Cramer–Rao lower bound.
The gradient defined in Equation (13) is frequently called the natural gradient in the IG literature, following the use introduced in the case of parametric models by Amari []. In Riemannian geometry [,], the metric acts as a duality pairing, and the definition of the gradient is similar to Equation (13). The classic example of the computation of the gradient is the gradient of the expected value as a function of the probability function,
so that .
The gradient of gives the velocities of curves “orthogonal” to the surfaces of constant -value, that is, the curves of steepest ascent. The solutions of the equation are the stationary points of , and an equation of the form
is a gradient flow equation.
In conclusion, we review the derivation of a function f between two maximal exponential models using the mixture charts Equation (7). The expressions of f and its derivative in the charts centered, respectively, at and , are
It follows that the computation of the derivative from its expression is
1.2. Summary of Content
In the following sections, we give both new results and new versions of the known results. The aim is to show the interest of the non-parametric dually affine IG in computing the gradient flow of a constrained KL-divergence.
In Section 2, we show how to use the statistical bundle formalism to compute derivatives of functions defined on the open probability simplex and how to compute natural gradients and total natural gradients of the KL-divergence, the cross entropy, the entropy, and the Jensen–Shannon divergence.
In Section 3, we apply the general computations of the previous section to independence models and marginal conditional probabilities in a factorial product setting. The dually affine methodology methodically reproduces known computations and suggests neat variations of potential interest. In particular, Section 3.5 contains a fully worked example of the derivation of a gradient flow equation of interest in approximate Bayes computations.
2. Total Natural Gradient of the KL-Divergence
The KL-divergence ([], Ch. 3) as a function of two variables is
The computation of the total derivative is well-known in Information Theory. However, we provide proof in the affine setting, expressing the result in the affine charts.
In the exponential chart at p and in the mixture chart at p, the expressions of the probability functions q and r are, respectively,
By plugging (16) into (15) and using Equation (10), one sees that the expressions of the partial KL-divergences are, respectively,
and
Notice that the peculiar choice of the charts in the combination exponential for the first variable and mixture for the second variable is inessential in the finite state space case because any other choice will produce the same final result in the computation of the total natural gradient. However, it is consistent with the dual affine setting, in which two connections exist between one space and its dual. However, the expression of the KL-divergence using the exponential chart in both variables is interesting because, in such a case, the resulting expression is equal to the Bregman divergence of the cumulant function ,
which, in turn, is the second remainder in the Taylor expansion. For example, one closed form is
If , then by Equation (11),
2.1. Total Natural Gradient of the KL-Divergence
We compute our gradients in the duality induced on each fiber by the covariance; hence, the total natural gradient of the KL-divergence has two components implicitly defined by
where is a random variable in the fiber at Q, while is a random variable in the fiber at r. The adjective total refers to the fact that D is a function of two variables.
Proposition 1.
The total natural gradient of the KL-divergence is
That is, more explicitly, for each smooth couple of curves and , Equation (19) becomes
Proof.
The derivative at of Equation (18) in the direction is
□
The gradient computation forms the corresponding gradient flow equation, whose discretization provides basic optimization algorithms. Here are two basic examples.
Given , the solution of the gradient flow equation
is the exponential family
The conclusion follows from Equation (10).
Given , the solution of the gradient flow equation
is the mixture family
Notice that in both cases, the t parameter appears in the solution in exponential form. Other forms of the temperature parameter will follow from a weighted form of the gradient flow equation.
2.2. Natural Gradient of the Entropy and Total Natural Gradient of the Cross Entropy
The KL-divergence equals the cross entropy minus the entropy,
In the exponential chart at p for the first variable, the cross entropy is
with derivative at v in the direction h
In the mixture chart at p for the second variable
with derivative at w in the direction k,
Proposition 2.
The total natural gradient of the cross entropy is
and the natural gradient of the entropy is
2.3. Total Natural Gradient of the Jensen–Shannon Divergence
The Jensen–Shannon divergence [] is
It is the minimum value of the function
In fact,
which vanishes for .
Let us compute the derivative of . The mixture expression of f at p according to Equation (7) is the affine function
so that the derivative in the direction h is .
The push-back, according to the mixture transport Equation (3), is
We now compute the gradient of of Equation (26), using the total natural gradient of the KH-divergence of Proposition 1, the derivative Equation (27), and the duality of parallel transports Equation (4):
It is also instructive to use the expression of the Jensen–Shannon divergence in terms of entropies. From Equation (25),
3. Product Sample Space
This section uses as a factorial sample space. For each , the margins are and . In the mean-field assumption, the model equals the tensor product of the margins,
The velocities are, respectively,
Below, we will discuss the optimality of a mean-field approximation.
3.1. Product Sample Space: Marginalization
The (first) marginalization is
Proposition 3.
The derivative of the marginalization Equation (30) is
Proof.
In the mixture chart centered at and , respectively, the expression of the marginalization is
Note that the expression in Equation (31) is linear. Hence, the derivative at v in the direction h is with so that the bundle derivative is
□
There is an interesting relation between conditional expectation and mixture transport. The conditional expectation commutes with the mixture transports,
It is a way to express Bayes’ theorem for conditional expectations. For all ,
3.2. Product Sample Space: Mean-Field Approximation
The derivative of the joint marginalization
follows from the derivative of the marginalization in Equation (29).
Proposition 4.
Proof.
Compose the partial derivatives with the mapping □
The decomposition of the velocity according to Equation (32) provides a better decomposition than of Equations (28) and (29) and provides a definition of the mean-field approximation. In the language of ANOVA decomposition of statistical interactions, the derivative part in Equation (32) is the sum of the simple effects of the velocity,
where , , and the last term is the interaction, the q-orthogonal residual. See [] for a discussion of the ANOVA decomposition in the context of the statistical bundle.
The equation for the total natural gradient of the KL-divergence and the computation of the derivative above provide the natural gradients of the divergence between the joint probability function and the mean-field approximation. In information theory [], the KL-divergence in Equation (34) is called mutual information.
Proposition 5.
The natural gradients of the divergences of a joint distribution r and its mean-field approximation are
The conditional terms in Equation (33) depend on the mean-field model; hence, we could express them as a disintegration of r. For example,
where the last term is the conditional entropy under r.
Proof of Equation (33).
We find the natural gradient of by computing with Equations (20) and (32) the variation along a smooth curve such that and . It holds that
We want to present the first term of the RHS as an inner product at r applied to . Let us push the inner product from to r with Equation (5). It holds that
The last equality follows from
□
3.3. Product Sample Space: Kantorovich and Scrödinger
If denotes the joint marginalization, the set of transport plans with margins and is
Here, we deal with a classical topic with considerable literature. We mention only the monograph of ref. [] and, from the Information Geometry perspective, ref. [,].
Let us consider first the Kantorovich problem. Given the cost function (i.e., potential function)
and a curve , we want to minimize the cost
As for all ,
so that . The velocity of a curve in the transport plans is an interaction. Now, the derivative of the cost is
From the interaction property of , it follows that if the ANOVA decomposition
holds, then
The Scrödinger problem is similar. Given the cost function (i.e., potential function)
consider the exponential perturbation of the mean-field probability function
The parameter is called temperature, and the normalizing constant is
The gradient of is, from Equations (35) and (34),
Only the interaction part is relevant in the constrained problem , and the interaction kills the two conditional expectations, which leaves
We refer to [] for a method to compute the interaction part of a random variable.
3.4. Product Sample Space: Conditional Probability Function
When the sample space is a product, , we can represent each probability function in the maximal exponential model via conditioning on one margin,
The two representations are
Following the approach of [], ([], Ch. 11), we look at the transition mapping
as a family of probability functions representing alternative probability models. The other transition mapping
is the discriminator, that is, is the probability that the sample x comes from .
The right-to-left second mapping in Equation (39),
maps the vector of the 1-margin and the set of alternative probability functions to the joint probability function. The kinematics of Equation (40), that is, the computation of velocities, is
Hence, the total derivative of B is
The transposed total derivative is defined by
that is,
In conclusion, the transposed total derivative is
It is interesting to derive in the mixture atlas. The mixture expression of B with respect to and is
and the total derivative in the directions , is
The push-back of the total derivative expression to the statistical bundles uses the equations
to obtain
In our affine language, we repeat computations in []. We especially derive the natural gradient of a composite function by the equation
We have
The first component of is
so that . The x-component is
so that .
We now assume a target probability function and consider the probability function on the product sample space where all the model probability functions equal g, and the discriminator is uniform, say , .
3.5. Variational Bayes
We revisit and develop some computations of ([], § 2.2). We keep the same notation as above so that Bayes’ formula is
where x is a sample value and y is a latent variable value.
For a fixed , we look for a r in some model in order to approximate the conditional . If satisfies
then
The so-called variational lower bound follows from ,
for all . The bound is exact because for all y if, and only if, The lower-bound variation along a curve is
If the model is an exponential tilting of the margin ,
where is a vector parameter, and u is the vector of sufficient statistics of the exponential family with , then the velocity in Equation (43) becomes
and the gradient in Equation (43) becomes
In conclusion, the gradient flow equation for the maximization of the lower bound under the model is
As a sanity check, assume that the model is exact for the given x,
The solution of the gradient flow Equation (44) requires the ability to compute the covariance for the current model distribution. We do not discuss the numerical and simulation issues related to the implementation here.
4. Discussion
In this paper, we have shown how the dually affine formalism for the open probability simplex provides a system of affine charts in which the statistical notion of Fisher’s score becomes the moving-chart affine velocity, and the natural gradient becomes the gradient. The construction applies to standard computations of interest for statistical machine learning. In particular, we have discussed the neat form of the total gradient of the KL-divergence and its applications in a factorial sample space, such as mean-field approximation and Bayes computations.
This approach is helpful because the unique features of the Fisherian approach to statistics, such as Fisher’s score, maximum likelihood, and Fisher’s information, are formalized as an affine calculus so that all the statistical tools are available in this more extensive theory. Moreover, this setting potentially unifies the formalisms of Statistics, Optimal Transport, and Statistical Physics, examples being the affine modeling of Optimal Transport [] and the second-order methods of optimization [].
We have not considered the implementation of the formal gradient flow equation as a practical learning algorithm. Such further development is currently outside the scope of this piece of research. It would require the numerical analysis of the continuous equation and the search for sampling versions of the expectation operators. We hope this note will prompt further research. On the abstract side, topics worth studying seem to be the cases of continuous sample space as in [], or Gaussian models as in [].
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
This study is purely methodological. No data were produced or used.
Acknowledgments
The Author acknowledges the partial support of de Castro Statistics, Collegio Carlo Alberto. He is a member of the non-profit organization Nuovo SEFIR and the UMI interest group AI&ML&MAT.
Conflicts of Interest
The Author declares no conflicts of interest.
References
- Efron, B.; Hastie, T. Computer Age Statistical Inference; Institute of Mathematical Statistics (IMS) Monographs; Algorithms, evidence, and data science; Cambridge University Press: New York, NY, USA, 2016; Volume 5, pp. xix+475. [Google Scholar]
- Amari, S. Geometry of Semiparametric Models and Applications. Invited Papers Meeting IP64 Likelihood and Geometry. Organizer Preben F. Blaesild. In Proceedings of the 51st Session of the International Statistical Institute, Istanbul, Turkey, 18–26 August 1997. [Google Scholar]
- Amari, S.I. Natural gradient works efficiently in learning. Neural Comput. 1998, 10, 251–276. [Google Scholar] [CrossRef]
- Amari, S.; Nagaoka, H. Methods of Information Geometry; Translated from the 1993 Japanese original by Daishi Harada; American Mathematical Society: Providence, RI, USA, 2000; pp. x+206. [Google Scholar]
- Amari, S.I. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Tokyo, Japan, 2016; Volume 194, pp. xiii+374. [Google Scholar]
- Pistone, G.; Sempi, C. An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Statist. 1995, 23, 1543–1561. [Google Scholar] [CrossRef]
- Chirco, G.; Pistone, G. Dually affine Information Geometry modeled on a Banach space. arXiv 2022, arXiv:2204.00917. [Google Scholar] [CrossRef]
- Weyl, H. Space- Time- Matter; Translation of the 1921 RAUM ZEIT MATERIE; Dover: New York, NY, USA, 1952. [Google Scholar]
- Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
- Amari, S.i.; Karakida, R.; Oizumi, M. Information geometry connecting Wasserstein distance and Kullback-Leibler divergence via the entropy-relaxed transportation problem. Inf. Geom. 2018, 1, 13–37. [Google Scholar] [CrossRef]
- Peyré, G.; Cuturi, M. Computational Optimal Transport. Found. Trends Mach. Learn. 2019, 11, 355–607. [Google Scholar] [CrossRef]
- Kingma, D.P.; Welling, M. Auto-encoding variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Khan, M.E.; Rue, H.V. The Bayesian learning rule. J. Mach. Learn. Res. 2023, 24, 46. [Google Scholar]
- Musielak, J. Orlicz Spaces and Modular Spaces; Lecture Notes in Mathematics; Springer: Berlin/Heidelberg, Germany, 1983; Volume 1034, pp. iii+222. [Google Scholar]
- Pistone, G. Information geometry of the Gaussian space. In Information Geometry and Its Applications; Springer: Cham, Switzerland, 2018; Volume 252, pp. 119–155. [Google Scholar]
- Lang, S. Differential and Riemannian Manifolds, 3rd ed.; Graduate Texts in Mathematics; Springer: Berlin/Heidelberg, Germany, 1995; Volume 160, pp. xiv+364. [Google Scholar]
- Pistone, G. Information Geometry of the Probability Simplex: A Short Course. Nonlinear Phenom. Complex Syst. 2020, 23, 221–242. [Google Scholar] [CrossRef]
- Landau, L.D.; Lifshits, E.M. Course of Theoretical Physics, 3rd ed.; Statistical Physics; Butterworth-Heinemann: Oxford, UK, 1980; Volume V. [Google Scholar]
- Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory; Number 9 in IMS Lecture Notes; Monograph Series; Institute of Mathematical Statistics: Ann Arbor, MI, USA, 1986; pp. x+283. [Google Scholar]
- do Carmo, M.P. Riemannian Geometry; Mathematics: Theory & Applications; Translated from the Second Portuguese Edition by Francis Flaherty; Birkhäuser Boston Inc.: Berlin, Germany, 1992; pp. xiv+300. [Google Scholar]
- Pistone, G. Statistical Bundle of the Transport Model. In Geometric Science of Information; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 752–759. [Google Scholar] [CrossRef]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2006; pp. xxiv+748. [Google Scholar]
- Ay, N. Information geometry of the Otto metric. Inf. Geom. 2024, 1–24. [Google Scholar] [CrossRef]
- Chirco, G.; Malagò, L.; Pistone, G. Lagrangian and Hamiltonian dynamics for probabilities on the statistical bundle. Int. J. Geom. Methods Mod. Phys. 2022, 19, 2250214. [Google Scholar] [CrossRef]
- Malagò, L.; Montrucchio, L.; Pistone, G. Wasserstein Riemannian geometry of Gaussian densities. Inf. Geom. 2018, 1, 137–179. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).