#
Natural Gradient Flow in the Mixture Geometry of a Discrete Exponential Family^{ †}

^{1}

^{2}

^{3}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

_{t}= (1 − t)p

_{0}+ tp

_{1}. A proper definition of the exponential and mixture geometry, where probability densities are considered points, requires the definition of the proper tangent space to hold the vectors representing the velocity of a curve. In both cases, the tangent space T

_{p}at a point p is a space of random variables V with zero expected value, E

_{p}[V] = 0. On the tangent space T

_{p}, a natural scalar product is defined, 〈U, V〉

_{p}= E

_{p}[UV], so that a pseudo-Riemannian structure is available. Note that the Riemannian structure is a third geometry, different from both the exponential and the mixture geometries. Note also that both the expected value and the covariance can be naturally extended to be defined on ${\mathcal{P}}^{1}$.

**ξ**↦ p

**of $\mathcal{M}$, the parametric expression of the relaxed function is $\widehat{F}(\mathit{\xi})=\mathrm{E}{p}_{\mathit{\xi}}[f]$. Under integrability and differentiability conditions on both**

_{ξ}**ξ**↦ p

**and x ↦ f(**

_{ξ}**x**), $\widehat{F}$ is differentiable, with ${\partial}_{j}\widehat{F}(\mathit{\xi})=E{p}_{\mathit{\xi}}[{\partial}_{j}\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}({p}_{\mathit{\xi}})f]$ and $E{p}_{\mathit{\xi}}[{\partial}_{j}\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}({p}_{\mathit{\xi}})]=0$; see [1,8]. In order to properly describe the gradient flow of a relaxed random variable, these classical computations are better cast into the formal language of information geometry (see [9]) and, even better, in the language of non-parametric differential geometry [10] that was used in [11]. The previous computations suggest to take the Fisher score ${\partial}_{j}\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}({p}_{\mathit{\xi}})$ as the definition of a tangent vector at the j-th coordinate curve. While the development of this analogy in the finite state space case does not require a special setup, in the non-finite state space, some care has to be taken.

## 2. Gradient Flow of Relaxed Optimization

**x**= (x

_{1}, …, x

_{n}) and µ the counting measure of Ω. In this case, a density $p\in {\mathcal{P}}_{\ge}$ is a probability function, i.e., $p:\mathrm{\Omega}\to {\mathbb{R}}_{+}$, such that ${\sum}_{x\in \mathrm{\Omega}}p(\mathit{x})=1$.

_{1}= ⋯ = c

_{d}= 0; for instance consider $\mathcal{B}$ such that ${\sum}_{x\in \mathrm{\Omega}}{T}_{j}(\mathit{x})=0$, j = 0,…,d, and $\mathcal{B}$ is a linear basis. We say that $\mathcal{B}$ is a set of affinely independent random variables. If $\mathcal{B}$ is a linear basis it is affinely independent if and only if {1, T

_{1}, …, T

_{d}} is a linear basis.

**θ**in the exponential family with sufficient statistics $\mathcal{B}$ namely:

**T**, in particular,

_{θ}is:

**T**(

**x**)|

**x**∈ Ω}. Note that no extra condition is required, because on a finite state space, all random variables are bounded. Nonetheless, even in this case, the proof is not trivial; see [13].

**θ**↦

**η**·

**θ**− ψ(

**θ**) has divergence mapping

**θ**↦

**η**− ∇ψ(

**θ**), and the equation

**η**= ∇ψ(

**θ**) has a solution if and only if

**η**belongs to the interior M° of the marginal polytope. The restriction $\varphi ={\psi}_{*}{|}_{M\xb0}$ is the Legendre conjugate of ψ, and it is computed by:

^{−1}, and it provides an alternative parameterization of ℇ with the so-called expectation or mixture parameter

**η**= ∇ψ(

**θ**),

**θ**parameters, the entropy is H(p

_{θ}) = ψ(

**θ**) −

**θ**· ∇ψ(

**θ**), in the

**η**parameters, the ϕ function gives the negative entropy: $-H({p}_{\mathit{\eta}})={\mathrm{E}}_{{p}_{\mathit{\eta}}}[{\mathrm{log}}_{{p}_{\mathit{\eta}}}]=\varphi (\mathit{\eta})$.

**Proposition 1.**

- Hess ϕ (
**η**) = (Hess ψ(**θ**))^{−1}when**η**= ∇ψ(**θ**). - The Fisher information matrix of the statistical model given by the exponential family in the
**θ**parameters is I_{e}(**θ**) = Cov_{pθ}(∇ log p_{θ}, ∇ log p_{θ}) = Hess ψ(**θ**). - The Fisher information matrix of the statistical model given by the exponential family in the
**η**parameters is I_{m}(**θ**) = Cov_{p}_{η}(∇ log p_{η}, ∇ log p_{η}) = Hess ϕ (**η**).

**Proof.**Derivation of the equality ∇ ϕ = (∇ψ)

^{−1}gives the first item. The second item is a property of the cumulant generating function ψ. The third item follows from Equation (1). □

#### 2.1. Statistical Manifold

**θ**or the

**η**parameterization, named respectively exponential or mixture parameterization. We discuss now the proper definition of the tangent bundle T ℇ.

**Definition 1**(Velocity). If I ∋ t ↦ p

_{t}, I open interval, is a differentiable curve in ℇ, then its velocity vector is identified with its Fisher score:

**Definition 2**(Tangent space). In the expression of the curve by the exponential parameters, the velocity is:

_{t}. As a consequence, we identify the tangent space at each p ∈ ℇ with the vector space of centered sufficient statistics, that is:

**η**(t)) (

**T**−

**η**(t)), that is the basis of velocities along the

**η**coordinates.

**Definition 3**(Tangent bundle T ℇ). The tangent space at each p ∈ ℇ is a vector space of random variables T

_{p}ℇ = Span (T

_{j}− E

_{p}[T

_{j}]|j = 1, …, d), and the tangent bundle T ℇ = {(p, V)|p ∈ ℇ, V ∈ T

_{p}ℇ}, as a manifold, is defined by the chart:

**Proposition 2.**

- If V =
**v**· (**T**−**η**) ∈ T_{pη}ℇ, then V is represented in the conjugate basis as:$$\begin{array}{l}V=\mathit{v}\cdot (\mathit{T}-\mathit{\eta})=\mathit{v}\cdot {(\text{Hess}\phantom{\rule{0.2em}{0ex}}\varphi (\mathit{\eta}))}^{-1}\text{Hess}\phantom{\rule{0.2em}{0ex}}\varphi (\mathit{\eta})(\mathit{T}-\mathit{\eta})=\\ \phantom{\rule{12.2em}{0ex}}\left(\text{Hess}\phantom{\rule{0.2em}{0ex}}\varphi {(\mathit{\eta})}^{-1}\mathit{v}\right)\cdot \text{Hess}\phantom{\rule{0.2em}{0ex}}\varphi (\mathit{\eta})(\mathit{T}-\mathit{\eta}).\end{array}$$ - The mapping (Hess ϕ (
**η**))^{−1}maps the coordinates**v**of a tangent vector V ∈ T_{pη}ℇ with respect to the basis of centered sufficient statistics to the coordinates**v**^{*}with respect to the conjugate basis. - In the
**θ**parameters, the transformation is**v**↦**v**^{*}= Hess ψ(**θ**)**v**.

**Remark 1.**In the finite state space case, it is not necessary to go on to the formal construction of a dual tangent bundle, because all finite dimensional vector spaces are isomorphic. However, this step is compulsory in the infinite state space case, as was done in [1]. Moreover, the explicit construction of natural connections and natural parallel transports of the tangent and dual tangent bundle is unavoidable when considering the second-order calculus, as was done in [1,8], in order to compute Hessians and implement Newton methods of optimization. However, the scope of the present paper is restricted to a basic study of gradient flows; hence, from now on, we focus on the Riemannian structure and disregard all second-order topics.

**Proposition 3**(Riemannian metric). The tangent bundle has a Riemannian structure with the natural scalar product of each T

_{p}ℇ, 〈V, W〉

_{p}= E

_{p}[VW]. In the basis of sufficient statistics, the metric is expressed by the Fisher information matrix I(p) = Cov

_{p}(

**T**,

**T**), while in the conjugate basis, it is expressed by the inverse Fisher matrix I

^{−1}(p).

**Proof.**In the basis of the sufficient statistics, V =

**v**· (

**T**− E

_{p}[T]), W =

**w**· (

**T**− E

_{p}[

**T**]), so that:

_{p}(

**T**,

**T**) is the Fisher information matrix.

_{θ}= p

_{η}, the conjugate basis at p is:

**v**

^{*}· I

^{−1}(p) (

**T**− E

_{p}[

**T**]), W =

**w**

^{*}· I

^{−1}(p) (

**T**− E

_{p}[

**T**]); thus:

#### 2.2. Gradient

^{1}real function $F:\mathcal{E}\to \mathbb{R}$, its gradient is defined by taking the derivative along a C

^{1}curve I ↦ p(t), p = p(0), and writing it with the Riemannian metrics,

**θ**and t ↦

**θ**(t) is the expression of the curve, then $\frac{d}{dt}\widehat{F}(\mathit{\theta}(t))=\nabla \widehat{F}(\mathit{\theta}(t))\cdot \dot{\mathit{\theta}}(t)$, so that at p = p

_{θ}

_{(0)}, with velocity $V=\frac{D}{dt}p(t){|}_{t=0}=\dot{\mathit{\theta}}(0)\cdot (\mathit{T}-\nabla \psi (\mathit{\theta}(0))$, so that we obtain the celebrated Amari’s natural gradient of [16]:

**η**and t ↦

**η**(t) is the expression of the curve, then $\frac{d}{dt}\widehat{F}(\mathit{\theta}(t))=\nabla \widehat{F}(\mathit{\theta}(t))\cdot \dot{\mathit{\theta}}(t)$ so that at p = p

_{η}

_{(0)}, with velocity $V=\frac{d}{dt}\mathrm{log}(p(t)){|}_{t=0}=\dot{\mathit{\eta}}(0)\cdot \text{Hess}\phantom{\rule{0.2em}{0ex}}\varphi (\mathit{\eta}(0))(\mathit{T}-\mathit{\eta}(0))$,

**Definition 4**(Gradients).

- The random variable ∇F (p) uniquely defined by Equation (9) is called the (geometric) gradient of F at p. The mapping ∇F : ℇ ∋ p ↦ ∇F (p) is a vector field of T ℇ.
- The vector$\tilde{\nabla}\widehat{F}(\mathit{\theta})=\text{Hess}\phantom{\rule{0.2em}{0ex}}\varphi {(\mathit{\theta})}^{-1}\nabla \widehat{F}(\mathit{\theta})$ of Equation (10) is the expression of the geometric gradient in the
**θ**in the basis of sufficient statistics, and it is called the natural gradient, while$\nabla \widehat{F}(\mathit{\theta})$, which is the expression in the conjugate basis of the sufficient statistics, is called the vanilla gradient. - The vector $\tilde{\nabla}\stackrel{\u2323}{F}(\mathit{\eta})=\text{Hess}\phantom{\rule{0.2em}{0ex}}\varphi {(\mathit{\eta})}^{-1}\nabla \stackrel{\u2323}{F}(\mathit{\eta})$ of Equation (10) is the expression of the geometric gradient in the
**η**parameter and in the conjugate basis of sufficient statistics, and it is called the natural gradient, while$\nabla \stackrel{\u2323}{F}(\mathit{\eta})$, which is the expression in the basis of sufficient statistics, is called the vanilla gradient.

_{p}ℇ, which is called a section of the tangent bundle in the standard differential geometric language, an integral curve from p is a curve I ∋ t ↦ p(t), such that p(0) = p and $\frac{D}{dt}p(t)=G(p(t))$. In the

**θ**parameters, G(p

_{θ}) =

**Ĝ**(

**θ**) · (

**T**− ∇ψ(

**θ**)), so that the differential equation is expressed by $\dot{\mathit{\theta}}(t)=\widehat{\mathit{G}}(\mathit{\theta}(t))$. In the

**η**parameters, $G({p}_{\eta})=\stackrel{\u2323}{\mathit{G}}(\mathit{\eta})\xb7\text{Hess}\phantom{\rule{0.2em}{0ex}}\varphi (\mathit{\eta})(\mathit{T}-\mathit{\eta})$, and the differential equation is $\dot{\mathit{\eta}}(t)=\stackrel{\u2323}{\mathit{G}}(\mathit{\eta}(t))$.

**Definition 5**(Gradient flow). The gradient flow of the real function F : ℇ is the flow of the differential equation$\frac{D}{dt}p(t)=\nabla F(p(t))$, i.e., $\frac{d}{dt}p(t)=p(t)\nabla F(p(t))$. The expression in the

**θ**parameters is$\dot{\mathit{\theta}}(t)=\tilde{\nabla}\widehat{F}(\mathit{\theta}(t))$, and the expression in the

**η**parameters is$\dot{\mathit{\eta}}(t)=\tilde{\nabla}\stackrel{\u2323}{F}(\mathit{\eta}(t))$.

**ζ**↦ p

**. Consider the change of parametrization from**

_{ζ}**ζ**to

**θ**,

**ζ**). We have:

**ζ**coordinate basis of the tangent space ${T}_{{p}_{\mathit{\zeta}}}\mathcal{E}$ consists of the components of the gradient with respect to

**ζ**,

#### 2.3. Gradient Flow in the Mixture Geometry

**η**parameters. From Definition 4, we have:

_{p}(

**T**,

**T**). As p ↦ Cov

_{p}(

**T**,

**T**) is the restriction to the simplex of a quadratic function, while p ↦

**η**is the restriction to the exponential family ℇ of a linear function, in some cases, we can naturally consider the extension of the gradient flow equation outside M°. One notable case is when the function F is a relaxation of a non-constant state space function f : Ω → ℝ, as it is defined in, e.g., [3].

**Proposition 4.**Let f : Ω → ℝ, and let F (p) = E

_{p}[f] be its relaxation on p ∈ ℇ. It follows:

- ∇F (p) is the least square projection of f onto T
_{p}ℇ, that is:$$\nabla F(p)=I{(p)}^{-1}{\text{Cov}}_{p}(f,\mathit{T})\xb7(\mathit{T}-{\mathrm{E}}_{p}[\mathit{T}]).$$ - The expressions in the exponential parameters
**θ**are$\tilde{\nabla}\widehat{F}(\mathit{\theta})={(\text{Hess}\phantom{\rule{0.2em}{0ex}}\psi (\mathit{\theta}))}^{-1}{\text{Cov}}_{\mathit{\theta}}(f,\mathit{T})$, $\nabla \widehat{F}(\mathit{\theta})={\text{Cov}}_{\mathit{\theta}}(f,\mathit{T})$ respectively. - The expressions in the mixture parameters
**η**are $\tilde{\nabla}\stackrel{\u2323}{F}(\mathit{\eta})={\text{Cov}}_{\mathit{\eta}}(f,\mathit{T})$ and $\nabla \stackrel{\u2323}{F}(\mathit{\eta})=\text{Hess}\phantom{\rule{0.2em}{0ex}}\varphi (\mathit{\eta}){\text{Cov}}_{\mathit{\eta}}(f,\mathit{T})$, respectively.

**Proof.**On a generic curve through p with velocity V, we have $\frac{d}{dt}{\mathrm{E}}_{p(t)}[f]{|}_{t=0}={\text{Cov}}_{p}(f,V)={\langle f,V\rangle}_{p}$. If V ∈ T

_{p}ℇ, we can orthogonally project f to get ${\langle \nabla F,V\rangle}_{p}={\langle ({I}^{-1}(p){\text{Cov}}_{p}(f,\mathit{T}))\cdot (\mathit{T}-{\mathrm{E}}_{p}[\mathit{T}]),V\rangle}_{p}$.

**Remark 2.**Let us briefly recall the behavior of the gradient flow in the relaxation case. Let

**θ**

_{n}, n = 1, 2, …, be a minimizing sequence for$\widehat{F}$, and let$\overline{p}$ be a limit point of the sequence${({p}_{{\mathit{\theta}}_{n}})}_{n}$. It follows that$\overline{p}$ has a defective support, in particular$\overline{p}\in \mathcal{E}$; see [18,19]. For a proof along lines coherent with the present paper, see [20] (Theorem 1). It is found that the support$\underset{\xaf}{F}\subset \mathrm{\Omega}$ is exposed, that is$\mathit{T}(\underset{\xaf}{F})$ is a face of the marginal polytope M = con {

**T**(

**x**)|

**x**∈ Ω}. In particular, ${\mathrm{E}}_{\overline{p}}[\mathit{T}]=\overline{\mathit{\eta}}$ belongs to a face of the marginal polytope M. If a is the (interior) orthogonal of the face, that is

**a**·

**T**(

**x**) + b ≥ 0 for all

**x**∈ Ω and

**a**·

**T**(

**x**) + b = 0 on the exposed set, then$\mathit{a}\cdot (\mathit{T}(\mathit{x})-\overline{\mathit{\eta}})=0$ on the face, so that$\mathit{a}\cdot {\text{Cov}}_{\overline{p}}(f,\mathit{T})=0$. If we extend the mapping

**η**↦ Cov

_{η}(f,

**T**) on the closed marginal polytope M to be the limit of the vector field of the gradient on the faces of the marginal polytope, we expect to see that such a vector field is tangent to the faces. This remark is further elaborated below in the binary case.

#### 2.4. The Saturated Model

**x**= 0 is a distinguished point. If X is the identity on Ω, we define the sufficient statistics to be the indicator functions of points T

_{j}= (X = j), j = 1, …, d. The saturated exponential family consists of all of the positive densities written as:

_{j}= E ((X = j)) is the probability of case

**x**= j and the marginal polytope is the probability simplex Δ

_{d}.

**η**∈]0, 1[

^{d}by:

**η**parameterization (1) of the probability is:

**Remark 3.**The previous equation prompts three crucial remarks:

- The expression of the probability in the
**η**parameters is a normalized monomial in the parameters. - The expression continuously extends the exponential family to the probabilities in${\mathcal{P}}_{\ge}$.
- The expression actually is a polynomial parameterization of the signed densities${\mathcal{P}}^{1}$.

**θ**) is the Fisher information matrix I(p) of the exponential family at

**p**= p

_{θ}, and the matrix Hess ϕ (

**η**) is the inverse Fisher information matrix I

^{−1}(p) at

**p**= p

_{η}. It follows that the natural gradient of a function

**η**↦ h(

**η**) will be:

**Proposition 5.**

- The inverse Fisher information matrix I(p)
^{−1}is zero on the vertexes of the simplex, only. - The determinant of the inverse Fisher information matrix I(p)
^{−1}is:$$\mathrm{det}(I{(p)}^{-1})=\left(1-{\displaystyle \sum _{i=1}^{n}{p}_{i}}\right){\displaystyle \prod _{i=1}^{n}{p}_{i}}.$$ - The determinant of the inverse Fisher information matrix I(p)
^{−1}is zero on the borders of the simplex, only. - On the interior of each facet, the rank of the inverse Fisher information matrix I(p)
^{−1}is (n − 1), and the (n − 1) linear independent column vectors generate the subspace parallel to the facet itself.

## 3. Toric Models: A Tutorial Example

^{2}or {+1, −1}

^{d}, have special algebro-combinatorial features that fall under the name of algebraic statistics. Seminal papers have been [22,23]. Monographs on the topic are [24–26]. The book [27] covers both information geometry and algebraic statistics.

_{1}and T

_{2}are the sufficient statistics of the exponential family:

_{3}is an integer-valued vector basis of the orthogonal space Span (1, T

_{1}, T

_{2})

^{⊥}.

^{2}⊃ Ω, Ω = {(0, 0), (0, 1), (1, 0), (2, 1)}; see Figure 1. Given a finite subset of ℝ

^{d}, it is a general algebraic fact that there exists a filtering set of monomial functions that is a vector basis of all real functions on the subset itself; see an exposition and the applications to statistics in [24] or [27]. In our case, the monomial basis is 1, T

_{1}, T

_{2}, T

_{1}T

_{2}, and we define the matrix of the saturated model to be:

_{1}= exp (θ

_{1}), ζ

_{2}= exp (θ

_{2}), the model is shown to be a (piece of an) algebraic variety, that is a set described by the rational parametric equations:

_{1}: q

_{2}: q

_{3}: q

_{4}] belong to the projective space

**P**

^{3}. Precisely, the (real) projective space

**P**

^{3}is the set of all non-zero points of ℝ

^{4}together with the equivalence relation $[{q}_{1}:{q}_{2}:{q}_{3}:{q}_{4}]=[{\overline{q}}_{1}:{\overline{q}}_{2}:{\overline{q}}_{3}:{\overline{q}}_{4}]$ if, and only if, $[{q}_{1}:{q}_{2}:{q}_{3}:{q}_{4}\left]=k\right[{\overline{q}}_{1}:{\overline{q}}_{2}:{\overline{q}}_{3}:{\overline{q}}_{4}]$, k ≠ 0. The domain of unnormalized signed probabilities as projective points is the open subset ${\mathbb{P}}_{*}^{3}$ of ℙ

^{3}where q

_{1}+ q

_{2}+ q

_{3}+ q

_{4}≠ 0. On this set, we can compute the normalization:

^{*}ℇ is the affine space generated by the simplex Δ

_{3}. Notice that this embedding produces a number of natural geometrical structures on

^{*}ℇ.

_{1}, T

_{2}), which, in turn, is equivalent to log p ⊥ T

_{3}. We can rewrite the orthogonality as:

_{1}, θ

_{2}∈ ℝ or Equation (18) with ζ

_{1}, ζ

_{2}∈ ℝ

_{>}are equivalently described by the equations:

_{3}, which is represented in Figure 2 by the triangularization of a grid of points that satisfy the invariant.

_{1}, T

_{2})

^{⊥}, we can embed the marginal polytope of Figure 1 into the associated full marginal polytope. By expressing probabilities as a function of the expectation parameters, Equation (21) identifies a relationship between η

_{1}, η

_{2}and the expected values of the chosen basis for the orthogonal space. This corresponds to an equivalent invariant in the expectation parameters, which, in turn, identifies a surface in the full marginal polytope.

**η**= (η

_{1}, η

_{2}, η

_{3}), with ${\eta}_{3}=\mathrm{E}[{T}_{3}]$, which corresponds to the choice of T

_{3}as a basis for the space orthogonal to the span of the sufficient statistics of the model, together with the constant

**1**, as in Equation (12). We introduce the following matrix:

**η**associated with the model:

_{3}which can be computed as a function of the other expectation parameters. Solving Equation (25) for η

_{3}allows one to express explicitly the value of η

_{3}given (η

_{1}, η

_{2}) and represent the surface associated with the invariant in the full marginal polytope. However, the cubic polynomial in Equation (25) in general admits three roots. The unique value of η

_{3}can be obtained from the roots of the cubic polynomial, by imposing that η

_{3}must be real and belong to the full marginal polytope given by Conv {(T

_{1}(

**x**), T

_{2}(

**x**), T

_{3}(

**x**))|

**x**∈ Ω}.

_{3}variable:

_{3}of Equation (25), Δ < 0 for η

_{2}− 1 ≠ 0 and for:

_{1}, η

_{2}) where Δ < 0, in red where Δ > 0, and the points where Δ = 0 with a dashed line. For Δ < 0, the only real root is η

_{3,1}, which identifies the blue surface in the full marginal polytope in Figure 3(b). For Δ > 0, it is easy to verify that only η

_{3,2}belongs to the interior of the full marginal polytope parametrized by (η

_{1}, η

_{2}, η

_{3}), since it satisfies the inequalities given by the facets of the marginal polytope, and is represented in Figure 3(b) by the red surface. Finally, the three real roots coincide for Δ = 0, that is, for η

_{2}= 1, and where:

_{1}, p

_{2}, p

_{3}, p

_{4}], the model ideal:

^{4}defined by Equations (20) and (21). The Jacobian matrix is:

^{*}ℇ = {p ∈ ℝ

^{4}|p

_{1}+ p

_{2}+ p

_{3}+ p

_{4}= 1}. This (double) critical line intersects the simplex along the edge δ

_{2}↔ δ

_{4}. Outside $\mathrm{C}$, that is in the open complement set, the equations of the toric variety are locally solvable in two among the p

_{i}’s under the condition that the corresponding minor is not zero. To have a picture of what this critical set looks like, let us intersect our surface with the plane p

_{3}= 0. On the affine space p

_{1}+ p

_{2}+ p

_{4}= 1 we have ${p}_{1}^{2}{p}_{4}=0$, that is the union of the double line ${p}_{1}^{2}=0$ with the line p

_{4}= 0.

**P**

^{1}by the equations:

**p**(τ, 0) = (0, τ, 0, 1 − τ), which satisfies Equation (42) for τ ∈ ℝ. If 0 ≤ τ ≤ 1, then

**p**(τ, 0) belongs to the edge δ

_{2}↔ δ

_{4}.

_{1}, p

_{3}are not both zero and αp

_{1}+ βp

_{3}= 0, then αp

_{1}= − βp

_{3}≠ 0. It follows (αp

_{1})

^{2}= (βp

_{3})

^{2}≠ 0; hence:

^{−1}. The same condition applies to component 4. As $[\alpha :\beta ]=\left[\frac{\alpha}{\beta -\alpha}:\frac{\beta}{\beta -\alpha}\right]$, we can always assume β > 0 and β − α = 1 that is, α = β − 1; hence β < 1. The parameterization of the positive probabilities in the model becomes:

**η**as a function of (α, t), i.e.,

_{1}and p

_{3}, we obtain β = p

_{1}/(p

_{1}+ p

_{3}). As p

_{2}+ p

_{4}= 1 − t, we have the chart:

_{1}and p

_{3}only.

^{2}onto the interior of the marginal polytope M; see Figure 3(b). The equations:

_{1}, η

_{2}) ∈ M°. We study the local solvability in ζ

_{1}, ζ

_{2}of:

_{12}, from Equations (15) and (18) we have the system:

_{3}, from Equations (16) and (41), it is possible to derive the equation of the model variety in the η

_{1}, η

_{2}, η

_{3}parameters. From Equation (18), we have:

**ζ**, that is:

**η**. In the example, the sample space has four points; the monomials 1, T

_{1}, T

_{2}, T

_{1}T

_{2}are a vector basis of the linear space of the columns of the matrix A, in particular T

_{3}is a linear combination:

#### 3.1. Border

_{1}T

_{2}, and parametrized by (η

_{1}, η

_{2}, η

_{12}). By Equation (16), the marginal polytope is represented by the inequalities:

_{1}, T

_{2}, T

_{1}T

_{2}}. As the critical set is the edge δ

_{2}↔ δ

_{4}in the

**p**space, it is the edge (0, 1, 0) ↔ (2, 1, 2) in the η space.

_{1}and η

_{2}are obtained from Equation (15).

#### 3.2. Fisher Information

_{|12}the block of the two central columns in A in Equation (14) and by

**p**the row vector of probabilities. Then, the variance matrix is:

**p**−

**p**

^{T}

**p**is zero; hence, the covariance matrix is zero. In each of the cases where the probability is supported by a facet, say {1, 2}, the matrix

**p**−

**p**

^{T}

**p**reduces to the corresponding block, and the covariance matrix is:

_{1}, p

_{2}, p

_{3}. This polynomial is zero on each facet.

**η**parameters can be given as a function of either

**θ**or

**ζ**. We have:

_{1}, ζ

_{2}, we obtain:

_{1}, ζ

_{2}with coefficients that are polynomials in η

_{1}, η

_{2}, η

_{3}to be considered with the implicit equation derived from ${p}_{1}^{2}{p}_{4}-{p}_{2}{p}_{3}^{2}=0$. The system is:

#### 3.3. Extension of the Model

**η**parameters, in particular when the critical points belong to the boundary of the model. Indeed, by extending the gradient field outside the marginal polytope, we can identify open neighborhoods for critical points on the boundary of the polytope, which allow one to study the convergence of the differential equations associated with the gradient flows, for instance by means of Lyapunov stability.

_{2}↔ δ

_{4}of the full marginal polytope parametrized by (η

_{1}, η

_{2}, η

_{3}) and b the coordinates of the corresponding point over δ

_{1}↔ δ

_{3}obtained by intersecting the line of the ruled surface through a with the edge δ

_{1}↔ δ

_{3}. The values of the η

_{2}coordinate for

**a**and

**b**are one and zero, respectively. The other coordinates of

**b**depend on those of

**a**though α. First, we obtain the values of the η

_{3}coordinates as a function of the η

_{1}coordinate. For

**a**, we find the equation of the line to which δ

_{2}↔ δ

_{4}belongs, given by:

_{3}= 1 − η

_{1}. Similarly, for the η

_{3}coordinate of

**b**, we consider the line through δ

_{1}↔ δ

_{3}, that is:

_{3}= 4η

_{1}− 2. Finally, for the η

_{1}coordinate, we use Equations (44). In

**a**, since t = 0 and p

_{1}= p

_{3}= 0, then ${p}_{2}=\frac{{\beta}^{2}}{{\alpha}^{2}+{\beta}^{2}}$ and ${p}_{4}=\frac{{\alpha}^{2}}{{\alpha}^{2}+{\beta}^{2}}$. From Equation (24), it follows that:

**b**, we have p

_{2}= p

_{4}= 0 and t = 1, so that p

_{1}= α + 1 and p

_{3}= −α. From Equation (24), it follows that:

**a**and

**b**both depend on α as follows,

**a**and

**b**described by the following implicit representation, for −1 < α < 1 and 0 < t < 1,

**a**and

**b**for different values of α. For α → ±∞, the η

_{1}coordinate of

**b**tends to ∓∞, while the η

_{1}of

**a**tends to one. For α → ±∞, the ruled surface admits the same limit given by the line parallel to δ

_{1}↔ δ

_{3}passing through (1, 1, 0). The surface intersects the interior of the marginal polytope for t ∈ (0, 1) and α ∈ (−1, 0). Moreover, the surface intersects the critical line twice, for t = 0, α ∈ [−1, 0] and for t = 0, α ∉ [−1, 0].

_{1}, η

_{2}, η

_{3}) and in the marginal polytope parametrized by (η

_{1}, η

_{2}).

#### 3.4. Optimization and Natural Gradient Flows

_{1}, x

_{2}, x

_{1}x

_{2}}, so that any f : Ω → ℝ can be written as:

_{3}= 4x

_{1}+ 3x

_{2}− 5x

_{1}x

_{2}− 2 and ${\eta}_{3}=\mathbb{E}[{\mathrm{T}}_{3}]$, so that:

_{η}(

**η**) over the marginal polytope parameterized by (η

_{1}, η

_{2}), we need to express η

_{3}as a function of η

_{1}and η

_{2}. In order to do that, we parametrize the exponential family as a ruled surface by means of the (α, t) parameters. Moreover, this parametrization has a natural extension outside the marginal polytope, which allows one to study the stability of the critical points on the boundary of the marginal polytope. We start by evaluating the gradient field of F

_{α}

_{,}

_{t}(α, t) in the (α, t) parametrization, then we map it to the marginal polytope in the

**η**parameterization.

_{1}, η

_{2}) as a function of (α, t), we obtain:

_{1}, η

_{2}) parameterization. Indeed, by the chain rule and the inverse function theorem, we have:

_{1}, η

_{2}) is:

^{−1}is not defined for t which satisfies Equation (69).

#### 3.5. Examples with Global and Local Optima

_{0}= 0, c

_{1}= 1, c

_{2}= 2, c

_{3}= 3, so that:

_{1}, η

_{2}) parameterization. Notice that, as expected, differently from the vanilla gradient, the natural gradient flows converge to the unique global optima, which corresponds to the vertex where all of the probability is concentrated over {1}. In the (α, t) parameterization, the flows have been extended outside the statistical model by prolonging the lines of the ruled surface, and as we can see, they remain compatible with the flows on the interior of the model, in the sense that the nature of the critical point is the same for trajectories with initial conditions on the interior and on the exterior of the model. In other words, the global optima is an attractor from both the interior and the exterior of the model and similarly for the other critical points on the vertices, both for saddle points and the unstable points, where the natural gradient vanishes.

_{0}= 0, c

_{1}= 1, c

_{2}= 2, c

_{3}= −5/2, and we have:

_{2}admits a minimum on {4}. In Figures 12 and 13, we plotted the vector fields associated with the vanilla and natural gradient, together with some gradient flows for different initial conditions, in the (α, t) and (η

_{1}, η

_{2}) parameterization, respectively. As in the previous example, natural gradient flows converge to the vertices of the model; however, in this case, we have one local optima in {1} and one global optima in {4}, together with a saddle point in the interior of the model. Similarly to the previous example, in the (α, t) parameterization, the flows have been extended outside the statistical model, and the nature of the critical points is the same for trajectories with initial conditions in the statistical model and in the extension of the statistical model.

## 4. Pseudo-Boolean Functions

**x**= (x

_{1},…,x

_{n}) ∈ {+1, −1}

^{n}= Ω. For any function f : Ω ↦ ℝ, with multi-index notation, f(

**x**) = ∑

_{α∈L}a

_{α}

**x**

^{α}, with L = {0, 1}

^{n}and ${\mathit{x}}^{\alpha}={\displaystyle {\prod}_{i=1}^{n}{x}_{i}^{{\alpha}_{i}}}$, 0

^{0}=1. If M ⊂ L* = L\{0}, the model where p ∈ ε if:

**Proposition 6**(Implicitization of the exponential family). Given a function p: Ω → ℝ, then and p ∈ ε if, and only if, the following conditions all hold:

- p(
**x**) > 0,**x**∈ Ω; - ∑
_{x}_{∈Ω}p(**x**)=1; - ${\prod}_{\mathit{x}:{\mathit{x}}^{\beta}=1}p\left(\mathit{x}\right)}={\displaystyle {\prod}_{\mathit{x}:{\mathit{x}}^{\beta}=-1}p\left(\mathit{x}\right)$ for all β ∈ L*\M.

**Proof.**(⇒) If p ∈ ε, then p(

**x**) > 0,

**x**∈ Ω (Item 1) and ∑

_{x}

_{∈Ω}p(x) = 1 (Item 2). Moreover, log $p(\mathit{x})={\displaystyle {\sum}_{\alpha \in M}{\theta}_{\alpha}{\mathit{x}}^{\alpha}-\psi (\mathit{\theta})}$. The function log p is orthogonal to each

**X**

^{β}, β ∈ L* \ M. Hence:

**X**

^{β}; hence, there exists

**θ**, such that log $p={\displaystyle {\sum}_{\alpha \in M}{\theta}_{\alpha}{\mathit{X}}^{\alpha}+C}$. Now, Item 2 implies C = −ψ(

**θ**).

**x**)|

**x**∈ Ω}. Given a binary model M, the set of polynomials:

**Proposition 7.**

- The exponential variety of M is the Zariski closure of the exponential model ε.
- The closure$\overline{\mathcal{E}}$ of ε in $\mathcal{P}\ge $ is characterized by p(
**x**) ≥ 0,**x**∈ Ω, together with Items 2 and 3 of Proposition 6. - The algebraic variety of the ring ℝ[p(
**x**):**x**∈ Ω], which is generated by the polynomials Σ_{x}_{∈Ω}p(**x**)−1, ${\prod}_{\mathit{x}:{\mathit{x}}^{\beta}=1}p\left(\mathit{x}\right)}-{\displaystyle {\prod}_{\mathit{x}:{\mathit{x}}^{\beta}=-1}p\left(\mathit{x}\right)$, β ∈ L* \ M, is an extension ε^{1}of ε to ${\mathcal{P}}^{1}$. - Define the moments${\eta}_{\alpha}={\displaystyle {\sum}_{\mathit{x}\in \mathrm{\Omega}}{\mathit{x}}^{\alpha}}p\left(\mathit{x}\right)$,
**α**∈ L, i.e., the discrete Fourier transform of p, with inverse$p(\mathit{x})={2}^{-n}{\displaystyle {\sum}_{\alpha \in L}{\mathit{x}}^{\alpha}{\eta}_{\alpha}}$. There exists an algebraic extension of the moment function ε ∋ p ↔**η**(p) ∈ M° to a mapping defined on ε^{1}. **Proof.**1. According to the implicitization Proposition 6, the exponential family is characterized by the positivity condition together with the algebraic binomial conditions.- This follows from the implicit form, and it is proven, for example, in [20].
- By definition.
- As the mapping from the probabilities to the moments is affine and one-to-one, such a transformation extends to a one-to-one mapping from the extended model to the affine space of the marginal polytope.

^{3}, the full model in the statistics 0 ↦ 1, 1 ↦ −1, that is t = (−1)

^{x}= 1 − 2x, is described by the matrix:

_{1}T

_{2}T

_{3}is the same model as the toric model without the three-way interaction, which is based on the matrix:

_{7}if (η

_{1}, η

_{2}, η

_{3}, η

_{4}, η

_{5}, η

_{6}) is in the interior of the marginal polytope. As in the example of the previous section, it is possible to intersect the polynomial invariant in Equation (83) with one or more sheaves of hyperplanes around some faces of the simplex, in order to lower the degree of the invariant and thus decompose the model as the convex hull of probabilities on the boundary of the model. We do not describe the details here, and we postpone the discussion of this example to a paper which is in preparation.

## 5. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Pistone, G. Nonparametric information geometry. In Geometric Science of Information, Proceedings of the First International Conference (GSI 2013), Paris, France, 28–30 August 2013; Nielsen, F., Barbaresco, F., Eds.; Springer: Heidelberg, Germany, 2013; 8085, pp. 5–36. [Google Scholar]
- Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Relaxation as a Unifying Approach in 0/1 Programming, 2009, Proceedings of the NIPS 2009 Workshop on Discrete Optimization in Machine Learning: Submodularity, Sparsity & Polyhedra (DISCML), Whistler Resort & Spa, BC, Canada, 11–12 December 2009.
- Malagò, L.; Matteucci, M.; Pistone, G. Towards the geometry of estimation of distribution algorithms based on the exponential family. Proceedings of the 11th Workshop on Foundations of Genetic Algorithms (FOGA ’11), Schwarzenberg, Austria, 5–8 January 2011; ACM: New York, NY, USA, 2011; pp. 230–242. [Google Scholar]
- Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Natural Gradient Descent by estimation of empirical covariances. Proceedings of the 2011 IEEE Congress on Evolutionary Computation (CEC), New Orleans, LA, USA, 5–8 June 2011; pp. 949–956.
- Malagò, L.; Matteucci, M.; Pistone, G. Natural gradient, fitness modelling and model selection: A unifying perspective, Proceedings of the 2013 IEEE Congress on Evolutionary Computation (CEC), Cancun, Mexico, 20–23 June 2013; pp. 486–493.
- Wierstra, D.; Schaul, T.; Peters, J.; Schmidhuber, J. Natural evolution strategies. Proceedings of the 2008 IEEE Congress on Evolutionary Computation, Hong Kong, China, 1–6 June 2008; pp. 3381–3387.
- Ollivier, Y.; Arnold, L.; Auger, A.; Hansen, N. Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles
**2011**. arXiv: 1106.3708. - Malagò, L.; Pistone, G. Combinatorial Optimization with Information Geometry: Newton method. Entropy
**2014**, 16, 4260–4289. [Google Scholar] - Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI, USA, 2000; Translated from the 1993 Japanese original by Daishi Harada. [Google Scholar]
- Bourbaki, N. Variétés differentielles et analytiques. Fascicule de résultats / Paragraphes 1 à 7; Number XXXIII in Éléments de mathématiques; Hermann: Paris, France, 1971. [Google Scholar]
- Pistone, G.; Sempi, C. An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Stat.
**1995**, 23, 1543–1561. [Google Scholar] - Malagò, L.; Pistone, G. Gradient Flow of the Stochastic Relaxation on a Generic Exponential Family. Proceedings of Conference of Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2014), Clos Lucé, Amboise, France, 21–26 September 2014; Mohammad-Djafari, A., Barbaresco, F., Eds.; pp. 353–360.
- Brown, L.D. Fundamentals of Statistical Exponential Families With Applications in Statistical Decision Theory; Number 9 in IMS Lecture Notes, Monograph Series; Institute of Mathematical Statistics: Hayward, CA, USA, 1986. [Google Scholar]
- Rockafellar, R.T. Convex Analysis; Princeton Mathematical Series No. 28; Princeton University Press: Princeton, NJ, USA, 1970. [Google Scholar]
- Do Carmo, M.P. Riemannian Geometry; Mathematics: Theory & Applications; Birkhäuser Boston Inc.: Boston, MA, USA, 1992; Translated from the second Portuguese edition by Francis Flaherty. [Google Scholar]
- Amari, S.I. Natural gradient works efficiently in learning. Neur. Comput.
**1998**, 10, 251–276. [Google Scholar] - Shima, H. The Geometry of Hessian Structures; World Scientific Publishing Co. Pte. Ltd.: Hackensack, NJ, USA, 2007. [Google Scholar]
- Rinaldo, A.; Fienberg, S.E.; Zhou, Y. On the geometry of discrete exponential families with application to exponential random graph models. Electron. J. Stat.
**2009**, 3, 446–484. [Google Scholar] - Rauh, J.; Kahle, T.; Ay, N. Support Sets in Exponential Families and Oriented Matroid Theory. Int. J. Approx. Reas.
**2011**, 52, 613–626. [Google Scholar] - Malagò, L.; Pistone, G. A note on the border of an exponential family
**2010**. arXiv:1012.0637v1. - Pistone, G.; Rogantin, M. The gradient flow of the polarization measure. With an appendix
**2015**. arXiv:1502.06718. [Google Scholar] - Diaconis, P.; Sturmfels, B. Algebraic algorithms for sampling from conditional distributions. Ann. Stat.
**1998**, 26, 363–397. [Google Scholar] - Pistone, G.; Wynn, H.P. Generalised confounding with Gröbner bases. Biometrika
**1996**, 83, 653–666. [Google Scholar] - Pistone, G.; Riccomagno, E.; Wynn, H.P. Algebraic Statistics: Computational Commutative Algebra in Statistics; Volume 89, Monographs on Statistics and Applied Probability; Chapman & Hall/CRC: Boca Raton, FL, USA, 2001. [Google Scholar]
- Drton, M.; Sturmfels, B.; Sullivant, S. Lectures on Algebraic Statistics; Volume 39, Oberwolfach Seminars; Birkhäuser Verlag: Basel, Germany, 2009. [Google Scholar]
- Pachter, L.; Sturmfels, B. (Eds.) Algebraic Statistics for Computational Biology; Cambridge University Press: Cambridge, UK, 2005.
- Gibilisco, P.; Riccomagno, E.; Rogantin, M.P.; Wynn, H.P. Algebraic and Geometric Methods in Statistics; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
- 4ti2 team. 4ti2—A software package for algebraic, geometric and combinatorial problems on linear spaces. Available online: http://www.4ti2.de accessed on 2 June 2015.
- Michałek, M.; Sturmfels, B.; Uhler, C.; Zwiernik, P. Exponential Varieties
**2014**. arXiv:1412.6185. - Sturmfels, B. Gröbner Bases and Convex Polytopes; American Mathematical Society: Providence, RI, USA, 1996. [Google Scholar]
- Geiger, D.; Meek, C.; Sturmfels, B. On the toric algebra of graphical models. Ann. Stat.
**2006**, 34, 1463–1492. [Google Scholar] - Rapallo, F. Toric statistical models: Parametric and binomial representations. Ann. Inst. Stat. Math.
**2007**, 59, 727–740. [Google Scholar] - Beltrametti, M.; Carletti, E.; Gallarati, D.; Monti Bragadin, G. Lectures on Curves, Surfaces and Projective Varieties: A Classical View of Algebraic Geometry; EMS textbooks in mathematics; European Mathematical Society: Zürich, Switzerland, 2009. [Google Scholar]
- Rinaldo, A.; Fienberg, S.E.; Zhou, Y. On the geometry of discrete exponential families with application to exponential random graph models. Electron. J. Stat.
**2009**, 3, 446–484. [Google Scholar] - Pistone, G. Algebraic varieties vs. differentiable manifolds in statistical models. In Algebraic and Geometric Methods in Statistics; Gibilisco, P., Riccomagno, E., Rogantin, M., Wynn, H.P., Eds.; Cambridge University Press: Cambridge, UK, 2009; Chapter 21; pp. 339–363. [Google Scholar]
- Abbott, J.; Bigatti, A.; Lagorio, G. CoCoA-5: A system for doing Computations in Commutative Algebra. Available online: http://cocoa.dima.unige.it accessed on 2 June 2015.

**Figure 1.**Marginal polytope of the exponential family in Equations (12) and (13). The coordinates of the vertices are given by (T

_{1}, T

_{2}).

**Figure 2.**Representation of the exponential family in Equations (12) and (13) as a surface that intersects the probability simplex ∆

_{3}. The surface is obtained by the triangularization of a grid of points that satisfy the invariant in Equation (21).

**Figure 3.**Marginal polytope of the exponential family in Equations (12) and (13) (

**a**). The dashed lines correspond to the points where ∆ = 0, where ∆ is the discriminant in Equation (31); over the red regions ∆ > 0 and over the blue regions ∆ < 0. Representation of the exponential family as a surface in the full marginal polytope parametrized by (η

_{1}, η

_{2}, η

_{3}) (

**b**). The blue surface is given by the unique real root η

_{3,1}in Equation (32); the red surface corresponds to the unique real root η

_{3,2}, which belongs to the full marginal polytope; over the dashed lines, which have been computed solving Equation (40) numerically, Equation (26) admits a real root with multiplicity equal to three.

**Figure 4.**Representation of the exponential family in Equations (12) and (13) as a ruled surface in the probability simplex (

**a**) and in the parameter space (α, t) (

**b**). The dashed line corresponds to the critical edge δ

_{2}↔ δ

_{4}and the blue line to the case $\alpha =-\frac{1}{2}$.

**Figure 5.**Representation of the exponential family in Equations (12) and (13) as a ruled surface in the marginal polytope (η

_{1,}η

_{2}) (

**a**) and in the full marginal polytope parametrized by (η

_{1,}η

_{2}η

_{3}) (

**b**) The dashed line corresponds to the critical line δ

_{2}↔ δ

_{4}$\alpha =-\frac{1}{2}$

**Figure 6.**The segments that form the ruled surface in Figure 4 have been extended, for −0.5 < t < 1.5. New lines described by Equations (60) have been represented for 0 < α < exp(0.7) (shading from red to black for increasing values of α) and for exp(0.7) − 1 < α < −1 (shading from red to white for decreasing values of α). The simplex in (

**b**) has been rotated with respect to Figure 4(a) to better visualize the intersection of the lines with the critical edge δ

_{2}↔ δ

_{4}.

**Figure 7.**Extension of the ruled surface associated with the exponential family in Equations (12) and (13) as in Figure 6(b), for exp(3.5) − 1 < α < exp(3.5) and −0.5 < t < 1.5; for α → ±∞, the lines of the extended surface admit the same limit.

**Figure 8.**The segments that form the ruled surface in Figure 5 have been extended, for −0.5 < t < 1.5. New lines described by Equations (60) have been represented for 0 < α < exp(1) (shading from blue to black for increasing values of α) and exp(1) − 1 < α < −1 (shading from blue to white for decreasing values of α). The full marginal polytope in (

**b**) has been rotated with respect to Figure 5(b) to better visualize the intersection of the lines with the critical edge δ

_{2}↔ δ

_{4}.

**Figure 9.**Extension of the ruled surface associated with the exponential family in Equations (12) and (13) as in Figure 8(b), for exp(3)−1 < α < exp(3) and −0.5 < t < 1.5; notice that for α → ±∞, the lines of the extended surface admit the same limit.

**Figure 10.**Vanilla gradient field and flows in blue (

**a**) and natural gradient field and flows in red (

**b**), together with level lines associated with F

_{α}

_{,}

_{t}(α, t) in the (α, t) parameterization, for c

_{0}= 0, c

_{1}= 1, c

_{2}= 2 and c

_{3}= 3; the dashed blue lines in (

**b**) represent the points where $\tilde{\nabla}{F}_{\alpha ,t}\left(\alpha ,t\right)$ is not defined; see Equation (68).

**Figure 11.**Vanilla gradient field in blue (

**a**) and natural gradient field and flows in red (

**b**), together with level lines associated with F

_{η}(α, t) over the marginal polytope, for c

_{0}= 0, c

_{1}= 1, c

_{2}= 2 and c

_{3}= 3.

**Figure 12.**Vanilla gradient field and flows in blue (

**a**) and natural gradient field and flows in red (

**b**) as in Figure 10, for c

_{0}= 0, c

_{1}= 1, c

_{2}= 2 and ${c}_{3}=-\frac{5}{2}$.

**Figure 13.**Vanilla gradient field in blue (

**a**) and natural gradient field and flows in red (

**b**) as in Figure 11, for c

_{0}= 0, c

_{1}= 1, c

_{2}= 2 and ${c}_{3}=-\frac{5}{2}$.

© 2015 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Malagò, L.; Pistone, G. Natural Gradient Flow in the Mixture Geometry of a Discrete Exponential Family. *Entropy* **2015**, *17*, 4215-4254.
https://doi.org/10.3390/e17064215

**AMA Style**

Malagò L, Pistone G. Natural Gradient Flow in the Mixture Geometry of a Discrete Exponential Family. *Entropy*. 2015; 17(6):4215-4254.
https://doi.org/10.3390/e17064215

**Chicago/Turabian Style**

Malagò, Luigi, and Giovanni Pistone. 2015. "Natural Gradient Flow in the Mixture Geometry of a Discrete Exponential Family" *Entropy* 17, no. 6: 4215-4254.
https://doi.org/10.3390/e17064215