Next Article in Journal
New Hyperbolic Function Solutions for Some Nonlinear Partial Differential Equation Arising in Mathematical Physics
Next Article in Special Issue
On Monotone Embedding in Information Geometry
Previous Article in Journal
Sliding-Mode Synchronization Control for Uncertain Fractional-Order Chaotic Systems with Time Delay
Previous Article in Special Issue
Entropy, Information Theory, Information Geometry and Bayesian Inference in Data, Signal and Image Processing and Inverse Problems
 
 
Order Article Reprints
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Natural Gradient Flow in the Mixture Geometry of a Discrete Exponential Family †

1
Department of Electrical and Electronic Engineering, Shinshu University, Nagano, Japan
2
Inria Saclay, Île-de-France, Orsay Cedex, France
3
De Castro Statistics, Collegio Carlo Alberto, Moncalieri, Italy
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in the Proceedings of MaxEnt 2014 Conference on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Amboise, France, 21–26 September 2014.
Entropy 2015, 17(6), 4215-4254; https://doi.org/10.3390/e17064215
Received: 31 January 2015 / Revised: 21 May 2015 / Accepted: 2 June 2015 / Published: 18 June 2015
(This article belongs to the Special Issue Information, Entropy and Their Geometric Structures)

Abstract

:
In this paper, we study Amari’s natural gradient flows of real functions defined on the densities belonging to an exponential family on a finite sample space. Our main example is the minimization of the expected value of a real function defined on the sample space. In such a case, the natural gradient flow converges to densities with reduced support that belong to the border of the exponential family. We have suggested in previous works to use the natural gradient evaluated in the mixture geometry. Here, we show that in some cases, the differential equation can be extended to a bigger domain in such a way that the densities at the border of the exponential family are actually internal points in the extended problem. The extension is based on the algebraic concept of an exponential variety. We study in full detail a toy example and obtain positive partial results in the important case of a binary sample space.

Graphical Abstract

1. Introduction

For the purpose of obtaining a clear presentation of our approach to the geometry of statistical models, we start with a recap of nonparametric statistical manifold; see, e.g., the review paper [1]. However, we will shortly move to the actual setup of the present paper, i.e., the finite state space case.
Let ( Ω , A , µ ) be a measured space of sample points x ∈ Ω. We denote by P L 1 ( µ ) the simplex of (probability) densities and by P > P the convex set of strictly positive densities. If Ω is finite, then P > is the topological interior of P . We denote by P 1 the affine space generated by P .
The set P > holds the exponential geometry, which is an affine geometry, whose geodesics are curves of the form t p t p 0 1 t p 1 t. The set P 1 holds the mixture geometry, whose geodesics are of the form tpt = (1 − t)p0 + tp1. A proper definition of the exponential and mixture geometry, where probability densities are considered points, requires the definition of the proper tangent space to hold the vectors representing the velocity of a curve. In both cases, the tangent space Tp at a point p is a space of random variables V with zero expected value, Ep [V] = 0. On the tangent space Tp, a natural scalar product is defined, 〈U, Vp = Ep [UV], so that a pseudo-Riemannian structure is available. Note that the Riemannian structure is a third geometry, different from both the exponential and the mixture geometries. Note also that both the expected value and the covariance can be naturally extended to be defined on P 1.
For each lower bounded objective function f : Ω and each statistical model M P >, the (stochastic) relaxation of f to M is the function F ( p ) = E p [ f ] , p M; cf. [2]. The minimization of the stochastic relaxation as a tool to minimize the objective function has been studied by many authors [37].
If we have a parameterization ξpξ of M, the parametric expression of the relaxed function is F ^ ( ξ ) = E p ξ [ f ]. Under integrability and differentiability conditions on both ξpξ and xf(x), F ^ is differentiable, with j F ^ ( ξ ) = E p ξ [ j log ( p ξ ) f ] and E p ξ [ j log ( p ξ ) ] = 0; see [1,8]. In order to properly describe the gradient flow of a relaxed random variable, these classical computations are better cast into the formal language of information geometry (see [9]) and, even better, in the language of non-parametric differential geometry [10] that was used in [11]. The previous computations suggest to take the Fisher score j log ( p ξ ) as the definition of a tangent vector at the j-th coordinate curve. While the development of this analogy in the finite state space case does not require a special setup, in the non-finite state space, some care has to be taken.
In this paper, we follow the non-parametric setup discussed in [1] and, in particular, the notion of an exponential family and the identification of the tangent space at each p ε with a space of p-centered random variables.
The paper is organized as follows. We discuss in Section 2 the generalities of the finite state space case; in particular, we carefully define the various notions of the Fisher information matrix and natural gradient that arise from a given parameterization. In Section 3, we discuss a toy example in order to introduce the construction of an algebraic variety extending the exponential family from positive probabilities P > to signed probabilities P 1; this construction is applied to the natural gradient flow in the expectation parameters; moreover, it is shown that this model has a variety that is ruled. The last Section 4 is devoted to the treatment of the special important case when the sample space is binary.
The present paper is a development of the paper [12], which was presented as a poster at the MaxEnt Conference 2014. While the topic is the same, the actual overlapping between the two papers is minimal and concerns mainly the generalities that are repeated for the convenience of the reader.

2. Gradient Flow of Relaxed Optimization

Let Ω be a finite set of points x = (x1, …, xn) and µ the counting measure of Ω. In this case, a density p P is a probability function, i.e., p : Ω +, such that x Ω p ( x ) = 1.
Let B = { T 1 , , T d }be a set of random variables, such that, if j = 1 d c j T j is constant, then c1 = ⋯ = cd = 0; for instance consider B such that x Ω T j ( x ) = 0, j = 0,…,d, and B is a linear basis. We say that B is a set of affinely independent random variables. If B is a linear basis it is affinely independent if and only if {1, T1, …, Td} is a linear basis.
We consider the statistical model whose elements are uniquely identified by the natural parameters θ in the exponential family with sufficient statistics B namely:
p θ log p θ ( x ) = i = 1 d θ i T i ( x ) ψ ( θ ) ,
see [13].
The proper convex function ψ : d,
θ ψ ( θ ) = log x Ω e θ · T ( x ) = θ · E p θ [ T ] E p θ [ log ( p θ ) ]
is the cumulant generating function of the sufficient statistics T, in particular,
ψ ( θ ) = E θ [ T ] , Hess ψ ( θ ) = Cov θ ( T , T ) .
Moreover, the entropy of pθ is:
H ( p θ ) = E p θ [ log ( p θ ) ] = ψ ( θ ) θ · ψ ( θ ) .
The mapping ∇ψ is one-to-one onto the interior M° of the marginal polytope, that is the convex span of the values of the sufficient statistics M = {T (x)|x ∈ Ω}. Note that no extra condition is required, because on a finite state space, all random variables are bounded. Nonetheless, even in this case, the proof is not trivial; see [13].
Convex conjugation applies [14] (Section 25) with the definition:
ψ * ( η ) = sup { θ d | θ η ψ ( θ ) } , η d .
The concave function θη · θψ(θ) has divergence mapping θη − ∇ψ(θ), and the equation η = ∇ψ(θ) has a solution if and only if η belongs to the interior M° of the marginal polytope. The restriction ϕ = ψ * | M ° is the Legendre conjugate of ψ, and it is computed by:
ϕ : M ° η ( ψ ) 1 ( η ) · η ψ ( ψ ) 1 ( η ) .
The Legendre conjugate ϕ is such that ∇ϕ = (∇ψ)−1, and it provides an alternative parameterization of with the so-called expectation or mixture parameter η = ∇ψ(θ),
p η = exp ( ( T η ) ϕ ( η ) + ϕ ( η ) ) .
While in the θ parameters, the entropy is H(pθ) = ψ(θ) − θ · ∇ψ(θ), in the η parameters, the ϕ function gives the negative entropy: H ( p η ) = E p η [ log p η ] = ϕ ( η ).
Proposition 1.
  • Hess ϕ (η) = (Hess ψ(θ))−1 when η = ∇ψ(θ).
  • The Fisher information matrix of the statistical model given by the exponential family in the θ parameters is Ie(θ) = Cov (∇ log pθ, ∇ log pθ) = Hess ψ(θ).
  • The Fisher information matrix of the statistical model given by the exponential family in the η parameters is Im(θ) = Covpη (∇ log pη, ∇ log pη) = Hess ϕ (η).
Proof. Derivation of the equality ∇ ϕ = (∇ψ)−1 gives the first item. The second item is a property of the cumulant generating function ψ. The third item follows from Equation (1). □

2.1. Statistical Manifold

The exponential family is an elementary manifold in either the θ or the η parameterization, named respectively exponential or mixture parameterization. We discuss now the proper definition of the tangent bundle T .
Definition 1 (Velocity). If Itpt, I open interval, is a differentiable curve in ℇ, then its velocity vector is identified with its Fisher score:
D d t p ( t ) = d d t log ( p t ) .
The capital D notation is taken from differential geometry; see the classical monograph [15].
Definition 2 (Tangent space). In the expression of the curve by the exponential parameters, the velocity is:
D d t p ( t ) = d d t log ( p t ) = d d t ( θ ( t ) T ψ ( θ ( t ) ) ) = θ ˙ ( t ) ( T E θ ( t ) [ T ] ) ,
that is it equals the statistics whose coordinates are θ ˙ ( t ) in the basis of the sufficient statistics centered at pt. As a consequence, we identify the tangent space at each pℇ with the vector space of centered sufficient statistics, that is:
T p = Span ( T j E p [ T j ] | j = 1 , , d ) .
In the mixture parameterization of Equation (1), the computation of the velocity is:
D d t p ( t ) = d d t log ( p t ) = d d t ( ϕ η ( t ) ( T η ( t ) ) + ϕ ( η ( t ) ) ) = ( Hess ϕ ( η ( t ) ) η ˙ ( t ) ) ( T η ( t ) ) = η ˙ ( t ) [ Hess ϕ ( η ( t ) ) ( T η ( t ) ) ] .
The last equality provides the interpretation of η ˙ ( t ) as the coordinate of the velocity in the conjugate vector basis Hess ϕ (η(t)) (Tη(t)), that is the basis of velocities along the η coordinates.
In conclusion, the first order geometry is characterized as follows.
Definition 3 (Tangent bundle T ℇ). The tangent space at each pℇ is a vector space of random variables Tp = Span (TjEp [Tj]|j = 1, …, d), and the tangent bundle T = {(p, V)|p ∈ , VTp}, as a manifold, is defined by the chart:
T ( e θ T ψ ( θ ) , v ( T E θ [ T ] ) ) ( θ , v ) .
Proposition 2.
  • If V = v · (Tη)Tℇ, then V is represented in the conjugate basis as:
    V = v ( T η ) = v ( Hess ϕ ( η ) ) 1 Hess ϕ ( η ) ( T η ) = ( Hess ϕ ( η ) 1 v ) Hess ϕ ( η ) ( T η ) .
  • The mapping (Hess ϕ (η))−1 maps the coordinates v of a tangent vector VT ℇ with respect to the basis of centered sufficient statistics to the coordinates v* with respect to the conjugate basis.
  • In the θ parameters, the transformation is vv* = Hess ψ(θ)v.
Remark 1. In the finite state space case, it is not necessary to go on to the formal construction of a dual tangent bundle, because all finite dimensional vector spaces are isomorphic. However, this step is compulsory in the infinite state space case, as was done in [1]. Moreover, the explicit construction of natural connections and natural parallel transports of the tangent and dual tangent bundle is unavoidable when considering the second-order calculus, as was done in [1,8], in order to compute Hessians and implement Newton methods of optimization. However, the scope of the present paper is restricted to a basic study of gradient flows; hence, from now on, we focus on the Riemannian structure and disregard all second-order topics.
Proposition 3 (Riemannian metric). The tangent bundle has a Riemannian structure with the natural scalar product of each Tp, 〈V, Wp = Ep [VW]. In the basis of sufficient statistics, the metric is expressed by the Fisher information matrix I(p) = Covp (T, T), while in the conjugate basis, it is expressed by the inverse Fisher matrix I−1(p).
Proof. In the basis of the sufficient statistics, V = v · (T − Ep [T]), W = w · (T − Ep [T]), so that:
V , W p = v E p [ ( T E p [ T ] ) ( T E p [ T ] ) ] w = v Cov p ( T , T ) w = v I ( p ) w ,
where I(p) = Covp (T, T) is the Fisher information matrix.
If p = pθ = pη, the conjugate basis at p is:
Hess ϕ ( η ) ( T η ) = Hess ψ ( θ ) 1 ( T ϕ ( θ ) ) = I 1 ( p ) ( T E p ( T ) ) ,
so that for elements of the tangent space expressed in the conjugate basis, we have V = v* · I−1(p) (T − Ep [T]), W = w* · I−1(p) (T − Ep [T]); thus:
V , W p = v * E p [ I 1 ( p ) ( T E p [ T ] ) ( T E p [ T ] ) I 1 ( p ) ] w * = v * I 1 ( p ) w * .

2.2. Gradient

For each C1 real function F : , its gradient is defined by taking the derivative along a C1 curve Ip(t), p = p(0), and writing it with the Riemannian metrics,
d d t F ^ ( θ ( t ) ) | t = 0 = F ( p ) , D d t p ( t ) | t = 0 p , F ( p ) T p .
If θ F ^ ( θ ) is the expression of F in the parameter θ and tθ (t) is the expression of the curve, then d d t F ^ ( θ ( t ) ) = F ^ ( θ ( t ) ) θ ˙ ( t ), so that at p = pθ (0), with velocity V = D d t p ( t ) | t = 0 = θ ˙ ( 0 ) ( T ψ ( θ ( 0 ) ), so that we obtain the celebrated Amari’s natural gradient of [16]:
F ( p ) , V p = ( Hess ψ ( θ ( 0 ) ) 1 F ^ ( θ ( 0 ) ) Hess ψ ( θ ( 0 ) ) θ ˙ ( 0 ) .
If η F ( η ) is the expression of F in the parameter η and tη (t) is the expression of the curve, then d d t F ^ ( θ ( t ) ) = F ^ ( θ ( t ) ) θ ˙ ( t ) so that at p = pη(0), with velocity V = d d t log ( p ( t ) ) | t = 0 = η ˙ ( 0 ) Hess ϕ ( η ( 0 ) ) ( T η ( 0 ) ),
F ( p ) , V p = ( Hess ϕ ( η ( 0 ) ) 1 F ^ ( η ( 0 ) ) Hess ϕ ( η ( 0 ) ) η ˙ ( 0 ) .
We summarize all notions of gradient in the following definition.
Definition 4 (Gradients).
  • The random variableF (p) uniquely defined by Equation (9) is called the (geometric) gradient of F at p. The mappingF : p ↦ ∇F (p) is a vector field of T ℇ.
  • The vector ˜ F ^ ( θ ) = Hess ϕ ( θ ) 1 F ^ ( θ ) of Equation (10) is the expression of the geometric gradient in the θ in the basis of sufficient statistics, and it is called the natural gradient, while F ^ ( θ ), which is the expression in the conjugate basis of the sufficient statistics, is called the vanilla gradient.
  • The vector ˜ F ( η ) = Hess ϕ ( η ) 1 F ( η ) of Equation (10) is the expression of the geometric gradient in the η parameter and in the conjugate basis of sufficient statistics, and it is called the natural gradient, while F ( η ), which is the expression in the basis of sufficient statistics, is called the vanilla gradient.
Given a vector field of , i.e., a mapping G defined on , such that G(p) ∈ Tp, which is called a section of the tangent bundle in the standard differential geometric language, an integral curve from p is a curve Itp(t), such that p(0) = p and D d t p ( t ) = G ( p ( t ) ). In the θ parameters, G(pθ) = Ĝ(θ) · (T − ∇ψ(θ)), so that the differential equation is expressed by θ ˙ ( t ) = G ^ ( θ ( t ) ). In the η parameters, G ( p η ) = G ( η ) · Hess ϕ ( η ) ( T η ), and the differential equation is η ˙ ( t ) = G ( η ( t ) ).
Definition 5 (Gradient flow). The gradient flow of the real function F : ℇ is the flow of the differential equation D d t p ( t ) = F ( p ( t ) ), i.e., d d t p ( t ) = p ( t ) F ( p ( t ) ). The expression in the θ parameters is θ ˙ ( t ) = ˜ F ^ ( θ ( t ) ), and the expression in the η parameters is η ˙ ( t ) = ˜ F ( η ( t ) ).
The cases of gradient computation we have discussed above are just a special case of a generic argument. Let us briefly study the gradient flow in a general chart f : ζpζ. Consider the change of parametrization from ζ to θ,
ζ p ζ θ ( p ζ ) = I ( p ζ ) 1 Cov p ζ ( T , log p ζ ) ,
and denote the Jacobian matrix of the parameters’ change by J(ζ). We have:
log p ζ = T · θ ( ζ ) ψ ( θ ( ζ ) ) = T · I ( p ζ ) 1 Cov p ζ ( T , log p ζ ) ψ ( I ( p ζ ) 1 Cov p ζ ( T , log p ζ ) ) ,
and the ζ coordinate basis of the tangent space T p ζ consists of the components of the gradient with respect to ζ,
( ζ log p ζ ) = J 1 ( ζ ) ( T E p ζ [ T ] )
It should be noted that in this case, the expression of the Fisher information matrix does not have the form of a Hessian of a potential function. In fact, the case of the exponential and the mixture parameters point to a special structure, which is called the Hessian manifold; see [17].

2.3. Gradient Flow in the Mixture Geometry

From now on, we are going to focus on the expression of the gradient flow in the η parameters. From Definition 4, we have:
˜ F ( η ) = Hess ϕ ( η ) 1 F ( η ) = Hess ψ ( ϕ ( η ) ) F ( η ) = I ( p η ) F ( η ) ,
where I(p) = Covp (T, T). As p ↦ Covp (T, T) is the restriction to the simplex of a quadratic function, while pη is the restriction to the exponential family of a linear function, in some cases, we can naturally consider the extension of the gradient flow equation outside M°. One notable case is when the function F is a relaxation of a non-constant state space function f : Ω → ℝ, as it is defined in, e.g., [3].
Proposition 4. Let f : Ω → ℝ, and let F (p) = Ep [f] be its relaxation on pℇ. It follows:
  • F (p) is the least square projection of f onto Tpℇ, that is:
    F ( p ) = I ( p ) 1 Cov p ( f , T ) · ( T E p [ T ] ) .
  • The expressions in the exponential parameters θ are ˜ F ^ ( θ ) = ( Hess ψ ( θ ) ) 1 Cov θ ( f , T ), F ^ ( θ ) = Cov θ ( f , T ) respectively.
  • The expressions in the mixture parameters η are ˜ F ( η ) = Cov η ( f , T ) and F ( η ) = Hess ϕ ( η ) Cov η ( f , T ), respectively.
Proof. On a generic curve through p with velocity V, we have d d t E p ( t ) [ f ] | t = 0 = Cov p ( f , V ) = f , V p. If VTp, we can orthogonally project f to get F , V p = ( I 1 ( p ) Cov p ( f , T ) ) ( T E p [ T ] ) , V p.
Remark 2. Let us briefly recall the behavior of the gradient flow in the relaxation case. Let θn, n = 1, 2, …, be a minimizing sequence for F ^, and let p ¯ be a limit point of the sequence ( p θ n ) n. It follows that p ¯ has a defective support, in particular p ¯ ; see [18,19]. For a proof along lines coherent with the present paper, see [20] (Theorem 1). It is found that the support F ¯ Ω is exposed, that is T ( F ¯ ) is a face of the marginal polytope M = con {T (x)|x ∈ Ω}. In particular, E p ¯ [ T ] = η ¯ belongs to a face of the marginal polytope M. If a is the (interior) orthogonal of the face, that is a · T (x) + b ≥ 0 for all x ∈ Ω and a · T (x) + b = 0 on the exposed set, then a ( T ( x ) η ¯ ) = 0 on the face, so that a Cov p ¯ ( f , T ) = 0. If we extend the mapping η ↦ Covη (f, T) on the closed marginal polytope M to be the limit of the vector field of the gradient on the faces of the marginal polytope, we expect to see that such a vector field is tangent to the faces. This remark is further elaborated below in the binary case.

2.4. The Saturated Model

A case of special tutorial interest is obtained when the exponential family contains all probability densities, that is when = P >. This case has been treated by many authors; here, we use the presentation of [21].
It is convenient to recode the sample space as Ω = {0, …, d}, where x = 0 is a distinguished point. If X is the identity on Ω, we define the sufficient statistics to be the indicator functions of points Tj = (X = j), j = 1, …, d. The saturated exponential family consists of all of the positive densities written as:
p ( x ; θ ) = exp ( j = 1 d θ j ( X = j ) ψ ( θ ) ) ,
where:
ψ ( θ ) = log ( 1 + j = 1 d e θ j ) .
Note that, in this case, the expectation parameter ηj = E ((X = j)) is the probability of case x = j and the marginal polytope is the probability simplex Δd.
The gradient mapping is:
η = ψ ( θ ) = ( e θ j 1 + i = 1 d e θ i | j = 1 , , d ) ,
the inverse gradient mapping is defined for η ∈]0, 1[d by:
θ = ( ψ ) 1 ( η ) = ϕ ( η ) = ( log ( η j 1 i = 1 d η i ) | j = 1 , , d ) ,
the negative entropy (Legendre conjugate) is:
ϕ ( η ) = η ϕ ( η ) ψ ϕ ( η ) = j = 1 d η j log ( η j 1 i = 1 d η i ) + log ( 1 i = 1 d η i ) ,
the η parameterization (1) of the probability is:
p η = exp ( ( T η ) ϕ ( η ) + ϕ ( η ) ) = exp ( j = 1 d ( ( X = j ) η j ) log ( η j 1 i = 1 d η i ) + j = 1 d η j log ( η j 1 i = 1 d η i ) + log ( 1 i = 1 d η i ) ) = exp ( j = 1 d ( X = j ) log ( η j 1 i = 1 d η i ) + log ( 1 i = 1 d η i ) ) = j = 1 d ( η j 1 i = 1 d η i ) ( X = j ) ( 1 i = 1 d η i ) = ( 1 i = 1 d η i ) ( X = 0 ) j = 1 d η j ( X = j ) .
Remark 3. The previous equation prompts three crucial remarks:
  • The expression of the probability in the η parameters is a normalized monomial in the parameters.
  • The expression continuously extends the exponential family to the probabilities in P .
  • The expression actually is a polynomial parameterization of the signed densities P 1.
We proceed to approach the three issues above. The Hessian functions are:
Hess ψ ( θ ) = diag ( p ) p p , p = ( 1 j = 1 d e θ j ) 1 e p , Hess ϕ ( η ) = diag ( η ) 1 η 0 1 [ 1 ] i , j = 1 d , η 0 = 1 j = 1 d η j .
The matrix Hess ψ(θ) is the Fisher information matrix I(p) of the exponential family at p = pθ, and the matrix Hess ϕ (η) is the inverse Fisher information matrix I−1(p) at p = pη. It follows that the natural gradient of a function ηh(η) will be:
˜ h ( η ) = Hess ϕ ( η ) h ( η ) ,
whose behavior depends on the following theorem; see [21] (Proposition 3).
Proposition 5.
  • The inverse Fisher information matrix I(p)−1 is zero on the vertexes of the simplex, only.
  • The determinant of the inverse Fisher information matrix I(p)−1 is:
    det ( I ( p ) 1 ) = ( 1 i = 1 n p i ) i = 1 n p i .
  • The determinant of the inverse Fisher information matrix I(p)−1 is zero on the borders of the simplex, only.
  • On the interior of each facet, the rank of the inverse Fisher information matrix I(p)−1 is (n − 1), and the (n − 1) linear independent column vectors generate the subspace parallel to the facet itself.
A generic statistical model can be seen as a submanifold of the saturated model, so that the form of the gradient in the submanifold is derived according to the general results in differential geometry. We do not do that here, and we switch to some very specific examples.

3. Toric Models: A Tutorial Example

Exponential families whose sample space is an integer lattice, such as finite subsets of ℤ2 or {+1, −1}d, have special algebro-combinatorial features that fall under the name of algebraic statistics. Seminal papers have been [22,23]. Monographs on the topic are [2426]. The book [27] covers both information geometry and algebraic statistics.
We do not assume the reader has detailed information about algebraic statistics. In this section, we work on a toy example intended to show both the basic mechanism of algebraic statistics and how the algebraic concepts are applied to the gradient flow problem as it was described in the previous section.
First, we give a general definition of the object on which we focus. A toric model is an exponential family, such that the orthogonal space of the space generated by the sufficient statistics and the constant has a vector basis of integer-valued random variables. We consider this example:
Ω T 1 T 2 T 3 1 0 0 2 2 0 1 1 3 1 0 2 4 2 1 1 ,
which corresponds to a variation of the classical independence model, where the design corresponds to the vertices of a square. It this example we moved the point {4} from (1, 1) to (2, 1).
In Equation (12)T1 and T2 are the sufficient statistics of the exponential family:
p θ = exp ( θ 1 T 1 + θ 2 T 2 ψ ( θ ) ) , ψ ( θ ) = log ( 1 + e θ 2 + e θ 1 + e 2 θ 1 + θ 2 ) ,
T3 is an integer-valued vector basis of the orthogonal space Span (1, T1, T2).
For the purpose of the generalization to less trivial examples, it should be noted that T 3 = T 3 + T 3 , that is (−2, 1, 2,−1) = (0, 1, 2, 0) − (2, 0, 0, 1). The couple T 3 + T 3 connects the lattice defined by:
= { ( Y , Z } 4 × 4 | B T y = B T Z } , B = [ 1 T 1 T 2 ] .
Such a set of generators is called a Markov basis of the lattice; see [22]. Algorithms are available to compute such a set of generators and are implemented, for instance, in the software suite 4ti2; see [28].
The sample space can be identified with the value of the sufficient statistics, hence with a finite subset of ℚ2 ⊃ Ω, Ω = {(0, 0), (0, 1), (1, 0), (2, 1)}; see Figure 1. Given a finite subset of ℝd, it is a general algebraic fact that there exists a filtering set of monomial functions that is a vector basis of all real functions on the subset itself; see an exposition and the applications to statistics in [24] or [27]. In our case, the monomial basis is 1, T1, T2, T1T2, and we define the matrix of the saturated model to be:
1 T 1 T 2 T 1 T 2 A = 1 2 3 4 [ 1 0 0 0 1 0 1 0 1 1 0 0 1 2 1 2 ] , A 1 = 1 2 [ 2 0 0 0 2 0 2 0 2 2 0 0 2 1 2 1 ] .
The matrix A one-to-one maps probabilities into expected values,
[ 1 η 1 η 2 η 12 ] = [ 1 E [ T 1 ] E [ T 2 ] E [ T 1 T 2 ] ] = [ p 1 p 2 p 3 p 4 ] [ 1 0 0 0 1 0 1 0 1 1 0 0 1 2 1 2 ] ,
and vice versa,
[ p 1 p 2 p 3 p 4 ] = [ 1 η 1 η 2 η 12 ] [ 1 0 0 0 1 0 1 0 1 1 0 0 1 1 2 1 1 2 ] .
On Model (13), the (positive) probabilities are constrained by the model:
Ω p θ exp ( θ 1 T 1 + θ 2 T 2 log ( 1 + e θ 2 + e θ 1 + e 2 θ 1 + θ 2 ) ) 1 p ( 1 ; θ ) exp ( log ( 1 + e θ 2 + e θ 1 + e 2 θ 1 + θ 2 ) ) 2 p ( 2 ; θ ) exp ( θ 2 log ( 1 + e θ 2 + e θ 1 + e 2 θ 1 + θ 2 ) ) 3 p ( 3 ; θ ) exp ( θ 1 log ( 1 + e θ 2 + e θ 1 + e 2 θ 1 + θ 2 ) ) 4 p ( 4 ; θ ) exp ( 2 θ 1 + θ 2 log ( 1 + e θ 2 + e θ 1 + e 2 θ 1 + θ 2 ) ) .
If we introduce the parameters ζ1 = exp (θ1), ζ2 = exp (θ2), the model is shown to be a (piece of an) algebraic variety, that is a set described by the rational parametric equations:
Ω p ζ ζ T 1 ζ T 2 / ( 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 ) 1 p ( 1 ; ζ ) 1 / ( 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 ) 2 p ( 2 ; ζ ) ζ 2 / ( 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 ) 3 p ( 3 ; ζ ) ζ 1 / ( 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 ) 4 p ( 4 ; ζ ) ζ 1 2 ζ 2 / ( 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 ) .
The peculiar structure of the toric model is best seen by considering the unnormalized probabilities:
Ω q ζ ζ T 1 ζ T 2 1 q ( 1 ; ζ ) 1 2 q ( 2 ; ζ ) ζ 2 3 q ( 3 ; ζ ) ζ 1 4 q ( 4 ; ζ ) ζ 1 2 ζ 2 , p ( x ; ζ ) = q ( x ; ζ ) 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 .
In algebraic terms, the homogeneous coordinates [q1 : q2 : q3 : q4] belong to the projective space P3. Precisely, the (real) projective space P3 is the set of all non-zero points of ℝ4 together with the equivalence relation [ q 1 : q 2 : q 3 : q 4 ] = [ q ¯ 1 : q ¯ 2 : q ¯ 3 : q ¯ 4 ] if, and only if, [ q 1 : q 2 : q 3 : q 4 ] = k [ q ¯ 1 : q ¯ 2 : q ¯ 3 : q ¯ 4 ], k ≠ 0. The domain of unnormalized signed probabilities as projective points is the open subset * 3 of ℙ3 where q1 + q2 + q3 + q4 ≠ 0. On this set, we can compute the normalization:
* 3 [ q 1 : q 2 : q 3 : q 4 ] [ q 1 , q 2 , q 3 , q 4 ] / ( q 1 + q 2 + q 3 + q 4 ) ,
where * is the affine space generated by the simplex Δ3. Notice that this embedding produces a number of natural geometrical structures on *.
Because of the form of (13), a positive density p belongs to that family if, and only if, log p ∈ Span (1, T1, T2), which, in turn, is equivalent to log pT3. We can rewrite the orthogonality as:
0 = x Ω log p ( x ) T 3 ( x ) = x : T 3 ( x ) > 0 log p ( x ) T 3 + ( x ) x : T 3 ( x ) < 0 log p ( x ) T 3 ( x ) = log ( x : T 3 ( x ) > 0 p ( x ) T 3 + ( x ) ) log ( x : T 3 ( x ) < 0 p ( x ) T 3 ( x ) ) .
Dropping the log function in the last expression, we observe that the positive probabilities described by either Equation (17) with θ1, θ2 ∈ ℝ or Equation (18) with ζ1, ζ2 ∈ ℝ> are equivalently described by the equations:
p 1 + p 2 + p 3 + p 4 1 = 0 ,
p 1 2 p 4 p 2 p 3 2 = 0.
Equation (21) identifies a surface within the probability simplex Δ3, which is represented in Figure 2 by the triangularization of a grid of points that satisfy the invariant.
By choosing a basis for the space orthogonal to Span (1, T1, T2), we can embed the marginal polytope of Figure 1 into the associated full marginal polytope. By expressing probabilities as a function of the expectation parameters, Equation (21) identifies a relationship between η1, η2 and the expected values of the chosen basis for the orthogonal space. This corresponds to an equivalent invariant in the expectation parameters, which, in turn, identifies a surface in the full marginal polytope.
For instance, consider the full marginal polytope parametrized by η = (η1, η2, η3), with η 3 = E [ T 3 ], which corresponds to the choice of T3 as a basis for the space orthogonal to the span of the sufficient statistics of the model, together with the constant 1, as in Equation (12). We introduce the following matrix:
1 T 1 T 2 T 3 B = 1 2 3 4 [ 1 0 0 2 1 0 1 1 1 1 0 2 1 2 1 1 ] ,
and similarly to Equation (15), we use the B matrix to one-to-one map probabilities into expected values, that is:
[ 1 η 1 η 2 η 3 ] = [ 1 1 1 1 0 0 1 2 0 1 0 1 2 1 2 1 ] [ p 1 p 2 p 3 p 4 ] ,
and:
[ p 1 p 2 p 3 p 4 ] = [ 3 5 1 5 2 5 1 5 1 5 2 5 7 10 1 10 2 5 1 5 3 5 1 5 1 5 2 5 3 10 1 10 ] [ 1 η 1 η 2 η 3 ] .
Then, by expressing probabilities as a function of the expectation parameters in Equation (21), we obtain the following invariant in η associated with the model:
( 4 η 1 + 3 η 2 η 3 2 ) ( η 1 + 2 η 2 + η 3 3 ) 2 + ( 4 η 1 7 η 2 η 3 2 ) ( η 1 3 η 2 + η 3 + 2 ) 2 = 0.
From the linear relationship between probabilities and expectation probabilities, we know that on the interior of the full marginal polytope, there exists a unique η3 which can be computed as a function of the other expectation parameters. Solving Equation (25) for η3 allows one to express explicitly the value of η3 given (η1, η2) and represent the surface associated with the invariant in the full marginal polytope. However, the cubic polynomial in Equation (25) in general admits three roots. The unique value of η3 can be obtained from the roots of the cubic polynomial, by imposing that η3 must be real and belong to the full marginal polytope given by Conv {(T1(x), T2(x), T3(x))|x ∈ Ω}.
We remind that the determinant Δ associated with the cubic function in Equation (25) in the η3 variable:
a η 3 3 + b η 3 2 + c η 3 + d = 0 ,
with:
a = 1
b = 2 η 1 + η 2 + 1
c = ( 4 η 1 + 3 η 2 2 ) ( η 1 + 2 η 2 3 ) + 1 2 ( η 1 + 2 η 2 3 ) 2 ( 4 η 1 + 7 η 2 2 ) ( η 1 3 η 2 + 2 ) + + 1 2 ( η 1 3 η 2 + 2 ) 2
d = 1 2 ( 4 η 1 + 3 η 2 2 ) ( η 1 + 2 η 2 3 ) 2 1 2 ( 4 η 1 7 η 2 2 ) ( η 1 3 η 2 + 2 ) 2
is given by:
Δ = 18 a b c d 4 b 3 d + b 2 c 2 4 a c 3 27 a 2 d 2 .
For Δ = 0, the polynomial has a real root with multiplicity equal to three; for Δ < 0, we have one real root and two complex conjugates roots, while for Δ > 0, there exist three real roots. The three roots of the polynomial as a function of the coefficients are given by:
η 3 , k = 1 3 ( b + u k C + Δ 0 u k C ) ,
for k ∈ {1, 2, 3}, with:
u 1 = 1 ,
u 2 = 1 + i 3 2 ,
u 3 = 1 i 3 2 ,
and:
C = Δ 1 + ( Δ 1 2 4 Δ 0 3 ) 2 3 ,
Δ 0 = b 2 3 a c ,
Δ 1 = 2 b 3 + 9 a b c + 27 a 2 d .
For the cubic polynomial in η3 of Equation (25), Δ < 0 for η2 − 1 ≠ 0 and for:
4 η 1 4 8 η 1 3 η 2 + 24 η 1 2 η 2 2 20 η 1 η 2 3 2 η 2 4 8 η 1 3 12 η 1 2 η 2 + 4 η 2 3 + 8 η 1 2 + 16 η 1 η 2 η 2 2 4 η 1 2 η 2 + 1 > 0.
In Figure 3(a), we represent in blue the region of the space (η1, η2) where Δ < 0, in red where Δ > 0, and the points where Δ = 0 with a dashed line. For Δ < 0, the only real root is η3,1, which identifies the blue surface in the full marginal polytope in Figure 3(b). For Δ > 0, it is easy to verify that only η3,2 belongs to the interior of the full marginal polytope parametrized by (η1, η2, η3), since it satisfies the inequalities given by the facets of the marginal polytope, and is represented in Figure 3(b) by the red surface. Finally, the three real roots coincide for Δ = 0, that is, for η2 = 1, and where:
4 η 1 4 8 η 1 3 η 2 + 24 η 1 2 η 2 2 20 η 1 η 2 3 2 η 2 4 8 η 1 3 12 η 1 2 η 2 + 4 η 2 3 + 8 η 1 2 + 16 η 1 η 2 η 2 2 4 η 1 2 η 2 + 1 = 0.
In the polynomial ring ℚ [p1, p2, p3, p4], the model ideal:
I = p 1 + p 2 + p 3 + p 4 1 , p 1 2 p 4 p 2 p 3 2
consists of all the polynomials of the form:
A = ( p 1 + p 2 + p 3 + p 4 1 ) + B ( p 1 2 p 4 p 2 p 3 2 ) , A , B [ p 1 , p 2 , p 3 , p 4 ] .
The algebraic variety of I uniquely extends the exponential family outside the positive octant. In the language of commutative algebra, it is the real Zariski closure of the exponential family model, cf. [29]. It is a notable example of toric variety. The general theory is in the monograph [30], and the applications to statistical models were first discussed in [31,32].
Let us discuss in some detail the parameterization of the toric variety as the submanifold of ℝ4 defined by Equations (20) and (21). The Jacobian matrix is:
J = [ 1 1 1 1 2 p 1 p 4 p 3 2 2 p 2 p 3 p 1 2 ] .
It has rank one, that is, there is a singularity, if, and only if,
2 p 1 p 4 = p 3 2 = 2 p 2 p 3 = p 1 2 .
This is equivalent to p 1 2 = p 3 2 = 0, which is a subspace of dimension two, whose intersection with Equation (20), is a line C in the affine space * = {p ∈ ℝ4|p1 + p2 + p3 + p4 = 1}. This (double) critical line intersects the simplex along the edge δ2δ4. Outside C, that is in the open complement set, the equations of the toric variety are locally solvable in two among the pi’s under the condition that the corresponding minor is not zero. To have a picture of what this critical set looks like, let us intersect our surface with the plane p3 = 0. On the affine space p1 + p2 + p4 = 1 we have p 1 2 p 4 = 0, that is the union of the double line p 1 2 = 0 with the line p4 = 0.
In the following, we derive a parameterization based on an algebraic argument, the Bézout theorem. In fact, it is remarkable that the cubic surface defined by Equations (20) and (21) is a well known example of ruled surface, see Exercise 5.8.15 in [33]. In fact, the singular line is a double line, so that the intersection of the cubic surface with any plane through the singular line is of degree 1 = 3 − 2, by the Bézout theorem, and thus, it is a line.
The line C is said to be double because the polynomial p 1 2 p 4 p 2 p 3 2 belongs to the ideal generated by p 1 2 and p 3 2. Let us consider the sheaf of planes through the singular line defined for each [α : β] ∈ P1 by the equations:
P [ α : β ] = { p 1 + p 2 + p 3 + p 4 1 = 0 , α p 1 + β p 3 = 0 } .
Let us intersect each plane P [ α : β ] of the sheaf with the model variety M by solving the system of equations:
{ p 1 + p 2 + p 3 + p 4 = 1 p 1 2 p 4 p 2 p 3 2 = 0 α p 1 + β p 3 = 0 .
On the critical line C, a generic point is parameterized as p(τ, 0) = (0, τ, 0, 1 − τ), which satisfies Equation (42) for τ ∈ ℝ. If 0 ≤ τ ≤ 1, then p(τ, 0) belongs to the edge δ2δ4.
As the critical line is double and the intersection of the model variety with the plane of the sheaf is a cubic curve, we expect the remaining part to be of degree 3 − 2 = 1, that is to be a line. Assume first α, β ≠ 0. Outside the critical line, as p1, p3 are not both zero and αp1 + βp3 = 0, then αp1 = − βp3 ≠ 0. It follows (αp1)2 = (βp3)2≠ 0; hence:
p 1 2 p 4 p 2 p 3 2 = 0 β 2 ( α p 1 ) 2 p 4 α 2 p 2 ( β p 3 ) 2 = 0 β 2 p 4 α 2 p 2 = 0.
We have found that for α, β ≠ 0, the intersection between the plane P [ α : β ] and the model variety M is the union of the critical line C and the line of equations:
{ p 1 + p 2 + p 3 + p 4 = 1 α p 1 + β p 3 = 0 α 2 p 2 + β 2 p 4 = 0 .
This line intersects the critical line where:
p 1 = p 3 = 0 , p 2 + p 4 = 1 , α 2 p 2 + β 2 p 4 = 0 ,
that is in the point:
p ( [ α : β ] , 0 ) ) = ( 0 , β 2 α 2 + β 2 , 0 , α 2 α 2 + β 2 ) .
In parametric form, the line in Equations (43) is:
p ( [ α : β ] , t ) = p ( [ α : β ] , 0 ) + u t ,
with u = ( β , β 2 ( α β ) α 2 + β 2 , α , α 2 ( α β ) α 2 + β 2 ) ,
p 1 ( [ α : β ] , t ) = β t p 2 ( [ α : β ] , t ) = β 2 α 2 + β 2 + β 2 ( α β ) α 2 + β 2 t p 3 ( [ α : β ] , t ) = α t p 4 ( [ α : β ] , t ) = α 2 α 2 + β 2 + α 2 ( α β ) α 2 + β 2 t .
The same equations hold in the previously excluded case αβ = 0.
Positive values of components 1 and 3 of the probability are obtained in Equation (44) for αβ < 0 and βt > 0, say α < 0, β > 0, t > 0. In this case, we have for component 2:
β 2 α 2 + β 2 + β 2 ( α β ) α 2 + β 2 t = β 2 α 2 + β 2 ( 1 ( β α ) t ) ,
which is positive if t < (βα)−1. The same condition applies to component 4. As [ α : β ] = [ α β α : β β α ], we can always assume β > 0 and βα = 1 that is, α = β − 1; hence β < 1. The parameterization of the positive probabilities in the model becomes:
p 1 ( α , t ) = ( α + 1 ) t p 2 ( α , t ) = α 2 ( α 2 + 2 α + 1 ) t + 2 α + 1 2 α 2 + 2 α + 1 p 3 ( α , t ) = α t p 4 ( α , t ) = α 2 t α 2 2 α 2 + 2 α + 1 , 0 < t < 1 , 1 < α < 0.
For example, with α = 1 2, we have:
p 1 ( α , t ) = 1 2 t p 2 ( α , t ) = 1 2 ( 1 t ) p 3 ( α , t ) = 1 2 t p 4 ( α , t ) = 1 2 ( 1 t ) , 0 < t < 1.
In Figure 4(a), we represented the surface associated with the invariant of Equation (21) as a ruled surface in the probability simplex, according to Equations (45), where the blue line corresponds to the case α = 1 2. The ruled surface corresponds to the surface in Figure 2 that was approximated by the triangularization of a grid of points satisfying the invariant. In Figure 4(b), we represent the same lines of Figure 4(a) in the chart (α, t).
From Equation (45), we can express the expectation parameters η as a function of (α, t), i.e.,
η 1 = 2 α 2 ( 2 α 3 + 4 α 2 + α ) t 2 α 2 + 2 α + 1 ,
η 2 = t + 1 ,
η 3 = ( 8 α 3 + 12 α 2 + 10 α + 3 ) t 2 α 1 2 α 2 + 2 α + 1 .
Notice that the dependence on (α, t) is rational. In Figure 5(a), the ruled surface has been represented in the full marginal polytope, while in Figure 5(a), the lines have been projected over the marginal polytope.
Let us invert Equation (45) to obtain the corresponding chart p ↦ (β, t). From p1 and p3, we obtain β = p1/(p1 + p3). As p2 + p4 = 1 − t, we have the chart:
β = p 1 p 1 + p 3 , t = 1 p 2 p 4 = p 1 + p 3 .
It is remarkable that the model depends on the probability restricted to {1, 3}; similarly, the expectation parameters depend on p1 and p3 only.
From the theory of exponential families, we know that the gradient mapping:
( θ 1 , θ 2 ) ψ ( θ 1 , θ 2 ) = [ 2 e ( 2 θ 1 + θ 2 ) + e θ 1 e ( 2 θ 1 + θ 2 ) + e θ 1 + e θ 2 + 1 e ( 2 θ 1 + θ 2 ) + e θ 2 e ( 2 θ 1 + θ 2 ) + e θ 1 + e θ 2 + 1 ]
is one-to-one from ℝ2 onto the interior of the marginal polytope M; see Figure 3(b). The equations:
η 1 = ζ 1 + 2 ζ 1 2 ζ 2 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 , η 2 = ζ 2 + ζ 1 2 ζ 2 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 ,
are uniquely solvable for (η1, η2) ∈ M°. We study the local solvability in ζ1, ζ2 of:
( 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 ) η 1 = ζ 1 + 2 ζ 1 2 ζ 2 , ( 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 ) η 2 = ζ 2 + ζ 1 2 ζ 2 ,
that is,
0 = η 1 + ( η 1 1 ) ζ 1 + η 1 ζ 2 + ( η 1 2 ) ζ 1 2 ζ 2 , 0 = η 2 + η 2 ζ 1 + ( η 2 1 ) ζ 2 + ( η 2 1 ) ζ 1 2 ζ 2 .
The Jacobian is:
[ ( η 1 1 ) + 2 ( η 1 2 ) ζ 1 ζ 2 η 1 + ( η 1 2 ) ζ 1 2 η 2 + 2 ( η 2 1 ) ζ 1 ζ 2 ( η 2 1 ) + ( η 2 1 ) ζ 1 2 ] .
If we introduce the extra variable η12, from Equations (15) and (18) we have the system:
( 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 ) η 1 = ζ 1 + 2 ζ 1 2 ζ 2 , ( 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 ) η 2 = ζ 2 + ζ 1 2 ζ 2 , ( 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 ) η 12 = 2 ζ 1 2 ζ 2 ,
Instead, if we use the variable η3, from Equations (16) and (41), it is possible to derive the equation of the model variety in the η1, η2, η3 parameters. From Equation (18), we have:
η 1 = E ζ [ T 1 ] = ζ 1 + 2 ζ 1 2 ζ 2 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 , η 2 = E ζ [ T 2 ] = ζ 2 + ζ 1 2 ζ 2 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 , η 3 = E ζ [ T 3 ] = 2 + ζ 2 + 2 ζ 1 ζ 1 2 ζ 2 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 .
Let us solve for the ζ, that is:
There is another way to derive the model constraint in the η. In the example, the sample space has four points; the monomials 1, T1, T2, T1T2 are a vector basis of the linear space of the columns of the matrix A, in particular T3 is a linear combination:
Ω 1 T 1 T 2 T 1 T 2 T 3 1 1 0 0 0 2 2 1 0 1 0 1 3 1 1 0 0 2 4 1 2 1 2 1 2 4 3 5 =
It follows that:
η 3 = E θ [ T 3 ] = E θ [ 2 + 4 T 1 + 3 T 2 5 T 1 T 2 ] = 2 + 4 E θ [ T 1 ] + 3 E θ [ T 2 ] + 3 Cov θ ( T 1 , T 2 ) + 3 E θ [ T 1 ] E θ [ T 2 ] = 2 + 4 1 ψ ( θ ) + 3 2 ψ ( θ ) 5 1 2 ψ ( θ ) 5 1 ψ ( θ ) 2 ψ ( θ ) = 2 + 4 η 1 + 3 η 2 5 1 2 ψ ( θ ) 5 η 1 η 2 .

3.1. Border

Let us consider the points in the model variety that are probabilities, that is,
p 1 + p 2 + p 3 + p 4 = 1 , p 1 2 p 4 = p 2 p 3 2 , p 1 , p 2 , p 3 , p 4 0.
From the equation above, we see that single zeros are not allowed, that is to say there are no intersections between the model in Equation (49) and the open facets of the probability simplex. We now consider the full marginal polytope obtained by adding the sufficient statistics T1T2, and parametrized by (η1, η2, η12). By Equation (16), the marginal polytope is represented by the inequalities:
p 1 = 1 η 1 η 2 + η 12 0 , p 2 = η 2 1 2 η 12 0 , p 3 = η 1 η 12 0 , p 4 = 1 2 η 3 0 ,
which is a convex set with vertexes (0, 0, 0), (0, 1, 0), (1, 0, 0), (2, 1, 2), which corresponds to the full marginal polytope associated to the sufficient statistics {T1, T2, T1T2}. As the critical set is the edge δ2δ4 in the p space, it is the edge (0, 1, 0) ↔ (2, 1, 2) in the η space.
We have the following possible models on the border of the probability simplex and on the border of the full marginal polytope, where the values for η1 and η2 are obtained from Equation (15).
p 1 p 2 p 3 p 4 η 1 η 2 0 0 + + p 3 + 2 p 4 p 4 0 + 0 + 2 p 4 p 2 + p 4 + 0 + 0 p 3 0 + + 0 0 0 p 2 p 1 p 2 p 3 p 4 η 1 η 2 + 0 0 0 0 0 0 + 0 0 0 1 0 0 + 0 1 0 0 0 0 + 2 1
That is, the domains that can be support of probabilities in the algebraic model are the faces of the marginal polytope. This is general; see [20,34].

3.2. Fisher Information

Let us consider the covariance matrix of the sufficient statistics. Let us denote by A|12 the block of the two central columns in A in Equation (14) and by p the row vector of probabilities. Then, the variance matrix is:
A | 12 T diag ( p ) A | 12 ( p A | 12 ) T p A | 12 = A | 12 T diag ( p ) A | 12 A | 12 T p T p A | 12 = A | 12 T ( diag ( p ) p T p ) A | 12 .
On each of the cases of probabilities supported by a single point, the matrix ppT p is zero; hence, the covariance matrix is zero. In each of the cases where the probability is supported by a facet, say {1, 2}, the matrix ppT p reduces to the corresponding block, and the covariance matrix is:
[ 0 0 1 1 0 1 0 1 ] [ p 1 p 1 2 p 1 p 2 0 0 p 1 p 2 p 2 p 2 2 0 0 0 0 0 0 0 0 0 0 ] [ 0 0 0 1 1 0 2 1 ] = [ 0 0 0 1 ] [ p 1 p 1 2 p 1 p 2 p 1 p 2 p 2 p 2 2 ] [ 0 0 0 1 ] = [ 0 0 0 p 2 p 2 2 ] .
The space generated by the covariance matrix is ℚ (0, 1), that is the affine space that contains the facets itself. Analogous results hold for each facet, and this result is general.
We note that the determinant of the covariance matrix is a polynomial of degree six in the indeterminates p1, p2, p3. This polynomial is zero on each facet.
The η parameters can be given as a function of either θ or ζ. We have:
η A T [ p ζ ] η 1 ( ζ 1 + 2 ζ 1 2 ζ 2 ) / ( 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 ) η 2 ( ζ 2 + ζ 1 2 ζ 2 ) / ( 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 ) η 3 ( 2 + ζ 2 + 2 ζ 1 ζ 1 2 ζ 2 ) / ( 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 )
We know from the theory of exponential families that the mapping:
] 0 , [ × ] 0 , [ ( ζ 1 , ζ 2 ) ( η 1 , η 2 ) Conv { ( T 1 ( x ) , T 2 ( x ) ) | x Ω } °
is one-to-one. We look for an algebraic inversion of the equations:
( 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 ) η 1 = ζ 1 + 2 ζ 1 2 ζ 2 , ( 1 + ζ 2 + ζ 1 + ζ 1 2 ζ 2 ) η 2 = ζ 2 + ζ 1 2 ζ 2 .
If we rewrite Equations (50) as polynomials in ζ1, ζ2, we obtain:
η 1 + ( η 1 1 ) ζ 1 + η 1 ζ 2 + ( η 1 2 ) ζ 1 2 ζ 2 = 0 ,
η 2 + η 2 ζ 1 + ( η 2 1 ) ζ 2 + ( η 2 1 ) ζ 1 2 ζ 2 = 0 ,
η 3 + ( η 3 2 ) ζ 1 + ( η 3 1 ) ζ 2 + ( η 3 + 1 ) ζ 1 2 ζ 2 = 0.
Gauss elimination produces a linear system in ζ1, ζ2 with coefficients that are polynomials in η1, η2, η3 to be considered with the implicit equation derived from p 1 2 p 4 p 2 p 3 2 = 0. The system is:
2 η 2 η 3 2 η 1 + 2 η 2 = ( 2 η 2 η 3 2 η 1 + 2 ) ζ 1 + ( 2 η 2 η 3 + 2 η 2 + 2 η 3 2 ) ζ 2 , η 2 = η 2 ζ 1 + ( η 2 1 ) ζ 2 .

3.3. Extension of the Model

In this subsection, we study an extension to signed probabilities of the exponential family in Equations (12) and (13) based on the representation of the statistical model as a ruled surface in the probability simplex. Our motivation for such an analysis is the study of the stability of the critical points of a gradient field in the η parameters, in particular when the critical points belong to the boundary of the model. Indeed, by extending the gradient field outside the marginal polytope, we can identify open neighborhoods for critical points on the boundary of the polytope, which allow one to study the convergence of the differential equations associated with the gradient flows, for instance by means of Lyapunov stability.
In the following, we describe more in detail how the extension can be obtained. Let a be a point along the edge δ2δ4 of the full marginal polytope parametrized by (η1, η2, η3) and b the coordinates of the corresponding point over δ1δ3 obtained by intersecting the line of the ruled surface through a with the edge δ1δ3. The values of the η2 coordinate for a and b are one and zero, respectively. The other coordinates of b depend on those of a though α. First, we obtain the values of the η3 coordinates as a function of the η1 coordinate. For a, we find the equation of the line to which δ2δ4 belongs, given by:
( η 1 η 2 η 3 ) = ( 0 1 1 ) + u ( 2 0 2 ) = ( 2 u 1 1 2 u ) ,
from which we obtain η3 = 1 − η1. Similarly, for the η3 coordinate of b, we consider the line through δ1δ3, that is:
( η 1 η 2 η 3 ) = ( 0 0 2 ) + u ( 1 0 4 ) = ( u 0 4 u 2 ) ,
which gives us η3 = 4η1 − 2. Finally, for the η1 coordinate, we use Equations (44). In a, since t = 0 and p1 = p3 = 0, then p 2 = β 2 α 2 + β 2 and p 4 = α 2 α 2 + β 2. From Equation (24), it follows that:
η 1 = 2 α 2 2 α 2 + 2 α + 1 .
Similarly, for b, we have p2 = p4 = 0 and t = 1, so that p1 = α + 1 and p3 = −α. From Equation (24), it follows that:
η 1 = α .
As a result, the coordinates of a and b both depend on α as follows,
a = ( 2 α 2 2 α 2 + 2 α + 1 , 1 , 2 α + 1 2 α 2 + 2 α + 1 )
b = ( α , 0 , 4 α 2 )
The ruled surface in the full marginal polytope is given by the lines through a and b described by the following implicit representation, for −1 < α < 1 and 0 < t < 1,
[ η 1 η 2 η 3 ] = [ α 0 4 α 2 ] + t [ 2 α 3 + 4 α 2 + α 2 α 2 + 2 α + 1 1 8 α 3 + 12 α 2 + 10 α + 3 2 α 2 + 2 α + 1 ] .
The ruled surface can be extended outside the marginal polytope by taking values of α, t ∈ ℝ and considering the set of lines through a and b for different values of α. For α → ±∞, the η1 coordinate of b tends to ∓∞, while the η1 of a tends to one. For α → ±∞, the ruled surface admits the same limit given by the line parallel to δ1δ3 passing through (1, 1, 0). The surface intersects the interior of the marginal polytope for t ∈ (0, 1) and α ∈ (−1, 0). Moreover, the surface intersects the critical line twice, for t = 0, α ∈ [−1, 0] and for t = 0, α ∉ [−1, 0].
In Figures 6 and 7, we represent the extension of the ruled surface outside the probability simplex and in the (α, t) chart, while in Figures 8 and 9, the extended surface has been represented in the full marginal polytope parametrized by (η1, η2, η3) and in the marginal polytope parametrized by (η1, η2).

3.4. Optimization and Natural Gradient Flows

We are interested in the study of natural gradient flows of functions defined over statistical models. Our motivation is the study of the optimization of the stochastic relaxation of a function, i.e., the optimization of the expected value of the function itself with respect to a distribution p in a statistical model. Natural gradient flows associated with the stochastic relaxation converge to the boundary of the model, where the probability mass is concentrated on some instances of the search space. To study the convergence over the boundary, we proposed to extend the natural gradient field outside the marginal polytope and the probability simplex, by employing a parameterization that describes the model as a ruled surface, as we described in the tutorial example of this section.
In the following, we focus on the optimization of a function f : Ω → ℝ, and we consider its stochastic relaxation with respect to a probability distribution in the exponential family in Equations (12) and (13). First, we compute a basis for all real-valued functions defined over Ω using algebraic arguments. Consider the zero-dimensional ideal I associated with the set of points in Ω, and let R be the polynomial ring with the field of real coefficients; a vector space basis for the quotient ring R/I defines a basis for all functions defined over Ω. In CoCoA [36], this can be computed with the command QuotientBasis.
Coming back to our example, with Ω = {1, 2, 3, 4}, by fixing the graded reverse lexicographical monomial order, which is the default one in CoCoA [36], we obtain a basis given by {1, x1, x2, x1 x2}, so that any f : Ω → ℝ can be written as:
f = c 0 + c 1 x 1 + c 2 x 2 + c 12 x 1 x 2 .
We are interested in the study of the natural gradient field of F ( p ) = E p [ f ]. Recall that T3 = 4x1 + 3x2 − 5x1x2 − 2 and η 3 = E [ T 3 ], so that:
E [ x 1 x 2 ] = 1 5 ( 4 η 1 + 3 η 2 η 3 2 ) ,
which implies:
F η ( η ) c 0 2 5 c 12 + ( c 1 + 4 5 c 12 ) η 1 + ( c 2 + 3 4 c 12 ) η 2 1 5 c 12 η 3 .
In order to study the gradient field of Fη(η) over the marginal polytope parameterized by (η1, η2), we need to express η3 as a function of η1 and η2. In order to do that, we parametrize the exponential family as a ruled surface by means of the (α, t) parameters. Moreover, this parametrization has a natural extension outside the marginal polytope, which allows one to study the stability of the critical points on the boundary of the marginal polytope. We start by evaluating the gradient field of Fα,t(α, t) in the (α, t) parametrization, then we map it to the marginal polytope in the η parameterization.
By expressing (η1, η2) as a function of (α, t), we obtain:
F α , t ( α , t ) = 2 α 2 ( c 1 + c 12 ) + ( 2 α 2 + 2 α + 1 ) ( c 0 + c 2 ) ( 2 α 2 ( c 1 + c 12 ) + ( 2 α 2 + 2 α + 1 ) ( c 1 α + c 2 ) ) t 2 α 2 + 2 α + 1 .
If we take partial derivatives of Equation (64) with respect to α and t, we have:
α F α , t ( α , t ) = 4 ( α 2 + α ) ( c 1 + c 12 ) ( ( 4 α 4 + 8 α 3 + 12 α 2 + 8 α + 1 ) c 1 + 4 ( α 2 + α ) c 12 ) t 4 α 4 + 8 α 3 + 8 α 2 + 4 α + 1 ,
t F α , t ( α , t ) = 2 α 2 c 12 + ( 2 α 3 + 4 α 2 + α ) c 1 + ( 2 α 2 + 2 α + 1 ) c 2 2 α 2 + 2 α + 1 .
In the (α, t) parameterization, the Fisher information matrix reads:
I α , t ( α , t ) = E α , t [ 2 log p ( x ; α , t ) ] = [ 4 α 2 ( 4 α 4 + 8 α 3 + 12 α 2 + 8 α + 1 ) t + 4 α 4 α 6 + 12 α 5 + 16 α 4 + 12 α 3 + 5 α 2 + α 0 0 ( t 2 t ) 1 ] .
Finally, the natural gradient becomes:
˜ F α , t ( α , t ) = I α , t ( α , t ) 1 F α , t ( α , t ) = [ ( 4 α 6 + 12 α 5 + 16 α 4 + 12 α 3 + 5 α 2 + α ) ( 4 ( α 2 + α ) c 1 + 4 ( α 2 + α ) c 12 ( ( 4 α 4 + 8 α 3 + 12 α 2 + 8 α + 1 ) c 1 + 4 ( α 2 + α ) c 12 ) t ) ( 4 α 4 + 8 α 3 + 8 α 2 + 4 α + 1 ) ( 4 α 2 ( 4 α 4 + 8 α 3 + 12 α 2 + 8 α + 1 ) t + 4 α ( 2 α 2 c 12 + ( 2 α 3 + 4 α 2 + α ) c 1 + ( 2 α 2 + 2 α + 1 ) c 2 ) ( t 2 t ) 2 α 2 + 2 α + 1 ]
We obtained a rational formula for the natural gradient in the (α, t) parameterization, which can be easily extended outside the marginal polytope. However, notice that the inverse Fisher information matrix and the natural gradient are not defined for:
t = 4 ( α 2 + α ) 4 α 4 + 8 α 3 + 12 α 2 + 8 α + 1 .
We also remark that over the boundary of the model, for t ∈ {0, 1} and α ∈ {−1, 0}, the determinant of the inverse Fisher information vanishes, so that the matrix is not full rank. It follows that the trajectories associated with natural gradient flows with initial conditions in the interior of the marginal polytope remain in the marginal polytope.
In order to study the natural gradient field over the marginal polytope, we apply a reparameterization of a tangent vector from the (α, t) parameterization to the (η1, η2) parameterization. Indeed, by the chain rule and the inverse function theorem, we have:
F η ( α , t ) = F α , t ( α , t ) T J ( α , t ) 1
The Jacobian of the map (α, t) 7↦ (η1, η2) is:
J ( α , t ) = [ ( 6 α 2 + 8 α + 1 ) t 4 α 2 α 2 + 2 α + 1 2 ( 2 α 2 ( 2 α 3 + 4 α 2 + α ) t ) ( 2 α + 1 ) ( 2 α 2 + 2 α + 1 ) 2 2 α 3 + 4 α 2 + α 2 α 2 + 2 α + 1 0 1 ] ,
with inverse:
J ( α , t ) 1 = [ 4 α 4 + 8 α 3 + 8 α 2 + 4 α + 1 4 α 2 ( 4 α 4 + 8 α 3 + 12 α 2 + 8 α + 1 ) t + 4 α 4 α 5 + 12 α 4 + 12 α 3 + 6 α 2 + α 4 α 2 ( 4 α 4 + 8 α 3 + 12 α 2 + 8 α + 1 ) t + 4 α 0 1 ] .
It follows that:
F η ( α , t ) = [ 4 ( α 2 + α ) c 1 + 4 ( α 2 + α ) c 12 ( ( 4 α 4 + 8 α 3 + 12 α 2 + 8 α + 1 ) c 1 + 4 ( α 2 + α ) c 12 ) t 4 α 2 ( 4 α 4 + 8 α 3 + 12 α 2 + 8 α + 1 ) t + 4 α 4 ( α 3 + α 2 ) c 12 4 ( α 2 + α ) c 2 + ( 2 ( 2 α 4 α 2 ) c 12 + ( 4 α 4 + 8 α 3 + 12 α 2 + 8 α + 1 ) c 2 ) t 4 α 2 ( 4 α 4 + 8 α 3 + 12 α 2 + 8 α + 1 ) t + 4 α ] .
Notice that, as for the inverse Fisher information matrix, the inverse Jacobian J(α, t)−1 is not defined for t which satisfies Equation (69).
We compute the inverse Fisher information matrix by evaluating the covariance between the sufficient statistics of the exponential family. Since over Ω, we have x 1 2 = x 1 + x 1 x 2 and x 1 2 = x 1, it follows that:
I η ( η ) 1 = [ 1 5 ( 9 η 1 + 3 η 2 η 3 2 ) η 1 2 1 5 ( 4 η 1 + 3 η 2 + η 3 2 ) η 1 η 2 1 5 ( 4 η 1 + 3 η 2 η 3 2 ) η 1 η 2 η 2 η 2 2 ] .
By parameterizing I η 1 with (α, t), we have:
I η ( α , t ) 1 = [ 4 α 4 + 8 α 3 ( 4 α 6 + 16 α 5 + 20 α 4 + 8 α 3 + α 2 ) t 2 + 4 α 2 + ( 4 α 5 12 α 3 8 α 2 α ) t 4 α 4 + 8 α 3 + 8 α 2 + 4 α + 1 ( 2 α 3 + 4 α 2 + α ) t 2 ( 2 α 3 + 4 α 2 + α ) t 2 α 2 + 2 α + 1 ( 2 α 3 + 4 α 2 + α ) t 2 ( 2 α 3 + 4 α 2 + α ) t 2 α 2 + 2 α + 1 t 2 + t ] .
Finally, we derive the following rational formula for the natural gradient over the marginal polytope parametrized as a ruled surface by (α, t):
˜ F η ( α , t ) = I η ( α , t ) 1 F η ( α , t ) = [ ( ( 4 α 6 + 16 α 5 + 20 α 4 + 8 α 3 + α 2 ) c 1 + 2 ( 2 α 5 + 4 α 4 + α 3 ) c 12 + ( 4 α 5 + 12 α 4 + 12 α 3 + 6 α 2 + α ) c 2 ) t 2 4 ( α 4 + 2 α 3 + α 2 ) c 1 + 4 ( α 4 + 2 α 3 + α 2 ) c 12 ( ( 4 α 5 12 α 3 8 α 2 α ) c 1 + 2 ( 2 α 5 + 2 α 4 3 α 3 2 α 2 ) c 12 + ( 4 α 5 + 12 α 4 + 12 α 3 + 6 α 2 + α ) c 2 ) t 4 α 4 + 8 α 3 + 8 α 2 + 4 α + 1 ( 2 α 2 c 12 + ( 2 α 3 + 4 α 2 + α ) c 1 + ( 2 α 2 + 2 α + 1 ) c 2 ) t 2 ( 2 α 2 c 12 + ( 2 α 2 c 12 + ( 2 α 3 + 4 α 2 + α ) c 1 + ( 2 α 2 + 2 α + 1 ) c 2 ) t 2 α 2 + 2 α + 1 ] .

3.5. Examples with Global and Local Optima

We conclude this section with two examples of natural gradient flows associated with two different f functions. First, consider the case where c0 = 0, c1 = 1, c2 = 2, c3 = 3, so that:
Ω x 1 x 2 f 1 1 0 0 0 2 0 1 2 3 1 0 1 4 2 1 10 .
The function admits a minimum on {1}. In Figure 10, we plotted the vector fields associated with the vanilla and natural gradient, together with some gradient flows for different initial conditions, in the (α, t) parameterization. In Figure 11, we represent the vanilla and natural gradient field over the marginal polytope in the (η1, η2) parameterization. Notice that, as expected, differently from the vanilla gradient, the natural gradient flows converge to the unique global optima, which corresponds to the vertex where all of the probability is concentrated over {1}. In the (α, t) parameterization, the flows have been extended outside the statistical model by prolonging the lines of the ruled surface, and as we can see, they remain compatible with the flows on the interior of the model, in the sense that the nature of the critical point is the same for trajectories with initial conditions on the interior and on the exterior of the model. In other words, the global optima is an attractor from both the interior and the exterior of the model and similarly for the other critical points on the vertices, both for saddle points and the unstable points, where the natural gradient vanishes.
In the second example, we set c0 = 0, c1 = 1, c2 = 2, c3 = −5/2, and we have:
Ω x 1 x 2 f 2 1 0 0 0 2 0 1 2 4 2 1 1
so that f2 admits a minimum on {4}. In Figures 12 and 13, we plotted the vector fields associated with the vanilla and natural gradient, together with some gradient flows for different initial conditions, in the (α, t) and (η1, η2) parameterization, respectively. As in the previous example, natural gradient flows converge to the vertices of the model; however, in this case, we have one local optima in {1} and one global optima in {4}, together with a saddle point in the interior of the model. Similarly to the previous example, in the (α, t) parameterization, the flows have been extended outside the statistical model, and the nature of the critical points is the same for trajectories with initial conditions in the statistical model and in the extension of the statistical model.
We conclude the section by noticing that in both examples, for certain values of t in Equation (69), the natural gradient flows are not defined on the extension of the statistical model. As represented in the figures, once a trajectory encounters the dashed blue line in the (α, t) parameterization, the flow stops at that point.

4. Pseudo-Boolean Functions

We turn to discuss a case of considerable practical interest to see which of the results obtained in the example of the previous section we are able to extend.
For binary variables, we use the coding ±1, that is x = (x1,…,xn) ∈ {+1, −1}n = Ω. For any function f : Ω ↦ ℝ, with multi-index notation, f(x) = ∑α∈L aαxα, with L = {0, 1}n and x α = i = 1 n x i α i, 00=1. If ML* = L\{0}, the model where pε if:
p exp ( α M θ α X α ) = α M ( e θ α ) X α
has been considered in a number of papers on combinatorial optimization; see [35]. The following statements are results in algebraic statistics; cf. [20,35]. Let P 1 = { f Ω | x Ω p ( x ) = 1 }.
Proposition 6 (Implicitization of the exponential family). Given a function p: Ω → ℝ, then and pε if, and only if, the following conditions all hold:
  • p(x) > 0, x ∈ Ω;
  • x∈Ωp(x)=1;
  • x : x β = 1 p ( x ) = x : x β = 1 p ( x ) for all βL*\M.
Proof. (⇒) If pε, then p(x) > 0, x ∈ Ω (Item 1) and ∑x∈Ω p(x) = 1 (Item 2). Moreover, log p ( x ) = α M θ α x α ψ ( θ ). The function log p is orthogonal to each Xβ, βL* \ M. Hence:
0 = x Ω log p ( x ) x β = x : x β = 1 log p ( x ) x : x β = 1 log p ( x ) = log x : x β = 1 p ( x ) log x : x β = 1 p ( x ) ,
which is equivalent to Item 3.
(⇐) Oppositely, the computation in Equation (79) implies that log p is orthogonal to each Xβ; hence, there exists θ, such that log p = α M θ α X α + C. Now, Item 2 implies C = −ψ(θ).
Let ℝ [Ω] denote the ring of polynomials in the indeterminates {p(x)|x ∈ Ω}. Given a binary model M, the set of polynomials:
{ x : x β = 1 p ( x ) x : x β = 1 p ( x ) | β L * \ M } ,
generates an ideal J ( M ), which is called the toric ideal of the model M. Its variety V ( M ) is called the exponential variety of M.
Proposition 7.
  • The exponential variety of M is the Zariski closure of the exponential model ε.
  • The closure ¯ of ε in P is characterized by p(x) ≥ 0, x ∈ Ω, together with Items 2 and 3 of Proposition 6.
  • The algebraic variety of the ring ℝ[p(x): x ∈ Ω], which is generated by the polynomials Σx∈Ωp(x)−1, x : x β = 1 p ( x ) x : x β = 1 p ( x ), βL* \ M, is an extension ε1 of ε to P 1.
  • Define the moments η α = x Ω x α p ( x ), αL, i.e., the discrete Fourier transform of p, with inverse p ( x ) = 2 n α L x α η α. There exists an algebraic extension of the moment function εpη(p) ∈ M° to a mapping defined on ε1.
  • Proof. 1. According to the implicitization Proposition 6, the exponential family is characterized by the positivity condition together with the algebraic binomial conditions.
  • This follows from the implicit form, and it is proven, for example, in [20].
  • By definition.
  • As the mapping from the probabilities to the moments is affine and one-to-one, such a transformation extends to a one-to-one mapping from the extended model to the affine space of the marginal polytope.
We conclude this section by introducing the so-called no three-way interaction example. On Ω = {0, 1}3, the full model in the statistics 0 ↦ 1, 1 ↦ −1, that is t = (−1)x = 1 − 2x, is described by the matrix:
1 T 3 T 2 T 2 T 3 T 1 T 1 T 3 T 1 T 2 T 1 T 2 T 3 D 3 = 000 001 010 011 100 101 110 111 [ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ] .
Note the lexicographic order of both the sample points and the statistics’ exponents.
The exponential family without the interaction term T1T2T3 is the same model as the toric model without the three-way interaction, which is based on the matrix:
C ς 1 ς 2 ς 3 ς 4 ς 5 ς 6 B = 000 001 010 011 100 101 110 111 [ 1 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 1 1 0 0 1 1 1 0 0 0 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1 0 0 ]
that is the probabilities as a function of the ζ’s are:
{ p 1 = c p 2 = c ς 1 ς 3 ς 5 p 3 = c ς 2 ς 3 ς 6 p 4 = c ς 1 ς 2 ς 5 ς 6 p 5 = c ς 4 ς 5 ς 6 p 6 = c ς 1 ς 3 ς 4 ς 6 p 7 = c ς 2 ς 3 ς 4 ς 5 p 8 = c ς 1 ς 2 ς 4 .
The toric ideal of the toric model in Equation (82) is generated by the polynomial:
p 2 p 3 p 5 p 8 p 1 p 4 p 6 p 7 = 0 ,
this means that the closure of the exponential family is given by the solution of the equations:
{ p 1 + p 2 + p 3 + p 4 + p 5 + p 6 + p 7 + p 8 = 1 p 2 p 3 p 5 p 8 p 1 p 4 p 6 p 7 = 0 .
The η parameters are the expected values of the sufficient statistics of the full model,
000 001 010 011 100 101 110 111 [ η 1 η 2 η 3 η 4 η 5 η 6 η 7 ] = [ 001 010 011 100 101 110 111 ] [ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ] [ p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 ] .
In the ring:
R = [ p 1 , p 2 , p 3 , p 4 , p 5 , p 6 , p 7 , p 8 , η 2 , η 3 , η 4 , η 5 , η 6 , η 7 ]
we can consider the ideal I generated by the Equations (84) together with Equations (85). The elimination ideal:
J = I [ η 1 , η 2 , η 3 , η 4 , η 5 , η 6 , η 7 ]
will express the model as a dependence between the η’s.
Computation with CoCoA [36] gives the following polynomial:
f ( η 1 , η 2 , η 3 , η 4 , η 5 , η 6 ; η 7 ) = η 1 2 η 3 η 4 + η 2 2 η 3 η 4 η 3 3 η 4 η 3 η 4 3 + η 1 2 η 2 η 5 η 2 3 η 5 + η 2 η 3 2 η 5 + η 2 η 4 2 η 5 + η 3 η 4 η 5 2 η 2 η 5 3 η 1 3 η 6 + η 1 η 2 2 η 6 + η 1 η 3 2 η 6 + η 1 η 4 2 η 6 + η 1 η 5 2 η 6 + η 3 η 4 η 6 2 + η 2 η 5 η 6 2 η 1 η 6 3 2 η 1 η 2 η 4 2 η 1 η 3 η 5 2 η 2 η 3 η 6 2 η 4 η 5 η 6 + η 3 η 4 + η 2 η 5 + η 1 η 6 + ( 2 η 1 η 2 η 3 2 η 1 η 4 η 5 2 η 2 η 4 η 6 2 η 3 η 5 η 6 + η 1 2 + η 2 2 + η 3 2 + η 4 2 + η 5 2 + η 6 2 1 ) η 7 + ( η 3 η 4 + η 2 η 5 + η 1 η 6 ) η 7 2 + ( 1 ) η 7 3 .
The equation:
f ( η 1 , η 2 , η 3 , η 4 , η 5 , η 6 ; η 7 ) = 0
is an expression of the model in the expectation parameters, and this expression is a polynomial equation. We know unique solvability in η7 if (η1, η2, η3, η4, η5, η6) is in the interior of the marginal polytope. As in the example of the previous section, it is possible to intersect the polynomial invariant in Equation (83) with one or more sheaves of hyperplanes around some faces of the simplex, in order to lower the degree of the invariant and thus decompose the model as the convex hull of probabilities on the boundary of the model. We do not describe the details here, and we postpone the discussion of this example to a paper which is in preparation.

5. Conclusions

Geometry and algebra play a fundamental role in the study of statistical models, and in particular in the exponential family. In the fist part of the paper, starting from the definition of the natural gradient over an exponential family, we described the relationship between its expression in the basis of the sufficient statistics and in the conjugate basis. From this perspective, the terms natural gradient and vanilla gradient, to denote gradients evaluated with respect to the Fisher and the Euclidean geometry, together with their duality in the natural and expectation parameters, assume a new meaning, since these definitions depend on the choice of the basis for the tangent space.
In order to study natural gradient flows for a generic discrete exponential model and, in particular, their convergence, it is convenient to move to the mixture geometry of the expectation parameters and to study trajectories over the marginal polytope. However, in order to obtain explicit equations for the flows, it is necessary to determine the dependence between the moments associated with the sufficient statistics of the model, which are constrained to belong to the marginal polytope, and the remaining moments, which on the other side are not free. Such a relationship, which for finite search spaces is given by a system of polynomial invariants, cannot be easily solved explicitly in general. In the second part of the paper, by using algebraic tools, we proposed a novel parameterization based on ruled surfaces for an exponential family, which does not require to solve the polynomial invariant explicitly. We applied our approach to a simple example, and we showed that the surface associated with the model in the full marginal polytope is a ruled surface. We claim that these results are not peculiar to the example we described, and we are working towards an extension of this approach in a more general case.

Acknowledgments

The authors would like to thank Gianfranco Casnati from Politecnico di Torino for the useful discussions on the geometry of ruled surfaces. Giovanni Pistone is supported by de Castro Statistics of Collegio Carlo Alberto at Moncalieri and is a member of INdAM/GNAMPA.

Author Contributions

Both authors contributed to the design of the research. The research was carried out by all of the authors. The manuscript was written by Luigi Malagò and Giovanni Pistone. Both authors read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Pistone, G. Nonparametric information geometry. In Geometric Science of Information, Proceedings of the First International Conference (GSI 2013), Paris, France, 28–30 August 2013; Nielsen, F., Barbaresco, F., Eds.; Springer: Heidelberg, Germany, 2013; 8085, pp. 5–36. [Google Scholar]
  2. Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Relaxation as a Unifying Approach in 0/1 Programming, 2009, Proceedings of the NIPS 2009 Workshop on Discrete Optimization in Machine Learning: Submodularity, Sparsity & Polyhedra (DISCML), Whistler Resort & Spa, BC, Canada, 11–12 December 2009.
  3. Malagò, L.; Matteucci, M.; Pistone, G. Towards the geometry of estimation of distribution algorithms based on the exponential family. Proceedings of the 11th Workshop on Foundations of Genetic Algorithms (FOGA ’11), Schwarzenberg, Austria, 5–8 January 2011; ACM: New York, NY, USA, 2011; pp. 230–242. [Google Scholar]
  4. Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Natural Gradient Descent by estimation of empirical covariances. Proceedings of the 2011 IEEE Congress on Evolutionary Computation (CEC), New Orleans, LA, USA, 5–8 June 2011; pp. 949–956.
  5. Malagò, L.; Matteucci, M.; Pistone, G. Natural gradient, fitness modelling and model selection: A unifying perspective, Proceedings of the 2013 IEEE Congress on Evolutionary Computation (CEC), Cancun, Mexico, 20–23 June 2013; pp. 486–493.
  6. Wierstra, D.; Schaul, T.; Peters, J.; Schmidhuber, J. Natural evolution strategies. Proceedings of the 2008 IEEE Congress on Evolutionary Computation, Hong Kong, China, 1–6 June 2008; pp. 3381–3387.
  7. Ollivier, Y.; Arnold, L.; Auger, A.; Hansen, N. Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles 2011. arXiv: 1106.3708.
  8. Malagò, L.; Pistone, G. Combinatorial Optimization with Information Geometry: Newton method. Entropy 2014, 16, 4260–4289. [Google Scholar]
  9. Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI, USA, 2000; Translated from the 1993 Japanese original by Daishi Harada. [Google Scholar]
  10. Bourbaki, N. Variétés differentielles et analytiques. Fascicule de résultats / Paragraphes 1 à 7; Number XXXIII in Éléments de mathématiques; Hermann: Paris, France, 1971. [Google Scholar]
  11. Pistone, G.; Sempi, C. An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Stat. 1995, 23, 1543–1561. [Google Scholar]
  12. Malagò, L.; Pistone, G. Gradient Flow of the Stochastic Relaxation on a Generic Exponential Family. Proceedings of Conference of Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2014), Clos Lucé, Amboise, France, 21–26 September 2014; Mohammad-Djafari, A., Barbaresco, F., Eds.; pp. 353–360.
  13. Brown, L.D. Fundamentals of Statistical Exponential Families With Applications in Statistical Decision Theory; Number 9 in IMS Lecture Notes, Monograph Series; Institute of Mathematical Statistics: Hayward, CA, USA, 1986. [Google Scholar]
  14. Rockafellar, R.T. Convex Analysis; Princeton Mathematical Series No. 28; Princeton University Press: Princeton, NJ, USA, 1970. [Google Scholar]
  15. Do Carmo, M.P. Riemannian Geometry; Mathematics: Theory & Applications; Birkhäuser Boston Inc.: Boston, MA, USA, 1992; Translated from the second Portuguese edition by Francis Flaherty. [Google Scholar]
  16. Amari, S.I. Natural gradient works efficiently in learning. Neur. Comput. 1998, 10, 251–276. [Google Scholar]
  17. Shima, H. The Geometry of Hessian Structures; World Scientific Publishing Co. Pte. Ltd.: Hackensack, NJ, USA, 2007. [Google Scholar]
  18. Rinaldo, A.; Fienberg, S.E.; Zhou, Y. On the geometry of discrete exponential families with application to exponential random graph models. Electron. J. Stat. 2009, 3, 446–484. [Google Scholar]
  19. Rauh, J.; Kahle, T.; Ay, N. Support Sets in Exponential Families and Oriented Matroid Theory. Int. J. Approx. Reas. 2011, 52, 613–626. [Google Scholar]
  20. Malagò, L.; Pistone, G. A note on the border of an exponential family 2010. arXiv:1012.0637v1.
  21. Pistone, G.; Rogantin, M. The gradient flow of the polarization measure. With an appendix 2015. arXiv:1502.06718. [Google Scholar]
  22. Diaconis, P.; Sturmfels, B. Algebraic algorithms for sampling from conditional distributions. Ann. Stat. 1998, 26, 363–397. [Google Scholar]
  23. Pistone, G.; Wynn, H.P. Generalised confounding with Gröbner bases. Biometrika 1996, 83, 653–666. [Google Scholar]
  24. Pistone, G.; Riccomagno, E.; Wynn, H.P. Algebraic Statistics: Computational Commutative Algebra in Statistics; Volume 89, Monographs on Statistics and Applied Probability; Chapman & Hall/CRC: Boca Raton, FL, USA, 2001. [Google Scholar]
  25. Drton, M.; Sturmfels, B.; Sullivant, S. Lectures on Algebraic Statistics; Volume 39, Oberwolfach Seminars; Birkhäuser Verlag: Basel, Germany, 2009. [Google Scholar]
  26. Pachter, L.; Sturmfels, B. (Eds.) Algebraic Statistics for Computational Biology; Cambridge University Press: Cambridge, UK, 2005.
  27. Gibilisco, P.; Riccomagno, E.; Rogantin, M.P.; Wynn, H.P. Algebraic and Geometric Methods in Statistics; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
  28. 4ti2 team. 4ti2—A software package for algebraic, geometric and combinatorial problems on linear spaces. Available online: http://www.4ti2.de accessed on 2 June 2015.
  29. Michałek, M.; Sturmfels, B.; Uhler, C.; Zwiernik, P. Exponential Varieties 2014. arXiv:1412.6185.
  30. Sturmfels, B. Gröbner Bases and Convex Polytopes; American Mathematical Society: Providence, RI, USA, 1996. [Google Scholar]
  31. Geiger, D.; Meek, C.; Sturmfels, B. On the toric algebra of graphical models. Ann. Stat. 2006, 34, 1463–1492. [Google Scholar]
  32. Rapallo, F. Toric statistical models: Parametric and binomial representations. Ann. Inst. Stat. Math. 2007, 59, 727–740. [Google Scholar]
  33. Beltrametti, M.; Carletti, E.; Gallarati, D.; Monti Bragadin, G. Lectures on Curves, Surfaces and Projective Varieties: A Classical View of Algebraic Geometry; EMS textbooks in mathematics; European Mathematical Society: Zürich, Switzerland, 2009. [Google Scholar]
  34. Rinaldo, A.; Fienberg, S.E.; Zhou, Y. On the geometry of discrete exponential families with application to exponential random graph models. Electron. J. Stat. 2009, 3, 446–484. [Google Scholar]
  35. Pistone, G. Algebraic varieties vs. differentiable manifolds in statistical models. In Algebraic and Geometric Methods in Statistics; Gibilisco, P., Riccomagno, E., Rogantin, M., Wynn, H.P., Eds.; Cambridge University Press: Cambridge, UK, 2009; Chapter 21; pp. 339–363. [Google Scholar]
  36. Abbott, J.; Bigatti, A.; Lagorio, G. CoCoA-5: A system for doing Computations in Commutative Algebra. Available online: http://cocoa.dima.unige.it accessed on 2 June 2015.
Figure 1. Marginal polytope of the exponential family in Equations (12) and (13). The coordinates of the vertices are given by (T1, T2).
Figure 1. Marginal polytope of the exponential family in Equations (12) and (13). The coordinates of the vertices are given by (T1, T2).
Entropy 17 04215f1
Figure 2. Representation of the exponential family in Equations (12) and (13) as a surface that intersects the probability simplex ∆3. The surface is obtained by the triangularization of a grid of points that satisfy the invariant in Equation (21).
Figure 2. Representation of the exponential family in Equations (12) and (13) as a surface that intersects the probability simplex ∆3. The surface is obtained by the triangularization of a grid of points that satisfy the invariant in Equation (21).
Entropy 17 04215f2
Figure 3. Marginal polytope of the exponential family in Equations (12) and (13) (a). The dashed lines correspond to the points where ∆ = 0, where ∆ is the discriminant in Equation (31); over the red regions ∆ > 0 and over the blue regions ∆ < 0. Representation of the exponential family as a surface in the full marginal polytope parametrized by (η1, η2, η3) (b). The blue surface is given by the unique real root η3,1 in Equation (32); the red surface corresponds to the unique real root η3,2, which belongs to the full marginal polytope; over the dashed lines, which have been computed solving Equation (40) numerically, Equation (26) admits a real root with multiplicity equal to three.
Figure 3. Marginal polytope of the exponential family in Equations (12) and (13) (a). The dashed lines correspond to the points where ∆ = 0, where ∆ is the discriminant in Equation (31); over the red regions ∆ > 0 and over the blue regions ∆ < 0. Representation of the exponential family as a surface in the full marginal polytope parametrized by (η1, η2, η3) (b). The blue surface is given by the unique real root η3,1 in Equation (32); the red surface corresponds to the unique real root η3,2, which belongs to the full marginal polytope; over the dashed lines, which have been computed solving Equation (40) numerically, Equation (26) admits a real root with multiplicity equal to three.
Entropy 17 04215f3
Figure 4. Representation of the exponential family in Equations (12) and (13) as a ruled surface in the probability simplex (a) and in the parameter space (α, t) (b). The dashed line corresponds to the critical edge δ2δ4 and the blue line to the case α = 1 2.
Figure 4. Representation of the exponential family in Equations (12) and (13) as a ruled surface in the probability simplex (a) and in the parameter space (α, t) (b). The dashed line corresponds to the critical edge δ2δ4 and the blue line to the case α = 1 2.
Entropy 17 04215f4
Figure 5. Representation of the exponential family in Equations (12) and (13) as a ruled surface in the marginal polytope (η1, η2) (a) and in the full marginal polytope parametrized by (η1, η2 η3) (b) The dashed line corresponds to the critical line δ2δ4 α = 1 2
Figure 5. Representation of the exponential family in Equations (12) and (13) as a ruled surface in the marginal polytope (η1, η2) (a) and in the full marginal polytope parametrized by (η1, η2 η3) (b) The dashed line corresponds to the critical line δ2δ4 α = 1 2
Entropy 17 04215f5
Figure 6. The segments that form the ruled surface in Figure 4 have been extended, for −0.5 < t < 1.5. New lines described by Equations (60) have been represented for 0 < α < exp(0.7) (shading from red to black for increasing values of α) and for exp(0.7) − 1 < α < −1 (shading from red to white for decreasing values of α). The simplex in (b) has been rotated with respect to Figure 4(a) to better visualize the intersection of the lines with the critical edge δ2δ4.
Figure 6. The segments that form the ruled surface in Figure 4 have been extended, for −0.5 < t < 1.5. New lines described by Equations (60) have been represented for 0 < α < exp(0.7) (shading from red to black for increasing values of α) and for exp(0.7) − 1 < α < −1 (shading from red to white for decreasing values of α). The simplex in (b) has been rotated with respect to Figure 4(a) to better visualize the intersection of the lines with the critical edge δ2δ4.
Entropy 17 04215f6
Figure 7. Extension of the ruled surface associated with the exponential family in Equations (12) and (13) as in Figure 6(b), for exp(3.5) − 1 < α < exp(3.5) and −0.5 < t < 1.5; for α → ±∞, the lines of the extended surface admit the same limit.
Figure 7. Extension of the ruled surface associated with the exponential family in Equations (12) and (13) as in Figure 6(b), for exp(3.5) − 1 < α < exp(3.5) and −0.5 < t < 1.5; for α → ±∞, the lines of the extended surface admit the same limit.
Entropy 17 04215f7
Figure 8. The segments that form the ruled surface in Figure 5 have been extended, for −0.5 < t < 1.5. New lines described by Equations (60) have been represented for 0 < α < exp(1) (shading from blue to black for increasing values of α) and exp(1) − 1 < α < −1 (shading from blue to white for decreasing values of α). The full marginal polytope in (b) has been rotated with respect to Figure 5(b) to better visualize the intersection of the lines with the critical edge δ2δ4.
Figure 8. The segments that form the ruled surface in Figure 5 have been extended, for −0.5 < t < 1.5. New lines described by Equations (60) have been represented for 0 < α < exp(1) (shading from blue to black for increasing values of α) and exp(1) − 1 < α < −1 (shading from blue to white for decreasing values of α). The full marginal polytope in (b) has been rotated with respect to Figure 5(b) to better visualize the intersection of the lines with the critical edge δ2δ4.
Entropy 17 04215f8
Figure 9. Extension of the ruled surface associated with the exponential family in Equations (12) and (13) as in Figure 8(b), for exp(3)−1 < α < exp(3) and −0.5 < t < 1.5; notice that for α → ±∞, the lines of the extended surface admit the same limit.
Figure 9. Extension of the ruled surface associated with the exponential family in Equations (12) and (13) as in Figure 8(b), for exp(3)−1 < α < exp(3) and −0.5 < t < 1.5; notice that for α → ±∞, the lines of the extended surface admit the same limit.
Entropy 17 04215f9
Figure 10. Vanilla gradient field and flows in blue (a) and natural gradient field and flows in red (b), together with level lines associated with Fα,t(α, t) in the (α, t) parameterization, for c0 = 0, c1 = 1, c2 = 2 and c3 = 3; the dashed blue lines in (b) represent the points where ˜ F α , t ( α , t ) is not defined; see Equation (68).
Figure 10. Vanilla gradient field and flows in blue (a) and natural gradient field and flows in red (b), together with level lines associated with Fα,t(α, t) in the (α, t) parameterization, for c0 = 0, c1 = 1, c2 = 2 and c3 = 3; the dashed blue lines in (b) represent the points where ˜ F α , t ( α , t ) is not defined; see Equation (68).
Entropy 17 04215f10
Figure 11. Vanilla gradient field in blue (a) and natural gradient field and flows in red (b), together with level lines associated with Fη(α, t) over the marginal polytope, for c0 = 0, c1 = 1, c2 = 2 and c3 = 3.
Figure 11. Vanilla gradient field in blue (a) and natural gradient field and flows in red (b), together with level lines associated with Fη(α, t) over the marginal polytope, for c0 = 0, c1 = 1, c2 = 2 and c3 = 3.
Entropy 17 04215f11
Figure 12. Vanilla gradient field and flows in blue (a) and natural gradient field and flows in red (b) as in Figure 10, for c0 = 0, c1 = 1, c2 = 2 and c 3 = 5 2.
Figure 12. Vanilla gradient field and flows in blue (a) and natural gradient field and flows in red (b) as in Figure 10, for c0 = 0, c1 = 1, c2 = 2 and c 3 = 5 2.
Entropy 17 04215f12
Figure 13. Vanilla gradient field in blue (a) and natural gradient field and flows in red (b) as in Figure 11, for c0 = 0, c1 = 1, c2 = 2 and c 3 = 5 2.
Figure 13. Vanilla gradient field in blue (a) and natural gradient field and flows in red (b) as in Figure 11, for c0 = 0, c1 = 1, c2 = 2 and c 3 = 5 2.
Entropy 17 04215f13

Share and Cite

MDPI and ACS Style

Malagò, L.; Pistone, G. Natural Gradient Flow in the Mixture Geometry of a Discrete Exponential Family. Entropy 2015, 17, 4215-4254. https://doi.org/10.3390/e17064215

AMA Style

Malagò L, Pistone G. Natural Gradient Flow in the Mixture Geometry of a Discrete Exponential Family. Entropy. 2015; 17(6):4215-4254. https://doi.org/10.3390/e17064215

Chicago/Turabian Style

Malagò, Luigi, and Giovanni Pistone. 2015. "Natural Gradient Flow in the Mixture Geometry of a Discrete Exponential Family" Entropy 17, no. 6: 4215-4254. https://doi.org/10.3390/e17064215

Article Metrics

Back to TopTop