Previous Article in Journal
Thermoeconomic Evaluation of Integrated Solar Combined Cycle Systems (ISCCS)
Previous Article in Special Issue
Network Decomposition and Complexity Measures: An Information Geometrical Approach

Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

# Combinatorial Optimization with Information Geometry: The Newton Method

1
Dipartimento di Informatica, Università degli Studi di Milano, Via Comelico, 39/41, 20135 Milano, Italy
2
Castro Statistics, Collegio Carlo Alberto, Via Real Collegio 30, 10024 Moncalieri, Italy
*
Author to whom correspondence should be addressed.
Entropy 2014, 16(8), 4260-4289; https://doi.org/10.3390/e16084260
Received: 31 March 2014 / Revised: 10 July 2014 / Accepted: 11 July 2014 / Published: 28 July 2014

## 1. Introduction

In this paper, statistical exponential families  are thought of as differentiable manifolds along the approach called information geometry  or the exponential statistical manifold . Specifically, our aim is to discuss optimization on statistical manifolds using the Newton method, as is suggested in ( (Ch. 5 and 6)); see also the monograph . This method is based on classical Riemannian geometry , but here, we put our emphasis on coordinate-free differential geometry; see [7,8].
We mainly refer to the above-mentioned references [2,4], with one notable exception in the description of the tangent space. Our manifold will be an exponential family $ℰ$V of positive densities, V being a vector space of sufficient statistics. Given a one-dimensional statistical model p(t) ∈ $ℰ$V, tI, we define its velocity at time t to be its Fisher score $s ( t ) = d d t ln p ( t )$. The Fisher score s(t) is a random variable with zero expectation with respect to p(t), [s(t)] = 0. Because of that, the tangent space at p$ℰ$V is a vector space of random variables with zero expectation at p. A vector field is a mapping from p to a random variable V (p), such that for all p$ℰ$, the random variable V(p) is centered at p, [V(p)] = 0. In other words, each point of the manifold has a different tangent space, and this tangent space can be used as a non-parametric model space of the manifold. In this formalism, a vector field is a mapping from densities to centered random variables, that is, it is what in statistics is called a pivot of the statistical model. To avoid confusion with the product of random variables, we do not use the standard notation for the action of a vector field on a real function. This approach is possibly unusual in differential geometry, but it is fully natural from the statistical point of view, where the Fisher score has a central place. Moreover, this approach scales nicely from the finite state space to the general state space; see the discussion in  and the review in .
A complete construction of the geometric framework based on the idea of using the Fisher scores as elements of the tangent bundle has been actually worked out. In this paper, we go on by considering a second order geometry based on the non-parametric settings.
Our main motivation for such a geometrical construction is its application to combinatorial optimization using exponential families, whose first order version was developed in . We give here an illustration of the methods in the following toy example.
Consider the function f(x1, x2) = a0 + a1x1 + a2x2 + a12x1x2, with x1, x2 = ±1, a0, a1, a2, a12 ∈ ℝ. The function f is a real random variable on the sample space Ω = {+1, −1}2 with the uniform probability λ. Note that the coordinate mappings X1,X2 of Ω generate an orthonormal basis 1,X1,X2,X1X2 of L2(Ω, λ) and that f is the general form of a real random variable on such a space. Let ℘> be the open simplex of positive densities on (Ω, λ), and let $ℰ$V be a statistical model, i.e., a subset of ℘>. The relaxed mapping F: $ℰ$V → ℝ,
$F ( p ) = E p [ f ] = a o + a 1 E p [ X 1 ] + a 2 E p [ X 2 ] = a 12 E p [ X 1 X 2 ] ,$
is strictly bounded by the maximum of f, F(p) = [f] < maxx∈Ω f(x), unless f is constant. We are looking for a sequence pn, n ∈ ℕ, such that [f] → maxx∈Ω f(x) as n → ∞. The existence of such a sequence is a nontrivial condition for the model $ℰ$. Precisely, the closure of $ℰ$V must contain a density, whose support is contained in the set of maxima {x ∈ Ω|f(x) = max f}. This condition is satisfied by the independence model, V = Span {X1,X2}, where we can write:
$F ( η 1 , η 2 ) = a 0 + a 1 η 1 + a 2 η 2 + a 12 η 1 η 2 , η i = E p [ X i ] ,$
The gradient of Equation (2) has components 1F = a1 + a12η2, 2F = a2 + a12η1, and the flow along the gradient produces increasing values for F; however, the gradient flow does not converge to the maximum of F; see the dotted line in Figure 2. However, one can follow the suggestion by  and use a modified gradient (the “natural” gradient) flow that produces better results in our problem; see Figure 3. Full details on this example are given in Section 2.5.2.
In combinatorial optimization, the values of the function f are assumed to be available at each point, and the curve of steepest ascent of the relaxed function is learned through a simulation procedure based on exponential statistical models.
In this paper, we introduce, in Section 2, the geometry of exponential families and its first order calculus. The second order calculus and the Hessian are discussed in Section 3. Finally, in Section 4, we apply the formalism to the discussion of the Newton method in the context of the maximization of the relaxed function.

## 2. Models on a Finite State Space

We consider here the exponential statistical manifold on the set of positive densities on a measure space (Ω, μ) with Ω finite and counting measure μ. The setup we describe below is not strictly required in the finite case, because in such a case, other approaches are possible, but it provides a mathematical formalism that has its own pros and that scales naturally to the infinite case.
We provide below a schematic presentation of our formalism as an introduction to this section.
• Two different exponential families can actually be the same statistical model, as the set of densities in the two exponential families are actually equal. This fact is due to both the arbitrariness of the reference density and the fact that sufficient statistics are actually a vector basis of the vector space generated by the sufficient statistics. In a non-parametric approach, we can refer directly to the vector space of centered log-densities, while the change of reference density is geometrically interpreted as a change of chart. The set of all possible such charts defines a manifold.
• We make a specific interpretation of the tangent bundle as the vector space of Fisher’s scores at each density and use such tangent spaces as the space of coordinates. This produces a different tangent space/space of coordinates at each density, and different tangent spaces are mapped one onto another by a proper parallel transport, which is nothing else than the re-centering of random variables.
• If a basis is chosen, a parametrization is given, and such a parametrization is, in fact, a new chart, whose values are real vectors. In the real parametrization, the natural scalar product in each scores space is given by Fisher’s information matrix.
• Riemannian gradients are defined in the usual way. It is customary in information geometry to call “natural gradient” the real coordinate presentation of the Riemannian gradient. The natural gradient is computed by applying the inverse of the Fisher information matrix to the Euclidean gradient. It seems that there are tree gradients involved, but they all represent the same object when correctly understood.
• The classical notion of expectation parameters for exponential families carries on as another chart on the statistical manifold, which gives rise to a further presentation of a geometrical object.
• While the statistical manifold is unique, there are at least three relevant connections as structures on the vector bundles of the manifold: one relating to the exponential charts, one relating to the expectation charts and one depending on the Riemannian structure.

#### 2.1. Exponential Families As Manifolds

On the finite sample space Ω, #Ω = n, let a set of random variables $ℬ$ = {X1, . . . ,Xm} be given, such that ∑JαjXj is constant if, and only if, the αj’s are zero, or, equivalently, such that X0 = 1,X1, . . . ,Xm are affinely independent. The condition implies, necessarily, the linear independence of $ℬ$. A common choice is to take a set of linearly independent and μ-centered random variables.
We write = Span {X1, . . . ,Xm} and define the following exponential family of positive densities p ∈ ℘>:
$E V = { q ∈ P > ∣ q ∝ e V p , V ∈ V } .$
Given any couple p, q , then there exist a unique set of parameters θ = θp(q), such that:
$q = exp ( ∑ j θ j U e p X j - ψ p ( θ ) ) · p$
where is the centering at p, that is,
$U e p : V ∉ U ↦ U - E p [ U ] ∈ U e p V .$
The linear mapping is one-to-one on and Xj, j = 1, . . . ,m, and is a basis of . We view each choice of a specific reference p as providing a chart centered at p on the exponential family , namely:
$σ p : exp ( ∑ j θ j U e p X j - ψ p ( θ ) ) · p ↦ θ ,$
If:
$U = U e p U + E p [ U ] = ∑ j = 1 m θ j U e p X j + E p [ U ] ,$
then:
$E p [ U U e p X i ] = ∑ j = 1 m θ j E p [ U e p X i U e p X j ] ,$
so that $θ = I B - 1 ( p ) E p [ U U e p X ]$, where:
$I B ( p ) = [ Cov p ( X i , X j ) ] i j = E p [ X X ′ ] - E p [ X ] E p [ X ′ ]$
is the Fisher information matrix of the basis $ℬ$ = {X1, . . . ,Xm}.
The mappings:
$σ p : E V ∋ q ↦ U ↦ θ ∈ R m$
where:
$s p : q ↦ U = log ( q p ) - E p [ log ( q p ) ] ,$
$σ p : q ↦ θ = I B - 1 ( p ) E p [ U U e p X ] = I B - 1 ( p ) E p [ log ( q p ) U e p X ] ,$
are global charts in the non-parametric and parametric coordinates, respectively. Notice that Equation (12) provides the regression coefficients of the least squares estimate on of the log-likelihood.
We denote by ep : ℝm the inverse of σp, i.e.,
$e p ( θ ) = exp ( ∑ j = 1 m θ j U e p X j - ψ p ( θ ) ) · p ,$
so that the representation of the divergence q ↦ D(p ||q) in the chart σp is ψp:
$ψ p ( θ ) = log ( E p [ e ∑ j = 1 m θ j U e p X j ] ) = E 0 [ log ( p e p ( θ ) ) ] = D ( p ‖ e p ( θ ) ) .$
The mapping I$ℬ$: p ↦ Covp (X, X) ∈ ℝm×m is represented in the chart centered at p by:
$I B , p ( θ ) = I B ( e p ( θ ) ) = [ Cov e p ( θ ) ( X i , X j ) ] i , j = Hess ψ p ( θ ) ,$
See .

#### 2.2. Change of Chart

Fix p, ; then, we can express p in the chart centered at ,
$p = exp ( U ¯ - k p ( U ¯ ) ) · p ¯ , U ¯ ∈ U e p ¯ V , k p ¯ ( U ¯ ) = log ( E p ¯ [ e U ¯ ] ) .$
In coordinates $U ¯ = ∑ j = 1 ‘ m θ ¯ j U e p ¯ X j$.
For all q , q = exp (Ukp(U)) p, U , kp(U) = log ( [eU]), in coordinates $U = ∑ j = 1 ‘ m θ j U e p X j$, we can write:
$q = exp ( U - k p ( U ) ) · p = exp ( U - k p ( U ) ) exp ( U ¯ - k p ¯ ( U ¯ ) ) · p ¯ = exp ( U - k p ( U ) + U ¯ - k p ¯ ( U ¯ ) ) · p ¯ = exp ( ( ( U + U ¯ ) - E p ¯ [ U ] ) - ( k p ( U ) - k p ¯ ( U ¯ ) + E p ¯ [ U ] ) ) · p ¯ ,$
hence, the non-parametric coordinate of q in the chart centered at is U + Ū [U] = (U) + Ū.
$σ p ¯ ( q ) = I V - 1 ( p ¯ ) E p ¯ [ ( U e p ¯ U + U ¯ ) U e p ¯ X ] = θ + θ ¯$
This provides the change of charts $σ p ¯ ∘ σ p - 1 : θ ↦ θ + θ ¯$. This atlas of charts defines the affine manifold ( , (σp)). This fact has deep consequences that we do not discuss here, e.g., our manifold is an instance of a Hessian manifold .

#### 2.3. Tangent Bundle

The space of Fisher scores at p is , and it is identified with the tangent space of the manifold at p, Tq ; see the discussion in [3,9]. Let us check the consistency of this statement with our θ-parametrization.
Let:
$q ( τ ) = exp ( ∑ j = 1 m θ j ( τ ) U e q ( 0 ) X - ψ q ( 0 ) ( τ ) ) · q ( 0 ) ,$
τI, I an open interval containing zero, a curve in . In the chart centered at q(0), we have from Equation (12):
$σ q ( 0 ) ( q ( τ ) ) = I B - 1 ( q ( 0 ) ) E q ( 0 ) [ log ( q ( τ ) q ( 0 ) ) U e q ( 0 ) X ] = I B - 1 ( q ( 0 ) ) E q ( 0 ) [ ( ∑ j = 1 m θ j ( τ ) U e q ( 0 ) X j - ψ q ( 0 ) ( θ ( τ ) ) ) U e q ( 0 ) X ] = I B - 1 ( q ( 0 ) ) ∑ j = 1 m θ j ( τ ) E q ( 0 ) [ U e q ( 0 ) X j U e q ( 0 ) X ] = I B - 1 ( q ( 0 ) ) E q ( 0 ) [ U e q ( 0 ) X U e q ( 0 ) X ] θ = θ ( τ ) .$
The vector space is represented by the coordinates in the base $ℬ$. The tangent bundle T as a manifold is defined by the charts (σp, σ̇p) on the domain:
$T E V = { ( p , v ) ∣ p ∈ E V , v ∈ T p E V }$
with:
$( σ p , σ ˙ p ) : ( q , V ) ↦ ( I B - 1 ( p ) E p [ log ( q p ) U e p X ] , I B - 1 ( p ) E p [ V U e p X ] ) .$
The dot notation σ̇p for the charts on the tangent spaces is justified by the computation in Equation (23) below:
$d d t σ q ( 0 ) ( q ( τ ) ) | τ = 0 = I B - 1 ( q ( 0 ) ) E q ( 0 ) [ d d t log ( q ( τ ) ) | τ = 0 U e q ( 0 ) X ] = I B - 1 ( q ( 0 ) ) E q ( 0 ) [ δ q ( 0 ) U e q ( 0 ) X ] = σ ˙ q ( 0 ) ( δ q ( 0 ) ) .$
The velocity at τ = 0 is $δ q ( 0 ) = d d τ log ( q ( τ ) ) ∣ τ = 0 ∈ T q ( 0 ) E V$ and:
$d d τ θ ( τ ) | τ = 0 = I B - 1 ( q ( 0 ) ) E q ( 0 ) [ d d τ log ( q ( τ ) ) | τ = 0 U e q ( 0 ) X ] = I B - 1 ( q ( 0 ) ) E q ( 0 ) [ δ q ( 0 ) U e q ( 0 ) X ] ,$
which is consistent with both the definition of tangent space as set of Fisher scores and with the chart of the tangent bundle as defined in Equation (22).
The velocity at a generic τ is $δ q ( τ ) = d d τ log ( q ( τ ) ) ∈ T q ( τ ) E V$ and has coordinates at p:
$d d τ θ ( τ ) = I B - 1 ( q ( 0 ) ) E q ( 0 ) [ d d τ log ( q ( τ ) ) U e q ( 0 ) X ] = I B - 1 ( q ( 0 ) ) E q ( 0 ) [ δ q ( τ ) U e q ( 0 ) X ] .$
If V,W are vector fields on T , i.e., V (p),W(p) ∈ Tp = , p , we define a Riemannian metric g(V,W)) by:
$g ( V , W ) ( p ) = g p ( V ( p ) , W ( p ) ) = E p [ V ( p ) W ( p ) ]$
In coordinates at p, $V ( p ) = ∑ j σ ˙ p j ( V ) U e p X j , W ( p ) = ∑ j σ ˙ p j ( W ) U e p X j$, so that:
$g p ( V ( p ) , W ( p ) ) = σ ˙ p ( V ) ′ I B ( p ) σ ˙ p ( W ) .$

Given a function φ: → ℝ let φp = φep, $e p = σ p - 1$, its representation in the chart centered at p:
The derivative of θφp(θ) at θ = 0 along α ∈ ℝm is:
$∇ φ p ( 0 ) α = ∇ φ p ( 0 ) I B - 1 ( p ) I B ( p ) α = ( I B - 1 ( p ) ∇ φ p ( 0 ) ′ ) ′ I B ( p ) α = g p ( I B - 1 ( p ) ∇ φ p ( 0 ) ′ , α ) .$
The mapping $∇ ˜ φ : p ↦ I B - 1 ( p ) ( ∇ φ p ( 0 ) ) ′ ∈ R m$ that appears in Equation (29) is Amari’s natural gradient of φ: ; see . It is a standard notion in Riemannian geometry; cf.  (p. 46).
More generally, the derivative of θφp(θ) at θ along α ∈ ℝm is:
$∇ φ p ( θ ) α = ∇ φ p ( θ ) I B - 1 ( e p ( θ ) ) I B ( e p ( θ ) ) α = ( I B - 1 ( e p ( θ ) ) ∇ φ p ( θ ) ′ ) ′ I B ( e p ( θ ) ) α = g e p ( θ ) ( I B - 1 ( e p ( θ ) ) ∇ φ p ( θ ) ′ , α ) .$
Let us compare ∇φq(0) and ∇φp(θ) when q = ep(θ). As φp = φep and φq = φeq, we have the change of charts:
$φ q = φ ∘ e q = φ ∘ e p ∘ σ p ∘ e q = φ p ∘ σ p ∘ e q ,$
hence ∇φq(0) = ∇φp(σp(q))J(σpeq)(0), where J(σpeq) is the Jacobian of σpeq. As σpeq(θ) = θ + σp(q), we have J(σpeq) = Id, and in conclusion, ∇φep(θ)(0) = ∇φp(θ). For all p and θ ∈ ℝm,
$∇ ˜ φ ( e p ( θ ) ) = I B - 1 ( e p ( θ ) ) ∇ φ p ( θ ) .$
Alternatively, for all q, p , ∇̃φ: → ℝm is defined by:
$∇ ˜ φ ( q ) = I B - 1 ( q ) ∇ φ p ( σ p ( q ) ) .$
The Riemannian gradient of φ: is the vector field ∇φ, such that DYφ = g(∇φ, Y ). Note that the Riemannian gradient takes values in the tangent bundle, while the natural gradient takes values in ℝm. We compute the Riemannian gradient at p as follows. If y = σ̇p(Y (p)),
$D Y φ ( p ) = d φ p ( 0 ) y = g p ( ∇ ˜ φ ( p ) , y ) = E p [ ∇ φ ( p ) Y ( p ) ] ,$
hence $∇ ˜ φ ( p ) = I B - 1 ( p ) ∇ φ p ( 0 ) ′$ is the representation in the chart centered at p of the vector field ∇φ: . Explicitly, we have (see Equation (22)),
$∇ ˜ φ ( p ) = I B - 1 ( p ) ( ∇ φ p ( 0 ) ) ′ = I B - 1 ( p ) E p [ ∇ φ ( p ) U e p X ] ,$
$∇ φ ( p ) = ∑ j ( ∇ ˜ φ ( p ) ) j U e p X j$
The Euclidean gradient ∇φp(θ) is sometimes called the “vanilla gradient.” It is equal to the covariance between the Riemannian gradient ∇φ(p) and the basis X, (∇φp(0))′ = [∇φ(p) X].
We summarize in a display the relations between our three gradients: Euclidean ∇φp(0), natural ∇̃φ(p) and Riemannian ∇φ(p).
In the following, we shall frequently use the fact that the representation of the gradient vector field ∇φ in a generic chart centered at p is:
$( ∇ φ ) p ( θ ) = σ ˙ p ( ∇ φ ( e p ( θ ) ) ) = ( ∇ ˜ φ ) ( e p ( θ ) ) = I B , p - 1 ( θ ) ∇ φ p ( θ ) .$
It should be noted that the leftmost term (∇φ)p(θ) is the presentation of the gradient in the charts of the tangent bundle, while in the rightmost term, ∇φp(θ) denotes the Euclidean gradient of the presentation of the function φ in the charts of the manifold.

#### 2.4.1. Expectation Parameters

As ψp is strictly convex, the gradient mapping θ ↦ (∇ψp(θ))′ is a homeomorphism from the space of parameters ℝm to the interior of the convex set generated by the image of X; see . The function μp : defined by:
$μ p ( q ) = E q [ U e p X ] = E q [ X ] - E p [ X ] = ( ∇ ψ p ( θ ) ) ′ , θ = σ p ( q )$
is a chart for all p . The value of the inverse q = Lp(μ) is characterized as the unique q , such that μ = [ X], i.e., the maximum likelihood estimator.
Let us compute the change of chart from p to :
$μ p ¯ ∘ μ p - 1 ( η ) = η ¯ = η + E p [ X ] - E p ¯ [ X ] .$
In fact, μ = [ X] and μ̄ = μ(Lp(μ)) = [ X].
We do not discuss here the rich theory started in  about the duality between σp and μp. We limit ourselves to the computation of the Riemannian gradient in the expectation parameters. If φ: ,
$φ p ( θ ) = φ ∘ e p ( θ ) = φ ∘ L p ∘ μ p ∘ e p ( θ ) = ( φ ∘ L p ) ∘ ( ∇ ψ p ) ( θ ) ,$
because μpep(θ) = [ X] = ∇φp(θ), hence:
$∇ φ p ( θ ) = ∇ ( φ ∘ L p ) ( ∇ ψ p ( θ ) ) Hess ψ p ( θ ) ,$
$∇ ˜ φ ( p ) = I V ( p ) - 1 ( ∇ ( φ ∘ L p ) ( 0 ) Hess ψ p ( θ ) ) ′ = ( ∇ ( φ ∘ L p ) ( 0 ) ) ′ ,$
$∇ φ ( p ) = ∇ ( φ ∘ L p ) ( 0 ) U e p X ,$
that is, the natural gradient ∇̃φ at p = Lp(μ) is equal to the Euclidean gradient of μφLp(μ) at μ = 0.

#### 2.4.2. Vector Fields

If V is a vector field of T and φ: is a real function, then we define the action of V on φ, ∇V φ, to be the real function:
$∇ V φ : E V ∋ p ↦ ∇ V φ ( p ) = ∇ φ p ( 0 ) σ ˙ p ( V ( p ) ) .$
We prefer to avoid the standard notation V φ, because in our setting, V (p) is a random variable, and the product V (p)φ(p) is otherwise defined as the ordinary product.
Let us represent ∇V φ in the chart centered at p:
$( ∇ V φ ) p ( θ ) = ∇ V φ ( e p ( θ ) ) = ∇ φ e p ( θ ) ( 0 ) σ ˙ e p ( θ ) ( V ( e p ( θ ) ) ) = ∇ φ p ( θ ) V p ( θ ) ,$
where we have used the equality ∇φep (θ)(0) = ∇φp(θ) and Vp(θ) = σ̇ep(θ) (V (ep(θ))).
If W is a vector field, we can compute ∇WV φ at p as:
$∇ W ∇ V φ ( p ) = ∇ ( ∇ V φ ) p ( 0 ) σ ˙ p ( W ( p ) ) = V p ( 0 ) ′ Hess φ p ( 0 ) W p ( 0 ) + ∇ φ p ( 0 ) J V p ( 0 ) W p ( 0 ) ,$
where J denotes the Jacobian matrix.
The Lie bracket [W, V ]φ (see  (§4.2),  (V, §1),  (Section 5.3.1)) is given by:
$[ W , V ] φ ( p ) = ∇ W ∇ V φ ( p ) - ∇ V ∇ w φ ( p ) = ∇ φ p ( 0 ) ( J V p ( 0 ) W p ( 0 ) - J W p ( 0 ) V p ( 0 ) ) ,$
because of Equation (47) and the symmetry of the Hessian.
The flow of the smooth vector field V : is a family of curves γ(t, p), p tJp, Jp open real interval containing zero, such that for all p and tJp,
$γ ( 0 , p ) = p ,$
$δ γ ( t , p ) = V ( γ ( t , p ) ) .$
As uniqueness holds in Equation (50) (see  (VI, §1) or  (§4.1)), we have semi-group property γ(s + t, p) = γ(s, γ (t, p)), and Equation (50) is equivalent to δγ(0, p) = V (γ(0, p)), p .
If a flow of V is available, we have an interpretation of ∇Vφ as a derivative of φ along (t, p),
$d d t φ ( γ ( t , p ) ) | t = 0 = ∇ φ p ( σ p ( γ ( t , p ) ) ) ( d d t σ p ( γ ( t , p ) ) ) | t = 0 = ∇ φ p ( 0 ) V ( p ) = ∇ V φ ( p ) .$

#### 2.5. Examples

The following examples are intended to show how the formalism of gradients is usable in performing basic computations.

#### 2.5.1. Expectation

Let f be any random variable, and define F : by F(p) = [f]. In the chart centered at p, we have:
$F p ( θ ) = ∫ f exp ( ∑ j θ j U e p X j - ψ p ( θ ) ) · p d μ$
$∇ F p ( 0 ) = Cov p ( f , X ) ∈ ( R m ) ′ .$
$∇ ˜ F ( p ) = Cov p ( X , X ) - 1 Cov p ( X , f ) ∈ R m ,$
$∇ F ( p ) = ( ∇ ˜ F ( p ) ) ′ U e p X = Cov p ( f , X ) Cov p ( X , X ) - 1 U e p X ∈ T p E V .$
From Equation (55), it follows that ∇F(p) is the L2(p)-projection f onto , while ∇̃F(p) in Equation (54) are the coordinates of the projection. Let us consider the family of curves:
$γ ( t , p ) = exp ( ∑ j = 1 m t ( ∇ ˜ F ( p ) ) j U e p X j - ψ p ( t ∇ ˜ F ( p ) ) ) · p , t ∈ R .$
The velocity is:
$δ γ ( t , p ) = d d t ( ∑ j = 1 m t ( ∇ ˜ F ( p ) ) j U e p X j - ψ p ( t ∇ ˜ F ( p ) ) ) = ∇ F ( p ) - E γ ( t , p ) [ ∇ F ( p ) ] ,$
which is different from ∇F(γ(t, p)), unless f ⊕ ℝ. Then, γ is not, in general, the flow of ∇F, but it is a local approximation, as δγ(0, p) = ∇F(p).
These computation are the basis of model-based methods in combinatorial optimization; see .

#### 2.5.2. Binary Independent Variables

Here, we present, in full generality, the toy example of the Introduction; see  for more information on the application to combinatorial optimization. Our example is a very special case of Ising exactly solvable models , our aim being here to explore the geometric framework.
Let Ω = {+1, −1}m with counting measure μ, and let the space be generated by the coordinate projections $ℬ$ = {X1, . . . , Xd}. Note that we use here the coding +1, −1 (from physics) instead of the coding 0, 1, which is more common in combinatorial optimization. The exponential family is $E V = { exp ( ∑ J = 1 m θ j X j - ψ λ ( θ ) ) · 2 - m }$, λ(x) = 2m for x ∈Ω being the uniform density. The independence of the sufficient statistics Xj under all distributions in implies:
$ψ λ ( θ ) = ∑ j = 1 m ψ ( θ j ) , ψ ( θ ) = log ( cosh ( θ ) ) .$
We have:
$∇ ψ λ ( θ ) = [ tanh ( θ j ) : j = 1 , … , d ] = η λ ( θ ) ,$
$Hess ψ λ ( θ ) = diag ( cosh - 2 ( θ j ) : j = 1 , … , d ) = diag ( e - 2 ψ ( θ j ) : j = 1 , … , d ) = I B , λ ( θ ) ,$
$I B , λ ( θ ) - 1 = diag ( cosh 2 ( θ j ) : j = 1 , … , d ) = diag ( e 2 ψ ( θ j ) : j = 1 , … , d ) .$
The quadratic function f(X) = a0 +∑j ajXj +∑{i,j} ai,jXiXj has expected value at p = eλ(θ), i.e., relaxed value, equal to:
$F ( p ) = F λ ( θ ) = E θ [ f ( X ) ] = a 0 + ∑ j a j tanh ( θ j ) + ∑ { i , j } a i , j tanh ( θ i ) tanh ( θ j ) ,$
and covariance with Xk$ℬ$ equal to:
$Cov θ ( f ( X ) , X k ) = ∑ j a j Cov θ ( X j , X k ) + ∑ { i , j } a i , j Cov θ ( X i X j , X k ) = a k Var θ ( X k ) + ∑ i ≠ k a i , k E θ [ X i ] Var θ ( X k ) = cosh - 2 ( θ k ) ( a k + ∑ i ≠ k a i , k tanh ( θ i ) ) .$
In the computation, we have used the independence and the special algebra of ±1, which implies $X i 2 = 1$, so that Covθ (XiXj, Xk) = 0 if i, jk, otherwise Covθ (XiXk,Xk) = [Xi] − [Xi] [Xk]2; see .
$∇ F λ ( θ ) = [ cosh - 2 ( θ j ) ( a j + ∑ i ≠ j a i , j tanh ( θ i ) ) : j = 1 , … , d ] ,$
$∇ ˜ F ( e λ ( θ ) ) = [ a j + ∑ i ≠ j a i , j tanh ( θ i ) : j = 1 , … , d ] ,$
$∇ F ( e λ ( θ ) ) = ∑ j = 1 m ( a j + ∑ i ≠ j a i , j E θ [ X i ] ) ( X j - E θ [ X j ] ) .$
The (natural) gradient flow equations are:
$θ ˙ j ( t ) = a j + ∑ i ≠ j a i , j tanh ( θ i ( t ) ) , j = 1 , … , d .$
Equations (64)(66) are usable in practice if the aj’s and the ai,j’s are estimable. Otherwise, one can use Equation (63) and the following forms of the gradients:
$∇ F λ ( θ ) = [ Cov θ ( X j , f ( X ) ) : j = 1 , … , d ] ,$
$∇ ˜ F ( e λ ( θ ) ) = [ cosh 2 ( θ j ) Cov θ ( f ( X ) , X j ) : j = 1 , … , d ] ,$
in which case, the gradient flow equations are:
$θ ˙ j ( t ) = cosh 2 ( θ j ) Cov θ ( f ( X ) , X j ) , j = 1 , … , d .$
Let us study the relaxed function in the expectation parameters ηj = ηj(θ), j = 1, . . . , d,
$F λ ( η ) = a 0 + ∑ j a j η j + ∑ { i , j } a i , j η i η j , η ∈ ] - 1 , + 1 [ m .$
The Euclidean gradient with respect to η has components:
$∂ j F λ ( η ) = a j + ∑ i ≠ j a i , j η i ,$
which are equal to the components of the natural gradient; see Section 2.4.1. As:
$η ˙ j ( t ) = d d t tanh ( θ j ( t ) ) = cosh - 2 ( θ j ( t ) ) θ ˙ j ( t ) = ( 1 - η j ( t ) 2 ) θ ˙ j ( t ) , j = 1 , … , m ,$
the gradient flow expressed in the η-parameters has equations:
$η ˙ j ( t ) = ( 1 - η j ( t ) 2 ) ( a j + ∑ i ≠ j a i , j η i ( t ) ) , j = 1 , … , d .$
Alternatively, in vector form,
$η ˙ ( t ) = diag ( 1 - η j ( t ) 2 : j = 1 , … , d ) ( a + A η ( t ) ) ,$
where a = [aj : j = 1, . . . , d]t and Ai,j = 0 if i = j, Aij = ai,j. The matrix A is symmetric with zero diagonal, and it has the meaning of the adjacency matrix of the (weighted) interaction graph. We do not know a closed-form solution of Equation (74). An example of a numerical solution is shown in Figure 3.

#### 2.5.3. Escort Probabilities

For a given a > 0, consider the function C(a) : defined by C(a) (p) = ∫ pa . We have:
$C p ( a ) ( θ ) = ∫ exp ( a ∑ j = 1 m θ j U e p X j - a ψ p ( θ ) ) p a d μ$
and:
$d C p ( a ) ( 0 ) α = ∫ a ( ∑ j = 1 m α j U e p X j ) p a d μ = ∑ j = 1 m α j ∫ a U e p X j p a d μ = ∑ j = 1 m α j Cov p ( X j , a p a - 1 ) ,$
that is, the Euclidean gradient is $∇ C p ( a ) ( 0 ) = Cov p ( a p a - 1 , X )$ (row vector). The natural gradient is computed from Equation (35) as:
$∇ ˜ C ( a ) ( p ) = I B - 1 ( p ) ( ∇ C p ( a ) ( 0 ) ) ′ = Cov p ( X , X ) - 1 Cov p ( X , a p a - 1 ) ,$
while the Riemannian gradient follows from Equation (36):
$∇ C ( a ) ( p ) = Cov p ( a p a - 1 , X ) Cov p ( X , X ) - 1 U e p X .$
Note that the Riemannian gradient is the orthogonal projection of the random variable apa −1 onto the tangent space Tp = .
The probability density pa/C(p) is called the escort density in the literature on non-extensive statistical mechanics; see, e.g.,  (Section 7.4).
We compute now the tangent mapping of ppa/C(a)(a) ∈ ℘>. Let us extend the basis X1, . . . , Xm to a basis X1, . . . , Xn, nm, whose exponential family is full, i.e., equal to ℘>. The non-parametric coordinate of $q = ( exp ( ∑ j = 1 m θ j U e p X j - ψ p ( θ ) ) p ) a / C p ( a ) ( θ )$ in the chart centered at $p ¯ = p a / C p ( a ) ( 0 )$ is the -centering of the random variable:
$log ( q p ¯ ) = log ( ( exp ( ∑ j = 1 m θ j U e p X j - ψ p ( θ ) ) p ) a / C p ( a ) ( θ ) p a / C p ( a ) ( 0 ) ) = a ∑ j = 1 m θ j U e p X j - a ψ p ( θ ) + ln C p ( a ) ( 0 ) - ln C p ( a ) ( θ ) ,$
that is,
$v = a ∑ j = 1 m θ j U e p ¯ X j .$
The coordinates of v in the basis X1, . . . , Xn are (1, . . . , aθm, 0, . . . , 0), and the Jacobian of θ ∋ (aθ, 0nm) is the m × n matrix [aIm|0(n m)].

#### 2.5.4. Polarization Measure

The polarization measure has been introduced in Economics by . Here, we consider the qualitative version of . If π is a distribution of a finite set, the probability that in three independent samples from π there are exactly two equal is $3 ∑ j π j 2 ( 1 - π j )$. If p , define:
$G p = ∫ p 2 ( 1 - p ) d μ = C ( 2 ) ( p ) - C ( 3 ) ( p ) ,$
where C(2) and C(3) are defined as in Example 2.5.3.
From Equation (78), we find the natural gradient:
$∇ ˜ G ( p ) = Cov p ( X , X ) - 1 Cov p ( X , 2 p - 3 p 2 ) .$
Note that ∇̃G(p) = 0 if p is constant; see Figure 4.

## 3. Second Order Calculus

In this section, we turn to considering second order calculus, in particular Hessians, in order to prepare the discussion of the Newton method for the relaxed optimization of Section 4.

#### 3.1. Metric Derivative (Levi–Civita connection)

Let V, W : be vector fields, that is, V (p), W(p) ∈ Tp = , p . Consider the real function R = g(V,W) : → ℝ, whose value at p is R(p) = gp(V (p), W(p)) = [V (p)W(p)]. Assuming smoothness, we want to compute the derivative of R along the vector field Y : , that is, (DY R)(p) = dRp(0, with α = σ̇p(Y (p)). The expression of R in the chart centered at p is, according to Equation (27),
$θ ↦ R p ( θ ) = σ ˙ p ( V ( e p ( θ ) ) ) ′ I B ( e p ( θ ) ) σ ˙ p ( W ( e p ( θ ) ) ) = V p ( θ ) ′ I B , p ( θ ) W p ( θ ) ,$
where Vp and Wp are the presentation in the chart of the vector fields V and W, respectively.
The i-th component iRp(θ) of the Euclidean gradient ∇Rp(θ) is:
$∂ i R p ( θ ) = ∂ i ( V p ( θ ) ′ I B , p ( θ ) W p ( θ ) ) = ∂ i V p ( θ ) ′ I B , p ( θ ) W p ( θ ) + V p ( θ ) ′ ∂ i I B , p ( θ ) W p ( θ ) ) + V p ( θ ) ′ I B , p ( θ ) ∂ i W p ( θ ) = ( ∂ i V p ( θ ) + 1 2 I B , p - 1 ( θ ) ∂ i I B , p ( θ ) V p ( θ ) ) ′ I B , p ( θ ) W p ( θ ) + V p ( θ ) ′ I B , p ( θ ) ( ∂ i W p ( θ ) + 1 2 I B , p - 1 ( θ ) ∂ i I B , p ( θ ) W p ( θ ) ) ,$
so that the derivative at θ along α = σ̇ep(θ)(Y (ep(θ))) is:
$d R p ( θ ) α = ( d V p ( θ ) α + 1 2 I B , p - 1 ( θ ) ( d I B , p ( θ ) α ) V p ( θ ) ) ′ I B , p ( θ ) W p ( θ ) + V p ( θ ) ′ I B , p ( θ ) ( d W p ( θ ) α + 1 2 I B , p - 1 ( θ ) ( d I B , p ( θ ) α ) W p ( θ ) ) .$

#### Proposition 1

If we define DY V to be the vector field on , whose value at q = ep(θ) has coordinates centered at p given by:
$σ ˙ p ( D Y V ( q ) ) = d V p ( θ ) α = 1 2 I B - 1 ( p ) ( d I B , p ( θ ) α ) V p ( θ ) , α = σ ˙ p ( Y ( q ) ) ,$
then:
$D Y g ( V , W ) = g ( D Y V , W ) + g ( V , D Y W ) ,$
i.e., Equation (87) is a metric covariant derivative; see  (Ch. 2 §3),  (VIII §4),  (§5.3.2).
The metric derivative Equation (87) could be computed from the flow of the vector field Y. Let (t, p) ↦ γ(t, p) be the flow of the vector field V, i.e., δγ(t, p) = V (γ (t, p)) and γ (0, p) = p. Using Equation (23), we have:
$d d t σ ˙ ( V ( γ ( t , p ) ) ) | t = 0 = d d t V p ( σ p ( γ ( t , p ) ) ) | t = 0 = d V p ( σ p ( γ ( t , p ) ) ) d d t σ p ( γ ( t , p ) ) | t = 0 = d V p ( 0 ) σ ˙ p ( δ γ ( 0 , p ) ) = d V p ( 0 ) σ ˙ p ( Y ( p ) ) ,$
and:
$d d t I V ( γ ( t , p ) ) | t = 0 = d d t I B , p ( σ p γ ( t , p ) ) | t = 0 = d I B , p ( 0 ) σ ˙ p ( δ γ ( 0 , p ) ) = d I B , p ( 0 ) σ ˙ p ( Y ( p ) ) V p ( 0 ) ,$
so that:
$σ ˙ ( D Y V ( p ) ) = d d t σ ˙ V ( γ ( t , p ) ) | t = 0 + 1 2 I V - 1 ( p ) d d t I V ( γ ( t , p ) ) | t = 0 .$
Let us check the symmetry of the metric covariant derivative to show that it is actually the unique Riemannian or Levi–Civita affine connection; see  (Th. 3.6).
The Lie bracket of the vector fields V and W is the vector field [V, W], whose coordinates are:
$[ V , W ] p ( θ ) = d V p ( 0 ) σ ˙ p ( W ( p ) ) - d W p ( 0 ) σ ˙ p ( V ( p ) ) .$
As the ij entry of kI$ℬ$;p(0) is kijψp(0), then the symmetry (dI$ℬ$,p(0)α)β = (dI$ℬ$,p(0)β)α holds, and we have:
$σ ˙ p ( D W V ( p ) - D V W ( p ) ) = d V p ( 0 ) σ ˙ p ( W ( p ) ) + 1 2 I B - 1 ( p ) ( d I B , p ( 0 ) σ ˙ p ( W ( p ) ) ) V p ( 0 ) - d W p ( 0 ) σ ˙ p ( V ( p ) ) - 1 2 I B - 1 ( p ) ( d I B , p ( 0 ) σ ˙ p ( V ( p ) ) ) W p ( 0 ) = σ ˙ [ V , W ] ( p ) .$
The term $Γ k ( p ) = 1 2 I p - 1 ( 0 ) ∂ k d I B , p ( 0 )$ of Equation (87) is sometimes referred to as the Christoffel matrix, but we do not use this terminology in this paper. As:
$I B , p ( θ ) = I B ( e p ( θ ) ) = [ Cov e p ( θ ) ( X i , X j ) ] i , j = 1 , … , m = [ ∂ i ∂ j ψ p ( θ ) ] i , j = i , … , m ,$
we have kI$ℬ$(ep(θ)) = [ijkψp(θ)]i,j=i,...,m = [Covep(θ) (Xi, Xj, Xk)]i,j=i,...,m and:
$Γ k ( p ) = 1 2 [ Cov p ( X i , X j ) ] i , j = i , … , m - 1 [ Cov p ( X i , X j , X k ] i , j = i , … , m$
If V, W are vector fields of T , we have:
$Γ ( p , V , W ) = 1 2 I B - 1 ( p ) Cov p ( X , V , W ) = 1 2 I B - 1 ( p ) E p [ U e p X V W ] ,$
which is the projection of V (p)W(p)/2 on .
Notice also that:
$( d I p - 1 ( 0 ) α ) I B , p ( 0 ) = - I p - 1 ( 0 ) ( d I B , p ( 0 ) α ) I p - 1 ( 0 ) I B , p ( 0 ) y = - I p - 1 ( 0 ) ( d I B , p ( 0 ) α ) .$

#### 3.2. Acceleration

Let p(t), tI, be a smooth curve in . Then, the velocity $δ p ( t ) = d d t log ( p ( t ) )$ is a vector field V (p(t)) = δp(t), defined on the support p(I) of the curve. As the curve is the flow of the velocity field, we can compute the metric derivative of the velocity along the the velocity itself Dδpδp from Equation (91) with V (p(0)) = δp(0); we can use Equation (91) to get:
$σ ˙ p ( D δ p δ p ) ( p ( 0 ) ) = d d t σ ˙ p ( 0 ) ( δ ( p ( t ) ) ) | t = 0 + 1 2 I B - 1 ( p ( 0 ) ) d d t I B ( p ( t ) ) | t = 0 = d 2 d t 2 σ p ( 0 ) ( p ( t ) ) | t = 0 + 1 2 I B - 1 ( p ( 0 ) ) d d t I B ( p ( t ) ) | t = 0 .$
which can be defined to be the Riemannian acceleration of the curve at t = 0.
Let us write θ(t) = σp(p(t)), p = p(0) and:
$p ( t ) = exp ( ∑ j = 1 m θ j ( t ) U e p X j - ψ p ( θ ( t ) ) ) · p ,$
so that σ̇p(δp)(0) = θ̇(0) and $d 2 d t 2 σ p ( p ( t ) ) | t = 0 = θ ¨ ( 0 )$. We have:
$d d t I B ( p ( t ) ) | t = 0 = d d t I B , p ( θ ( t ) ) | t = 0 = d d t Hess ψ p ( θ ( t ) ) | t = 0 = Cov p ( X , X , ∑ j = 1 m θ ˙ j ( t ) X j )$
so that the acceleration at p has coordinates:
$θ ¨ ( 0 ) + 1 2 ∑ i , j = 1 m θ ˙ i ( 0 ) θ ˙ j ( 0 ) Cov p ( X , X ) - 1 Cov p ( X , X i , X j ) = θ ¨ ( 0 ) = 1 2 Cov p ( X , X ) - 1 Cov p ( X , ∑ i m θ ˙ i ( 0 ) X i , ∑ j = 1 m θ ˙ j ( 0 ) X j ) .$
A geodesic is a curve whose acceleration is zero at each point. The exponential map is the mapping Exp: T defined by:
$( p , U ) ↦ Exp p U = p ( 1 ) ,$
where tp(t) is the geodesic, such that p(0) = p and δp(0) = U, for all U, such that the geodesic exists for t = 1.
The exponential map is a particular retraction, that is, a family of mappings Rp, p$ℰ$, from the tangent space at p to the manifold; here R: Tp$ℰ$$ℰ$, such that Rp(0) = p and dRp(0) = Id; see  (§5.4). It should be noted that exponential manifolds have natural retractions other than Exp, a notable one being the exponential family itself. A retraction provides a crucial step in a gradient search algorithms by mapping a direction of increase of the objective function to a new trial point.

#### 3.2.1. Example: Binary Independent 2.5.2 Continued

Let us consider the binary independent model of Section 2.5.2. We have
$I B ( e λ ( θ ) ) = I B , λ ( θ ) = diag ( cosh - 2 ( θ j ) : j = 1 , … , d ) ,$
it follows that
$∂ k I B , λ ( θ ) = ∂ k diag ( cosh - 2 ( θ j ) : j = 1 , … , d ) = - 2 cosh - 3 ( θ k ) sinh ( θ h ) E k k ,$
where Ekk is the d × d matrix with entry one at (k, k), zero otherwise. The k-th Christoffel’s matrix in the second term in the definition of the metric derivative (aka Levi–Civita connection) is:
$Γ B k ( e λ ( θ ) ) = Γ λ k ( θ ) = 1 2 I B , λ - 1 ( θ ) ∂ k I B , λ ( θ ) = - tanh ( θ k ) E k k .$
In terms of the moments, we have I$ℬ$λ(θ) = Covθ (X, X′) = Hess ψλ(θ). As kijψλ(θ) = Covθ (Xk, Xi, Xj), we that can write:
$∂ k I B , λ ( θ ) = ∂ k diag ( Var θ ( X j ) : j = 1 , … , d ) = Cov θ ( X k , X k , X k ) E k k$
and:
$Γ λ k ( θ ) = 1 2 Cov θ ( X k , X k ) - 1 Cov θ ( X k , X k , X k ) E k k = 1 2 ( 1 - ( η k ) 2 ) - 1 ( - 2 η k + 2 ( η k ) 3 ) E k k = - η k E k k .$
The equations for the geodesics starting from θ(0) with velocity θ̇(0) = u are:
$θ ¨ k ( t ) + ∑ i j = 1 m Γ i j k ( θ ( t ) ) θ ˙ i ( t ) θ ˙ j ( t ) = θ ¨ k ( t ) - tanh ( θ k ( t ) ) ( θ ˙ k ( t ) ) 2 = 0 , k = 1 , … , d .$
The ordinary differential equation:
$θ ¨ - tanh ( θ ) θ ˙ 2 = 0$
has the closed form solution:
$θ ( t ) = gd - 1 ( gd ( θ ( 0 ) ) + θ ˙ ( 0 ) cosh ( θ ( 0 ) ) t ) = tanh - 1 ( sin ( gd ( θ ( 0 ) + θ ˙ ( 0 ) cosh ( θ ( 0 ) ) t ) )$
for all t, such that:
$- π / 2 < gd ( θ ( 0 ) ) + θ ˙ ( 0 ) cosh ( θ ( 0 ) ) t < π / 2 ,$
where gd: ℝ →] −π/2,+ π/2[ is the Gudermannian function, that is, gd′(x) = 1/cosh x, gd(0) = 0; in closed form, gd(x) = arcsin(tanh(x)). In fact, if θ is a solution of Equation (109), then:
$d d t gd ( θ ( t ) ) = θ ¨ ( t ) cosh ( θ ( t ) )$
$d 2 d t 2 gd ( θ ( t ) ) = - sinh ( θ ( t ) ) ( θ ˙ ( t ) ) 2 cosh 2 ( θ ( t ) ) + θ ¨ ( t ) cosh ( θ ( t ) ) = 1 cosh ( θ ( h ) ) ( θ ¨ ( t ) - tanh ( θ ( t ) ) ( θ ˙ ( t ) ) 2 ) = 0 ,$
so that t ↦ gd(θ(t)) coincides (where it is defined) with an affine function characterized by the initial conditions.
In particular, at t = 1, the geodesic Equation (110) defines the Riemannian exponential Exp: T . If (p, U) ∈ T , that is, p and UTp , then σλ(p) = θ(0) and U = ∑uj Xj, σ̇λ(U) = u. If:
$- π / 2 < gd ( θ j ) + u j cosh ( θ j ) < π / 2 ,$
then we can take θ̇(0) = u and t = 1, so that:
$Exp p : U ↦ σ ˙ λ u ↦ [ gd - 1 ( gd ( θ j ) + u j cosh ( θ j ) ) : j = 1 , … , d ] ↦ e λ ∏ j = 1 m exp ( gd - 1 ( gd ( θ j ) + u j cosh ( θ j ) ) X j - ψ ( gd - 1 ( gd ( θ j ) + u j cosh ( θ j ) ) ) ) 2 - m .$
We have:
$exp ( gd - 1 ( v ) ) = exp ( tanh - 1 ( sin ( v ) ) ) = 1 + sin v 1 - sin v$
and:
$ψ ( gd - 1 ( v ) ) = + log ( gd - 1 ( sin v ) ) = log ( 1 cos v ) ,$
hence $u ↦ Exp p ( ∑ j = 1 d u j U e p X j )$ is given for:
$u ∈ ✕ j = 1 d ] cosh ( θ j ) ( - π / 2 - gd ( θ j ) ) , cosh ( θ j ) ( π / 2 - gd ( θ j ) ) [ ,$
by:
$Exp θ ( u ) = ∏ j = 1 m cos ( gd ( θ j ) + u j cosh ( θ j ) ) ( 1 + sin ( gd ( θ j ) + u j cosh ( θ j ) ) 1 - sin ( gd ( θ j ) + u j cosh ( θ j ) ) ) X j 2 = ∏ j = 1 m ( 1 + sin ( gd ( θ j ) + u j cosh ( θ j ) ) X j ) 2 - m ∈ E V .$
The expectation parameters are:
$η i ( t ) = E θ = 0 [ X i ∏ j = 1 m ( 1 + sin ( gd ( θ j ) + t u j cosh ( θ j ) ) X j ) ] = sin ( gd ( θ j ) + t u j cosh ( θ j ) ) ,$
and:
$gd ( θ j ) = arcsin ( η j ) , cosh ( θ j ) = 1 ( 1 - ( η j ) 2 ) 1 2 ,$
so that the exponential in terms of the expectation parameters is:
$Exp η ( u ) = ( sin ( arcsin η j + ( 1 - ( η j ) 2 ) 1 2 u j ) : j = 1 , … , m ) .$
The inverse of the Riemannian exponential provides a notion of translation between two elements of the exponential model, which is a particular parametrization of the model:
$η 1 η 2 → = Exp η 1 - 1 η 2 = [ ( ( 1 - ( η i 2 ) 2 ) - 1 2 ( arcsin η 2 j - arcsin η 1 j ) : j = 1 , … , m ]$
In particular, at θ = 0, we have the geodesic:
$t ↦ ∏ j = 1 d ( 1 + sin ( t u j ) X j ) 2 - m , ∣ t ∣ < π 2 max ∣ u j ∣$
See in Figure 5 some geodesic curves.

#### 3.3. Riemannian Hessian

Let φ: ℝ with Riemannian gradient ∇φ(p) = ∑i(∇̃φ)i(p) Xi, $∇ ˜ φ ( p ) = I B - 1 ( p ) ∇ φ p ( 0 )$. The Riemannian Hessian of φ is the metric derivative of the gradient ∇φ along the vector field Y, that is, HessY φ = DYφ; see  (Ch. 6, Ex. 11),  (§5.5). in the following, we denote by the symbol Hess, without a subscript, the ordinary Hessian matrix.
From Equation (87), we have the coordinates of HessY φ(p). Given a generic tangent vector α, we compute from Equation (38):
$d ( ∇ φ ) p ( θ ) α ∣ θ = 0 = d ( I B , p - 1 ( θ ) ∇ φ p ( θ ) ) α ∣ θ = 0 = ( d I B , p - 1 ( 0 ) α ) ∇ φ p ( 0 ) + I B , p - 1 ( 0 ) Hess φ p ( 0 ) α = - I B - 1 ( p ) ( d I B , p ( 0 ) α ) ∇ ˜ φ ( p ) + I B - 1 ( p ) Hess φ p ( 0 ) α$
and, upon substitution of (∇φ)p to Vp in Equation (87),
$σ ˙ p ( Hess Y φ ( p ) ) = d ( ∇ φ ) p ( 0 ) α + 1 2 I B - 1 ( p ) ( d I B , p ( 0 ) α ) ( ∇ φ ) p ( 0 ) , α = S p ( Y ( p ) ) = - I B - 1 ( p ) ( d I B , p ( 0 ) α ) ∇ ˜ φ ( p ) + I B - 1 ( p ) Hess φ p ( 0 ) + 1 2 I B - 1 ( p ) ( d I B , p ( 0 ) α ) ∇ ˜ φ ( p ) = I B - 1 ( p ) Hess φ p ( 0 ) α - 1 2 I B - 1 ( p ) ( d I B , p ( 0 ) α ) ∇ ˜ φ ( p ) = I B - 1 ( p ) ( Hess φ p ( 0 ) α - 1 2 ( d I B , p ( 0 ) α ) ∇ ˜ φ ( p ) )$
HessY φ is characterized by knowing the value of g(HessY φ, X) : for all vector fields X. We have from Equation (126), with α = σ̇p(Y (p)) and β = σ̇p(X(p)),
$g p ( Hess Y ( p ) φ ( p ) , X ( p ) ) = β ′ Hess φ p ( 0 ) α - 1 2 β ′ ( d I B , p ( 0 ) α ) ∇ ˜ φ ( p ) .$
This is the presentation of the Riemannian Hessian as a bi-linear form on T ; see the comments in  (Prop. 5.5.2-3). Note that the Riemannian Hessian is positive definite if:
$α ′ Hess φ p ( 0 ) α ≥ 1 2 α ′ ( d I B , p ( 0 ) α ) ∇ ˜ φ ( p ) , α ∈ R m .$

## 4. Application to Combinatorial Optimization

We conclude our paper by showing how the geometric method applies to the problem of finding the maximum of the expected value of a function.

#### 4.1. Hessian of a Relaxed Function

Here is a key example of vector field. Let f be any bounded random variable, and define the relaxed function to be φ(p) = [f], p>. Define F(p) to be the projection of f, as an element of L2(p), onto Tp = , i.e., F(p) is the element of , such that:
$E p [ ( f - F ( p ) ) v ] = 0 , v ∈ U e p V$
In the basis $ℬ$, we have F(p) = ∑ip,i Xi and:
$Cov p ( f , X j ) = ∑ i f ^ p , i E p [ U e p X i U e p X j ] , j = 1 , … , m ,$
so that $f ^ p = I B - 1 ( p ) Cov p ( X , f )$ and
$F ( p ) = f ^ p ′ U e p X = Cov p ( f , X ) I B - 1 ( p ) U e p X .$
Let us compute the gradient of the relaxed function φ = . [f] : . We have φp(θ) = [f], and from the properties of exponential families, the Euclidean gradient is ∇φp(0) = Covp (f, X). It follows that the natural gradient is:
$∇ ˜ φ p ( 0 ) = I B - 1 ( p ) Cov p ( X , f ) = f ^ ,$
and the Riemannian gradient is ∇φ(p) = F(p).
From the properties of exponential families, we have:
$Hess φ p ( 0 ) = Cov p ( X , X , f ) ,$
so that, in this case, Equation (127), when written in terms of the moments, is:
$β ′ Cov p ( X , X , f ) α - 1 2 β ′ Cov p ( X , X , α · X ) Cov p ( X , X ) - 1 Cov p ( X , f ) .$

#### 4.1.1. Example: Binary Independent 2.5.2 and 3.2.1 Continued

We list below the computation of the Hessian in the case of two binary independent variables. Computations were done with Sage , which allows both the reduction $x i 2 = 1$ in the ring of polynomials and the simplifications in the symbolic ring of parameters.
$Cov η ( X , f ) = ( - ( η 1 2 - 1 ) a 1 - ( η 1 2 η 2 - η 2 ) a 12 - ( η 2 2 - 1 ) a 2 - ( η 1 η 2 2 - η 1 ) a 12 ) = ( - ( η 1 - 1 ) ( η 1 + 1 ) ( a 12 η 2 + a 1 ) - ( η 2 - 1 ) ( η 2 + 1 ) ( a 12 η 1 + a 2 ) )$
$Cov η ( X , X ) = ( - η 1 2 + 1 0 0 - η 2 2 + 1 ) = ( - ( η 1 - 1 ) ( η 1 + 1 ) 0 0 - ( η 2 - 1 ) ( η 2 + 1 ) )$
$Cov η ( X , X ) - 1 Cov η ( X , f ) = ( a 12 η 2 + a 1 a 12 η 1 + a 2 ) = ∇ F ( η )$
$Cov η ( X , X , f ) = ( 2 ( η 1 3 - η 1 ) a 1 + 2 ( η 1 3 η 2 - η 1 η 2 ) a 12 ( η 1 2 η 2 2 - η 1 2 - η 2 2 + 1 ) a 12 ( η 1 2 η 2 2 - η 1 2 - η 2 2 + 1 ) a 12 2 ( η 1 η 2 3 - η 1 η 2 ) a 12 + 2 ( η 2 3 - η 2 ) a 2 ) = ( 2 ( η 1 - 1 ) ( η 1 + 1 ) ( a 12 η 2 + a 1 ) η 1 ( η 2 - 1 ) ( η 2 + 1 ) ( η 1 - 1 ) ( η 1 + 1 ) a 12 ( η 2 - 1 ) ( η 2 + 1 ) ( η 1 - 1 ) ( η 1 + 1 ) a 12 2 ( η 2 - 1 ) ( η 2 + 1 ) ( a 12 η 1 + a 2 ) η 2 )$
$Cov η ( X , X ) - 1 Cov η ( X , X , f ) = ( - 2 ( a 12 η 2 + a 1 ) η 1 - a 12 η 2 2 + a 12 - a 12 η 1 2 + a 12 - 2 ( a 12 η 1 + a 2 ) η 2 )$
$Cov η ( X , X , ∇ F ( η ) ) = ( - 2 ( a 12 η 2 + a 1 ) ( η 1 + 1 ) ( η 1 - 1 ) η 1 0 0 2 ( a 12 η 1 + a 2 ) ( η 2 + 1 ) ( η 2 - 1 ) η 2 )$
$Cov η ( X , X ) - 1 Cov η ( X , X , ∇ F ( η ) ) = ( - 2 ( a 12 η 2 + a 1 ) η 1 0 0 - 2 ( a 12 η 1 + a 2 ) η 2 )$
The Riemannian Hessian as a matrix in the basis of the tangent space is:
$Hess F ( η ) = Cov η ( X , X ) - 1 ( Cov η ( X , X , f ) - 1 2 Cov η ( X , X , ∇ F ( η ) ) ) = ( - ( a 12 η 2 + a 1 ) η 1 - a 12 ( η 2 + 1 ) ( η 2 - 1 ) - a 12 ( η 1 + 1 ) ( η 1 - 1 ) - ( a 12 η 1 + a 2 ) η 2 )$
As a check, let us compute the Riemannian Hessian as a natural Hessian in the Riemannian parameters, Hess φ ∘ Expp(u) | u=0; see  (Prop. 5.5.4). We have:
$F ∘ Exp η ( u ) = a 12 sin ( - η 1 2 + 1 u 1 + arcsin ( η 1 ) ) sin ( - η 2 2 + 1 u 2 + arcsin ( η 2 ) ) + a 1 sin ( - η 1 2 + 1 u 1 + arcsin ( η 1 ) ) + a 2 sin ( - η 2 2 + 1 u 2 + arcsin ( η 2 ) )$
and:
$Hess F ∘ Exp η ( u ) ∣ u = 0 = ( ( η 1 2 - 1 ) a 12 η 1 η 2 + ( η 1 2 - 1 ) a 1 η 1 ( η 1 2 - 1 ) ( η 2 2 - 1 ) a 12 ( η 1 2 - 1 ) ( η 2 2 - 1 ) a 12 ( η 2 2 - 1 ) a 12 η 1 η 2 + ( η 2 2 - 1 ) a 2 η 2 ) = ( ( a 12 η 2 + a 1 ) ( η 1 + 1 ) ( η 1 - 1 ) η 1 a 12 ( η 1 + 1 ) ( η 1 - 1 ) ( η 2 + 1 ) ( η 2 - 1 ) a 12 ( η 1 + 1 ) ( η 1 - 1 ) ( η 2 + 1 ) ( η 2 - 1 ) ( a 12 η 1 + a 2 ) ( η 2 + 1 ) ( η 2 - 1 ) η 2 ) .$
Note the presence of the factor Covη (X,X).

#### 4.2. Newton Method

The Newton method is an iterative method that generates a sequence of points pt, with t = 0, 1, . . . , that converges towards a stationary point of a F(p) = [f], p , that is, a critical point of the vector field p ↦ ∇F(p), ∇F() = 0. Here, we follow  (Ch. 5–6), and in particular Algorithm 5 on Page 113.
Let ∇F be a gradient field. We reproduce in our case the basic derivation of the Newton method in the following. Note that, in this section, we use the notation Hess •[α] to denote Hessα •. Using the definition of metric derivative, we have for a geodesic curve [0, 1] ∋ tp(t) ∈ connecting p = p(0) to = p(1) that:
$d d t g p ( t ) ( ∇ F ( p ( t ) ) , δ p ( t ) ) = g p ( t ) ( Hess F ( p ( t ) ) [ δ p ( t ) ] , δ p ( t ) )$
hence the increment from p to is:
$g p ^ ( ∇ F ( p ^ ) , δ p ( 1 ) ) - g p ( ∇ F ( p ) , δ p ( 0 ) ) = ∫ 0 1 g p ( t ) ( Hess F ( p ( t ) ) [ δ p ( t ) ] , δ p ( t ) ) d t .$
Now, we assume that ∇F() = 0 and that in Equation (145), the integral is approximated by the initial value of the integrand, that is to say, the Hessian is approximately constant on the geodesic from p to ; we obtain:
$- g p ( ∇ F ( p ) , δ p ( 0 ) ) - g p ( Hess F ( p ) [ δ p ( 0 ) ] , δ p ( 0 ) ) + ∊ .$
If we can solve the Newton equation:
$Hess F ( p ( t ) ) [ u ] = - ∇ F ( p )$
then u is approximately equal to the initial velocity of the geodesic connecting p to , that is, = Expp(u).
The particular structure of the exponential manifold suggests at least two natural retractions that could be used to move from u to . Namely, we have the Riemannian exponential (θt, θt+1) ↦ Expθt (θt+1 θt) and the e-retraction coming from the exponential family itself and defined by (θt, θt+1) ↦ eθt (θt+1 θt), with θt+1 θt = ut.
In the θ parameters, with the e-retraction, the Newton method generates a sequence (θt) according to the following updating rule:
$θ t + 1 = θ t - λ Hess F ( θ t ) - 1 ∇ ˜ F ( θ t )$
where λ > 0 is an extra parameter intended to control the step size and, in turn, the convergence to θ̂; see .
We can rewrite Equation (148) in terms of covariances as:
$θ t + 1 = θ t - λ ( Cov θ t ( X , X , f ) - 1 2 Cov θ t ( X , X , ∇ ˜ F ( θ t ) ) ) - 1 ∇ ˜ F ( θ t ) .$

#### 4.3. Example: Binary Independent

In the η parameters, the Newton step is:
$u = - Hess F ( η ) - 1 ∇ F ( η ) = ( a 12 2 η 1 + a 12 a 2 + ( a 1 a 12 η 1 + a 1 a 2 ) η 2 a 12 2 η 1 2 + ( a 12 a 2 η 1 + a 12 2 ) η 2 2 - a 12 2 + ( a 1 a 1 η 1 2 + a 1 a 2 η 1 ) η 2 a 1 a 2 η 1 + a 1 a 12 + ( a 12 a 2 η 1 + a 12 2 ) η 2 a 12 2 η 1 2 + ( a 12 a 2 η 1 + a 12 2 ) η 2 2 - a 12 2 + ( a 1 a 12 η 1 2 + a 1 a 2 η 1 ) η 2 )$
and the new η in the Riemannian retraction is:
$Exp η ( u ) = ( sin ( ( a 12 2 η 1 + a 12 a 2 + ( a 1 a 12 η 1 + a 1 a 2 ) η 2 ) - η 1 2 + 1 a 12 2 η 1 2 + ( a 12 a 2 η 1 + a 12 2 ) η 2 2 - a 12 2 + ( a 1 a 12 η 1 2 + a 1 a 2 η 1 ) η 2 + arcsin ( η 1 ) ) sin ( ( a 1 a 2 η 1 + a 1 a 12 + ( a 12 a 2 η 1 + a 12 2 ) η 2 ) - η 2 2 + 1 a 12 2 η 1 2 + ( a 12 a 2 η 1 + a 12 2 ) η 2 2 - a 12 2 + ( a 1 a 12 η 1 2 + a 1 a 2 η 1 ) η 2 + arcsin ( η 2 ) ) . )$
In Figure 6, we represented the vector field associated with the Newton step in the η parameters, with λ = 0.05, using the Riemannian retraction, for the case a1 = 1, a2 = 2 and a12 = 3, with:
$Exp η ( u ) = ( sin ( λ - η 1 2 + 1 ( ( 3 η 1 + 2 ) η 2 + 9 η 1 + 6 ) 3 ( 2 η 1 + 3 ) η 2 2 + 9 η 1 2 + ( 3 η 1 2 + 2 η 1 ) η 2 - 9 + arcsin ( η 1 ) ) sin ( λ ( 3 ( 2 η 1 + 3 ) η 2 + 2 η 1 + 3 ) - η 2 2 + 1 3 ( 2 η 1 + 3 ) η 2 2 + 9 η 1 2 + ( 3 η 1 2 + 2 η 1 ) η 2 - 9 + arcsin ( η 2 ) ) ) .$
The red dotted lines represented in the figure identify the basins of attraction of the vector field and correspond to the solutions of the explicit equation in η for which the Newton step u is not defined. This vector field can be compared to that in Figure 7, associated with the Newton step for F(η) using the Euclidean geometry. In the Euclidean geometry, F(η) is a quadratic function with one saddle point, so that from any η, the Newton step points in the direction of the critical point. This makes the Newton step unsuitable for an optimization algorithm. On the other side, in the Riemannian geometry, the vertices of the polytope are critical points for F(η), and they determine the presence of multiple basins of attraction, as expected.
Figure 8 shows the Newton step in the θ parameters based on the e-retraction of Equation (149), while Figure 9 represents the Newton step evaluated with respect to the Euclidean geometry. A comparison of the two vector fields shows that, differently from the η parameters, the number of basins of attraction is the same in the two geometries; however, the scale of the vectors is different. In particular, notice how on the plateau, for diverging θ, the Newton step in the Euclidean geometry vanishes, while in the Riemannian geometry, it gets larger. This behavior suggests better convergence properties for an optimization algorithm based on the Newton step evaluated using the proper Riemannian geometry. In the θ parameters, the boundaries of the basins of attraction represented by the red dotted lines have been computed numerically and correspond to the values of θ for which the update step is not defined.
Finally, notice that in both the η and θ parameters, the step is not always in the direction of descent for the function, a common behavior of the Newton method, which converges to the critical points.

## 5. Discussion and Conclusions

In this paper, we introduced second-order calculus over a statistical manifold, following the approach described in , which has been adapted to the special case of exponential statistical models [2,3]. By defining the Riemannian Hessian and using the notion of retraction, we developed the proper machinery necessary for the definition of the updating rule of the Newton method for the optimization of a function defined over an exponential family.
The examples discussed in the paper show that by taking into account the proper Riemannian geometry of a statistical exponential family, the vector fields associated with the Newton step in the different parametrizations change profoundly. Not only new basins of attraction associated with local and global minima appear, as for the expectation parameters, but also the magnitude of the Newton step is affected, as over the plateau in the natural parameters. Such differences are expected to have a strong impact on the performance of an optimization algorithm based on the Newton step, from both the point of view of achievable convergence and the speed of convergence to the optimum.
The Newton method is a popular second order optimization technique based on the computation of the Hessian of the function to be optimized and is well known for its super-linear convergence properties. However, the use of the Newton method poses a number of issues in practice.
First of all, as the examples in Figures 6 and 8 show, the Newton step does not always point in the direction of the natural gradient, and the algorithm may not converge to a (local) optimum of the function. Such behavior is not unexpected; indeed the Newton method tends to converge to critical points of the function to be optimized, which include local minima, local maxima and saddle points. In order to obtain a direction of ascent for the function to be optimized, the Hessian must be negative-definite, i.e., its eigenvalues must be strictly negative, which is not guaranteed in the general case. Another important remark is related to the computational complexity associated with the evaluation of the Hessian, compared to the (natural) gradient. Indeed, to obtain the Newton step d, Christoffel matrices have to be evaluated, together with the third order covariances between sufficient statistics and the function, and the Hessian has to be inverted. Finally, notice that when the Hessian is close to being non-invertible, numerical problems may arise in the computation of the Newton step, and the algorithm may become unstable and diverge.
In the literature, different methods have been proposed to overcome these issues. Among them, we mention quasi-Newton methods, where the update vector is obtained using a modified Hessian, which has been made negative-definite, for instance, by adding a proper correction matrix.
This paper represents the first step in the design of an algorithm based on the Newton method for the optimization over a statistical model. The authors are working on the computational aspects related to the implementation of the method, and a new paper with experimental results is in progress.

## Acknowledgments

Luigi Malagò was supported by the Xerox University Affairs Committee Award and by de Castro Statistics, Collegio Carlo Alberto, Moncalieri. Giovanni Pistone is supported by de Castro Statistics, Collegio Carlo Alberto, Moncalieri, and is a member of GNAMPA–INdAM, Roma.

## Author Contributions

All authors contributed to the design of the research. The research was carried out by all authors. The study of the Hessian and of the Newton method in statistical manifolds was originally suggested by Luigi Malagò. The manuscript was written by Luigi Malagò and Giovanni Pistone. All authors have read and approved the final manuscript.

## Conflicts of Interest

The authors declare no conflict of interest.

## References

1. Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory; Number 9 in IMS Lecture Notes. Monograph Series; Institute of Mathematical Statistics: Hayward, CA, USA, 1986; p. 283. [Google Scholar]
2. Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI, USA, 2000; p. 206. [Google Scholar]
3. Pistone, G. Nonparametric Information Geometry. In Geometric Science of Information; Proceedings of the First International Conference, GSI 2013, Paris, France, 28–30 August 2013, Nielsen, F., Barbaresco, F., Eds.; Lecture Notes in Computer Science, Volume 8085; Springer: Berlin/Heidelberg, Germany, 2013; pp. 5–36. [Google Scholar]
4. Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University Press: Princeton, NJ, USA, 2008; p. xvi+224. [Google Scholar]
5. Nocedal, J.; Wright, S.J. Numerical Optimization, 2nd ed.; Springer Series in Operations Research and Financial Engineering; Springer: New York, NY, USA, 2006; p. xxii+664. [Google Scholar]
6. Do Carmo, M.P. Riemannian geometry; Mathematics: Theory & Applications; Birkhäuser Boston Inc.: Boston, MA, USA, 1992; p. xiv+300. [Google Scholar]
7. Abraham, R.; Marsden, J.E.; Ratiu, T. Manifolds, Tensor Analysis, and Applications, 2nd ed; Applied Mathematical Sciences, Volume 75; Springer: New York, NY, USA, 1988; p. x+654. [Google Scholar]
8. Lang, S. Differential and Riemannian Manifolds, 3rd ed.; Graduate Texts in Mathematics; Springer: New York, NY, USA, 1995; p. xiv+364. [Google Scholar]
9. Pistone, G. Algebraic varieties vs. differentiable manifolds in statistical models. In Algebraic and Geometric Methods in Statistics; Gibilisco, P., Riccomagno, E., Rogantin, M.P., Wynn, H.P., Eds.; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
10. Malagò, L.; Matteucci, M.; Dal Seno, B. An information geometry perspective on estimation of distribution algorithms: Boundary analysis. Proceedings of the 2008 GECCO Conference Companion On Genetic and Evolutionary Computation (GECCO ’08); ACM: New York, NY, USA, 2008; pp. 2081–2088. [Google Scholar]
11. Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Relaxation as a Unifying Approach in 0/1 Programming. Proceedings of the NIPS 2009 Workshop on Discrete Optimization in Machine Learning: Submodularity, Sparsity & Polyhedra (DISCML), Whistler Resort & Spa, BC, Canada, 11–12 December 2009.
12. Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Natural Gradient Descent by Estimation of Empirical Covariances. Proceedings of the IEEE Congress on Evolutionary Computation (CEC), New Orleans, LA, USA, 5–8 June 2011; pp. 949–956.
13. Malagò, L.; Matteucci, M.; Pistone, G. Towards the geometry of estimation of distribution algorithms based on the exponential family. Proceedings of the 11th Workshop on Foundations of Genetic Algorithms (FOGA ’11), Schwarzenberg, Austria, 5–8 January 2011; ACM: New York, NY, USA, 2011; pp. 230–242. [Google Scholar]
14. Malagò, L.; Matteucci, M.; Pistone, G. Natural gradient, fitness modelling and model selection: A unifying perspective. Proceedings of the IEEE Congress on Evolutionary Computation (CEC), Cancun, Mexico, 20–23 June 2013; pp. 486–493.
15. Amari, S.I. Natural gradient works efficiently in learning. Neural Comput 1998, 10, 251–276. [Google Scholar]
16. Shima, H. The Geometry of Hessian Structures; Scientific Publishing Co Pte. Ltd. World: Hackensack, NJ, USA, 2007; p. xiv+246. [Google Scholar]
17. Malagò, L. On the Geometry of Optimization Based on the Exponential Family Relaxation. Ph.D. Thesis, Politecnico di Milano, Milano, Italy, 2012. [Google Scholar]
18. Gallavotti, G. Statistical Mechanics: A Short Treatise; Texts and Monographs in Physics; Springer: Berlin, Germany, 1999; p. xiv+339. [Google Scholar]
19. Naudts, J. Generalised exponential families and associated entropy functions. Entropy 2008, 10, 131–149. [Google Scholar]
20. Esteban, J.; Ray, D. On the Measurement of Polarization. Econometrica 1994, 62, 819–851. [Google Scholar]
21. Montalvo, J.; Reynal-Querol, M. Ethnic polarization, potential conflict, and civil wars. Am. Econ. Rev 2005, 796–816. [Google Scholar]
22. Stein, W.; et al. Sage Mathematics Software (Version 6.0); The Sage Development Team, 2013. Available online: http://www.sagemath.org (accessed on 27 March 2014).
Figure 1. Relaxation of the Function (2) on the independence model. a1 = 1, a2 = 2, a12 = 3.
Figure 1. Relaxation of the Function (2) on the independence model. a1 = 1, a2 = 2, a12 = 3.
Figure 2. Gradient flow of the Function (2). The domain has been increased to include values outside the square [−1, +1]2.
Figure 2. Gradient flow of the Function (2). The domain has been increased to include values outside the square [−1, +1]2.
Figure 3. Gradient flow (blue line) and natural gradient flow (black line) for the Function (2), starting at (−1=4, −1=4).
Figure 3. Gradient flow (blue line) and natural gradient flow (black line) for the Function (2), starting at (−1=4, −1=4).
Figure 4. Normalized polarization.
Figure 4. Normalized polarization.
Figure 5. Geodesics from η = (0.75, 0.75).
Figure 5. Geodesics from η = (0.75, 0.75).
Figure 6. The Newton step in the η parameters, Riemannian retraction, λ = 0.05. The red dotted lines identify the different basins of attraction and correspond to the points for which the Newton step is not defined; cf. Equation (150). The instability close to the critical lines is represented by the longer arrows.
Figure 6. The Newton step in the η parameters, Riemannian retraction, λ = 0.05. The red dotted lines identify the different basins of attraction and correspond to the points for which the Newton step is not defined; cf. Equation (150). The instability close to the critical lines is represented by the longer arrows.
Figure 7. The Newton step in the η parameters, Euclidean geometry, λ = 0.05.
Figure 7. The Newton step in the η parameters, Euclidean geometry, λ = 0.05.
Figure 8. The Newton step in the θ parameters, exponential retraction, λ = 0.015. The red dotted lines identify the different basins of attraction and correspond to the points for which the Newton step is not defined. The instability along the critical lines, which identifies the basins of attraction, is not represented.
Figure 8. The Newton step in the θ parameters, exponential retraction, λ = 0.015. The red dotted lines identify the different basins of attraction and correspond to the points for which the Newton step is not defined. The instability along the critical lines, which identifies the basins of attraction, is not represented.
Figure 9. The Newton step in the θ parameters, Euclidean geometry, λ = 0.15. The red dotted lines identify the different basins of attraction and correspond to the points for which the Newton step is not defined. The instability along the critical lines, which identifies the basins of attraction, is not represented.
Figure 9. The Newton step in the θ parameters, Euclidean geometry, λ = 0.15. The red dotted lines identify the different basins of attraction and correspond to the points for which the Newton step is not defined. The instability along the critical lines, which identifies the basins of attraction, is not represented.

## Share and Cite

MDPI and ACS Style

Malagò, L.; Pistone, G. Combinatorial Optimization with Information Geometry: The Newton Method. Entropy 2014, 16, 4260-4289. https://doi.org/10.3390/e16084260

AMA Style

Malagò L, Pistone G. Combinatorial Optimization with Information Geometry: The Newton Method. Entropy. 2014; 16(8):4260-4289. https://doi.org/10.3390/e16084260

Chicago/Turabian Style

Malagò, Luigi, and Giovanni Pistone. 2014. "Combinatorial Optimization with Information Geometry: The Newton Method" Entropy 16, no. 8: 4260-4289. https://doi.org/10.3390/e16084260