Combinatorial Optimization with Information Geometry: The Newton Method

Malagò, Luigi; Pistone, Giovanni

doi:10.3390/e16084260

Open AccessArticle

Combinatorial Optimization with Information Geometry: The Newton Method

by

Luigi Malagò

¹ and

Giovanni Pistone

^2,*

¹

Dipartimento di Informatica, Università degli Studi di Milano, Via Comelico, 39/41, 20135 Milano, Italy

²

Castro Statistics, Collegio Carlo Alberto, Via Real Collegio 30, 10024 Moncalieri, Italy

^*

Author to whom correspondence should be addressed.

Entropy 2014, 16(8), 4260-4289; https://doi.org/10.3390/e16084260

Submission received: 31 March 2014 / Revised: 10 July 2014 / Accepted: 11 July 2014 / Published: 28 July 2014

(This article belongs to the Special Issue Information Geometry)

Download

Browse Figures

Versions Notes

1. Introduction

In this paper, statistical exponential families [1] are thought of as differentiable manifolds along the approach called information geometry [2] or the exponential statistical manifold [3]. Specifically, our aim is to discuss optimization on statistical manifolds using the Newton method, as is suggested in ([4] (Ch. 5 and 6)); see also the monograph [5]. This method is based on classical Riemannian geometry [6], but here, we put our emphasis on coordinate-free differential geometry; see [7,8].

We mainly refer to the above-mentioned references [2,4], with one notable exception in the description of the tangent space. Our manifold will be an exponential family

ℰ

_V of positive densities, V being a vector space of sufficient statistics. Given a one-dimensional statistical model p(t) ∈

ℰ

_V, t ∈ I, we define its velocity at time t to be its Fisher score

s (t) = \frac{d}{d t} ln p (t)

[9]. The Fisher score s(t) is a random variable with zero expectation with respect to p(t), Entropy 16 04260f11

[s(t)] = 0. Because of that, the tangent space at p ∈

ℰ

_V is a vector space of random variables with zero expectation at p. A vector field is a mapping from p to a random variable V (p), such that for all p ∈

ℰ

, the random variable V(p) is centered at p, Entropy 16 04260f10

[V(p)] = 0. In other words, each point of the manifold has a different tangent space, and this tangent space can be used as a non-parametric model space of the manifold. In this formalism, a vector field is a mapping from densities to centered random variables, that is, it is what in statistics is called a pivot of the statistical model. To avoid confusion with the product of random variables, we do not use the standard notation for the action of a vector field on a real function. This approach is possibly unusual in differential geometry, but it is fully natural from the statistical point of view, where the Fisher score has a central place. Moreover, this approach scales nicely from the finite state space to the general state space; see the discussion in [9] and the review in [3].

A complete construction of the geometric framework based on the idea of using the Fisher scores as elements of the tangent bundle has been actually worked out. In this paper, we go on by considering a second order geometry based on the non-parametric settings.

Our main motivation for such a geometrical construction is its application to combinatorial optimization using exponential families, whose first order version was developed in [10–14]. We give here an illustration of the methods in the following toy example.

Consider the function f(x₁, x₂) = a₀ + a₁x₁ + a₂x₂ + a₁₂x₁x₂, with x₁, x₂ = ±1, a₀, a₁, a₂, a₁₂ ∈ ℝ. The function f is a real random variable on the sample space Ω = {+1, −1}² with the uniform probability λ. Note that the coordinate mappings X₁,X₂ of Ω generate an orthonormal basis 1,X₁,X₂,X₁X₂ of L²(Ω, λ) and that f is the general form of a real random variable on such a space. Let ℘_> be the open simplex of positive densities on (Ω, λ), and let

ℰ

_V be a statistical model, i.e., a subset of ℘_>. The relaxed mapping F:

ℰ

_V → ℝ,

F (p) = E_{p} [f] = a_{o} + a_{1} E_{p} [X_{1}] + a_{2} E_{p} [X_{2}] = a_{12} E_{p} [X_{1} X_{2}],

(1)

is strictly bounded by the maximum of f, F(p) = Entropy 16 04260f10

[f] < max_x_∈Ω f(x), unless f is constant. We are looking for a sequence p_n, n ∈ ℕ, such that Entropy 16 04260f12

[f] → max_x_∈Ω f(x) as n → ∞. The existence of such a sequence is a nontrivial condition for the model

ℰ

. Precisely, the closure of

ℰ

_V must contain a density, whose support is contained in the set of maxima {x ∈ Ω|f(x) = max f}. This condition is satisfied by the independence model, V = Span {X₁,X₂}, where we can write:

F (η^{1}, η^{2}) = a_{0} + a_{1} η^{1} + a_{2} η^{2} + a_{12} η^{1} η^{2}, η^{i} = E_{p} [X_{i}],

(2)

See Figure 1.

The gradient of Equation (2) has components ∂₁F = a₁ + a₁₂η², ∂₂F = a₂ + a₁₂η¹, and the flow along the gradient produces increasing values for F; however, the gradient flow does not converge to the maximum of F; see the dotted line in Figure 2. However, one can follow the suggestion by [15] and use a modified gradient (the “natural” gradient) flow that produces better results in our problem; see Figure 3. Full details on this example are given in Section 2.5.2.

In combinatorial optimization, the values of the function f are assumed to be available at each point, and the curve of steepest ascent of the relaxed function is learned through a simulation procedure based on exponential statistical models.

In this paper, we introduce, in Section 2, the geometry of exponential families and its first order calculus. The second order calculus and the Hessian are discussed in Section 3. Finally, in Section 4, we apply the formalism to the discussion of the Newton method in the context of the maximization of the relaxed function.

2. Models on a Finite State Space

We consider here the exponential statistical manifold on the set of positive densities on a measure space (Ω, μ) with Ω finite and counting measure μ. The setup we describe below is not strictly required in the finite case, because in such a case, other approaches are possible, but it provides a mathematical formalism that has its own pros and that scales naturally to the infinite case.

We provide below a schematic presentation of our formalism as an introduction to this section.

Two different exponential families can actually be the same statistical model, as the set of densities in the two exponential families are actually equal. This fact is due to both the arbitrariness of the reference density and the fact that sufficient statistics are actually a vector basis of the vector space generated by the sufficient statistics. In a non-parametric approach, we can refer directly to the vector space of centered log-densities, while the change of reference density is geometrically interpreted as a change of chart. The set of all possible such charts defines a manifold.
We make a specific interpretation of the tangent bundle as the vector space of Fisher’s scores at each density and use such tangent spaces as the space of coordinates. This produces a different tangent space/space of coordinates at each density, and different tangent spaces are mapped one onto another by a proper parallel transport, which is nothing else than the re-centering of random variables.
If a basis is chosen, a parametrization is given, and such a parametrization is, in fact, a new chart, whose values are real vectors. In the real parametrization, the natural scalar product in each scores space is given by Fisher’s information matrix.
Riemannian gradients are defined in the usual way. It is customary in information geometry to call “natural gradient” the real coordinate presentation of the Riemannian gradient. The natural gradient is computed by applying the inverse of the Fisher information matrix to the Euclidean gradient. It seems that there are tree gradients involved, but they all represent the same object when correctly understood.
The classical notion of expectation parameters for exponential families carries on as another chart on the statistical manifold, which gives rise to a further presentation of a geometrical object.
While the statistical manifold is unique, there are at least three relevant connections as structures on the vector bundles of the manifold: one relating to the exponential charts, one relating to the expectation charts and one depending on the Riemannian structure.

2.1. Exponential Families As Manifolds

On the finite sample space Ω, #Ω = n, let a set of random variables

ℬ

= {X₁, . . . ,X_m} be given, such that ∑_Jα_jX_j is constant if, and only if, the α_j’s are zero, or, equivalently, such that X₀ = 1,X₁, . . . ,X_m are affinely independent. The condition implies, necessarily, the linear independence of

ℬ

. A common choice is to take a set of linearly independent and μ-centered random variables.

We write

= Span {X₁, . . . ,X_m} and define the following exponential family of positive densities p ∈ ℘_>:

E_{V} = {q \in P_{>} ∣ q \propto e^{V} p, V \in V} .

(3)

Given any couple p, q ∈ Entropy 16 04260f15

, then there exist a unique set of parameters θ = θ_p(q), such that:

q = exp (\sum_{j} θ^{j} {{}^{e}U}^{p} X_{j} - ψ_{p} (θ)) \cdot p

(4)

where

is the centering at p, that is,

{{}^{e}U}^{p} : V \notin U \mapsto U - E_{p} [U] \in {{}^{e}U}^{p} V .

(5)

The linear mapping Entropy 16 04260f13

is one-to-one on Entropy 16 04260f14

and

X_j, j = 1, . . . ,m, and is a basis of Entropy 16 04260f13

. We view each choice of a specific reference p as providing a chart centered at p on the exponential family Entropy 16 04260f15

, namely:

σ_{p} : exp (\sum_{j} θ^{j} {{}^{e}U}^{p} X_{j} - ψ_{p} (θ)) \cdot p \mapsto θ,

(6)

If:

U = {{}^{e}U}^{p} U + E_{p} [U] = \sum_{j = 1}^{m} θ^{j} {{}^{e}U}^{p} X_{j} + E_{p} [U],

(7)

then:

E_{p} [U {{}^{e}U}^{p} X_{i}] = \sum_{j = 1}^{m} θ^{j} E_{p} [{{}^{e}U}^{p} X_{i} {{}^{e}U}^{p} X_{j}],

(8)

so that

θ = I_{B}^{- 1} (p) E_{p} [U {{}^{e}U}^{p} X]

, where:

I_{B} (p) = {[{Cov}_{p} (X_{i}, X_{j})]}_{i j} = E_{p} [X X^{'}] - E_{p} [X] E_{p} [X^{'}]

(9)

is the Fisher information matrix of the basis

ℬ

= {X₁, . . . ,X_m}.

The mappings:

σ_{p} : E_{V} ∋ q \mapsto U \mapsto θ \in R^{m}

(10)

where:

s_{p} : q \mapsto U = log (\frac{q}{p}) - E_{p} [log (\frac{q}{p})],

(11)

σ_{p} : q \mapsto θ = I_{B}^{- 1} (p) E_{p} [U {{}^{e}U}^{p} X] = I_{B}^{- 1} (p) E_{p} [log (\frac{q}{p}) {{}^{e}U}^{p} X],

(12)

are global charts in the non-parametric and parametric coordinates, respectively. Notice that Equation (12) provides the regression coefficients of the least squares estimate on Entropy 16 04260f13

of the log-likelihood.

We denote by e_p : ℝ^m → Entropy 16 04260f15

the inverse of σ_p, i.e.,

e_{p} (θ) = exp (\sum_{j = 1}^{m} θ^{j} {{}^{e}U}^{p} X_{j} - ψ_{p} (θ)) \cdot p,

(13)

so that the representation of the divergence q ↦ D(p ||q) in the chart σ_p is ψ_p:

ψ_{p} (θ) = log (E_{p} [e^{\sum_{j = 1}^{m} θ^{j} {{}^{e}U}^{p} X_{j}}]) = E_{0} [log (\frac{p}{e_{p} (θ)})] = D (p ‖ e_{p} (θ)) .

(14)

The mapping I_$ℬ$: p ↦ Cov_p (X, X) ∈ ℝ^m^×^m is represented in the chart centered at p by:

I_{B, p} (θ) = I_{B} (e_{p} (θ)) = {[{Cov}_{e_{p} (θ)} (X_{i}, X_{j})]}_{i, j} = Hess ψ_{p} (θ),

(15)

See [1].

2.2. Change of Chart

Fix p, p̄ ∈ Entropy 16 04260f15

; then, we can express p in the chart centered at p̄,

p = exp (\bar{U} - k_{p} (\bar{U})) \cdot \bar{p}, \bar{U} \in {{}^{e}U}^{\bar{p}} V, k_{\bar{p}} (\bar{U}) = log (E_{\bar{p}} [e^{\bar{U}}]) .

(16)

In coordinates

\bar{U} = \sum_{j = 1 ‘}^{m} {\bar{θ}}^{j} {{}^{e}U}^{\bar{p}} X_{j}

.

For all q ∈ Entropy 16 04260f15

, q = exp (U − k_p(U)) p, U ∈ Entropy 16 04260f13

, k_p(U) = log ( Entropy 16 04260f10

[e^U]), in coordinates

U = \sum_{j = 1 ‘}^{m} θ^{j} {{}^{e}U}^{p} X_{j}

, we can write:

\begin{array}{l} q = exp (U - k_{p} (U)) \cdot p \\ = exp (U - k_{p} (U)) exp (\bar{U} - k_{\bar{p}} (\bar{U})) \cdot \bar{p} \\ = exp (U - k_{p} (U) + \bar{U} - k_{\bar{p}} (\bar{U})) \cdot \bar{p} \\ = exp (((U + \bar{U}) - E_{\bar{p}} [U]) - (k_{p} (U) - k_{\bar{p}} (\bar{U}) + E_{\bar{p}} [U])) \cdot \bar{p}, \end{array}

(17)

hence, the non-parametric coordinate of q in the chart centered at p̄ is U + Ū − Entropy 16 04260f16

[U] =

(U) + Ū.

From Equation (12):

\begin{array}{l} σ_{\bar{p}} (q) = I_{V}^{- 1} (\bar{p}) E_{\bar{p}} [({{}^{e}U}^{\bar{p}} U + \bar{U}) {{}^{e}U}^{\bar{p}} X] \\ = θ + \bar{θ} \end{array}

(18)

This provides the change of charts

σ_{\bar{p}} \circ σ_{p}^{- 1} : θ \mapsto θ + \bar{θ}

. This atlas of charts defines the affine manifold ( Entropy 16 04260f15

, (σ_p)). This fact has deep consequences that we do not discuss here, e.g., our manifold is an instance of a Hessian manifold [16].

2.3. Tangent Bundle

The space of Fisher scores at p is Entropy 16 04260f13

, and it is identified with the tangent space of the manifold at p, T_q Entropy 16 04260f15

; see the discussion in [3,9]. Let us check the consistency of this statement with our θ-parametrization.

Let:

q (τ) = exp (\sum_{j = 1}^{m} θ^{j} (τ) {{}^{e}U}^{q (0)} X - ψ_{q (0)} (τ)) \cdot q (0),

(19)

τ ∈ I, I an open interval containing zero, a curve in Entropy 16 04260f15

. In the chart centered at q(0), we have from Equation (12):

\begin{array}{l} σ_{q (0)} (q (τ)) = I_{B}^{- 1} (q (0)) E_{q (0)} [log (\frac{q (τ)}{q (0)}) {{}^{e}U}^{q (0)} X] \\ = I_{B}^{- 1} (q (0)) E_{q (0)} [(\sum_{j = 1}^{m} θ^{j} (τ) {{}^{e}U}^{q (0)} X_{j} - ψ_{q (0)} (θ (τ))) {{}^{e}U}^{q (0)} X] \\ = I_{B}^{- 1} (q (0)) \sum_{j = 1}^{m} θ^{j} (τ) E_{q (0)} [{{}^{e}U}^{q (0)} X_{j} {{}^{e}U}^{q (0)} X] \\ = I_{B}^{- 1} (q (0)) E_{q (0)} [{{}^{e}U}^{q (0)} X {{}^{e}U}^{q (0)} X] θ \\ = θ (τ) . \end{array}

(20)

The vector space Entropy 16 04260f13

is represented by the coordinates in the base Entropy 16 04260f13

ℬ

. The tangent bundle T Entropy 16 04260f15

as a manifold is defined by the charts (σ_p, σ̇_p) on the domain:

T E_{V} = {(p, v) ∣ p \in E_{V}, v \in T_{p} E_{V}}

(21)

with:

(σ_{p}, {\dot{σ}}_{p}) : (q, V) \mapsto (I_{B}^{- 1} (p) E_{p} [log (\frac{q}{p}) {{}^{e}U}^{p} X], I_{B}^{- 1} (p) E_{p} [V {{}^{e}U}^{p} X]) .

(22)

The dot notation σ̇_p for the charts on the tangent spaces is justified by the computation in Equation (23) below:

{\frac{d}{d t} σ_{q (0)} (q (τ)) |}_{τ = 0} = I_{B}^{- 1} (q (0)) E_{q (0)} [{\frac{d}{d t} log (q (τ)) |}_{τ = 0} {{}^{e}U}^{q (0)} X] = I_{B}^{- 1} (q (0)) E_{q (0)} [δ q (0) {{}^{e}U}^{q (0)} X] = {\dot{σ}}_{q (0)} (δ q (0)) .

(23)

The velocity at τ = 0 is

δ q (0) = {\frac{d}{d τ} log (q (τ)) ∣}_{τ = 0} \in T_{q (0)} E_{V}

and:

\begin{array}{l} {\frac{d}{d τ} θ (τ) |}_{τ = 0} = I_{B}^{- 1} (q (0)) E_{q (0)} [{\frac{d}{d τ} log (q (τ)) |}_{τ = 0} {{}^{e}U}^{q (0)} X] \\ = I_{B}^{- 1} (q (0)) E_{q (0)} [δ q (0) {{}^{e}U}^{q (0)} X], \end{array}

(24)

which is consistent with both the definition of tangent space as set of Fisher scores and with the chart of the tangent bundle as defined in Equation (22).

The velocity at a generic τ is

δ q (τ) = \frac{d}{d τ} log (q (τ)) \in T_{q (τ)} E_{V}

and has coordinates at p:

\begin{array}{l} \frac{d}{d τ} θ (τ) = I_{B}^{- 1} (q (0)) E_{q (0)} [\frac{d}{d τ} log (q (τ)) {{}^{e}U}^{q (0)} X] \\ = I_{B}^{- 1} (q (0)) E_{q (0)} [δ q (τ) {{}^{e}U}^{q (0)} X] . \end{array}

(25)

If V,W are vector fields on T Entropy 16 04260f15

, i.e., V (p),W(p) ∈ T_p Entropy 16 04260f15

=

, p ∈

, we define a Riemannian metric g(V,W)) by:

g (V, W) (p) = g_{p} (V (p), W (p)) = E_{p} [V (p) W (p)]

(26)

In coordinates at p,

V (p) = \sum_{j} {\dot{σ}}_{p}^{j} (V) {{}^{e}U}^{p} X_{j}, W (p) = \sum_{j} {\dot{σ}}_{p}^{j} (W) {{}^{e}U}^{p} X_{j}

, so that:

g_{p} (V (p), W (p)) = {\dot{σ}}_{p} {(V)}^{'} I_{B} (p) {\dot{σ}}_{p} (W) .

(27)

2.4. Gradients

Given a function φ: Entropy 16 04260f15

→ ℝ let φ_p = φ ∘ e_p,

e_{p} = σ_{p}^{- 1}

, its representation in the chart centered at p:

(28)

The derivative of θ ↦ φ_p(θ) at θ = 0 along α ∈ ℝ^m is:

\nabla φ_{p} (0) α = \nabla φ_{p} (0) I_{B}^{- 1} (p) I_{B} (p) α = {(I_{B}^{- 1} (p) \nabla φ_{p} {(0)}^{'})}^{'} I_{B} (p) α = g_{p} (I_{B}^{- 1} (p) \nabla φ_{p} {(0)}^{'}, α) .

(29)

The mapping

\tilde{\nabla} φ : p \mapsto I_{B}^{- 1} (p) {(\nabla φ_{p} (0))}^{'} \in R^{m}

that appears in Equation (29) is Amari’s natural gradient of φ: Entropy 16 04260f15

; see [15]. It is a standard notion in Riemannian geometry; cf. [4] (p. 46).

More generally, the derivative of θ ↦ φ_p(θ) at θ along α ∈ ℝ^m is:

\nabla φ_{p} (θ) α = \nabla φ_{p} (θ) I_{B}^{- 1} (e_{p} (θ)) I_{B} (e_{p} (θ)) α = {(I_{B}^{- 1} (e_{p} (θ)) \nabla φ_{p} {(θ)}^{'})}^{'} I_{B} (e_{p} (θ)) α = g_{e_{p} (θ)} (I_{B}^{- 1} (e_{p} (θ)) \nabla φ_{p} {(θ)}^{'}, α) .

(30)

Let us compare ∇φ_q(0) and ∇φ_p(θ) when q = e_p(θ). As φ_p = φ ∘ e_p and φ_q = φ ∘ e_q, we have the change of charts:

φ_{q} = φ \circ e_{q} = φ \circ e_{p} \circ σ_{p} \circ e_{q} = φ_{p} \circ σ_{p} \circ e_{q},

(31)

hence ∇φ_q(0) = ∇φ_p(σ_p(q))J(σ_p ∘ e_q)(0), where J(σ_p ∘ e_q) is the Jacobian of σ_p ∘ e_q. As σ_p ∘ e_q(θ) = θ + σ_p(q), we have J(σ_p ∘ e_q) = Id, and in conclusion, ∇φ_{e_p}_(θ)(0) = ∇φ_p(θ). For all p ∈ Entropy 16 04260f15

and θ ∈ ℝ^m,

\tilde{\nabla} φ (e_{p} (θ)) = I_{B}^{- 1} (e_{p} (θ)) \nabla φ_{p} (θ) .

(32)

Alternatively, for all q, p ∈ Entropy 16 04260f15

, ∇̃φ:

→ ℝ^m is defined by:

\tilde{\nabla} φ (q) = I_{B}^{- 1} (q) \nabla φ_{p} (σ_{p} (q)) .

(33)

The Riemannian gradient of φ: Entropy 16 04260f15

is the vector field ∇φ, such that D_Yφ = g(∇φ, Y ). Note that the Riemannian gradient takes values in the tangent bundle, while the natural gradient takes values in ℝ^m. We compute the Riemannian gradient at p as follows. If y = σ̇_p(Y (p)),

D_{Y} φ (p) = d φ_{p} (0) y = g_{p} (\tilde{\nabla} φ (p), y) = E_{p} [\nabla φ (p) Y (p)],

(34)

hence

\tilde{\nabla} φ (p) = I_{B}^{- 1} (p) \nabla φ_{p} {(0)}^{'}

is the representation in the chart centered at p of the vector field ∇φ: Entropy 16 04260f15

. Explicitly, we have (see Equation (22)),

\tilde{\nabla} φ (p) = I_{B}^{- 1} (p) {(\nabla φ_{p} (0))}^{'} = I_{B}^{- 1} (p) E_{p} [\nabla φ (p) {{}^{e}U}^{p} X],

(35)

\nabla φ (p) = \sum_{j} {(\tilde{\nabla} φ (p))}^{j} {{}^{e}U}^{p} X_{j}

(36)

The Euclidean gradient ∇φ_p(θ) is sometimes called the “vanilla gradient.” It is equal to the covariance between the Riemannian gradient ∇φ(p) and the basis X, (∇φ_p(0))′ = Entropy 16 04260f10

[∇φ(p)

X].

We summarize in a display the relations between our three gradients: Euclidean ∇φ_p(0), natural ∇̃φ(p) and Riemannian ∇φ(p).

(37)

In the following, we shall frequently use the fact that the representation of the gradient vector field ∇φ in a generic chart centered at p is:

{(\nabla φ)}_{p} (θ) = {\dot{σ}}_{p} (\nabla φ (e_{p} (θ))) = (\tilde{\nabla} φ) (e_{p} (θ)) = I_{B, p}^{- 1} (θ) \nabla φ_{p} (θ) .

(38)

It should be noted that the leftmost term (∇φ)_p(θ) is the presentation of the gradient in the charts of the tangent bundle, while in the rightmost term, ∇φ_p(θ) denotes the Euclidean gradient of the presentation of the function φ in the charts of the manifold.

2.4.1. Expectation Parameters

As ψ_p is strictly convex, the gradient mapping θ ↦ (∇ψ_p(θ))′ is a homeomorphism from the space of parameters ℝ^m to the interior of the convex set generated by the image of Entropy 16 04260f13

X; see [1]. The function μ_p : Entropy 16 04260f15

defined by:

μ_{p} (q) = E_{q} [{{}^{e}U}^{p} X] = E_{q} [X] - E_{p} [X] = {(\nabla ψ_{p} (θ))}^{'}, θ = σ_{p} (q)

(39)

is a chart for all p ∈ Entropy 16 04260f15

. The value of the inverse q = L_p(μ) is characterized as the unique q ∈ Entropy 16 04260f15

, such that μ = Entropy 16 04260f18

[

X], i.e., the maximum likelihood estimator.

Let us compute the change of chart from p to p̄:

μ_{\bar{p}} \circ μ_{p}^{- 1} (η) = \bar{η} = η + E_{p} [X] - E_{\bar{p}} [X] .

(40)

In fact, μ = Entropy 16 04260f19

[

X] and μ̄ = μ_p̄(L_p(μ)) = Entropy 16 04260f19

[

X].

We do not discuss here the rich theory started in [2] about the duality between σ_p and μ_p. We limit ourselves to the computation of the Riemannian gradient in the expectation parameters. If φ: Entropy 16 04260f15

,

φ_{p} (θ) = φ \circ e_{p} (θ) = φ \circ L_{p} \circ μ_{p} \circ e_{p} (θ) = (φ \circ L_{p}) \circ (\nabla ψ_{p}) (θ),

(41)

because μ_p ∘ e_p(θ) = Entropy 16 04260f20

[

X] = ∇φ_p(θ), hence:

\nabla φ_{p} (θ) = \nabla (φ \circ L_{p}) (\nabla ψ_{p} (θ)) Hess ψ_{p} (θ),

(42)

\tilde{\nabla} φ (p) = I_{V} {(p)}^{- 1} {(\nabla (φ \circ L_{p}) (0) Hess ψ_{p} (θ))}^{'} = {(\nabla (φ \circ L_{p}) (0))}^{'},

(43)

\nabla φ (p) = \nabla (φ \circ L_{p}) (0) {{}^{e}U}^{p} X,

(44)

that is, the natural gradient ∇̃φ at p = L_p(μ) is equal to the Euclidean gradient of μ ↦ φ ∘ L_p(μ) at μ = 0.

2.4.2. Vector Fields

If V is a vector field of T Entropy 16 04260f15

and φ:

is a real function, then we define the action of V on φ, ∇_V φ, to be the real function:

\nabla_{V} φ : E_{V} ∋ p \mapsto \nabla_{V} φ (p) = \nabla φ_{p} (0) {\dot{σ}}_{p} (V (p)) .

(45)

We prefer to avoid the standard notation V φ, because in our setting, V (p) is a random variable, and the product V (p)φ(p) is otherwise defined as the ordinary product.

Let us represent ∇_V φ in the chart centered at p:

{(\nabla_{V} φ)}_{p} (θ) = \nabla_{V} φ (e_{p} (θ)) = \nabla φ_{e_{p} (θ)} (0) {\dot{σ}}_{e_{p} (θ)} (V (e_{p} (θ))) = \nabla φ_{p} (θ) V_{p} (θ),

(46)

where we have used the equality ∇φ_{e_p}₍_θ₎(0) = ∇φ_p(θ) and V_p(θ) = σ̇_{e_p}₍_θ₎ (V (e_p(θ))).

If W is a vector field, we can compute ∇_W∇_V φ at p as:

\begin{array}{l} \nabla_{W} \nabla_{V} φ (p) = \nabla {(\nabla_{V} φ)}_{p} (0) {\dot{σ}}_{p} (W (p)) \\ = V_{p} {(0)}^{'} Hess φ_{p} (0) W_{p} (0) + \nabla φ_{p} (0) J V_{p} (0) W_{p} (0), \end{array}

(47)

where J denotes the Jacobian matrix.

The Lie bracket [W, V ]φ (see [7] (§4.2), [8] (V, §1), [4] (Section 5.3.1)) is given by:

[W, V] φ (p) = \nabla_{W} \nabla_{V} φ (p) - \nabla_{V} \nabla_{w} φ (p) = \nabla φ_{p} (0) (J V_{p} (0) W_{p} (0) - J W_{p} (0) V_{p} (0)),

(48)

because of Equation (47) and the symmetry of the Hessian.

The flow of the smooth vector field V : Entropy 16 04260f15

is a family of curves γ(t, p), p ∈ Entropy 16 04260f15

t ∈ J_p, J_p open real interval containing zero, such that for all p ∈ Entropy 16 04260f15

and t ∈ J_p,

γ (0, p) = p,

(49)

δ γ (t, p) = V (γ (t, p)) .

(50)

As uniqueness holds in Equation (50) (see [8] (VI, §1) or [7] (§4.1)), we have semi-group property γ(s + t, p) = γ(s, γ (t, p)), and Equation (50) is equivalent to δγ(0, p) = V (γ(0, p)), p ∈ Entropy 16 04260f15

.

If a flow of V is available, we have an interpretation of ∇_Vφ as a derivative of φ along (t, p),

{\frac{d}{d t} φ (γ (t, p)) |}_{t = 0} = {\nabla φ_{p} (σ_{p} (γ (t, p))) (\frac{d}{d t} σ_{p} (γ (t, p))) |}_{t = 0} = \nabla φ_{p} (0) V (p) = \nabla_{V} φ (p) .

(51)

2.5. Examples

The following examples are intended to show how the formalism of gradients is usable in performing basic computations.

2.5.1. Expectation

Let f be any random variable, and define F : Entropy 16 04260f15

by F(p) =

[f]. In the chart centered at p, we have:

F_{p} (θ) = \int f exp (\sum_{j} θ^{j} {{}^{e}U}^{p} X_{j} - ψ_{p} (θ)) \cdot p d μ

(52)

and the Euclidean gradient:

\nabla F_{p} (0) = {Cov}_{p} (f, X) \in {(R^{m})}^{'} .

(53)

The natural gradient is:

\tilde{\nabla} F (p) = {Cov}_{p} {(X, X)}^{- 1} {Cov}_{p} (X, f) \in R^{m},

(54)

and the Riemannian gradient is:

\nabla F (p) = {(\tilde{\nabla} F (p))}^{'} {{}^{e}U}^{p} X = {Cov}_{p} (f, X) {Cov}_{p} {(X, X)}^{- 1} {{}^{e}U}^{p} X \in T_{p} E_{V} .

(55)

From Equation (55), it follows that ∇F(p) is the L²(p)-projection f onto Entropy 16 04260f13

, while ∇̃F(p) in Equation (54) are the coordinates of the projection. Let us consider the family of curves:

γ (t, p) = exp (\sum_{j = 1}^{m} t {(\tilde{\nabla} F (p))}^{j} {{}^{e}U}^{p} X_{j} - ψ_{p} (t \tilde{\nabla} F (p))) \cdot p, t \in R .

(56)

The velocity is:

δ γ (t, p) = \frac{d}{d t} (\sum_{j = 1}^{m} t {(\tilde{\nabla} F (p))}^{j} {{}^{e}U}^{p} X_{j} - ψ_{p} (t \tilde{\nabla} F (p))) = \nabla F (p) - E_{γ (t, p)} [\nabla F (p)],

(57)

which is different from ∇F(γ(t, p)), unless f ∈ Entropy 16 04260f14

⊕ ℝ. Then, γ is not, in general, the flow of ∇F, but it is a local approximation, as δγ(0, p) = ∇F(p).

These computation are the basis of model-based methods in combinatorial optimization; see [10–14].

2.5.2. Binary Independent Variables

Here, we present, in full generality, the toy example of the Introduction; see [17] for more information on the application to combinatorial optimization. Our example is a very special case of Ising exactly solvable models [18], our aim being here to explore the geometric framework.

Let Ω = {+1, −1}^m with counting measure μ, and let the space Entropy 16 04260f14

be generated by the coordinate projections

ℬ

= {X₁, . . . , X_d}. Note that we use here the coding +1, −1 (from physics) instead of the coding 0, 1, which is more common in combinatorial optimization. The exponential family is

E_{V} = {exp (\sum_{J = 1}^{m} θ^{j} X_{j} - ψ_{λ} (θ)) \cdot 2^{- m}}

, λ(x) = 2⁻^m for x ∈Ω being the uniform density. The independence of the sufficient statistics X_j under all distributions in Entropy 16 04260f15

implies:

ψ_{λ} (θ) = \sum_{j = 1}^{m} ψ (θ^{j}), ψ (θ) = log (cosh (θ)) .

(58)

We have:

\begin{array}{l} \nabla ψ_{λ} (θ) = [tanh (θ^{j}) : j = 1, \dots, d] \\ = η_{λ} (θ), \end{array}

(59)

\begin{array}{l} Hess ψ_{λ} (θ) = diag ({cosh}^{- 2} (θ^{j}) : j = 1, \dots, d) \\ = diag (e^{- 2 ψ (θ^{j})} : j = 1, \dots, d) \\ = I_{B, λ} (θ), \end{array}

(60)

\begin{array}{l} I_{B, λ} {(θ)}^{- 1} = diag ({cosh}^{2} (θ^{j}) : j = 1, \dots, d) \\ = diag (e^{2 ψ (θ^{j})} : j = 1, \dots, d) . \end{array}

(61)

The quadratic function f(X) = a₀ +∑_j a_jX_j +∑_{_i,j_} a_i,jX_iX_j has expected value at p = e_λ(θ), i.e., relaxed value, equal to:

F (p) = F_{λ} (θ) = E_{θ} [f (X)] = a_{0} + \sum_{j} a_{j} tanh (θ^{j}) + \sum_{{i, j}} a_{i, j} tanh (θ^{i}) tanh (θ^{j}),

(62)

and covariance with X_k ∈

ℬ

equal to:

\begin{array}{l} {Cov}_{θ} (f (X), X_{k}) = \sum_{j} a_{j} {Cov}_{θ} (X_{j}, X_{k}) + \sum_{{i, j}} a_{i, j} {Cov}_{θ} (X_{i} X_{j}, X_{k}) \\ = a_{k} {Var}_{θ} (X_{k}) + \sum_{i \neq k} a_{i, k} E_{θ} [X_{i}] {Var}_{θ} (X_{k}) \\ = {cosh}^{- 2} (θ^{k}) (a_{k} + \sum_{i \neq k} a_{i, k} tanh (θ^{i})) . \end{array}

(63)

In the computation, we have used the independence and the special algebra of ±1, which implies

X_{i}^{2} = 1

, so that Cov_θ (X_iX_j, X_k) = 0 if i, j ≠ k, otherwise Cov_θ (X_iX_k,X_k) = Entropy 16 04260f23

[X_i] −

[X_i]

[X_k]²; see [13].

The Euclidean gradient, the natural gradient and the Riemannian gradient are, respectively,

\nabla F_{λ} (θ) = [{cosh}^{- 2} (θ^{j}) (a_{j} + \sum_{i \neq j} a_{i, j} tanh (θ^{i})) : j = 1, \dots, d],

(64)

\tilde{\nabla} F (e_{λ} (θ)) = [a_{j} + \sum_{i \neq j} a_{i, j} tanh (θ^{i}) : j = 1, \dots, d],

(65)

\nabla F (e_{λ} (θ)) = \sum_{j = 1}^{m} (a_{j} + \sum_{i \neq j} a_{i, j} E_{θ} [X_{i}]) (X_{j} - E_{θ} [X_{j}]) .

(66)

The (natural) gradient flow equations are:

{\dot{θ}}^{j} (t) = a_{j} + \sum_{i \neq j} a_{i, j} tanh (θ^{i} (t)), j = 1, \dots, d .

(67)

Equations (64)–(66) are usable in practice if the a_j’s and the a_i,j’s are estimable. Otherwise, one can use Equation (63) and the following forms of the gradients:

\nabla F_{λ} (θ) = [{Cov}_{θ} (X_{j}, f (X)) : j = 1, \dots, d],

(68)

\tilde{\nabla} F (e_{λ} (θ)) = [{cosh}^{2} (θ^{j}) {Cov}_{θ} (f (X), X_{j}) : j = 1, \dots, d],

(69)

in which case, the gradient flow equations are:

{\dot{θ}}^{j} (t) = {cosh}^{2} (θ^{j}) {Cov}_{θ} (f (X), X_{j}), j = 1, \dots, d .

(70)

Let us study the relaxed function in the expectation parameters η^j = η^j(θ), j = 1, . . . , d,

F_{λ} (η) = a_{0} + \sum_{j} a_{j} η^{j} + \sum_{{i, j}} a_{i, j} η^{i} η^{j}, η \in] - 1, + 1 [^{m} .

(71)

The Euclidean gradient with respect to η has components:

\partial_{j} F_{λ} (η) = a_{j} + \sum_{i \neq j} a_{i, j} η^{i},

(72)

which are equal to the components of the natural gradient; see Section 2.4.1. As:

{\dot{η}}^{j} (t) = \frac{d}{d t} tanh (θ^{j} (t)) = {cosh}^{- 2} (θ^{j} (t)) {\dot{θ}}^{j} (t) = (1 - η^{j} {(t)}^{2}) {\dot{θ}}^{j} (t), j = 1, \dots, m,

(73)

the gradient flow expressed in the η-parameters has equations:

{\dot{η}}^{j} (t) = (1 - η^{j} {(t)}^{2}) (a_{j} + \sum_{i \neq j} a_{i, j} η^{i} (t)), j = 1, \dots, d .

(74)

Alternatively, in vector form,

\dot{η} (t) = diag (1 - η^{j} {(t)}^{2} : j = 1, \dots, d) (a + A η (t)),

(75)

where a = [a_j : j = 1, . . . , d]^t and A_i,j = 0 if i = j, A_ij = a_i,j. The matrix A is symmetric with zero diagonal, and it has the meaning of the adjacency matrix of the (weighted) interaction graph. We do not know a closed-form solution of Equation (74). An example of a numerical solution is shown in Figure 3.

2.5.3. Escort Probabilities

For a given a > 0, consider the function C⁽^a⁾ : Entropy 16 04260f15

defined by C⁽^a⁾ (p) = ∫ p^a dμ. We have:

C_{p}^{(a)} (θ) = \int exp (a \sum_{j = 1}^{m} θ^{j} {{}^{e}U}^{p} X_{j} - a ψ_{p} (θ)) p^{a} d μ

(76)

and:

d C_{p}^{(a)} (0) α = \int a (\sum_{j = 1}^{m} α^{j} {{}^{e}U}^{p} X_{j}) p^{a} d μ = \sum_{j = 1}^{m} α^{j} \int a {{}^{e}U}^{p} X_{j} p^{a} d μ = \sum_{j = 1}^{m} α^{j} {Cov}_{p} (X_{j}, {a p}^{a - 1}),

(77)

that is, the Euclidean gradient is

\nabla C_{p}^{(a)} (0) = {Cov}_{p} ({a p}^{a - 1}, X)

(row vector). The natural gradient is computed from Equation (35) as:

\tilde{\nabla} C^{(a)} (p) = I_{B}^{- 1} (p) {(\nabla C_{p}^{(a)} (0))}^{'} = {Cov}_{p} {(X, X)}^{- 1} {Cov}_{p} (X, {a p}^{a - 1}),

(78)

while the Riemannian gradient follows from Equation (36):

\nabla C^{(a)} (p) = {Cov}_{p} ({a p}^{a - 1}, X) {Cov}_{p} {(X, X)}^{- 1} {{}^{e}U}^{p} X .

(79)

Note that the Riemannian gradient is the orthogonal projection of the random variable ap^a ⁻¹ onto the tangent space T_p Entropy 16 04260f15

=

.

The probability density p^a/C(p) is called the escort density in the literature on non-extensive statistical mechanics; see, e.g., [19] (Section 7.4).

We compute now the tangent mapping of Entropy 16 04260f15

∋ p ↦ p^a/C⁽^a⁾(a) ∈ ℘_>. Let us extend the basis X₁, . . . , X_m to a basis X₁, . . . , X_n, n ≥ m, whose exponential family is full, i.e., equal to ℘_>. The non-parametric coordinate of

q = {(exp (\sum_{j = 1}^{m} θ^{j} {{}^{e}U}^{p} X_{j} - ψ_{p} (θ)) p)}^{a} / C_{p}^{(a)} (θ)

in the chart centered at

\bar{p} = p^{a} / C_{p}^{(a)} (0)

is the p̄-centering of the random variable:

\begin{array}{l} log (\frac{q}{\bar{p}}) = log (\frac{{(exp (\sum_{j = 1}^{m} θ^{j} {{}^{e}U}^{p} X_{j} - ψ_{p} (θ)) p)}^{a} / C_{p}^{(a)} (θ)}{p^{a} / C_{p}^{(a)} (0)}) \\ = a \sum_{j = 1}^{m} θ^{j} {{}^{e}U}^{p} X_{j} - a ψ_{p} (θ) + ln C_{p}^{(a)} (0) - ln C_{p}^{(a)} (θ), \end{array}

(80)

that is,

v = a \sum_{j = 1}^{m} θ^{j} {{}^{e}U}^{\bar{p}} X_{j} .

(81)

The coordinates of v in the basis Entropy 16 04260f17

X₁, . . . , Entropy 16 04260f17

X_n are (aθ¹, . . . , aθ^m, 0, . . . , 0), and the Jacobian of θ ∋ (aθ, 0_n₋_m) is the m × n matrix [aI_m|0_m×₍_n ₋_m₎].

2.5.4. Polarization Measure

The polarization measure has been introduced in Economics by [20]. Here, we consider the qualitative version of [21]. If π is a distribution of a finite set, the probability that in three independent samples from π there are exactly two equal is

3 \sum_{j} π_{j}^{2} (1 - π_{j})

. If p ∈

, define:

G p = \int p^{2} (1 - p) d μ = C^{(2)} (p) - C^{(3)} (p),

(82)

where C⁽²⁾ and C⁽³⁾ are defined as in Example 2.5.3.

From Equation (78), we find the natural gradient:

\tilde{\nabla} G (p) = {Cov}_{p} {(X, X)}^{- 1} {Cov}_{p} (X, 2 p - 3 p^{2}) .

(83)

Note that ∇̃G(p) = 0 if p is constant; see Figure 4.

3. Second Order Calculus

In this section, we turn to considering second order calculus, in particular Hessians, in order to prepare the discussion of the Newton method for the relaxed optimization of Section 4.

3.1. Metric Derivative (Levi–Civita connection)

Let V, W : Entropy 16 04260f15

be vector fields, that is, V (p), W(p) ∈ T_p Entropy 16 04260f15

=

, p ∈

. Consider the real function R = g(V,W) : Entropy 16 04260f15

→ ℝ, whose value at p ∈ Entropy 16 04260f15

is R(p) = g_p(V (p), W(p)) = Entropy 16 04260f10

[V (p)W(p)]. Assuming smoothness, we want to compute the derivative of R along the vector field Y : Entropy 16 04260f15

, that is, (D_Y R)(p) = dR_p(0)α, with α = σ̇_p(Y (p)). The expression of R in the chart centered at p is, according to Equation (27),

θ \mapsto R_{p} (θ) = {\dot{σ}}_{p} {(V (e_{p} (θ)))}^{'} I_{B} (e_{p} (θ)) {\dot{σ}}_{p} (W (e_{p} (θ))) = V_{p} {(θ)}^{'} I_{B, p} (θ) W_{p} (θ),

(84)

where V_p and W_p are the presentation in the chart of the vector fields V and W, respectively.

The i-th component ∂_iR_p(θ) of the Euclidean gradient ∇R_p(θ) is:

\begin{array}{l} \partial_{i} R_{p} (θ) = \partial_{i} (V_{p} {(θ)}^{'} I_{B, p} (θ) W_{p} (θ)) = \\ \partial_{i} V_{p} {(θ)}^{'} I_{B, p} (θ) W_{p} (θ) + V_{p} {(θ)}^{'} \partial_{i} I_{B, p} (θ) W_{p} (θ)) + V_{p} {(θ)}^{'} I_{B, p} (θ) \partial_{i} W_{p} (θ) = \\ {(\partial_{i} V_{p} (θ) + \frac{1}{2} I_{B, p}^{- 1} (θ) \partial_{i} I_{B, p} (θ) V_{p} (θ))}^{'} I_{B, p} (θ) W_{p} (θ) + \\ V_{p} {(θ)}^{'} I_{B, p} (θ) (\partial_{i} W_{p} (θ) + \frac{1}{2} I_{B, p}^{- 1} (θ) \partial_{i} I_{B, p} (θ) W_{p} (θ)), \end{array}

(85)

so that the derivative at θ along α = σ̇_{e_p}₍_θ₎(Y (e_p(θ))) is:

d R_{p} (θ) α = {(d V_{p} (θ) α + \frac{1}{2} I_{B, p}^{- 1} (θ) (d I_{B, p} (θ) α) V_{p} (θ))}^{'} I_{B, p} (θ) W_{p} (θ) + V_{p} {(θ)}^{'} I_{B, p} (θ) (d W_{p} (θ) α + \frac{1}{2} I_{B, p}^{- 1} (θ) (d I_{B, p} (θ) α) W_{p} (θ)) .

(86)

Proposition 1

If we define D_Y V to be the vector field on Entropy 16 04260f15

, whose value at q = e_p(θ) has coordinates centered at p given by:

{\dot{σ}}_{p} (D_{Y} V (q)) = d V_{p} (θ) α = \frac{1}{2} I_{B}^{- 1} (p) (d I_{B, p} (θ) α) V_{p} (θ), α = {\dot{σ}}_{p} (Y (q)),

(87)

then:

D_{Y} g (V, W) = g (D_{Y} V, W) + g (V, D_{Y} W),

(88)

i.e., Equation (87) is a metric covariant derivative; see [6] (Ch. 2 §3), [8] (VIII §4), [4] (§5.3.2).

The metric derivative Equation (87) could be computed from the flow of the vector field Y. Let (t, p) ↦ γ(t, p) be the flow of the vector field V, i.e., δγ(t, p) = V (γ (t, p)) and γ (0, p) = p. Using Equation (23), we have:

\begin{array}{l} {\frac{d}{d t} \dot{σ} (V (γ (t, p))) |}_{t = 0} = {\frac{d}{d t} V_{p} (σ_{p} (γ (t, p))) |}_{t = 0} \\ = {d V_{p} (σ_{p} (γ (t, p))) \frac{d}{d t} σ_{p} (γ (t, p)) |}_{t = 0} \\ = d V_{p} (0) {\dot{σ}}_{p} (δ γ (0, p)) = d V_{p} (0) {\dot{σ}}_{p} (Y (p)), \end{array}

(89)

and:

{\frac{d}{d t} I_{V} (γ (t, p)) |}_{t = 0} = {\frac{d}{d t} I_{B, p} (σ_{p} γ (t, p)) |}_{t = 0} = d I_{B, p} (0) {\dot{σ}}_{p} (δ γ (0, p)) = d I_{B, p} (0) {\dot{σ}}_{p} (Y (p)) V_{p} (0),

(90)

so that:

\dot{σ} (D_{Y} V (p)) = {\frac{d}{d t} \dot{σ} V (γ (t, p)) |}_{t = 0} + {\frac{1}{2} I_{V}^{- 1} (p) \frac{d}{d t} I_{V} (γ (t, p)) |}_{t = 0} .

(91)

Let us check the symmetry of the metric covariant derivative to show that it is actually the unique Riemannian or Levi–Civita affine connection; see [6] (Th. 3.6).

The Lie bracket of the vector fields V and W is the vector field [V, W], whose coordinates are:

{[V, W]}_{p} (θ) = d V_{p} (0) {\dot{σ}}_{p} (W (p)) - d W_{p} (0) {\dot{σ}}_{p} (V (p)) .

(92)

As the ij entry of ∂_kI_$ℬ$_;p(0) is ∂_k∂_i∂_jψ_p(0), then the symmetry (dI_$ℬ$_,p(0)α)β = (dI_$ℬ$_,p(0)β)α holds, and we have:

\begin{array}{l} {\dot{σ}}_{p} (D_{W} V (p) - D_{V} W (p)) = d V_{p} (0) {\dot{σ}}_{p} (W (p)) + \frac{1}{2} I_{B}^{- 1} (p) (d I_{B, p} (0) {\dot{σ}}_{p} (W (p))) V_{p} (0) \\ - d W_{p} (0) {\dot{σ}}_{p} (V (p)) - \frac{1}{2} I_{B}^{- 1} (p) (d I_{B, p} (0) {\dot{σ}}_{p} (V (p))) W_{p} (0) = \dot{σ} [V, W] (p) . \end{array}

(93)

The term

Γ^{k} (p) = \frac{1}{2} I_{p}^{- 1} (0) \partial_{k} d I_{B, p} (0)

of Equation (87) is sometimes referred to as the Christoffel matrix, but we do not use this terminology in this paper. As:

I_{B, p} (θ) = I_{B} (e_{p} (θ)) = {[{Cov}_{e_{p} (θ)} (X_{i}, X_{j})]}_{i, j = 1, \dots, m} = {[\partial_{i} \partial_{j} ψ_{p} (θ)]}_{i, j = i, \dots, m},

(94)

we have ∂_kI_$ℬ$(e_p(θ)) = [∂_i∂_j∂_kψp(θ)]_i,j₌_i,...,m = [Cov_{e_p}₍_θ₎ (X_i, X_j, X_k)]_i,j₌_i,...,m and:

Γ^{k} (p) = \frac{1}{2} {[{Cov}_{p} (X_{i}, X_{j})]}_{i, j = i, \dots, m}^{- 1} [{Cov}_{p} {(X_{i}, X_{j}, X_{k}]}_{i, j = i, \dots, m}

(95)

If V, W are vector fields of T Entropy 16 04260f15

, we have:

\begin{array}{l} Γ (p, V, W) = \frac{1}{2} I_{B}^{- 1} (p) {Cov}_{p} (X, V, W) \\ = \frac{1}{2} I_{B}^{- 1} (p) E_{p} [{{}^{e}U}^{p} X V W], \end{array}

(96)

which is the projection of V (p)W(p)/2 on Entropy 16 04260f13

.

Notice also that:

(d I_{p}^{- 1} (0) α) I_{B, p} (0) = - I_{p}^{- 1} (0) (d I_{B, p} (0) α) I_{p}^{- 1} (0) I_{B, p} (0) y = - I_{p}^{- 1} (0) (d I_{B, p} (0) α) .

(97)

3.2. Acceleration

Let p(t), t ∈ I, be a smooth curve in Entropy 16 04260f15

. Then, the velocity

δ p (t) = \frac{d}{d t} log (p (t))

is a vector field V (p(t)) = δp(t), defined on the support p(I) of the curve. As the curve is the flow of the velocity field, we can compute the metric derivative of the velocity along the the velocity itself D_δpδp from Equation (91) with V (p(0)) = δp(0); we can use Equation (91) to get:

\begin{array}{l} {\dot{σ}}_{p} (D_{δ p} δ p) (p (0)) = {\frac{d}{d t} {\dot{σ}}_{p (0)} (δ (p (t))) |}_{t = 0} + {\frac{1}{2} I_{B}^{- 1} (p (0)) \frac{d}{d t} I_{B} (p (t)) |}_{t = 0} \\ = {\frac{d^{2}}{d t^{2}} σ_{p (0)} (p (t)) |}_{t = 0} + {\frac{1}{2} I_{B}^{- 1} (p (0)) \frac{d}{d t} I_{B} (p (t)) |}_{t = 0} . \end{array}

(98)

which can be defined to be the Riemannian acceleration of the curve at t = 0.

Let us write θ(t) = σ_p(p(t)), p = p(0) and:

p (t) = exp (\sum_{j = 1}^{m} θ^{j} (t) {{}^{e}U}^{p} X_{j} - ψ_{p} (θ (t))) \cdot p,

(99)

so that σ̇_p(δp)(0) = θ̇(0) and

{\frac{d^{2}}{d t^{2}} σ_{p} (p (t)) |}_{t = 0} = \ddot{θ} (0)

. We have:

{\frac{d}{d t} I_{B} (p (t)) |}_{t = 0} = {\frac{d}{d t} I_{B, p} (θ (t)) |}_{t = 0} = {\frac{d}{d t} Hess ψ_{p} (θ (t)) |}_{t = 0} = {Cov}_{p} (X, X, \sum_{j = 1}^{m} {\dot{θ}}^{j} (t) X_{j})

(100)

so that the acceleration at p has coordinates:

\ddot{θ} (0) + \frac{1}{2} \sum_{i, j = 1}^{m} {\dot{θ}}^{i} (0) {\dot{θ}}^{j} (0) {Cov}_{p} {(X, X)}^{- 1} {Cov}_{p} (X, X_{i}, X_{j}) = \ddot{θ} (0) = \frac{1}{2} {Cov}_{p} {(X, X)}^{- 1} {Cov}_{p} (X, \sum_{i}^{m} {\dot{θ}}^{i} (0) X_{i}, \sum_{j = 1}^{m} {\dot{θ}}^{j} (0) X_{j}) .

(101)

A geodesic is a curve whose acceleration is zero at each point. The exponential map is the mapping Exp: T Entropy 16 04260f15

→

defined by:

(p, U) \mapsto {Exp}_{p} U = p (1),

(102)

where t ↦ p(t) is the geodesic, such that p(0) = p and δp(0) = U, for all U, such that the geodesic exists for t = 1.

The exponential map is a particular retraction, that is, a family of mappings R_p, p ∈

ℰ

, from the tangent space at p to the manifold; here R: T_p

ℰ

→

ℰ

, such that R_p(0) = p and dR_p(0) = Id; see [4] (§5.4). It should be noted that exponential manifolds have natural retractions other than Exp, a notable one being the exponential family itself. A retraction provides a crucial step in a gradient search algorithms by mapping a direction of increase of the objective function to a new trial point.

3.2.1. Example: Binary Independent 2.5.2 Continued

Let us consider the binary independent model of Section 2.5.2. We have

I_{B} (e_{λ} (θ)) = I_{B, λ} (θ) = diag ({cosh}^{- 2} (θ^{j}) : j = 1, \dots, d),

(103)

it follows that

\begin{array}{l} \partial_{k} I_{B, λ} (θ) = \partial_{k} diag ({cosh}^{- 2} (θ^{j}) : j = 1, \dots, d) \\ = - 2 {cosh}^{- 3} (θ^{k}) sinh (θ^{h}) E^{k k}, \end{array}

(104)

where E^kk is the d × d matrix with entry one at (k, k), zero otherwise. The k-th Christoffel’s matrix in the second term in the definition of the metric derivative (aka Levi–Civita connection) is:

Γ_{B}^{k} (e_{λ} (θ)) = Γ_{λ}^{k} (θ) = \frac{1}{2} I_{B, λ}^{- 1} (θ) \partial_{k} I_{B, λ} (θ) = - tanh (θ^{k}) E^{k k} .

(105)

In terms of the moments, we have I_$ℬ$_λ(θ) = Cov_θ (X, X′) = Hess ψ_λ(θ). As ∂_k∂_i∂_jψ_λ(θ) = Cov_θ (X_k, X_i, X_j), we that can write:

\begin{array}{l} \partial_{k} I_{B, λ} (θ) = \partial_{k} diag ({Var}_{θ} (X_{j}) : j = 1, \dots, d) \\ = {Cov}_{θ} (X_{k}, X_{k}, X_{k}) E^{k k} \end{array}

(106)

and:

\begin{array}{l} Γ_{λ}^{k} (θ) = \frac{1}{2} {Cov}_{θ} {(X_{k}, X_{k})}^{- 1} {Cov}_{θ} (X_{k}, X_{k}, X_{k}) E^{k k} \\ = \frac{1}{2} {(1 - {(η^{k})}^{2})}^{- 1} (- 2 η^{k} + 2 {(η^{k})}^{3}) E^{k k} = - η^{k} E^{k k} . \end{array}

(107)

The equations for the geodesics starting from θ(0) with velocity θ̇(0) = u are:

{\ddot{θ}}^{k} (t) + \sum_{i j = 1}^{m} Γ_{i j}^{k} (θ (t)) {\dot{θ}}^{i} (t) {\dot{θ}}^{j} (t) = {\ddot{θ}}^{k} (t) - tanh (θ^{k} (t)) {({\dot{θ}}^{k} (t))}^{2} = 0, k = 1, \dots, d .

(108)

The ordinary differential equation:

\ddot{θ} - tanh (θ) {\dot{θ}}^{2} = 0

(109)

has the closed form solution:

θ (t) = {gd}^{- 1} (gd (θ (0)) + \frac{\dot{θ} (0)}{cosh (θ (0))} t) = {tanh}^{- 1} (sin (gd (θ (0) + \frac{\dot{θ} (0)}{cosh (θ (0))} t))

(110)

for all t, such that:

- π / 2 < gd (θ (0)) + \frac{\dot{θ} (0)}{cosh (θ (0))} t < π / 2,

(111)

where gd: ℝ →] −π/2,+ π/2[ is the Gudermannian function, that is, gd′(x) = 1/cosh x, gd(0) = 0; in closed form, gd(x) = arcsin(tanh(x)). In fact, if θ is a solution of Equation (109), then:

\frac{d}{d t} gd (θ (t)) = \frac{\ddot{θ} (t)}{cosh (θ (t))}

(112)

\begin{array}{l} \frac{d^{2}}{d t^{2}} gd (θ (t)) = - \frac{sinh (θ (t)) {(\dot{θ} (t))}^{2}}{{cosh}^{2} (θ (t))} + \frac{\ddot{θ} (t)}{cosh (θ (t))} \\ = \frac{1}{cosh (θ (h))} (\ddot{θ} (t) - tanh (θ (t)) {(\dot{θ} (t))}^{2}) = 0, \end{array}

(113)

so that t ↦ gd(θ(t)) coincides (where it is defined) with an affine function characterized by the initial conditions.

In particular, at t = 1, the geodesic Equation (110) defines the Riemannian exponential Exp: T Entropy 16 04260f15

→

. If (p, U) ∈ T Entropy 16 04260f15

, that is, p ∈ Entropy 16 04260f15

and U ∈ T_p Entropy 16 04260f15

, then σ_λ(p) = θ(0) and U = ∑u_j Entropy 16 04260f13

X_j, σ̇_λ(U) = u. If:

- π / 2 < gd (θ^{j}) + \frac{u_{j}}{cosh (θ^{j})} < π / 2,

(114)

then we can take θ̇(0) = u and t = 1, so that:

\begin{array}{l} {Exp}_{p} : U \overset{{\dot{σ}}_{λ}}{\mapsto} u \mapsto [{gd}^{- 1} (gd (θ^{j}) + \frac{u_{j}}{cosh (θ^{j})}) : j = 1, \dots, d] \overset{e_{λ}}{\mapsto} \\ \prod_{j = 1}^{m} exp ({gd}^{- 1} (gd (θ^{j}) + \frac{u_{j}}{cosh (θ^{j})}) X_{j} - ψ ({gd}^{- 1} (gd (θ^{j}) + \frac{u_{j}}{cosh (θ^{j})}))) 2^{- m} . \end{array}

(115)

We have:

exp ({gd}^{- 1} (v)) = exp ({tanh}^{- 1} (sin (v))) = \sqrt{\frac{1 + sin v}{1 - sin v}}

(116)

and:

ψ ({gd}^{- 1} (v)) = + log ({gd}^{- 1} (sin v)) = log (\frac{1}{cos v}),

(117)

hence

u \mapsto {Exp}_{p} (\sum_{j = 1}^{d} u_{j} {{}^{e}U}^{p} X_{j})

is given for:

u \in ✕_{j = 1}^{d}] cosh (θ^{j}) (- π / 2 - gd (θ^{j})), cosh (θ^{j}) (π / 2 - gd (θ^{j})) [,

(118)

by:

{Exp}_{θ} (u) = \prod_{j = 1}^{m} cos (gd (θ^{j}) + \frac{u_{j}}{cosh (θ^{j})}) {(\frac{1 + sin (gd (θ^{j}) + \frac{u_{j}}{cosh (θ^{j})})}{1 - sin (gd (θ^{j}) + \frac{u_{j}}{cosh (θ^{j})})})}^{\frac{X_{j}}{2}} = \prod_{j = 1}^{m} (1 + sin (gd (θ^{j}) + \frac{u_{j}}{cosh (θ^{j})}) X_{j}) 2^{- m} \in E_{V} .

(119)

The expectation parameters are:

η^{i} (t) = E_{θ = 0} [X_{i} \prod_{j = 1}^{m} (1 + sin (gd (θ^{j}) + \frac{t u_{j}}{cosh (θ^{j})}) X_{j})] = sin (gd (θ^{j}) + \frac{t u_{j}}{cosh (θ^{j})}),

(120)

and:

gd (θ^{j}) = arcsin (η^{j}), cosh (θ^{j}) = \frac{1}{{(1 - {(η^{j})}^{2})}^{\frac{1}{2}}},

(121)

so that the exponential in terms of the expectation parameters is:

{Exp}_{η} (u) = (sin (arcsin η^{j} + {(1 - {(η^{j})}^{2})}^{\frac{1}{2}} u_{j}) : j = 1, \dots, m) .

(122)

The inverse of the Riemannian exponential provides a notion of translation between two elements of the exponential model, which is a particular parametrization of the model:

\vec{η_{1} η_{2}} = {Exp}_{η_{1}}^{- 1} η_{2} = [{((1 - {(η_{i}^{2})}^{2})}^{- \frac{1}{2}} (arcsin η_{2}^{j} - arcsin η_{1}^{j}) : j = 1, \dots, m]

(123)

In particular, at θ = 0, we have the geodesic:

t \mapsto \prod_{j = 1}^{d} (1 + sin (t u_{j}) X_{j}) 2^{- m}, ∣ t ∣ < \frac{π}{2 max ∣ u_{j} ∣}

(124)

See in Figure 5 some geodesic curves.

3.3. Riemannian Hessian

Let φ:

→ ℝ with Riemannian gradient ∇φ(p) = ∑_i(∇̃φ)_i(p) Entropy 16 04260f13

X_i,

\tilde{\nabla} φ (p) = I_{B}^{- 1} (p) \nabla φ_{p} (0)

. The Riemannian Hessian of φ is the metric derivative of the gradient ∇φ along the vector field Y, that is, Hess_Y φ = D_Y∇φ; see [6] (Ch. 6, Ex. 11), [4] (§5.5). in the following, we denote by the symbol Hess, without a subscript, the ordinary Hessian matrix.

From Equation (87), we have the coordinates of Hess_Y φ(p). Given a generic tangent vector α, we compute from Equation (38):

\begin{array}{l} d {(\nabla φ)}_{p} (θ) α ∣_{θ = 0} = d (I_{B, p}^{- 1} (θ) \nabla φ_{p} (θ)) α ∣_{θ = 0} \\ = (d I_{B, p}^{- 1} (0) α) \nabla φ_{p} (0) + I_{B, p}^{- 1} (0) Hess φ_{p} (0) α \\ = - I_{B}^{- 1} (p) (d I_{B, p} (0) α) \tilde{\nabla} φ (p) + I_{B}^{- 1} (p) Hess φ_{p} (0) α \end{array}

(125)

and, upon substitution of (∇φ)_p to V_p in Equation (87),

\begin{array}{l} {\dot{σ}}_{p} ({Hess}_{Y} φ (p)) = d {(\nabla φ)}_{p} (0) α + \frac{1}{2} I_{B}^{- 1} (p) (d I_{B, p} (0) α) {(\nabla φ)}_{p} (0), α = S_{p} (Y (p)) \\ = - I_{B}^{- 1} (p) (d I_{B, p} (0) α) \tilde{\nabla} φ (p) + I_{B}^{- 1} (p) Hess φ_{p} (0) + \frac{1}{2} I_{B}^{- 1} (p) (d I_{B, p} (0) α) \tilde{\nabla} φ (p) \\ = I_{B}^{- 1} (p) Hess φ_{p} (0) α - \frac{1}{2} I_{B}^{- 1} (p) (d I_{B, p} (0) α) \tilde{\nabla} φ (p) \\ = I_{B}^{- 1} (p) (Hess φ_{p} (0) α - \frac{1}{2} (d I_{B, p} (0) α) \tilde{\nabla} φ (p)) \end{array}

(126)

Hess_Y φ is characterized by knowing the value of g(Hess_Y φ, X) : Entropy 16 04260f15

for all vector fields X. We have from Equation (126), with α = σ̇_p(Y (p)) and β = σ̇_p(X(p)),

g_{p} ({Hess}_{Y (p)} φ (p), X (p)) = β^{'} Hess φ_{p} (0) α - \frac{1}{2} β^{'} (d I_{B, p} (0) α) \tilde{\nabla} φ (p) .

(127)

This is the presentation of the Riemannian Hessian as a bi-linear form on T Entropy 16 04260f15

; see the comments in [4] (Prop. 5.5.2-3). Note that the Riemannian Hessian is positive definite if:

α^{'} Hess φ_{p} (0) α \geq \frac{1}{2} α^{'} (d I_{B, p} (0) α) \tilde{\nabla} φ (p), α \in R^{m} .

(128)

4. Application to Combinatorial Optimization

We conclude our paper by showing how the geometric method applies to the problem of finding the maximum of the expected value of a function.

4.1. Hessian of a Relaxed Function

Here is a key example of vector field. Let f be any bounded random variable, and define the relaxed function to be φ(p) = Entropy 16 04260f10

[f], p ∈ ℘_>. Define F(p) to be the projection of f, as an element of L²(p), onto T_p Entropy 16 04260f15

=

, i.e., F(p) is the element of Entropy 16 04260f13

, such that:

E_{p} [(f - F (p)) v] = 0, v \in {{}^{e}U}^{p} V

(129)

In the basis Entropy 16 04260f13

ℬ

, we have F(p) = ∑_if̂_p,i Entropy 16 04260f13

X_i and:

{Cov}_{p} (f, X_{j}) = \sum_{i} {\hat{f}}_{p, i} E_{p} [{{}^{e}U}^{p} X_{i} {{}^{e}U}^{p} X_{j}], j = 1, \dots, m,

(130)

so that

{\hat{f}}_{p} = I_{B}^{- 1} (p) {Cov}_{p} (X, f)

and

F (p) = {\hat{f}}_{p}^{'} {{}^{e}U}^{p} X = {Cov}_{p} (f, X) I_{B}^{- 1} (p) {{}^{e}U}^{p} X .

(131)

Let us compute the gradient of the relaxed function φ = Entropy 16 04260f21

. [f] :

. We have φ_p(θ) = Entropy 16 04260f22

[f], and from the properties of exponential families, the Euclidean gradient is ∇φ_p(0) = Cov_p (f, X). It follows that the natural gradient is:

\tilde{\nabla} φ_{p} (0) = I_{B}^{- 1} (p) {Cov}_{p} (X, f) = \hat{f},

(132)

and the Riemannian gradient is ∇φ(p) = F(p).

From the properties of exponential families, we have:

Hess φ_{p} (0) = {Cov}_{p} (X, X, f),

so that, in this case, Equation (127), when written in terms of the moments, is:

β^{'} {Cov}_{p} (X, X, f) α - \frac{1}{2} β^{'} {Cov}_{p} (X, X, α \cdot X) {Cov}_{p} {(X, X)}^{- 1} {Cov}_{p} (X, f) .

(133)

4.1.1. Example: Binary Independent 2.5.2 and 3.2.1 Continued

We list below the computation of the Hessian in the case of two binary independent variables. Computations were done with Sage [22], which allows both the reduction

x_{i}^{2} = 1

in the ring of polynomials and the simplifications in the symbolic ring of parameters.

{Cov}_{η} (X, f) = (\begin{array}{l} - (η_{1}^{2} - 1) a_{1} - (η_{1}^{2} η_{2} - η_{2}) a_{12} \\ - (η_{2}^{2} - 1) a_{2} - (η_{1} η_{2}^{2} - η_{1}) a_{12} \end{array}) = (\begin{array}{l} - (η_{1} - 1) (η_{1} + 1) (a_{12} η_{2} + a_{1}) \\ - (η_{2} - 1) (η_{2} + 1) (a_{12} η_{1} + a_{2}) \end{array})

(134)

{Cov}_{η} (X, X) = (\begin{array}{r} - η_{1}^{2} + 1 & 0 \\ 0 & - η_{2}^{2} + 1 \end{array}) = (\begin{array}{r} - (η_{1} - 1) (η_{1} + 1) & 0 \\ 0 & - (η_{2} - 1) (η_{2} + 1) \end{array})

(135)

{Cov}_{η} {(X, X)}^{- 1} {Cov}_{η} (X, f) = (\begin{matrix} a_{12} η_{2} + a_{1} \\ a_{12} η_{1} + a_{2} \end{matrix}) = \nabla F (η)

(136)

\begin{array}{l} {Cov}_{η} (X, X, f) = \\ (\begin{array}{r} 2 (η_{1}^{3} - η_{1}) a_{1} + 2 (η_{1}^{3} η_{2} - η_{1} η_{2}) a_{12} & (η_{1}^{2} η_{2}^{2} - η_{1}^{2} - η_{2}^{2} + 1) a_{12} \\ (η_{1}^{2} η_{2}^{2} - η_{1}^{2} - η_{2}^{2} + 1) a_{12} & 2 (η_{1} η_{2}^{3} - η_{1} η_{2}) a_{12} + 2 (η_{2}^{3} - η_{2}) a_{2} \end{array}) = \\ (\begin{array}{r} 2 (η_{1} - 1) (η_{1} + 1) (a_{12} η_{2} + a_{1}) η_{1} & (η_{2} - 1) (η_{2} + 1) (η_{1} - 1) (η_{1} + 1) a_{12} \\ (η_{2} - 1) (η_{2} + 1) (η_{1} - 1) (η_{1} + 1) a_{12} & 2 (η_{2} - 1) (η_{2} + 1) (a_{12} η_{1} + a_{2}) η_{2} \end{array}) \end{array}

(137)

{Cov}_{η} {(X, X)}^{- 1} {Cov}_{η} (X, X, f) = (\begin{array}{r} - 2 (a_{12} η_{2} + a_{1}) η_{1} & - a_{12} η_{2}^{2} + a_{12} \\ - a_{12} η_{1}^{2} + a_{12} & - 2 (a_{12} η_{1} + a_{2}) η_{2} \end{array})

(138)

{Cov}_{η} (X, X, \nabla F (η)) = (\begin{array}{r} - 2 (a_{12} η_{2} + a_{1}) (η_{1} + 1) (η_{1} - 1) η_{1} & 0 \\ 0 & 2 (a_{12} η_{1} + a_{2}) (η_{2} + 1) (η_{2} - 1) η_{2} \end{array})

(139)

{Cov}_{η} {(X, X)}^{- 1} {Cov}_{η} (X, X, \nabla F (η)) = (\begin{array}{r} - 2 (a_{12} η_{2} + a_{1}) η_{1} & 0 \\ 0 & - 2 (a_{12} η_{1} + a_{2}) η_{2} \end{array})

(140)

The Riemannian Hessian as a matrix in the basis of the tangent space is:

Hess F (η) = {Cov}_{η} {(X, X)}^{- 1} ({Cov}_{η} (X, X, f) - \frac{1}{2} {Cov}_{η} (X, X, \nabla F (η))) = (\begin{array}{r} - (a_{12} η_{2} + a_{1}) η_{1} & - a_{12} (η_{2} + 1) (η_{2} - 1) \\ - a_{12} (η_{1} + 1) (η_{1} - 1) & - (a_{12} η_{1} + a_{2}) η_{2} \end{array})

(141)

As a check, let us compute the Riemannian Hessian as a natural Hessian in the Riemannian parameters, Hess φ ∘ Exp_p(u) | _u₌₀; see [4] (Prop. 5.5.4). We have:

\begin{array}{l} F \circ {Exp}_{η} (u) = a_{12} sin (\sqrt{- η_{1}^{2} + 1} u_{1} + arcsin (η_{1})) sin (\sqrt{- η_{2}^{2} + 1} u_{2} + arcsin (η_{2})) + \\ a_{1} sin (\sqrt{- η_{1}^{2} + 1} u_{1} + arcsin (η_{1})) + a_{2} sin (\sqrt{- η_{2}^{2} + 1} u_{2} + arcsin (η_{2})) \end{array}

(142)

and:

\begin{array}{l} Hess F \circ {Exp}_{η} (u) ∣_{u = 0} = \\ (\begin{array}{r} (η_{1}^{2} - 1) a_{12} η_{1} η_{2} + (η_{1}^{2} - 1) a_{1} η_{1} & (η_{1}^{2} - 1) (η_{2}^{2} - 1) a_{12} \\ (η_{1}^{2} - 1) (η_{2}^{2} - 1) a_{12} & (η_{2}^{2} - 1) a_{12} η_{1} η_{2} + (η_{2}^{2} - 1) a_{2} η_{2} \end{array}) = \\ (\begin{array}{r} (a_{12} η_{2} + a_{1}) (η_{1} + 1) (η_{1} - 1) η_{1} & a_{12} (η_{1} + 1) (η_{1} - 1) (η_{2} + 1) (η_{2} - 1) \\ a_{12} (η_{1} + 1) (η_{1} - 1) (η_{2} + 1) (η_{2} - 1) & (a_{12} η_{1} + a_{2}) (η_{2} + 1) (η_{2} - 1) η_{2} \end{array}) . \end{array}

(143)

Note the presence of the factor Cov_η (X,X).

4.2. Newton Method

The Newton method is an iterative method that generates a sequence of points p_t, with t = 0, 1, . . . , that converges towards a stationary point p̂ of a F(p) = Entropy 16 04260f10

[f], p ∈

, that is, a critical point of the vector field p ↦ ∇F(p), ∇F(p̂) = 0. Here, we follow [4] (Ch. 5–6), and in particular Algorithm 5 on Page 113.

Let ∇F be a gradient field. We reproduce in our case the basic derivation of the Newton method in the following. Note that, in this section, we use the notation Hess •[α] to denote Hess_α •. Using the definition of metric derivative, we have for a geodesic curve [0, 1] ∋ t ↦ p(t) ∈ Entropy 16 04260f15

connecting p = p(0) to p̂ = p(1) that:

\frac{d}{d t} g_{p (t)} (\nabla F (p (t)), δ p (t)) = g_{p (t)} (Hess F (p (t)) [δ p (t)], δ p (t))

(144)

hence the increment from p to p̂ is:

g_{\hat{p}} (\nabla F (\hat{p}), δ p (1)) - g_{p} (\nabla F (p), δ p (0)) = \int_{0}^{1} g_{p (t)} (Hess F (p (t)) [δ p (t)], δ p (t)) d t .

(145)

Now, we assume that ∇F(p̂) = 0 and that in Equation (145), the integral is approximated by the initial value of the integrand, that is to say, the Hessian is approximately constant on the geodesic from p to p̂; we obtain:

- g_{p} (\nabla F (p), δ p (0)) - g_{p} (Hess F (p) [δ p (0)], δ p (0)) + ∊ .

(146)

If we can solve the Newton equation:

Hess F (p (t)) [u] = - \nabla F (p)

(147)

then u is approximately equal to the initial velocity of the geodesic connecting p to p̂, that is, p̂ = Exp_p(u).

The particular structure of the exponential manifold suggests at least two natural retractions that could be used to move from u to p̂. Namely, we have the Riemannian exponential (θ_t, θ_t₊₁) ↦ Exp_{θ_t} (θ_t₊₁ – θ_t) and the e-retraction coming from the exponential family itself and defined by (θ_t, θ_t₊₁) ↦ e_{θ_t} (θ_t₊₁ – θ_t), with θ_t₊₁ – θ_t = u_t.

In the θ parameters, with the e-retraction, the Newton method generates a sequence (θ_t) according to the following updating rule:

θ_{t + 1} = θ_{t} - λ Hess F {(θ_{t})}^{- 1} \tilde{\nabla} F (θ_{t})

(148)

where λ > 0 is an extra parameter intended to control the step size and, in turn, the convergence to θ̂; see [5].

We can rewrite Equation (148) in terms of covariances as:

θ_{t + 1} = θ_{t} - λ {({Cov}_{θ_{t}} (X, X, f) - \frac{1}{2} {Cov}_{θ_{t}} (X, X, \tilde{\nabla} F (θ_{t})))}^{- 1} \tilde{\nabla} F (θ_{t}) .

(149)

4.3. Example: Binary Independent

In the η parameters, the Newton step is:

u = - Hess F {(η)}^{- 1} \nabla F (η) = (\begin{matrix} \frac{a_{12}^{2} η_{1} + a_{12} a_{2} + (a_{1} a_{12} η_{1} + a_{1} a_{2}) η_{2}}{a_{12}^{2} η_{1}^{2} + (a_{12} a_{2} η_{1} + a_{12}^{2}) η_{2}^{2} - a_{12}^{2} + (a_{1} a_{1} η_{1}^{2} + a_{1} a_{2} η_{1}) η_{2}} \\ \frac{a_{1} a_{2} η_{1} + a_{1} a_{12} + (a_{12} a_{2} η_{1} + a_{12}^{2}) η_{2}}{a_{12}^{2} η_{1}^{2} + (a_{12} a_{2} η_{1} + a_{12}^{2}) η_{2}^{2} - a_{12}^{2} + (a_{1} a_{12} η_{1}^{2} + a_{1} a_{2} η_{1}) η_{2}} \end{matrix})

(150)

and the new η in the Riemannian retraction is:

{Exp}_{η} (u) = (\begin{matrix} sin (\frac{(a_{12}^{2} η_{1} + a_{12} a_{2} + (a_{1} a_{12} η_{1} + a_{1} a_{2}) η_{2}) \sqrt{- η_{1}^{2} + 1}}{a_{12}^{2} η_{1}^{2} + (a_{12} a_{2} η_{1} + a_{12}^{2}) η_{2}^{2} - a_{12}^{2} + (a_{1} a_{12} η_{1}^{2} + a_{1} a_{2} η_{1}) η_{2}} + arcsin (η_{1})) \\ sin (\frac{(a_{1} a_{2} η_{1} + a_{1} a_{12} + (a_{12} a_{2} η_{1} + a_{12}^{2}) η_{2}) \sqrt{- η_{2}^{2} + 1}}{a_{12}^{2} η_{1}^{2} + (a_{12} a_{2} η_{1} + a_{12}^{2}) η_{2}^{2} - a_{12}^{2} + (a_{1} a_{12} η_{1}^{2} + a_{1} a_{2} η_{1}) η_{2}} + arcsin (η_{2})) . \end{matrix})

(151)

In Figure 6, we represented the vector field associated with the Newton step in the η parameters, with λ = 0.05, using the Riemannian retraction, for the case a₁ = 1, a₂ = 2 and a₁₂ = 3, with:

{Exp}_{η} (u) = (\begin{matrix} sin (λ \frac{\sqrt{- η_{1}^{2} + 1} ((3 η_{1} + 2) η_{2} + 9 η_{1} + 6)}{3 (2 η_{1} + 3) η_{2}^{2} + 9 η_{1}^{2} + (3 η_{1}^{2} + 2 η_{1}) η_{2} - 9} + arcsin (η_{1})) \\ sin (λ \frac{(3 (2 η_{1} + 3) η_{2} + 2 η_{1} + 3) \sqrt{- η_{2}^{2} + 1}}{3 (2 η_{1} + 3) η_{2}^{2} + 9 η_{1}^{2} + (3 η_{1}^{2} + 2 η_{1}) η_{2} - 9} + arcsin (η_{2})) \end{matrix}) .

(152)

The red dotted lines represented in the figure identify the basins of attraction of the vector field and correspond to the solutions of the explicit equation in η for which the Newton step u is not defined. This vector field can be compared to that in Figure 7, associated with the Newton step for F(η) using the Euclidean geometry. In the Euclidean geometry, F(η) is a quadratic function with one saddle point, so that from any η, the Newton step points in the direction of the critical point. This makes the Newton step unsuitable for an optimization algorithm. On the other side, in the Riemannian geometry, the vertices of the polytope are critical points for F(η), and they determine the presence of multiple basins of attraction, as expected.

Figure 8 shows the Newton step in the θ parameters based on the e-retraction of Equation (149), while Figure 9 represents the Newton step evaluated with respect to the Euclidean geometry. A comparison of the two vector fields shows that, differently from the η parameters, the number of basins of attraction is the same in the two geometries; however, the scale of the vectors is different. In particular, notice how on the plateau, for diverging θ, the Newton step in the Euclidean geometry vanishes, while in the Riemannian geometry, it gets larger. This behavior suggests better convergence properties for an optimization algorithm based on the Newton step evaluated using the proper Riemannian geometry. In the θ parameters, the boundaries of the basins of attraction represented by the red dotted lines have been computed numerically and correspond to the values of θ for which the update step is not defined.

Finally, notice that in both the η and θ parameters, the step is not always in the direction of descent for the function, a common behavior of the Newton method, which converges to the critical points.

5. Discussion and Conclusions

In this paper, we introduced second-order calculus over a statistical manifold, following the approach described in [4], which has been adapted to the special case of exponential statistical models [2,3]. By defining the Riemannian Hessian and using the notion of retraction, we developed the proper machinery necessary for the definition of the updating rule of the Newton method for the optimization of a function defined over an exponential family.

The examples discussed in the paper show that by taking into account the proper Riemannian geometry of a statistical exponential family, the vector fields associated with the Newton step in the different parametrizations change profoundly. Not only new basins of attraction associated with local and global minima appear, as for the expectation parameters, but also the magnitude of the Newton step is affected, as over the plateau in the natural parameters. Such differences are expected to have a strong impact on the performance of an optimization algorithm based on the Newton step, from both the point of view of achievable convergence and the speed of convergence to the optimum.

The Newton method is a popular second order optimization technique based on the computation of the Hessian of the function to be optimized and is well known for its super-linear convergence properties. However, the use of the Newton method poses a number of issues in practice.

First of all, as the examples in Figures 6 and 8 show, the Newton step does not always point in the direction of the natural gradient, and the algorithm may not converge to a (local) optimum of the function. Such behavior is not unexpected; indeed the Newton method tends to converge to critical points of the function to be optimized, which include local minima, local maxima and saddle points. In order to obtain a direction of ascent for the function to be optimized, the Hessian must be negative-definite, i.e., its eigenvalues must be strictly negative, which is not guaranteed in the general case. Another important remark is related to the computational complexity associated with the evaluation of the Hessian, compared to the (natural) gradient. Indeed, to obtain the Newton step d, Christoffel matrices have to be evaluated, together with the third order covariances between sufficient statistics and the function, and the Hessian has to be inverted. Finally, notice that when the Hessian is close to being non-invertible, numerical problems may arise in the computation of the Newton step, and the algorithm may become unstable and diverge.

In the literature, different methods have been proposed to overcome these issues. Among them, we mention quasi-Newton methods, where the update vector is obtained using a modified Hessian, which has been made negative-definite, for instance, by adding a proper correction matrix.

This paper represents the first step in the design of an algorithm based on the Newton method for the optimization over a statistical model. The authors are working on the computational aspects related to the implementation of the method, and a new paper with experimental results is in progress.

Acknowledgments

Luigi Malagò was supported by the Xerox University Affairs Committee Award and by de Castro Statistics, Collegio Carlo Alberto, Moncalieri. Giovanni Pistone is supported by de Castro Statistics, Collegio Carlo Alberto, Moncalieri, and is a member of GNAMPA–INdAM, Roma.

Author Contributions

All authors contributed to the design of the research. The research was carried out by all authors. The study of the Hessian and of the Newton method in statistical manifolds was originally suggested by Luigi Malagò. The manuscript was written by Luigi Malagò and Giovanni Pistone. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory; Number 9 in IMS Lecture Notes. Monograph Series; Institute of Mathematical Statistics: Hayward, CA, USA, 1986; p. 283. [Google Scholar]
Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI, USA, 2000; p. 206. [Google Scholar]
Pistone, G. Nonparametric Information Geometry. In Geometric Science of Information; Proceedings of the First International Conference, GSI 2013, Paris, France, 28–30 August 2013, Nielsen, F., Barbaresco, F., Eds.; Lecture Notes in Computer Science, Volume 8085; Springer: Berlin/Heidelberg, Germany, 2013; pp. 5–36. [Google Scholar]
Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University Press: Princeton, NJ, USA, 2008; p. xvi+224. [Google Scholar]
Nocedal, J.; Wright, S.J. Numerical Optimization, 2nd ed.; Springer Series in Operations Research and Financial Engineering; Springer: New York, NY, USA, 2006; p. xxii+664. [Google Scholar]
Do Carmo, M.P. Riemannian geometry; Mathematics: Theory & Applications; Birkhäuser Boston Inc.: Boston, MA, USA, 1992; p. xiv+300. [Google Scholar]
Abraham, R.; Marsden, J.E.; Ratiu, T. Manifolds, Tensor Analysis, and Applications, 2nd ed; Applied Mathematical Sciences, Volume 75; Springer: New York, NY, USA, 1988; p. x+654. [Google Scholar]
Lang, S. Differential and Riemannian Manifolds, 3rd ed.; Graduate Texts in Mathematics; Springer: New York, NY, USA, 1995; p. xiv+364. [Google Scholar]
Pistone, G. Algebraic varieties vs. differentiable manifolds in statistical models. In Algebraic and Geometric Methods in Statistics; Gibilisco, P., Riccomagno, E., Rogantin, M.P., Wynn, H.P., Eds.; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
Malagò, L.; Matteucci, M.; Dal Seno, B. An information geometry perspective on estimation of distribution algorithms: Boundary analysis. Proceedings of the 2008 GECCO Conference Companion On Genetic and Evolutionary Computation (GECCO ’08); ACM: New York, NY, USA, 2008; pp. 2081–2088. [Google Scholar]
Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Relaxation as a Unifying Approach in 0/1 Programming. Proceedings of the NIPS 2009 Workshop on Discrete Optimization in Machine Learning: Submodularity, Sparsity & Polyhedra (DISCML), Whistler Resort & Spa, BC, Canada, 11–12 December 2009.
Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Natural Gradient Descent by Estimation of Empirical Covariances. Proceedings of the IEEE Congress on Evolutionary Computation (CEC), New Orleans, LA, USA, 5–8 June 2011; pp. 949–956.
Malagò, L.; Matteucci, M.; Pistone, G. Towards the geometry of estimation of distribution algorithms based on the exponential family. Proceedings of the 11th Workshop on Foundations of Genetic Algorithms (FOGA ’11), Schwarzenberg, Austria, 5–8 January 2011; ACM: New York, NY, USA, 2011; pp. 230–242. [Google Scholar]
Malagò, L.; Matteucci, M.; Pistone, G. Natural gradient, fitness modelling and model selection: A unifying perspective. Proceedings of the IEEE Congress on Evolutionary Computation (CEC), Cancun, Mexico, 20–23 June 2013; pp. 486–493.
Amari, S.I. Natural gradient works efficiently in learning. Neural Comput 1998, 10, 251–276. [Google Scholar]
Shima, H. The Geometry of Hessian Structures; Scientific Publishing Co Pte. Ltd. World: Hackensack, NJ, USA, 2007; p. xiv+246. [Google Scholar]
Malagò, L. On the Geometry of Optimization Based on the Exponential Family Relaxation. Ph.D. Thesis, Politecnico di Milano, Milano, Italy, 2012. [Google Scholar]
Gallavotti, G. Statistical Mechanics: A Short Treatise; Texts and Monographs in Physics; Springer: Berlin, Germany, 1999; p. xiv+339. [Google Scholar]
Naudts, J. Generalised exponential families and associated entropy functions. Entropy 2008, 10, 131–149. [Google Scholar]
Esteban, J.; Ray, D. On the Measurement of Polarization. Econometrica 1994, 62, 819–851. [Google Scholar]
Montalvo, J.; Reynal-Querol, M. Ethnic polarization, potential conflict, and civil wars. Am. Econ. Rev 2005, 796–816. [Google Scholar]
Stein, W.; et al. Sage Mathematics Software (Version 6.0); The Sage Development Team, 2013. Available online: http://www.sagemath.org (accessed on 27 March 2014).

Figure 1. Relaxation of the Function (2) on the independence model. a₁ = 1, a₂ = 2, a₁₂ = 3.

Figure 2. Gradient flow of the Function (2). The domain has been increased to include values outside the square [−1, +1]².

Figure 3. Gradient flow (blue line) and natural gradient flow (black line) for the Function (2), starting at (−1=4, −1=4).

Figure 4. Normalized polarization.

Figure 5. Geodesics from η = (0.75, 0.75).

Figure 6. The Newton step in the η parameters, Riemannian retraction, λ = 0.05. The red dotted lines identify the different basins of attraction and correspond to the points for which the Newton step is not defined; cf. Equation (150). The instability close to the critical lines is represented by the longer arrows.

Figure 7. The Newton step in the η parameters, Euclidean geometry, λ = 0.05.

Figure 8. The Newton step in the θ parameters, exponential retraction, λ = 0.015. The red dotted lines identify the different basins of attraction and correspond to the points for which the Newton step is not defined. The instability along the critical lines, which identifies the basins of attraction, is not represented.

Figure 9. The Newton step in the θ parameters, Euclidean geometry, λ = 0.15. The red dotted lines identify the different basins of attraction and correspond to the points for which the Newton step is not defined. The instability along the critical lines, which identifies the basins of attraction, is not represented.

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Malagò, L.; Pistone, G. Combinatorial Optimization with Information Geometry: The Newton Method. Entropy 2014, 16, 4260-4289. https://doi.org/10.3390/e16084260

AMA Style

Malagò L, Pistone G. Combinatorial Optimization with Information Geometry: The Newton Method. Entropy. 2014; 16(8):4260-4289. https://doi.org/10.3390/e16084260

Chicago/Turabian Style

Malagò, Luigi, and Giovanni Pistone. 2014. "Combinatorial Optimization with Information Geometry: The Newton Method" Entropy 16, no. 8: 4260-4289. https://doi.org/10.3390/e16084260

APA Style

Malagò, L., & Pistone, G. (2014). Combinatorial Optimization with Information Geometry: The Newton Method. Entropy, 16(8), 4260-4289. https://doi.org/10.3390/e16084260

Article Menu

Combinatorial Optimization with Information Geometry: The Newton Method

1. Introduction

2. Models on a Finite State Space

2.1. Exponential Families As Manifolds

2.2. Change of Chart

2.3. Tangent Bundle

2.4. Gradients

2.4.1. Expectation Parameters

2.4.2. Vector Fields

2.5. Examples

2.5.1. Expectation

2.5.2. Binary Independent Variables

2.5.3. Escort Probabilities

2.5.4. Polarization Measure

3. Second Order Calculus

3.1. Metric Derivative (Levi–Civita connection)

Proposition 1

3.2. Acceleration

3.2.1. Example: Binary Independent 2.5.2 Continued

3.3. Riemannian Hessian

4. Application to Combinatorial Optimization

4.1. Hessian of a Relaxed Function

4.1.1. Example: Binary Independent 2.5.2 and 3.2.1 Continued

4.2. Newton Method

4.3. Example: Binary Independent

5. Discussion and Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI