^{1}

^{*}

^{1}

^{1}

^{2}

^{3}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

We consider three different approaches to define natural Riemannian metrics on polytopes of stochastic matrices. First, we define a natural class of stochastic maps between these polytopes and give a metric characterization of Chentsov type in terms of invariance with respect to these maps. Second, we consider the Fisher metric defined on arbitrary polytopes through their embeddings as exponential families in the probability simplex. We show that these metrics can also be characterized by an invariance principle with respect to morphisms of exponential families. Third, we consider the Fisher metric resulting from embedding the polytope of stochastic matrices in a simplex of joint distributions by specifying a marginal distribution. All three approaches result in slight variations of products of Fisher metrics. This is consistent with the nature of polytopes of stochastic matrices, which are Cartesian products of probability simplices. The first approach yields a scaled product of Fisher metrics; the second, a product of Fisher metrics; and the third, a product of Fisher metrics scaled by the marginal distribution.

The Riemannian structure of a function’s domain has a crucial impact on the performance of gradient optimization methods, especially in the presence of plateaus and local maxima. The natural gradient [

In learning theory, when modeling the policy of a system, it is often preferred to consider stochastic matrices instead of joint probability distributions. For example, in robotics applications, policies are optimized over a parametric set of stochastic matrices by following the gradient of a reward function [

In the first part, we take another look at Lebanon’s approach for characterizing a distinguished metric on polytopes of stochastic matrices. However, since the maps considered by Lebanon do not map stochastic matrices to stochastic matrices, we will use different maps. We show that the product of Fisher metrics can be characterized by an invariance principle with respect to natural maps between stochastic matrices.

In the second part, we consider an approach that allows us to define Riemannian structures on arbitrary polytopes. Any polytope can be identified with an exponential family by using the coordinates of the polytope vertices as observables. The inverse of the moment map then defines an embedding of the polytope in a probability simplex. This embedding can be used to pull back geometric structures from the probability simplex to the polytope, including Riemannian metrics, affine connections, divergences,

In the third part, we return to stochastic matrices. We study natural embeddings of conditional distributions in probability simplices as joint distributions with a fixed marginal. These embeddings define a Fisher metric equal to a weighted product of Fisher metrics. This result corresponds to the definitions commonly used in robotics applications.

All three approaches give very similar results. In all cases, the identified metric is a product metric. This is a sensible result, since the set of _{m}_{−1}. Indeed, this is the result obtained from our second approach. The first approach yields that same result with an additional scaling factor of 1/

Which metric to use depends on the concrete problem and whether a natural marginal distribution is defined and known. In Section 7, we do a case study using a reward function that is given as an expectation value over a joint distribution. In this simple example, the weighted product metric gives the best asymptotic rate of convergence, under the assumption that the weights are optimally chosen. In Section 8, we sum up our findings.

The contents of the paper is organized as follows. Section 2 contains basic definitions around the Fisher metric and concepts of differential geometry. In Section 3, we discuss the theorems of Chentsov, Campbell and Lebanon, which characterize natural geometric structures on the probability simplex, on the set of positive measures and on the cone of positive matrices, respectively. In Section 4, we study metrics on polytopes of stochastic matrices, which are invariant under natural embeddings. In Section 5, we define a Riemannian structure for polytopes, which generalizes the Fisher information metric of probability simplices and conditional models in a natural way. In Section 6, we study a class of weighted product metrics. In Section 7, we study the gradient flow with respect to an expectation value. Section 8 contains concluding remarks. In

We will consider the simplex of probability distributions on [_{m}_{−1} consists of all strictly positive probability distributions on [_{i}_{∈[}_{k}_{]} ∆_{m}_{−1}. The relative interior

Given two random variables _{x}_{∈[}_{k}_{],}_{y}_{∈[}_{m}_{]} with rows (_{y}_{∈[}_{m}_{]} ∈ ∆_{m}_{−1} for all

The tangent space of
_{1},…, _{n}

The Fisher metric on the positive probability simplex

The same formula

Consider a smoothly parametrized family of probability distributions
^{(n)} induces a Riemannian metric on
_{i}

Here, it is not necessary to assume that the parameters _{i}^{(n)}.

Consider an embedding

where _{*} denotes the push-forward of _{p}ε

where
_{q}_{f}_{(}_{p}_{)}

An embedding

In this case, we say that the metric

One of the theoretical motivations for using the Fisher metric is provided by Chentsov’s characterization [

_{1},…, _{m}

is called a congruent embedding by a Markov mapping of

An example of a 3 × 5 row-partition matrix is:

Markov maps preserve the 1-norm and restrict to embeddings

^{(m)}

The main result in Campbell’s work [

^{(m)}

where
_{ij} is the Kronecker delta, and A and C are C^{∞} functions on

^{∞}

The metrics from Campbell’s theorem also define metrics on the probability simplices

^{(m)} to a Riemannian metric
^{(m)} is a multiple of the Fisher metric. Such metric extensions can be defined as follows. Consider the diffeomorphism:

Any tangent vector
_{p}_{r}∂_{r}_{p}_{*} maps the tangent vector
_{*}u = f_{*}u_{p}_{r}∂_{r}

is a metric on

In what follows, we will focus on positive matrices. In order to define a natural Riemannian metric, we can use the identification

where

^{(1)},…, ^{(k)}}. The map:

is called a congruent embedding by a Markov morphism of

that is, the ^{(a)}.

In a Lebanon map, each row of the input matrix ^{(i)}, and each resulting row is copied and scaled by an entry of

Contrary to what is stated in [

The main result in Lebanon’s work [15, Theorems 1 and 2] is the following.

^{(k,m)}

^{(k,m)}

Lebanon does not study the question under which assumptions on

The class of metrics

The special case with

Furthermore, if we restrict to

where

^{(k,m)}

Observe that these metrics agree with (a multiple of) the Fisher metric only if ^{⊤} ⊗ ^{⊤}

According to Chentsov’s theorem (Theorem 2), a natural metric on the probability simplex can be characterized by requiring the isometry of natural embeddings. Lebanon follows this axiomatic approach to characterize metrics on products of positive measures (Theorem 6). However, the maps considered by Lebanon dissolve the row-normalization of conditional distributions. In general, they do not map conditional polytopes to conditional polytopes. Therefore, we will consider a slight modification of Lebanon maps, in order to obtain maps between conditional polytopes.

A matrix of conditional distributions _{km}_{−1} with conditional distribution

In information theory, stochastic matrices are also viewed as channels. For any distribution of

Channels can be combined, provided the cardinalities of the state spaces fit together. If we take the output _{km}_{−1}) is transformed similarly; that is, the joint distribution of the pair (

More general maps result from compositions where the choice of the second channel depends on the input of the first channel. In other words, we have a first channel that takes as input ^{(i)}}_{i}

We can also consider transformations of the first random variable ^{⊤}

To sum up, if we combine the transformations due to ^{⊤} (

Finally, we will also consider the special case where the partition of

for a _{1},…, _{k}

For example, the 3 × 5 partition indicator matrix corresponding to

a conditional embedding of

Conditional embeddings preserve the 1-norm of the matrix rows; that is, the elements of

Considering the conditional embeddings discussed in the previous section, we obtain the following metric characterization.

^{(k,m)}

^{(k,m)}

The proof of Theorem 11 is similar to the proof of the theorems of Chentsov, Campbell and Lebanon. Due to its technical nature, we defer it to

Now, for the restriction of the metric ^{(k,m)} to

This metric is a specialization of the metric

The statement of Theorem 11 becomes false if we consider general conditional embeddings instead of homogeneous ones:

^{(k,m)}

This negative result will become clearer from the perspective of Section 6: as we will show in Theorem 17, although there are no metrics that are invariant under all conditional embeddings, there are families of metrics (depending on a parameter,

In the previous section, we obtained distinguished Riemannian metrics on

Let
_{x}_{i}_{A,ν}

with the normalization function
_{i}_{A}.

A direct calculation shows that the Fisher information matrix of _{A},_{ν}

Here, cov_{θ}

The convex support of _{A},_{ν}

where conv _{A}_{,ν}. The inverse of

Let
_{1},…, _{n}._{1},…, _{n}_{P}. We can use the inverse of the moment map, ^{−1}, to pull back geometric structures on

^{−1}.

Some obvious questions are: Why is this a natural construction? Which maps between polytopes are isometries between their Fisher metrics? Can we find a characterization of Chentsov type for this metric?

Affine maps are natural maps between polytopes. However, in order to obtain isometries, we need to put some additional constraints. Consider two polytopes
^{−1} = μ ○ ^{′−1}. Then, the following diagram commutes:

It follows that ϕ^{−1} is an isometry from _{n}_{−1}, then the upper moment map ^{−1} is the identity map, and ^{−1} equals the inverse moment map ^{−1} of

The constraint of mapping vertices to vertices bijectively is very restrictive. In order to consider a larger class of affine maps, we need to generalize our construction from polytopes to weighted point configurations.

_{1},…, _{n}_{i}_{A},_{ν}

The (

We recover Definition 13 as follows. For a polytope P, let _{P} _{A},_{ν}

The following are natural maps between weighted point configurations:

Consider a morphism (ϕ,

Then,

By Chentsov’s theorem (Theorem 2), ^{−1} also induces an isometric embedding. This shows the first part of the following theorem:

^{−1}: (conv

^{A},^{ν} be a Riemannian metric on^{−1}: (convA′)° → (conv ^{A},^{ν} is equal to α times the

^{P}^{−1} is itself a morphism.

Observe that ∆_{n}_{−1} = conv _{n}_{1},…, _{n}

Let _{j}_{i}^{−1} is an isometric embedding
^{−1} is equal to the Markov map

Theorem 16 defines a natural metric on

Consider _{1}],…, [n_{k}

where
_{1},…, _{k}_{k}

We can write any tangent vector

Just as the convex support of the independence model is the Cartesian product of probability simplices, the Fisher metric on the independence model is the product metric of the Fisher metrics on the probability simplices of the individual variables. If _{1} = … = _{k}

The Fisher metric on the product of simplices is equal to the product of the Fisher metrics on the factors. More generally, if _{1} × _{2} is a Cartesian product, then the Fisher metric on

Therefore, the pull-back by ^{−1} factorizes through the pull-back by

Let us compare the metric

where _{i}_{i}_{i}_{i}^{K}^{K}

In this section, we consider metrics on spaces of stochastic matrices defined as weighted sums of the Fisher metrics on the spaces of the matrix rows, similar to

Consider the following weighted product Fisher metric:

where
^{K}

In the following, we will try to illuminate the properties of polytope embeddings that yield the metric ^{K}

There are two direct ways of embedding

If _{ρ}_{ρ}

The pull-back of the Fisher metric on
_{ρ}

This recovers the weighted sum of Fisher metrics from

Are there natural maps that leave the metrics ^{ρ},^{m}^{(a)}}_{a}_{∈[}_{k}_{]} be a collection of stochastic partition matrices. The corresponding conditional embedding

Let
_{ρ}^{⊤}(

The preceding discussion implies the first statement of the following result:

^{ρ},^{m} on

^{(ρ,m)} ^{(ρ,m)} ^{(ρ,m)} = ^{ρ},^{m}

^{k}^{l}^{k}R.

A general distribution
^{(ρ,m)} is assumed to be continuous, it suffices to prove the statement for rational ^{(ρ,m)} is also of the desired form. □

In this section, we use gradient fields in order to compare Riemannian metrics on the space

We start with gradient fields on the simplex
_{p}_{p}F^{(n)} as the Riemannian metric, we obtain the gradient in the following way. First consider a differentiable extension of _{i}F

Note that the expression on the right-hand side of

We now apply this gradient formula to functions that have the structure of an expectation value. Given real numbers

Replacing the _{i}_{i}F_{i},

This equation has the solution:

Clearly, the mean fitness will increase along this solution of the gradient field. The rate of increase can be easily calculated:

As limit points of this solution, we obtain:

and:

Now, we come to the corresponding considerations of gradient fields in the context of stochastic matrices

One way to deal with this is to consider for each

Obviously, this is the gradient field that one obtains by using the product Fisher metric on

If we replace the metric by the weighted product Fisher metric considered by Kakade (

then we obtain

Next, we want to study how the gradient flows with respect to different metrics compare. We restrict to the class of metrics ^{ρ},^{m}_{i}._{i}

With a probability distribution
_{ij}

With

The corresponding solutions are given by:

Since
_{i}

and:

This is consistent with the fact that the critical points of gradient fields are independent of the chosen Riemannian metric. However, the speed of convergence does depend on the metric:

For each _{i}_{j} F_{ij}_{ij}

Therefore,

Thus, in the long run, the rate of convergence is given by
_{i}, i.e.,_{i}_{i}

Consider, for example, the case that the differences _{i}_{i}_{i}_{i}_{i}_{i}.

So, which Riemannian metric should one use in practice on the set of stochastic matrices,

Which metric performs best obviously depends on the concrete application. The first observation is that in order to use the metric ^{ρ},^{m}^{ρ},^{m}.

On the other hand, there may be situations where there is no natural choice of the weights ^{ρ},^{m}

For example, consider a utility function of the form
^{ρ},^{m}^{ρ},^{m}^{ρ},^{m}.^{(k,m)}.

The authors are grateful to Keyan Zahedi for discussions related to policy gradient methods in robotics applications. Guido Montúfar thanks the Santa Fe Institute for hosting him during the initial work on this article. Johannes Rauh acknowledges support by the VW Foundation. This work was supported in part by the DFG Priority Program, Autonomous Learning (DFG-SPP 1527).

All authors contributed to the design of the research. The research was carried out by all authors, with main contributions by Guido Montúfar and Johannes Rauh. The manuscript was written by Guido Montúfar, Johannes Rauh and Nihat Ay. All authors read and approved the final manuscript.

The authors declare no conflict of interests.

^{(k,m)}

is strictly positive for all non-zero

We can derive necessary conditions on the functions _{ab}_{ab}_{a}|,_{a}|_{ab}_{ab}_{a}

For any given
^{⊤}_{A}_{B}_{C}_{n}^{−1}

Let us consider a leading square block _{ab},_{cd}_{a}

Since _{C}

The matrix in the second term of

By Sylvester’s determinant theorem, we have:

where
_{a}

This shows that the matrix

The following lemma follows directly from the definition and contains all the technical details we need for the proofs.

^{(l,n)}

^{(a)} has the

This implies that

Using the second type of map, we get:

which implies
_{M}

_{zw}

and therefore,

which implies that

For a rational matrix _{M}

which implies:

such that the left-hand side is a constant C, and
_{M}

Summarizing, we found:

which proves the first statement. The second statement follows by plugging

Observe that:

In fact,

Again, we may chose

If we chose
^{k},^{m}

An interpretation for Lebanon maps and conditional embeddings. The variable

An illustration of different embeddings of the conditional polytope