On the Fisher Metric of Conditional Probability Polytopes

We consider three different approaches to define natural Riemannian metrics on polytopes of stochastic matrices. First, we define a natural class of stochastic maps between these polytopes and give a metric characterization of Chentsov type in terms of invariance with respect to these maps. Second, we consider the Fisher metric defined on arbitrary polytopes through their embeddings as exponential families in the probability simplex. We show that these metrics can also be characterized by an invariance principle with respect to morphisms of exponential families. Third, we consider the Fisher metric resulting from embedding the polytope of stochastic matrices in a simplex of joint distributions by specifying a marginal distribution. All three approaches result in slight variations of products of Fisher metrics. This is consistent with the nature of polytopes of stochastic matrices, which are Cartesian products of probability simplices. The first approach yields a scaled product of Fisher metrics; the second, a product of Fisher metrics; and the third, a product of Fisher metrics scaled by the marginal distribution.


Introduction
The Riemannian structure of a function's domain has a crucial impact on the performance of gradient optimization methods, especially in the presence of plateaus and local maxima. The natural gradient [1] gives the steepest increase direction of functions on a Riemannian space. For example, artificial neural networks can often be trained by following some function's gradient on a space of probabilities. In this context, it has been observed that following the natural gradient with respect to the Fisher information metric, instead of the Euclidean metric, can significantly alleviate the plateau problem [1,10]. The Fisher information metric, which is also called Shahshahani metric [18] in biological contexts, is broadly recognized as the natural metric of probability spaces. An important argument was given by Chentsov [8], who showed that the Fisher information metric is the only metric on probability spaces for which certain natural statistical embeddings, called Markov morphisms, are isometries. More generally, Chentsov's theorem characterizes the Fisher metric and α-connections of statistical manifolds uniquely (up to a multiplicative constant) by requiring invariance with respect to Markov morphisms. Campbell [7] gave another proof that characterizes invariant metrics on the set of non-normalized positive measures, which restrict to the Fisher metric in the case of probability measures (up to a multiplicative constant). In this paper, we explore ways of defining distinguished Riemannian metrics on spaces of stochastic matrices.
In learning theory, when modeling the policy of a system, it is often preferred to consider stochastic matrices instead of joint probability distributions. For example, in robotics applications, policies are optimized over a parametric set of stochastic matrices by following the gradient of a reward function [19,13]. The set of stochastic matrices can be parametrized in many ways, e.g., in terms of feedforward neural networks, Boltzmann machines [14] or projections of exponential families [3]. The information geometry of policy models plays an important role in these applications and has been studied by Kakade [10], Peters and co-workers [16,15,17], and Bagnell and Schneider [4], among others. A stochastic matrix is a tuple of probability distributions, and therefore, the space of stochastic matrices is a Cartesian product of probability simplices. Accordingly, in applications, usually a product metric is considered, with the usual Fisher metric on each factor. On the other hand, Lebanon [12] takes an axiomatic approach, following the ideas of Chentsov and Campbell, and characterizes a class of invariant metrics of positive matrices that restricts to the product of Fisher metrics in the case of stochastic matrices. We will consider three different approaches discussed in the following.
In the first part, we take another look at Lebanon's approach for characterizing a distinguished metric on polytopes of stochastic matrices. However, since the maps considered by Lebanon do not map stochastic matrices to stochastic matrices, we will use different maps. We show that the product of Fisher metrics can be characterized by an invariance principle with respect to natural maps between stochastic matrices.
In the second part, we consider an approach that allows us to define Riemannian structures on arbitrary polytopes. Any polytope can be identified with an exponential family by using the coordinates of the polytope vertices as observables. The inverse of the moment map then defines an embedding of the polytope in a probability simplex. This embedding can be used to pull back geometric structures from the probability simplex to the polytope, including Riemannian metrics, affine connections, divergences, etc. This approach has been considered in [3] as a way to define low-dimensional families of conditional probability distributions. More general embeddings can be defined by identifying each exponential family with a point configuration, B, together with a weight function, ν. Given B and ν, the corresponding exponential family defines geometric structures on the set (conv B) • , which is the relative interior of the convex support of the exponential family. Moreover, we can define natural morphisms between weighted point configurations as surjective maps between the point sets, which are compatible with the weight functions. As it turns out, the Fisher metric on (conv B) • can be characterized by invariance under these maps.
In the third part, we return to stochastic matrices. We study natural embeddings of conditional distributions in probability simplices as joint distributions with a fixed marginal. These embeddings define a Fisher metric equal to a weighted product of Fisher metrics. This result corresponds to the definitions commonly used in robotics applications.
All three approaches give very similar results. In all cases, the identified metric is a product metric. This is a sensible result, since the set of k × m stochastic matrices is a Cartesian product of probability simplices ∆ m−1 × · · · × ∆ m−1 = ∆ k m−1 , which suggests using the product metric of the Fisher metrics defined on the factor simplices, ∆ m−1 . Indeed, this is the result obtained from our second approach. The first approach yields that same result with an additional scaling factor of 1/k. Only when stochastic matrices of different sizes are compared, the two approaches differ. The third approach yields a product of Fisher metrics scaled by the marginal distribution that defines the embedding. Which metric to use depends on the concrete problem and whether a natural marginal distribution is defined and known. In Section 7, we do a case study using a reward function that is given as an expectation value over a joint distribution. In this simple example, the weighted product metric gives the best asymptotic rate of convergence, under the assumption that the weights are optimally chosen. In Section 8, we sum up our findings.
The contents of the paper is organized as follows. Section 2 contains basic definitions around the Fisher metric and concepts of differential geometry. In Section 3, we discuss the theorems of Chentsov, Campbell and Lebanon, which characterize natural geometric structures on the probability simplex, on the set of positive measures and on the cone of positive matrices, respectively. In Section 4, we study metrics on polytopes of stochastic matrices, which are invariant under natural embeddings. In Section 5, we define a Riemannian structure for polytopes, which generalizes the Fisher information metric of probability simplices and conditional models in a natural way. In Section 6, we study a class of weighted product metrics. In Section 7, we study the gradient flow with respect to an expectation value. Section 8 contains concluding remarks. In Appendix A, we investigate restrictions on the parameters of the metrics characterized in Sections 3 and 4 that make them positive definite. Appendix B contains the proofs of the results from Section 4.

Preliminaries
We will consider the simplex of probability distributions on [m] := {1, . . . , m}, m ≥ 2, which is given by ∆ m−1 := {(p i ) i ∈ R m : p i ≥ 0, i p i = 1}. The relative interior of ∆ m−1 consists of all strictly positive probability distributions on [m], and will be denoted ∆ • m−1 . This is a subset of R m + , the cone of strictly positive vectors. The set of k × m row-stochastic matrices is given by , the cone of strictly positive matrices.
Given two random variables X and Y taking values in the finite sets [k] and [m], respectively, the conditional probability distribution of Y given X is the stochastic matrix K = (P (y|x)) x∈[k],y∈ [m] with rows (P (y|x)) y∈[m] ∈ ∆ m−1 for all x ∈ [k]. Therefore, the polytope of stochastic matrices ∆ k m−1 is called a conditional polytope. The tangent space of R n + at a point p ∈ R n + , denoted by T p R n + , is the real vector space spanned by the vectors ∂ 1 , . . . , ∂ n of partial derivatives with respect to the n components. The tangent space The Fisher metric on the positive probability simplex ∆ • n−1 is the Riemannian metric given by: The same formula (2) also defines a Riemannian metric on R n + , which we will denote by the same symbol. This, however, is not the only way in which the Fisher metric can be extended from ∆ • n−1 to R n + . We will discuss other extensions in the next section (see Campbell Here, it is not necessary to assume that the parameters θ i are independent. In particular, the dimension of M may be smaller than d, in which case the matrix is not positive definite. If the map Ω → M, θ → p(·; θ) is an embedding (i.e., a smooth injective map that is a diffeomorphism onto its image), then g M θ defines a Riemannian metric on Ω, which corresponds to the pull-back of g (n) . Consider an embedding f : E → E . The pull-back of a metric g on E through f is defined as: where f * denotes the push-forward of T p E through f , which in coordinates is given by: where E . An embedding f : E → E of two Riemannian manifolds (E, g) and (E , g ) is an isometry iff: , for all p ∈ E and u, v ∈ T p E.
In this case, we say that the metric g is invariant with respect to f (and g ).

The Results of Campbell and Lebanon
One of the theoretical motivations for using the Fisher metric is provided by Chentsov's characterization [8], which states that the Fisher metric is uniquely specified, up to a multiplicative constant, by an invariance principle under a class of stochastic maps, called Markov morphisms. Later, Campbell [7] considered the characterization problem on the space R n + instead of ∆ • n−1 . This simplifies the computations, since R n + has a more symmetric parametrization.
Definition 1. Let 2 ≤ m ≤ n. A (row) stochastic partition matrix (or just row-partition matrix) is a matrix Q ∈ R m×n of non-negative entries, which satisfies j∈A i Q ij = δ ii for an m block partition {A 1 , . . . , A m } of [n]. The linear map defined by: is called a congruent embedding by a Markov mapping of R m + to R n + or just a Markov map, for short.
An example of a 3 × 5 row-partition matrix is: Markov maps preserve the 1-norm and restrict to embeddings Theorem 2 (Chentsov's theorem.).
• Let g (m) be a Riemannian metric on ∆ • m−1 for m ∈ {2, 3, . . .}. Let this sequence of metrics have the property that every congruent embedding by a Markov mapping is an isometry. Then, there is a constant C > 0 that satisfies: • Conversely, for any C > 0, the metrics given by Equation (9) define a sequence of Riemannian metrics under which every congruent embedding by a Markov mapping is an isometry.
The main result in Campbell's work [7] is the following variant of Chentsov's theorem.
• Let g (m) be a Riemannian metric on R m + for m ∈ {2, 3, . . .}. Let this sequence of metrics have the property that every embedding by a Markov mapping is an isometry. Then: where |p| = m i=1 p i , δ ij is the Kronecker delta, and A and C are C ∞ functions on R + satisfying C(α) > 0 and A(α) + C(α) > 0 for all α > 0.
• Conversely, if A and C are C ∞ functions on R + satisfying C(α) > 0, A(α) + C(α) > 0 for all α > 0, then Equation (10) defines a sequence of Riemannian metrics under which every embedding by a Markov mapping is an isometry.
The metrics from Campbell's theorem also define metrics on the probability simplices ∆ • m−1 In this case, the choice of A is immaterial, and the metric becomes Chentsov's metric.

Remark 4.
Observe that Chentsov's theorem is not a direct implication of Campbell's theorem. However, it can be deduced from it by the following arguments. Suppose that we have a family of Riemannian simplices (∆ • m−1 , g (m) ) for m ∈ {2, 3, . . .}, and suppose that they are isometric with respect to Markov maps. If we can extend every g (m) to a Riemannian metricg (m) on R m + in such a way that the resulting spaces (R m + ,g (m) ) are still isometric with respect to Markov maps, then Campbell's theorem implies that g (m) is a multiple of the Fisher metric. Such metric extensions can be defined as follows. Consider the diffeomorphism: Any tangent vector u ∈ T (p,r) R m + can be written uniquely as u = u p + u r ∂ r , where u p is tangent to r∆ • m−1 . Since each Markov map f preserves the one-norm | · |, its push-forward f * maps the tangent vector ∂ r ∈ T (p,r) R m + to the corresponding tangent vector In what follows, we will focus on positive matrices. In order to define a natural Riemannian metric, we can use the identification R k×m + ∼ = R km + and apply Campbell's theorem. This leads to metrics of the form: g where ∂ ij = ∂ ∂M ij and |M | = ij M ij . However, a disadvantage of this approach is that the action of general Markov maps on R km + has no natural interpretation in terms of the matrix structure. Therefore, Lebanon [12] considered a special class of Markov maps defined as follows.
Definition 5. Consider a k×l row-partition matrix R and a collection of m×n row-partition matrices is called a congruent embedding by a Markov morphism of R k×m + to R l×n + in [11]. We will refer to such an embedding as a Lebanon map. Here, the row product M ⊗ Q is defined by: that is, the a-th row of M is multiplied by the matrix Q (a) .
In a Lebanon map, each row of the input matrix M is mapped by an individual Markov mapping Q (i) , and each resulting row is copied and scaled by an entry of R. This kind of map preserves the sum of all matrix entries. Therefore, with the identification R k×m The set ∆ • mk−1 can be identified with the set of joint distributions of two random variables. Lebanon maps can be regarded as special Markov maps that incorporate the product structure present in the set of joint probability distributions of a pair of random variables. In Section 4, we will give an interpretation of these maps.
Contrary to what is stated in [11], a Lebanon map does not map (∆ k m−1 ) • to (∆ l n−1 ) • , unless k = l. Therefore, later, we will provide a characterization for the metrics on (∆ k m−1 ) • in terms of invariance under other maps (which are not Markov nor Lebanon maps).
The main result in Lebanon's work [11, Theorems 1 and 2] is the following.
• For each k ≥ 1, m ≥ 2, let g (k,m) be a Riemannian metric on R k×m + in such a way that every Lebanon map is an isometry. Then: for some differentiable functions A, B, C ∈ C ∞ (R + ).
• Conversely, let {(R k×m + , g (k,m) )} be a sequence of Riemannian manifolds, with metrics g (k,m) of the form (16) for some A, B, C ∈ C ∞ (R + ). Then, every Lebanon map is an isometry.
Lebanon does not study the question under which assumptions on A, B, C ∈ C ∞ (R + ) the formula (16) does indeed define a Riemannian metric. This question has the following simple answer, which we will prove in Appendix A: The class of metrics (16) is larger than the class of metrics (13) derived in Campbell's theorem. The reason is that Campbell's metrics are invariant with respect to a larger class of embeddings.
The special case with A(|M |) = 0, B(|M |) = 0 and C(|M |) = 1 is called product Fisher metric, Furthermore, if we restrict to (∆ k m−1 ) • , the functions A and B do not play any role. In this case |M | = k, and we obtain the scaled product Fisher metric: where C(k) : N → R + is a positive function. As mentioned before, Lebanon's theorem does not give a characterization of invariant metrics of stochastic matrices, since Lebanon maps do not preserve the stochasticity of the matrices. However, Lebanon maps are natural maps on the set ∆ • mk−1 of positive joint distributions. In the same way as Chentsov's theorem can be derived from Campbell's theorem (see Remark 4), we obtain the following corollary: be a double sequence of Riemannian manifolds with the property that every Lebanon map is an isometry. Then: for some constants B, C ∈ R with C > 0 and B } be a sequence of Riemannian manifolds with metrics g (k,m) of the form of Equation (19) for some B, C ∈ R. Then, every Lebanon map is an isometry.
Observe that these metrics agree with (a multiple of) the Fisher metric only if B = 0. The case B = 0 can also be characterized; note that Lebanon maps do not treat the two random variables symmetrically. Switching the two random variables corresponds to transposing the joint distribution matrix P . When exchanging the role of the two random variables, the Lebanon map becomes P → (P ⊗ Q) R. We call such a map a dual Lebanon map. If we require invariance under both Lebanon maps and their duals in Theorem 6 or Corollary 8, the statements remain true with the additional restriction that B = 0 (as a function or constant, respectively).

Invariance Metric Characterizations for Conditional Polytopes
According to Chentsov's theorem (Theorem 2), a natural metric on the probability simplex can be characterized by requiring the isometry of natural embeddings. Lebanon follows this axiomatic approach to characterize metrics on products of positive measures (Theorem 6). However, the maps considered by Lebanon dissolve the row-normalization of conditional distributions. In general, they do not map conditional polytopes to conditional polytopes. Therefore, we will consider a slight modification of Lebanon maps, in order to obtain maps between conditional polytopes.

Stochastic Embeddings of Conditional Polytopes
A matrix of conditional distributions P (Y |X) in ∆ k m−1 can be regarded as the equivalence class of all joint probability distributions P (X, Y ) ∈ ∆ km−1 with conditional distribution P (Y |X). Which Markov maps of probability simplices are compatible with this equivalence relation? The most obvious examples are permutations (relabelings) of the state spaces of X and Y .
In information theory, stochastic matrices are also viewed as channels. For any distribution of X, the stochastic matrix gives us a joint distribution of the pair (X, Y ) and, hence, a marginal distribution of Y . If we input a distribution of X into the channel, the stochastic matrix determines what the distribution of the output Y will be.
Channels can be combined, provided the cardinalities of the state spaces fit together. If we take the output Y of the first channel P (Y |X) and feed it into another channel P (Y |Y ) then we obtain a combined channel P (Y |X). The composition of channels corresponds to ordinary matrix multiplication. If the first channel is described by the stochastic matrix K and the second channel by Q, then the combined channel is described by K · Q. Observe that in this case, the joint distribution P (considered as a normalized matrix P ∈ ∆ km−1 ) is transformed similarly; that is, the joint distribution of the pair (X, Y ) is given by P · Q.
More general maps result from compositions where the choice of the second channel depends on the input of the first channel. In other words, we have a first channel that takes as input X and gives as output Y , and we have another channel that takes as input (X, Y ) and gives as output Y ; we are interested in the resulting channel from X to Y . The second channel can be described by a collection of stochastic matrices Q = {Q (i) } i . If K describes the first channel, then the combined channel is described by the row product K ⊗ Q (see Definition 5). Again, the joint distribution of (X, Y ) arises in a similar way as P ⊗ Q.
We can also consider transformations of the first random variable X. Suppose that we use X as the input to a channel described by a stochastic matrix R. In this case, the joint distribution of the output X of the channel and Y is described by R X. However, in general, there is not much that we can say about the conditional distribution of Y given X . The result depends in an essential way on the original distribution of X. However, this is not true in the special case that the channel is "not mixing", that is, in the case that R is a stochastic partition matrix. In this case, the conditional distribution P (Y |X ) is described by R K, where R is the corresponding partition indicator matrix, where all non-zero entries of R are replaced by one. In other words, each state of X corresponds to several states of X , and the corresponding row of K is copied a corresponding number of times.
To sum up, if we combine the transformations due to Q and R, then the joint probability distribution transforms as P → R (P ⊗ Q) and the conditional transforms as K → R (K ⊗ Q). In particular, for the joint distribution, we obtain the definition of a Lebanon map. Figure 1 illustrates conditional distributions: K = R (K ⊗ Q) Figure 1: An interpretation for Lebanon maps and conditional embeddings. The variable X is computed from X by R, and Y is computed from X and Y by Q.
the situation.
Finally, we will also consider the special case where the partition of R (and R) is homogeneous, i.e., such that all blocks have the same size. For example, this describes the case where there is a third random variable Z that is independent of Y given X. In this case, the conditional distribution satisfies P (Y |X) = P (Y |X, Z), and R describes the conditional distribution of (X, Z) given X.
Definition 9. A (row) partition indicator matrix is a matrix R ∈ {0, 1} k×l that satisfies: for a k block partition For example, the 3 × 5 partition indicator matrix corresponding to Equation (8) Definition 10. Consider a k × l partition indicator matrix R and a collection of m × n stochastic partition matrices Q = {Q (i) } k i=1 . We call the map: a conditional embedding of R k×m + in R l×n + . We denote the set of all such maps byF l,n k,m . If R is the partition indicator matrix of a homogeneous partition (with partition blocks of equal cardinality), then we call f a homogeneous conditional embedding. We denote the set of all such homogeneous conditional embeddings by F l,n k,m and assume that l is a multiple of k.
Conditional embeddings preserve the 1-norm of the matrix rows; that is, the elements ofF l,n k,m On the other hand, they do not preserve the 1-norm of the entire matrix. Conditional embeddings are Markov maps only when k = l, in which case they are also Lebanon maps.

Invariance Characterization
Considering the conditional embeddings discussed in the previous section, we obtain the following metric characterization. Theorem 11.
• Let g (k,m) denote a metric on R k×m + for each k ≥ 1 and m ≥ 2. If every homogeneous conditional embedding f ∈ F l,n k,m is an isometry with respect to these metrics, then: for some constants A, B, C ∈ R, where ∂ ab = ∂ ∂M ab and |M | = ab M ab . • Conversely, given the metrics defined by Equation (23) for any non-degenerate choice of constants A, B, C ∈ R, each homogeneous conditional embedding f ∈ F l,n k,m , k ≤ l, m ≤ n is an isometry.
• Moreover, the tensors g (k,m) from Equation (23)  The proof of Theorem 11 is similar to the proof of the theorems of Chentsov, Campbell and Lebanon. Due to its technical nature, we defer it to Appendix B. Now, for the restriction of the metric g (k,m) to (∆ k m−1 ) • , we have the following. In this case, for all a, the constants A and B become immaterial, and the metric can be written as: This metric is a specialization of the metric (18)  This negative result will become clearer from the perspective of Section 6: as we will show in Theorem 17, although there are no metrics that are invariant under all conditional embeddings, there are families of metrics (depending on a parameter, ρ) that transform covariantly (that is, in a welldefined manner) with respect to the conditional embeddings. We defer the proof of Theorem 12 to Appendix B.

The Fisher Metric on Polytopes and Point Configurations
In the previous section, we obtained distinguished Riemannian metrics on R k×m + and (∆ k m−1 ) • by postulating invariance under natural maps. In this section, we take another viewpoint based on general considerations about Riemannian metrics on arbitrary polytopes. This is achieved by embedding each polytope in a probability simplex as an exponential family. We first recall the necessary background. In Section 5.2, we then present our general results, and in Section 5.3, we discuss the special case of conditional polytopes.

Exponential Families and Polytopes
Let X be a finite set and A ∈ R d×X a matrix with columns a x indexed by x ∈ X . It will be convenient to consider the rows A i , i ∈ [d] of A as functions A i : X → R. Finally, let ν : X → R + . The exponential family E A,ν is the set of probability distributions on X given by: p(x; θ) = exp(θ a x + log(ν(x)) − log(Z(θ))), for all x ∈ X , for all θ ∈ R d , with the normalization function Z(θ) = x ∈X exp(θ a x + log(ν(x ))). The functions A i are called the observables and ν the reference measure of the exponential family. When the reference measure ν is constant, ν(x) = 1 for all x ∈ X , we omit the subscript and write E A .
A direct calculation shows that the Fisher information matrix of E A,ν at a point θ ∈ R d has coordinates: Here, cov θ denotes the covariance computed with respect to the probability distribution p(·; θ). The convex support of E A,ν is defined as: where conv S is the set of all convex combinations of points in S. The moment map µ : p ∈ ∆ n−1 → A · p ∈ R d restricts to a homeomorphism E A,ν → conv A; see [5]. Here, E A,ν denotes the Euclidean closure of E A,ν . The inverse of µ will be denoted by µ −1 : conv A → E A,ν ⊆ ∆ n−1 . This gives a natural embedding of the polytope conv A in the probability simplex ∆ |X |−1 . Note that the convex support is independent of the reference measure ν. See [6] for more details.

Invariance Fisher Metric Characterizations for Polytopes
Let P ∈ R d be a polytope with n vertices a 1 , . . . , a n . Let A = (a 1 , . . . , a n ) be the matrix with columns a i ∈ R d for all i ∈ [n]. Then, E A ⊆ ∆ • n−1 is an exponential family with convex support P. We will also denote this exponential family by E P . We can use the inverse of the moment map, µ −1 , to pull back geometric structures on ∆ • n−1 to the relative interior P • of P.
Definition 13. The Fisher metric on P • is the pull-back of the Fisher metric on E A ⊆ ∆ • n−1 by µ −1 .
Some obvious questions are: Why is this a natural construction? Which maps between polytopes are isometries between their Fisher metrics? Can we find a characterization of Chentsov type for this metric?
Affine maps are natural maps between polytopes. However, in order to obtain isometries, we need to put some additional constraints. Consider two polytopes P ∈ R d , P ∈ R d and an affine map φ : R d → R d that satisfies φ(P) ⊆ P . A natural condition in the context of exponential families is that φ restricts to a bijection between the set vert(P) of vertices of P and the set vert(P ) of vertices of P . In this case, E P ⊆ E P ⊆ ∆ • n−1 . Moreover, the moment map µ of P factorizes through the moment map µ of P: µ = φ • µ. Let φ −1 = µ • µ −1 . Then, the following diagram commutes: It follows that φ −1 is an isometry from P • to its image in P • . Observe that the inverse moment map itself arises in this way: In the diagram (28), if P is equal to ∆ n−1 , then the upper moment map µ −1 is the identity map, and φ −1 equals the inverse moment map µ −1 of P . The constraint of mapping vertices to vertices bijectively is very restrictive. In order to consider a larger class of affine maps, we need to generalize our construction from polytopes to weighted point configurations.

Definition 14.
A weighted point configuration is a pair (A, ν) consisting of a matrix A ∈ R d×n with columns a 1 , . . . , a n and a positive weight function ν : {1, . . . , n} → R + assigning a weight to each column a i . The pair (A, ν) defines the exponential family E A,ν .
The (A, ν)-Fisher metric on (conv A) • is the pull-back of the Fisher metric on ∆ • n−1 through the inverse of the moment map.
We recover Definition 13 as follows. For a polytope P, let A be the point configuration consisting of the vertices of P. Moreover, let ν be a constant function. Then, E P = E A,ν , and the two definitions of the Fisher metric on P • coincide.
The following are natural maps between weighted point configurations: Then, Q is a Markov mapping, and the following diagram commutes: By Chentsov's theorem (Theorem 2), Q is an isometric embedding. It follows that φ −1 also induces an isometric embedding. This shows the first part of the following theorem: Proof. The first statement follows from the discussion before the theorem. For the second statement, we show that under the given assumptions, all Markov maps are isometric embeddings. By Chentsov's theorem (Theorem 2), this implies that the metrics g P agree with the Fisher metric whenever P is a simplex. The statement then follows from the two facts that the metric on P • or (conv A) • is the pull-back of the Fisher metric through the inverse of the moment map and that µ −1 is itself a morphism.
Observe that ∆ n−1 = conv I n = conv{e 1 , . . . , e n } is a polytope, and ∆ • n−1 is the corresponding exponential family. Consider a Markov embedding Q : Let ν(i) = j Q ji be the value of the unique non-zero entry of Q in the i-th column. This defines a morphism and an embedding as follows: Let A be the matrix that arises from Q by replacing each non-zero entry by one. We define φ as the linear map represented by the matrix A, and define σ : [n] → [n ] by σ(j) = i if and only if a j = e i , that is, σ(j) indicates the row i in which the j-th column of A is non-zero. Then, (φ, σ) is a morphism (I n , ν) → (I n , 1), and by assumption, the inverse φ −1 is an isometric embedding ∆ • n −1 → ∆ • n−1 . However, φ −1 is equal to the Markov map Q. This shows that all Markov maps are isometric embeddings, and so, by Chentsov's theorem, the statement holds true on the simplices. Theorem 16 defines a natural metric on (∆ k m−1 ) • that we want to discuss in more detail next.

Independence Models and Conditional Polytopes
Consider k random variables with finite state spaces [n 1 ], . . . , [n k ]. The independence model consists of all joint distributions p ∈ ∆ i∈[k] n i −1 of these variables that factorize as: where p i ∈ ∆ n i −1 for all i ∈ [k]. Assuming fixed n 1 , . . . , n k , we denote the independence model by E k . It is the Euclidean closure of an exponential family (with observables of the form δ iy i ). The convex support of E k is equal to the product of simplices P k := ∆ n 1 −1 × · · · × ∆ n k −1 . The parametrization (31) corresponds to the inverse of the moment map. We can write any tangent vector u ∈ T (p 1 ,...,p k ) P • k of this open product of simplices as a linear . Given two such tangent vectors, the Fisher metric is given by: Just as the convex support of the independence model is the Cartesian product of probability simplices, the Fisher metric on the independence model is the product metric of the Fisher metrics on the probability simplices of the individual variables. If n 1 = · · · = n k =: n, then P k = ∆ k n−1 can be identified with the set of k × n stochastic matrices.
The Fisher metric on the product of simplices is equal to the product of the Fisher metrics on the factors. More generally, if P = Q 1 × Q 2 is a Cartesian product, then the Fisher metric on P • is equal to the product of the Fisher metrics on Q • 1 and Q • 2 . In fact, in this case, the inverse of the moment map of P can be expressed in terms of the two moment map inverses µ 1 : Therefore, the pull-back by µ −1 factorizes through the pull-back byμ −1 , and since the independence model carries a product metric, the product of polytopes also carries a product metric. Let us compare the metric g (k,m) K from Equation (24), with the Fisher metric g P k (K 1 ,...,K k ) from Equation (32) on the product of simplices P • = (∆ k m−1 ) • . In both cases, the metric is a product metric; that is, it has the form: where g i is a metric on the i-th factor ∆ • m−1 . For g , g i is equal to the Fisher metric on ∆ • m−1 . However, for g (k,m) K , g i is equal to 1/k times the Fisher metric on ∆ • m−1 . Since this factor only depends on k, it only plays a role if stochastic matrices of different sizes are compared. The additional factor of 1/k can be interpreted as the uniform distribution on k elements. This is related to another more general class of Riemannian metrics that are used in applications; namely, given a function K ∈ ∆ k m−1 → ρ K ∈ R k + , it is common to use product metrics with g i equal to ρ K (i) times the Fisher metric on ∆ • m−1 . When K has the interpretation of a channel or when K describes the policy by which a system reacts to some sensor values, a natural possibility is to let ρ K be the stationary distribution of the channel input or of the sensor values, respectively. We will discuss this approach in Section 6.

Weighted Product Metrics for Conditional Models
In this section, we consider metrics on spaces of stochastic matrices defined as weighted sums of the Fisher metrics on the spaces of the matrix rows, similar to Equation (34). This kind of metric was used initially by Amari [1] in order to define a natural gradient in the supervised learning context. Later, in the context of reinforcement learning, Kakade [10] defined a natural policy gradient based on this kind of metric, which has been further developed by Peters et al. [16]. Related applications within unsupervised learning have been pursued by Zahedi et al. [20].
Consider the following weighted product Fisher metric: where g (m),a Ka denotes the Fisher metric of ∆ • m−1 at the a-th row of K and ρ K ∈ ∆ • k−1 is a probability distribution over a associated with each K ∈ (∆ k m−1 ) • . For example, the distribution ρ K could be the stationary distribution of sensor values observed by an agent when operating under a policy described by K. In the following, we will try to illuminate the properties of polytope embeddings that yield the metric (35) as the pull-back of the Fisher information metric on a probability simplex. We will focus on the case that ρ K = ρ is independent of K.
There are two direct ways of embedding ∆ k n−1 in a probability simplex. In Section 5, we used the inverse of the moment map of an exponential family, possibly with some reference measure. This embedding is illustrated in the left panel of Figure 2. If we have given a fixed probability distribution ρ ∈ ∆ • k−1 , there is a second natural embedding ψ ρ : ∆ k m−1 → ∆ k·m−1 defined as follows: If ρ is the distribution of a random variable X and K ∈ ∆ k m−1 is the stochastic matrix describing the conditional distribution of another variable Y given X, then ψ ρ (K) is the joint distribution of X and Y . Note that ψ ρ is an affine embedding. See the right panel of Figure 2 for an illustration.
The pull-back of the Fisher metric on ∆ • km−1 through ψ ρ is given by: This recovers the weighted sum of Fisher metrics from Equation (35).
Are there natural maps that leave the metrics g ρ,m invariant? Let us reconsider the stochastic embeddings from Definition 10. Let R be a k × l indicator partition matrix and R a stochastic partition matrix with the same block structure as R. Observe that to each indicator partition matrix R there are many compatible stochastic partition matrices R, but the indicator partition matrix R for any stochastic partition matrix R is unique. Furthermore, let Q = {Q (a) } a∈[k] be a collection of stochastic partition matrices. The corresponding conditional embedding f maps K ∈ ∆ k m−1 to f (K) := R (K ⊗ Q) ∈ ∆ l n−1 . Let ρ ∈ ∆ • k−1 . Suppose that K describes the conditional distribution of Y given X and that ψ ρ (K) describes the joint distribution of Y and X. As explained in Section 4.1, the matrix f (P ) := R (P ⊗Q) describes the joint distribution of a pair of random variables (X , Y ), and the conditional distribution of Y given X is given by f (K). In this situation, the marginal distribution of X is given by ρ = ρR. Therefore, the following diagram commutes: The preceding discussion implies the first statement of the following result: Theorem 17.
Proof. The first statement follows from the commutative diagram (38). For the second statement, denote by ρ k the uniform distribution on a set of k elements. If f : K → R(K ⊗ Q) is a homogeneous conditional embedding of ∆ k m−1 in ∆ l n−1 , then R = k l R is a stochastic partition matrix corresponding to the partition indicator matrix R. Observe that ρ l = ρ k R. Therefore, the family of Riemannian metrics g ρ k ,m on ∆ k m−1 satisfies the assumptions of Theorem 11. Therefore, there is a constant A > 0 for which g ρ k ,m equals A/k times the product Fisher metric. This proves the statement for uniform distributions ρ.
A general distribution ρ ∈ ∆ • k−1 can be approximated by a distribution with rational probabilities. Since g (ρ,m) is assumed to be continuous, it suffices to prove the statement for rational ρ. In this case, there exists a stochastic partition matrix R for which ρ := ρR is a uniform distribution, and so, g (ρ ,n) is of the desired form. Equation (39) shows that g (ρ,m) is also of the desired form.

Gradient Fields and Replicator Equations
In this section, we use gradient fields in order to compare Riemannian metrics on the space (∆ k n−1 ) • .

Replicator Equations
We start with gradient fields on the simplex ∆ • n−1 . A Riemannian metric g on ∆ • n−1 allows us to consider gradient fields of differentiable functions F : ∆ • n−1 → R.
To be more precise, consider the differential d p F : It is a linear form on T p ∆ • n−1 , which maps each tangent vector u to d p F (u) = ∂F ∂u (p) ∈ R. Using the map u → g p (u, ·), this linear form can be identified with a tangent vector in T p ∆ • n−1 , which we denote by grad p F . If we choose the Fisher metric g (n) as the Riemannian metric, we obtain the gradient in the following way. First consider a differentiable extension of F to the positive cone R n + , which we will denote by the same symbol F . With the partial derivatives ∂ i F of F , the Fisher gradient of F on the simplex ∆ • n−1 is given as: Note that the expression on the right-hand side of Equation (40) does not depend on the particular differentiable extension of F to R n + . The corresponding differential equation is well known in theoretical biology as the replicator equation; see [9,2].
We now apply this gradient formula to functions that have the structure of an expectation value. Given real numbers F i , i ∈ [n], referred to as fitness values, we consider the mean fitness: Replacing the p i by any positive real numbers leads to a differentiable extension of F , also denoted by F . Obviously, we have ∂ i F = F i , which leads to the following replicator equation: This equation has the solution: Clearly, the mean fitness will increase along this solution of the gradient field. The rate of increase can be easily calculated: (45) As limit points of this solution, we obtain: and:
One way to deal with this is to consider for each i ∈ [k] the corresponding replicator equation: Obviously, this is the gradient field that one obtains by using the product Fisher metric on (∆ k n−1 ) • (Equation (17)): If we replace the metric by the weighted product Fisher metric considered by Kakade (Equation (35)), then we obtainK

The Example of Mean Fitness
Next, we want to study how the gradient flows with respect to different metrics compare. We restrict to the class of metrics g ρ,m (Equation (35)), where ρ ∈ ∆ • k is a probability distribution. In principle, one could drop the normalization condition i ρ i = 1 and allow arbitrary coefficients ρ i . However, it is clear that the rate of convergence can always be increased by scaling all values ρ i with a common positive factor. Therefore, some normalization condition is needed for ρ.
With a probability distribution p ∈ ∆ • k−1 and fitness values F ij , let us consider again the example of an expectation value function:F With ∂ ijF (π) = p i F ij , this leads to: The corresponding solutions are given by: Since argmax( p i ρ i F i· ) and argmin( p i ρ i F i· ) are independent of ρ i > 0, the limit points are given independently of the chosen ρ as: and: This is consistent with the fact that the critical points of gradient fields are independent of the chosen Riemannian metric. However, the speed of convergence does depend on the metric: For each i, let G i = max j F ij and g i = max j / ∈argmax(F ij ) F ij be the largest and second-largest values in the i-th row of F ij , respectively. Then, as: t → ∞, Therefore, Thus, in the long run, the rate of convergence is given by inf i { p i ρ i (G i − g i )}, which depends on the parameter ρ of the metric. As a result, in this case study, the optimal choice of ρ i , i.e., with the largest convergence rate, can be computed if the numbers G i and g i are known.
Consider, for example, the case that the differences G i − g i are of comparable sizes for all i. Then, we need to find the choice of ρ that maximizes inf i { p i ρ i }. Clearly, inf i { p i ρ i } ≤ 1 (since there is always an index i with p i ≤ ρ i ). Equality is attained for the choice ρ i = p i . Thus, we recover the choice of Kakade.

Conclusions
So, which Riemannian metric should one use in practice on the set of stochastic matrices, (∆ k m−1 ) • ? The results provided in this manuscript give different answers, depending on the approach. In all cases, the characterized Riemannian metrics are products of Fisher metrics with suitable factor weights. Theorem 11 suggests to use a factor weight proportional to 1/k, and Theorem 16 suggests to use a constant weight independent of k. In many cases, it is possible to work within a single conditional polytope (∆ k m−1 ) • and a fixed k, and then, these two results are basically equivalent. On the other hand, Theorem 17 gives an answer that allows arbitrary factor weights ρ.
Which metric performs best obviously depends on the concrete application. The first observation is that in order to use the metric g ρ,m of Theorem 17, it is necessary to know ρ. If the problem at hand suggests a natural marginal distribution ρ, then it is natural to make use of it and choose the metric g ρ,m . Even if ρ is not known at the beginning, a learning system might try to learn it to improve its performance.
On the other hand, there may be situations where there is no natural choice of the weights ρ. Observe that ρ breaks the symmetry of permuting the rows of a stochastic matrix. This is also expressed by the structural difference between Theorems 11 and 16 on the one side and Theorem 17 on the other. While the first two theorems provide an invariance metric characterization, Theorem 17 provides a "covariance" classification; that is, the metrics g ρ,m are not invariant under conditional embeddings, but they transform in a controlled manner. This again illustrates that the choice of a metric should depend on which mappings are natural to consider, e.g., which mappings describe the symmetries of a given problem.
For example, consider a utility function of the form F = i ρ i j K ij F ij . Row permutations do not leave g ρ,m invariant (for a general ρ), but they are not symmetries of the utility function F , either, and hence, they are not very natural mappings to consider. However, row permutations transform the metric g ρ,m and the utility function in a controlled manner; in such a way that the two transformations match. Therefore, in this case, it is natural to use g ρ,m . On the other hand, when studying problems that are symmetric under all row permutations, it is more natural to use the invariant metric g (k,m) . matrices is open, these inequalities have to be strictly satisfied. In the following, we study sufficient conditions.
For any given M ∈ R k×m + , we can write Equation (A2) as a product V GV , for all V ∈ R km , where G = G A + G B + G C ∈ R km×km is the sum of a matrix G A with all entries equal to A(|M |), a block diagonal matrix G B whose a-th block has all entries equal to |M | |Ma| B(|M |), and a diagonal matrix G C with diagonal entries equal to |M | M ab C(|M |). The matrix G is obviously symmetric, and by Sylvester's criterion, it is positive definite iff all its leading principal minors are positive. We can evaluate the minors using Sylvester's determinant theorem. That theorem states that for any invertible m × m matrix X, an m × n matrix Y and an n × m matrix Z, one has the equality det(X + Y Z) = det(X) det(I n + ZX −1 Y ).
Let us consider a leading square block G , consisting of all entries G ab,cd of G with row-index pairs (a, b) satisfying b ∈ [m] for all a < a and b ≤ b for a = a for some a ≤ k and b ≤ m; and the same restriction for the column index pairs. The corresponding block G A + G B can be written as the rank-a matrix Y Z, with Y consisting of columns 1 a for all a ≤ a and Z consisting of rows A + 1 a |M | |Ma| B for all a ≤ a . Hence, the determinant of G is equal to: Since G C is diagonal, the first term is just: The matrix in the second term of Equation (A3) is given by: By Sylvester's determinant theorem, we have: where A a = |Ma| |M | A for a < a and A a = b≤b M a b

|M |
A, and B a = B for a < a and B a = b≤b M a b |M a | B. This shows that the matrix G is positive definite for all M if and only if C > 0, C + B > 0 and 1 + a≤a Aa C+Ba > 0 for all a and b . The latter inequality is satisfied whenever A + B + C > 0. This completes the proof.

B Proofs of the Invariance Characterization
The following lemma follows directly from the definition and contains all the technical details we need for the proofs.