A Geometric Approach to Average Problems on Multinomial and Negative Multinomial Models

This paper is concerned with the formulation and computation of average problems on the multinomial and negative multinomial models. It can be deduced that the multinomial and negative multinomial models admit complementary geometric structures. Firstly, we investigate these geometric structures by providing various useful pre-derived expressions of some fundamental geometric quantities, such as Fisher-Riemannian metrics, α-connections and α-curvatures. Then, we proceed to consider some average methods based on these geometric structures. Specifically, we study the formulation and computation of the midpoint of two points and the Karcher mean of multiple points. In conclusion, we find some parallel results for the average problems on these two complementary models.


Introduction
The concept of an average of a set of points within a given geometric structure abounds in various mathematical research. Significant development has been produced since the conception of the Karcher mean [1]. Among the various structure to be studied, the standard simplex presents itself as an interesting framework since it can be directly connected to the parameter spaces of various probability distributions, such as the two to-be-discussed models-the multinomial and negative multinomial models. There are already some recent works involving the statistical modeling of the probability simplex [2]. In the present work, we provide alternative modelings by considering the multinomial and negative multinomial models with classical methods of information geometry.
As a newly developed theory, information geometry has supplied us with various measures of the discrepancy between any two probability distributions. In addition to some standard distance functions, divergence functions are intended for measuring the asymmetric proximity of probability distributions on an appropriate statistical model. Some geometric quantities, such as Riemannian metric and pair of dual connections, can be readily induced from a divergence function by its higher order derivatives [3,4]. In this way, the geometric structures of various parametric statistical models can be studied specifically [5][6][7], and the investigation about some particular parametric models can also be found in [8,9]. Among these models, we find the multinomial model and negative multinomial model especially interesting, in the sense that they can be seen as a pair of complementary model spaces. The multinomial model is well known as a spherical space of positive constant curvature [10], while the negative multinomial model is found to be a hyperbolic space of negative constant curvature [11].
To be more specific, the motivation of the present paper is twofold. Firstly, we aim at clarifying the complementary geometric structures of the multinomial and negative multinomial models. The main results about these geometric structures involving geometric quantities, such as Fisher-Riemannian metrics, α-connections and α-curvatures, are collected in Section 3, most of which can be derived in a standard way. In particular, this paper extends the isometric representation results about the multinomial model to those about the negative multinomial model, obtaining new insight into the complementary structures of these two models as illustrated by Table 1. Secondly, the original purpose of the formulation and computation of average problems is approached by utilizing these geometric structures. To this end, we propose a generalized concept of midpoints for two points and a computation scheme of Karcher mean for multiple points. For the midpoints, we generalize the Chernoff points in the literature [12] to some wider parametrized classes. For the Karcher mean, as there are many algorithmic results [13,14] for general manifolds, this paper mainly contributes to addressing some practical issues, such as initial point choice and iteration computation, which yields effective solving methods via the geometric structures within the multinomial model and negative multinomial model. The results about these average methods are presented in Section 4.

Preliminaries
For the sake of clarity, we summarize some preliminary knowledge about information divergence functions in this section (more details can be found in [15]).
Given a particular parametric statistical model M = {p θ | θ ∈ Θ}, for our purpose, here we mainly consider invariant divergences that satisfy the property of information monotonicity, as mentioned in [15]. A typical kind of invariant divergence is given in the form of the well-defined f -divergence as where f is a convex function satisfying f (1) = 0 and f (1) = 1.
A commonly used class of f -divergences is given by the α-divergence D (α) , with Particularly, for α = −1, the divergence D (−1) is usually called Kullback-Leibler divergence, which we denote by D KL ; for α = 0, D (0) is usually called squared Hellinger distance, which we denote by H 2 ; and for α = 3, D (3) is often related to the chi-square statistic.
While our later results are mainly related to the α-divergence, for comparison, we briefly mention another f -divergence called exponential divergence (see [6] §12.3.6), which we denote by E with f (u) = 1 2 log 2 u. For any divergence function D on a statistical manifold, we can construct another divergence function D * called dual divergence by swapping the arguments as ( [16]) To be mentioned, as a divergence is generally not symmetric, a symmetric divergence D s can be constructed from an asymmetric D by averaging it with its dual: A Riemannian metric g can be induced by a divergence D as which is equivalent to the usual Fisher-Riemannian metric in the case when D is an f -divergence. Furthermore, an affine connection ∇ is induced by the divergence D with connection coefficients Similarly, another affine connection ∇ * can be obtained by replacing D with the dual divergence D * in the above formula. Thus, with the primal connection ∇ and dual connection ∇ * , the statistical manifold admits a dual structure (M, g, ∇, ∇ * ). Furthermore, the structure (M, g, ∇, ∇ * ) is called dually flat if both ∇ and ∇ * are flat.
As a well known result [15], the primal connection induced by an f -divergence D f is the same as the usual α-connection ∇ (α) with α = 3 + 2 f (1), while the induced dual connection is ∇ (−α) . Particularly, one can check that the primal connection induced by the α-divergence D (α) is exactly the α-connection ∇ (α) .

Basic Information Geometric Structure
In this subsection, we present the parametric formulation of the multinomial and negative multinomial model, respectively. Then some basic results about divergences and geometric structures are derived for both models.

Multinomial Model
Consider the multinomial n-dimensional model M N consisting of (n+1)-nominal distributions with probability mass function given by where ∑ n i=0 x i = N and the parametrization is given by θ = (θ 1 , . . . , θ n ) with θ 0 = 1 − ∑ n i=1 θ i . We can rewrite Equation (3) as By some general knowledge about exponential distribution families [17], we see that the multinomial model M N admits the natural parameters and the potential function from which we also obtain the expectation parameters Next, we have the following result by direct calculation. Proposition 1. The divergences introduced in Section 2 are obtained for M N as follows: the Kullback-Leibler divergence the squared Hellinger distance From any one of these expressions, by using Equation (1), we obtain the Fisher-Riemannian metric matrix as where δ ij = 1 if i = j and 0 otherwise. Then, the inverse matrix of g can also be obtained as Furthermore, by direct verification with the metric expression of Equation (6), we have the following well-known result. Theorem 1 ([18]). An isometry is established between the multinomial model M N and the n-sphere within the non-negative orthant of the Euclidean space R n+1 by the parametric mapping: Consequently, via this isometry, the Fisher-Riemannian geodesic distance between two parameters θ and ξ of M N is given by Referring to some basic differential geometrical concepts (see [19]), we can derive some further consequent results as follows. Corollary 1. The multinomial n-dimensional model M N with the Fisher-Riemannian metric is a Riemannian manifold of constant sectional curvature K = 1 4 N and scalar curvature S = n(n−1) 4 N . Furthermore, for a unit speed geodesic γ in M N , the normal Jacobi fields along γ are precisely linear combinations of the vector fields, which are in the form of J(t) = sin(2 , where E is any parallel normal vector field along γ.

Negative Multinomial Model
Consider the negative multinomial n-dimensional model N M M consisting of negative (n+1)-nominal distributions with probability mass function given by where M > 0, x 1 , . . . , x n ≥ 0 and the parametrization is given by With the rewritten form of Equation (10) as we find that the negative multinomial model N M M admits the natural parameters and the potential function from which we also obtain the expectation parameters Again, we derive the following result by direct calculation.

Proposition 2.
The divergences introduced in Section 2 are obtained for N M M as follows: the Kullback-Leibler divergence Next, applying Equation (1) to the divergences in the foregoing proposition, we obtain the Fisher-Riemannian metric matrix as and its inverse matrix as Furthermore, by direct verification with the metric expression of Equation (13), we have the following result parallel to Theorem 1.

Theorem 2.
An isometry is established between the negative multinomial model N M M and the n-hyperbola within the nonnegative orthant of the Minkowski space R 1,n by the parametric mapping: Consequently, via this isometry, the Fisher-Riemannian geodesic distance between two parameters θ and ξ of N M M is given by Again, some further consequent results are obtained as follows.
, where E is any parallel normal vector field along γ.

Dual Structures
In this section, we derive the α-connection coefficients and curvatures of the multinomial and negative multinomial model. While some basic results are equivalent to those in [11], our calculation is performed directly with the original parameters in order to give a clear presentation of the results.
The α-connection coefficients can be obtained by applying the calculation of Equation (2) to the α-divergence D (α) , as mentioned in Section 2. However, an easier derivation of the α-connection ∇ (α) is given in the form of a linear combination of the mixture connection ∇ (m) = ∇ (−1) and the exponential connection ∇ (e) = ∇ (1) . With the mixture connection coefficients obtained from the Kullback-Leibler divergence as and the exponential connection coefficients obtained from the dual Kullback-Leibler divergence as the α-connection coefficients are given by Next, the α-curvature tensor R (α) is defined by where [X, Y] denotes the Lie bracket of X and Y. By the duality condition (see [20]) one can check that the following identity holds: where ∂ ∂θ i is shortened as ∂ i . At last, the α-sectional curvature spanned by two tangent vectors ∂ i and ∂ j (i = j) is determined by Furthermore, by Equations (6), (7) and (20), we obtain Thus, via Equation (21), we recover the following result.

Negative Multinomial Model
For the negative multinomial model N M M , by applying Equations (17)- (19) with the Kullback-Leibler divergence in Propositon 2, we have the mixture connection coefficients Furthermore, by Equations (13), (14) and (20), we have Again, via Equation (21), we recover another parallel result.

Theorem 4 ([11]
). The negative multinomial model N M M admits constant α-sectional curvature For clarity, we summarize these results about the complementary geometric structures of the multinomial and negative multinomial models in Table 1.

Geometric Average Methods on Multinomial and Negative Multinomial Models
In this section, we present some average methods induced by the geometry of the multinomial and negative multinomial models.
Firstly, we consider the particular case when the to-be-averaged set consists of only two points. In this case, the problem is to find a method of computing a midpoint in some geometric sense. Next, via some techniques related to the Karcher mean, we consider the general case with a set of multiple points.

Midpoints of Two Points
In this subsection, again within the multinomial and negative multinomial model, we study a particular class of midpoints named Chernoff points. The original Chernoff point, which is motivated by the application of computing the best error exponent for the Bayesian hypothesis testing problem, is determined as the intersection point of an exponential geodesic and a mixture bisector [21]. Furthermore, there are three other generalized Chernoff points proposed by [12].
To present a further generalization, here we formulate the concepts of α-geodesic and α-bisector determined by two probability distributions p θ and p θ of a parametric statistical model M. Definition 1. The α-geodesic is determined by the geodesic equation of the α-connection ∇ (α) as

whereθ(t) denotes the velocity vector of a curve θ(t).
Particularly, since an exponential family model is ±1-flat, as can be directly seen from Equations (22) and (23) for our case, the exponential geodesic for α = 1 can be determined by the linear interpolation of the natural parameters, while the mixture geodesic for α = −1 can be determined by the linear interpolation of the expectation parameters.

Definition 2. The α-bisector is determined by the equi-divergence identity of the α-divergence D (α) as
Particularly, the exponential bisector for α = 1 is determined in terms of the dual Kullback-Leibler divergence, while the mixture bisector for α = −1 is determined in terms of the Kullback-Leibler divergence.
Then, we can generalize the notion of Chernoff points suggested by [12] as follows.
Definition 3. Two types of generalized Chernoff points are given by the intersection points with parameter α: Thus, the Chernoff points already proposed in previous works can be recovered by setting α = ±1 in Definition 3.
The existence of these intersection points is assured by the intermediate value property of the determining Equation (24), since replacing θ by θ and θ , respectively, we get two opposite inequalities due to the non-negativity of the α-divergence.
While the uniqueness can be proven for an exponential family model if α = ±1 (see [12]), we conjecture it still holds for general cases but this is not pursued in the present paper.
To elucidate what we have mentioned earlier about the application of the original Chernoff point (CP (1) I in our notation) to the binary Bayesian hypothesis testing problem, we present here the upper bound of the probability of error of the Bayesian decision suggested by [21] as where p θ * denotes the Chernoff point CP I (p θ , p θ ). The overlapping case of the two classes of generalized Chernoff points is given by α = 0. For the multinomial and negative multinomial model, by comparing the 0-divergence, i.e., squared Hellinger distance, with the Fisher-Riemannian geodesic distance presented in Section 3.1, we find that the generalized Chernoff point CP II (p θ , p θ ) is exactly the unique Fisher-Riemannian geodesic midpoint between p θ and p θ within both models.
Next, we summarize some specific results about Chernoff points for both models as follows.

Multinomial Model
For the multinomial model M N , although the geodesic equation of the α-geodesic can be explicitly given in general cases, there are simple closed-form geodesic expressions at least for α = ±1. Proposition 3. The exponential and mixture geodesics connecting two probability distributions p θ and p θ of the multinomial model M N are given by Proof. As already mentioned, the exponential and mixture geodesics can be easily obtained by the linear interpolation of the natural and expectation parameters by Equations (4) and (5), respectively.
By using the expressions of α-divergences presented in Proposition 1, the α-bisectors are directly obtained as follows.

Proposition 4.
The α-bisectors between p θ and p θ within the multinomial model M N are given by the following equations: Combining the previous two propositions, we have the following result about the determining equations for the four particular Chernoff points with α = ±1.

For CP
(1) are to be solved by using some numerical methods such as simple bisection as suggested in [12].

For CP
For the Fisher-Riemannian geodesic midpoint, we have the following result.
Theorem 6. The Fisher-Riemannian geodesic midpoint p θ * between p θ and p θ in the multinomial model M N is determined by Proof. Let F be the isometry given by Equation (8). Denote the linear midpoint of the two image points F(θ )+F(θ ) 2 by x * ∈ R n+1 . Then we normalize x * to the n-sphere as u * = 2 √ N x * / x * e , where the Euclidean norm x e = (∑ n i=0 x 2 i ) 1/2 is used. At last, the required midpoint is obtained as the inverse image point θ * = F −1 (u * ).
To illustrate these notions for the multinomial model M N , we present a numerical example as follows. The two parameters θ and θ are taken as the empirical probability vectors of the first and second 100 decimal digits of π, respectively. The parameters of these two points and the resulting Chernoff points are summarized in Table 2. As we can see, the pair of points CP (1) I and CP (1) II admits a certain similarity between them as both of them lie on the same exponential geodesic, and so does the pair of points CP (−1) I and CP (−1) II as both of them lie on the same mixture geodesic. Whereas, the Fisher-Riemannian geodesic midpoint CP (0) can be considered as a medium version among these Chernoff points.
To be mentioned particularly, the upper bound of the probability of error given by Equation (25) is obtained via CP (1) I as being equal to 0.9862 N . Thus, we can choose sufficiently large N so that the probability of error is less than some threshold value.

Negative Multinomial Model
For the negative multinomial model N M M , again we can give the geodesic expressions for the exponential and mixture cases.
As can be seen, the two points of second type CP Proof. Let F be the isometry given by Equation (15). Denote the linear midpoint of the two image points F(θ )+F(θ ) 2 by x * ∈ R 1,n . Then we normalize x * to the n-hyperbola as u At last, the required midpoint is obtained as the inverse image point θ * = F −1 (u * ).
A numerical illustration for the negative multinomial model N M M is presented as follows. The two parameters θ and θ are taken as the empirical probability vectors of the decimal digits of π within the first and second 10 appearances of "0", respectively. The parameters of these two points and the resulting Chernoff points are summarized in Table 3. Again, the pair of points CP

Karcher Means of Multiple Points
A natural generalization of the Fisher-Riemannian geodesic midpoint between two points is given by the Karcher mean among multiple points.
Let M be a metric space and S be a set of points on M. Define a criterion function f : M → R by where d(·, ·) is the distance function and |S| is the number of points of S. If the minimizer of the function f exists and is unique, then it is called the Karcher mean of S on M. If d is a Riemannian metric on M, then the negative gradient vector field with respect to f is found to be the usual average of the corresponding points in the tangent space ( [1]): where exp −1 x is the inverse of the Riemannian exponential map at x. In view of this, the Karcher mean can be alternatively understood as a point at which the above vector field vanishes.
The Karcher mean may be not unique unless all points are located on a geodesically convex region. For example, there are infinitely many geodesic midpoints between two antipodal points on a sphere. However, for model spaces, such as open half-sphere and hyperbolic space, there are existing results to assure the existence and uniqueness of Karcher mean ( [22]). Thus, by virtue of Theorem 1 and Theorem 2, we conclude that the concept of a Karcher mean is well-defined on the multinomial and negative multinomial models. Now, we focus ourselves on the computation of the Karcher mean on these two models. The Karcher mean of two points admits a closed-form expression as the Fisher-Riemannian geodesic midpoint presented previously, but for multiple points, we can only expect to obtain a numerical solution of the Karcher mean.
By virtue of Equation (26), there is a Riemannian gradient iteration algorithm with a locally superlinear convergence in general ( [23]): However, this general algorithm is difficult to apply in practice unless proper representations of the models are derived. In our case, as we have prepared enough geometric representation results within the multinomial and negative multinomial models in Section 3, we still have to address two practical issuses: the choice of initial points and the computation of Riemannian exponential map exp x i and its inverse exp −1 x i .

Initial Points
Let S be a set of parameters to be averaged in either the model M N or N M M , we present a heuristic approach motivated by the proof of Theorem 1 and Theorem 2 to provide an initial point choice. The main procedure is presented as follows (here N = M = 1 is assumed as basic ideas are unchanged up to scale): 1. Set the average of isometry images x * := 1 |S| ∑ θ∈S F(θ); 2. Set the normalized vector u * := 2 x * / x * ; the parameter of the initial point is given by θ * := F −1 (u * ).
For the model M N , the isometry F is given by Equation (8), and the norm · is the Euclidean norm · e in the proof of Theorem 1. For the model N M M , the isometry F is given by Equation (15), and the norm · is the Minkowski norm · m in the proof of Theorem 2.

Computation of Riemannian Exponential and its Inverse
Within each of the model M N and N M M (again N = M = 1 is assumed), the Riemannian exponential and its inverse map can be computed in an easy-to-manipulate way. Except for the isometry F and the norm · being given as before, we also need to set ·, · by x, y = ∑ n i=0 x i y i as the Euclidean inner product for the model M N and by x, y = x 0 y 0 − ∑ n i=1 x i y i as the Minkowski inner product for the model N M M .
Thus, the resulting parameter ξ is obtained as exp θ (u).

Numerical Example
Now, we test the above algorithm for solving Karcher mean by a numerical example. The data set S is chosen as containing 10 empirical probability vectors from the first to the tenth 100 decimal digits of π.
To illustrate the goodness of each iteration, we present the norm of the negative gradient vector field at each iteration point via Equation (26), as shown in Table 4. Within each model, we present in column (a) the iteration results starting with the initial points chosen as the aforementioned way, while in column (b), the iteration results with initial points chosen as the usual Euclidean means are provided for comparison.  2.68 × 10 −9 3.90 × 10 −7 1.01 × 10 −5 7.88 × 10 −4 As we can see, all of the four iterations shown here converge rapidly within the first two steps, and our aforementioned choice for initial points is apparently better than the usual choice of Euclidean means. In conclusion, this example, to some extent, shows the effectiveness of our computation scheme for the Karcher mean within the multinomial and negative multinomial models.

Conclusions
In this paper, we have studied various information geometric properties based on divergence functions for the multinomial and negative multinomial models. Some pre-derived expressions of fundamental geometric quantities, such as Fisher-Riemannian metric, isometric representation and α-curvature, have made it clear that these two models can be put together into a complementary view. With the aid of these geometric structures, we investigate the average problems on these two models. We have proposed the conception of generalized Chernoff points as midpoints of two points and presented some determining equations for them. Then we provided an effective computation scheme for the Karcher mean of multiple points on the multinomial and negative multinomial models.