New Riemannian Priors on the Univariate Normal Model

The current paper introduces new prior distributions on the univariate normal model, with the aim of applying them to the classification of univariate normal populations. These new prior distributions are entirely based on the Riemannian geometry of the univariate normal model, so that they can be thought of as “Riemannian priors”. Precisely, if {pθ; θ ∈ Θ} is any parametrization of the univariate normal model, the paper considers prior distributions G(θ̄, γ) with hyperparameters θ̄ ∈ Θ and γ > 0, whose density with respect to Riemannian volume is proportional to exp(−d(θ, θ̄)/2γ), where d(θ, θ̄) is the square of Rao’s Riemannian distance. The distributions G(θ̄, γ) are termed Gaussian distributions on the univariate normal model. The motivation for considering a distribution G(θ̄, γ) is that this distribution gives a geometric representation of a class or cluster of univariate normal populations. Indeed, G(θ̄, γ) has a unique mode θ̄ (precisely, θ̄ is the unique Riemannian center of mass of G(θ̄, γ), as shown in the paper), and its dispersion away from θ̄ is given by γ. Therefore, one thinks of members of the class represented by G(θ̄, γ) as being centered around θ̄ and lying within a typical distance determined by γ. The paper defines rigorously the Gaussian distributions G(θ̄, γ) and describes an algorithm for computing maximum likelihood estimates of their hyperparameters. Based on this algorithm and on the Laplace approximation, it describes how the distributions G(θ̄, γ) can be used as prior distributions for Bayesian classification of large univariate normal populations. In a concrete application to texture image classification, it is shown that this leads to an improvement in performance over the use of conjugate priors. Entropy 2014, 16 4016


Introduction
In this paper, a new class of prior distributions is introduced on the univariate normal model.The new prior distributions, which will be called Gaussian distributions, are based on the Riemannian geometry of the univariate normal model.The paper introduces these new distributions, uncovers some of their fundamental properties and applies them to the problem of the classification of univariate normal populations.It shows that, in the context of a real-life application to texture image classification, the use of these new prior distributions leads to improved performance in comparison with the use of more standard conjugate priors.
To motivate the introduction of the new prior distributions, considered in the following, recall some general facts on the Riemannian geometry of parametric models.
In information geometry [1], it is well known that a parametric model {p θ ; θ ∈ Θ}, where Θ ⊂ R p , can be equipped with a Riemannian geometry, determined by Fisher's information matrix, say I(θ).Indeed, assuming I(θ) is strictly positive definite, for each θ ∈ Θ, a Riemannian metric on Θ is defined by: I ij (θ)dθ i dθ j (1) The fact that the length element Equation ( 1) is invariant to any change of parametrization was realized by Rao [2], who was the first to propose the application of Riemannian geometry in statistics.
Once the Riemannian metric Equation ( 1) is introduced, the whole machinery of Riemannian geometry becomes available for application to statistical problems relevant to the parametric model {p θ ; θ ∈ Θ}.This includes the notion of Riemannian distance between two distributions, p θ and p θ , which is known as Rao's distance, say d(θ, θ ), the notion of Riemannian volume, which is exactly the same as Jeffreys prior [3], and the notion of Riemannian gradient, which can be used in numerical optimization and coincides with the so-called natural gradient of Amari [4].
It is quite natural to apply Rao's distance to the problem of classifying populations that belong to the parametric model {p θ ; θ ∈ Θ}.In the case where this parametric model is the univariate normal model, this approach to classification is implemented in [5].For more general parametric models, beyond the univariate normal model, similar applications of Rao's distance to problems of image segmentation and statistical tests can be found in [6][7][8].
The idea of [5] is quite elegant.In general, it requires that some classes {S L ; L = 1, . . ., C}, (based on a learning sequence) have been identified with "centers" θL ∈ Θ.Then, in order to assign a test population, given by the parameter θ t , to a class L * , it is proposed to choose L * , which minimizes Rao's distance d 2 (θ t , θL ), over L = 1, . . ., C. In the specific context of the classification of univariate normal populations [5], this leads to the introduction of hyperbolic Voronoi diagrams.
The present paper is also concerned with the case where the parametric model {p θ ; θ ∈ Θ} is a univariate normal model.It starts from the idea that a class S L should be identified not only with a center θL , as in [5], but also with a kind of "variance", say γ 2 , which will be called a dispersion parameter.
Accordingly, assigning a test population given by the parameter θ t to a class L should be based on a tradeoff between the square of Rao's distance d 2 (θ t , θL ) and the dispersion parameter γ 2 .
Of course, this idea has a strong Bayesian flavor.It proposes to give more "confidence" to classes that have a smaller dispersion parameter.Thus, in order to implement it, in a concrete way, the paper starts by introducing prior distributions on the univariate normal model, which it calls Gaussian distributions.By definition, a Gaussian distribution G( θ, γ 2 ) has a probability density function, with respect to Riemannian volume, given by: Given this definition of a Gaussian distribution (which is developed in a detailed way, in Section 3), classification of univariate normal populations can be carried out by associating to each class S L of univariate normal populations a Gaussian distribution G( θL , γ 2 L ) and by assigning any test population with parameter θ t to the class L * , which maximizes the likelihood p(θ t | θL , γ L ), over L = 1, . . ., C.
The present paper develops in a rigorous way the general approach to the classification of univariate normal populations, which has just been described.It proceeds as follows.
Section 2, which is basically self-contained, provides the concepts, regarding the Riemannian geometry of the univariate normal model, which will be used throughout the paper.
Section 3 introduces Gaussian distributions on the univariate normal model and uncovers some of their general properties.In particular, Section 3.2 of this section gives a Riemannian gradient descent algorithm for computing maximum likelihood estimates of the parameters θ and γ of a Gaussian distribution.
Section 4 states the general approach to classification of univariate normal populations proposed in this paper.It deals with two problems: (i) given a class S of univariate normal populations S i , how to fit a Gaussian distribution G(z, γ) to this class; and (ii) given a test univariate normal population S t and a set of classes {S L , L = 1, . . ., C}, how to assign S t to a suitable class S L * .
In the present paper, the chosen approach for resolving these two problems is marginalized likelihood estimation, in the asymptotic framework where each univariate normal population contains a large number of data points.In this asymptotic framework, the Laplace approximation plays a major role [9].In particular, it reduces the first problem, of fitting a Gaussian distribution to a class of univariate normal populations, to the problem of maximum likelihood estimation, covered in Section 3.2.
The final result of Section 4 is the decision rule Equation (37).This generalizes the one developed in [5] and already explained above, by taking into account the dispersion parameter γ, in addition to the center θ, for each class.
In Section 5, the formalism of Section 4 is applied to texture image classification, using the VisTeX image database [10].This database is used to compare the performance obtained using Gaussian distributions, as in Section 4, to that obtained using conjugate prior distributions.It is shown that Gaussian distributions, proposed in the current paper, lead to a significant improvement in performance.
Before going on, it should be noted that probability density functions of the form (2), on general Riemannian manifolds, were considered by Pennec in [11].However, they were not specifically used as prior distributions, but rather as a representation of uncertainty in medical image analysis and directional or shape statistics.

Riemannian Geometry of the Univariate Normal Model
The current section presents in a self-contained way the results on the Riemannian geometry of the univariate normal model, which are required for the remainder of the paper.Section 2.1 recalls the fact that the univariate normal model can be reparametrized, so that its Riemannian geometry is essentially the same as that of the Poincaré upper half plane.Section 2.2 uses this fact to give analytic formulas for distance, geodesics and integration on the univariate normal model.Finally, Section 2.3 presents, in general form, the Riemannian gradient descent algorithm.

Derivation of the Fisher Metric
This paper considers the Riemannian geometry of the univariate normal model, as based on the Fisher metric (1).To be precise, the univariate normal model has a two-dimensional parameter space Θ = {θ = (µ, σ)|µ ∈ R , σ > 0}, and is given by: where each p θ is a probability density function with respect to the Lebesgue measure on R. The Fisher information matrix, obtained from Equation (3), is the following: As in [12], this expression can be made more symmetric by introducing the parametrization z = (x, y), where x = µ/ √ 2 and y = σ.This yields the Fisher information matrix: It is suitable to drop the factor two in this expression and introduce the following Riemannian metric for the univariate normal model, This is essentially the same as the Fisher metric (up to the factor tow) and will be considered throughout the following.The resulting Rao's distance and Riemannian geometry are given in the following paragraph.

Distance, Geodesics and Volume
The Riemannian metric (4), obtained in the last paragraph, happens to be a very well-known object in differential geometry.Precisely, the parameter space H = {z = (x, y)|y > 0} equipped with the metric (4) is known as the Poincaré upper half plane and is a basic model of a two-dimensional hyperbolic space [13].
Rao's distance between two points z 1 = (x 1 , y 1 ) and z 2 = (x 2 , y 2 ) in H can be expressed as follows (for results in the present paragraph, see [13], or any suitable reference on hyperbolic geometry), where acosh denotes the inverse hyperbolic cosine.
Starting from z 1 , in any given direction, it is possible to draw a unique geodesic ray γ : R + → H.This is a curve having the property that γ(0) = z 1 and, for any In other words, the length of γ between z 1 and z 2 is equal to the distance between z 1 and z 2 .
The equation of a geodesic ray starting from z ∈ H is conveniently written down in complex notation (that is, by treating points of H as complex numbers).To begin, consider the case of z = i (which stands for x = 0 and y = 1).The geodesic in the direction making an angle ψ with the y-axis is the curve, In particular ψ = 0 gives γ i (t) = e t i and ψ = π gives γ i (t) = e −t i.If ψ is not a multiple of π, γ i (t) traces out a portion of a circle, which is parallel to the y-axis, in the limit t → ∞.For a general starting point z, the geodesic ray in the direction making an angle ψ with the y-axis can be written: where z = (x, y) and γ i (t, ψ) is given by Equation (6).A more detailed treatment of Rao's distance (5) and of geodesics in the Poincaré upper half plane, along with applications in image clustering, can be found in [5].
The Riemannian volume (or area, since H is of dimension 2) element corresponding to the Riemannian metric (4) is dA(z) = dxdy/y 2 .Accordingly, the integral of a function f : H → R with respect to dA is given by: In many cases, the analytic computation of this integral can be greatly simplified by using polar coordinates (r, φ) defined with respect to some "origin" z ∈ H. Polar coordinates (r, ϕ) map to the point z(r, ϕ) given by: where the right-hand side is defined according to Equation (7).The polar coordinates (r, ϕ) do indeed define a global coordinate system of H, in the sense that the application that takes a complex number re iϕ to the point z(r, ϕ) in H is a diffeomorphism.The standard notation from differential geometry is: In these coordinates, the Riemannian metric (4) takes on the form: The integral Equation ( 8) can be computed in polar coordinates using the formula [13], where exp z was defined in Equation ( 10) and • denotes composition.This is particularly useful when f • exp z does not depend on ϕ.

Riemannian Gradient Descent
In this paper, the problem of minimizing, or maximizing, a differentiable function f : H → R will play a central role.A popular way of handling the minimization of a differentiable function defined on a Riemannian manifold (such as H) is through Riemannian gradient descent [14].
Here, the definition of Riemannian gradient is reviewed, and a generic description of Riemannian gradient descent is provided.The Riemannian gradient of f is here defined as a mapping ∇f : H → C with the following property: for any complex number h, where Re denotes the real part, * denotes conjugation and df is the "derivative", df = (∂f /∂x) + (∂f /∂y) i.For example, if f (z) = y, it follows from Equation ( 13) that ∇f (z) = y 2 .Riemannian gradient descent consists in following the direction of −∇f at each step, with the length of the step (in other words, the step size) being determined by the user.The generic algorithm is, up to some variations, the following: Here, in the condition for the while loop, ∇f (z k ) is the Riemannian norm of the gradient ∇f (z k ).In other words, Just like a classical gradient descent algorithm, the above Riemannian gradient descent consists in following the direction of the negative gradient −∇f (ẑ), in order to define a new estimate.This is repeated as long as the gradient is sensibly nonzero, in the sense of the loop condition.The generic algorithm described above has no guarantee of convergence.Convergence and behavior near limit points depends on the function f , on the initialization of the algorithm and on the step sizes λ.For these aspects, the reader may consult [14](Chapter 4).

Riemannian Prior on the Univariate Normal Model
The current section introduces new prior distributions on the univariate normal model.These may be referred to as "Riemannian priors", since they are entirely based on the Riemannian geometry of this model, and will also be called "Gaussian distributions", when viewed as probability distributions on the Poincaré half plane.
Here, Section 3.1 defines in a rigorous way Gaussian distributions on H (based on the intuitive Formula (2)).A Gaussian distribution G(z, γ) has two parameters, z ∈ H, called the center of mass, and γ > 0, called the dispersion parameter.Section 3.2 uses the Riemannian gradient descent algorithm Section 2.3 to provide an algorithm for computing maximum likelihood estimates of z and γ.Finally, Section 3.3 proves that z is the Riemannian center of mass or Karcher mean of the distribution G(z, γ), (Historically, it is more correct to speak of the "Fréchet mean", since this concept was proposed by Fréchet in 1948 [15]), and that γ is uniquely related to mean square Rao's distance from z.
The reader may wish to note that the results of Section 3.3 are not used in the following, so this paragraph may be skipped on a first reading.

Gaussian Distributions on H
A Gaussian distribution G(z, γ) on H is a probability distribution with the following probability density function: Here, z ∈ H is called the center of mass and γ > 0 the dispersion parameter of the distribution G(z, γ).
The squared distance d 2 (z, z) refers to Rao's distance (5).The probability density function ( 14) is understood with respect to the Riemannian volume element dA(z).In other words, the normalization constant Z(γ) is given by: Using polar coordinates, as in Equation (12), it is possible to calculate this integral explicitly.To do so, let (r, ϕ), whose origin is z.Then, d 2 (z, z) = r 2 when z = z(r, ϕ), as in Equation ( 9).It follows that: According to Equation ( 12), the integral Z(γ) reduces to: which is readily calculated, where erf denotes the error function.Formula ( 16) completes the definition of the Gaussian distribution G(z, γ).This definition is the same as suggested in [11], with the difference that, in the present work, it has been possible to compute exactly the normalization constant Z(γ).
It is noteworthy that the normalization constant Z(γ) depends only on γ and not on z.This shows that the shape of the probability density function (14) does not depend on z, which only plays the role of a location parameter.At a deeper mathematical level, this reflects the fact that H is a homogeneous Riemannian space [13].
The probability density function ( 14) bears a clear resemblance to the usual Gaussian (or normal) probability density function.Indeed, both are proportional to the exponential minus the "square distance", but in one case, the distance is interpreted as Euclidean distance and, in the other (that of Equation ( 14)) as Rao's distance.

Maximum Likelihood Estimation of z and γ
Consider the problem of computing maximum likelihood estimates of the parameters z and γ of the Gaussian distribution G(z, γ), based on independent samples {z i } N i=1 from this distribution.Given the expression ( 14) of the density p(z|z, γ), the log-likelihood function (z, γ) can be written, Since z only appears in the second term, the maximum likelihood estimate of z, say ẑ, can be computed first.It is given by the minimization problem: In other words, the maximum likelihood estimate ẑ minimizes the sum of squared Rao distances to the samples z i .This exhibits ẑ as the Riemannian center of mass, also called the Karcher or the Fréchet mean [16], of the samples z i .The notion of Riemannian center of mass is currently a widely popular one in signal and image processing, with applications ranging from blind source separation and radar signal processing [17,18] to shape and motion analysis [19,20].The definition of Gaussian distributions, proposed in the present paper, shows how the notion of Riemannian center of mass is related to maximum likelihood estimation, thereby giving it a statistical foundation.
An original result, due to Cartan and cited in Equation [16], states that ẑ, as defined in Equation ( 18), exists and is unique, since H, with the Riemannian distance (4), has constant negative curvature.Here, ẑ is computed using Riemannian gradient descent, as described in Section 2.3.The cost function f to be minimized is given by (the factor N −1 is conventional), Its Riemannian gradient ∇f (z) is easily found by noting the following fact.Let Then, the Riemannian gradient of this function is (see [21] (page 407)), where log z : H → C is the inverse of exp z : C → H.It follows from Equation (20) that, The analytic expression of log z , for any z ∈ H, will be given below (see Equation ( 23)).
Here, the gradient descent algorithm for computing ẑ is described.This algorithm uses a constant step size λ, which is fixed manually.
Once the maximum likelihood estimate ẑ has been computed, using the gradient descent algorithm, the maximum likelihood estimate of γ, say γ, is found by solving the equation: where The gradient descent algorithm for computing ẑ is the following, given by Equation ( 21) % step size λ is constant END WHILE OUTPUT ẑ % near Riemannian center of mass Application of Formula (21) requires computation of log ẑ (z i ) for i = 1, . . ., N .Fortunately, this can be done analytically as follows.In general, for ẑ = (x, ȳ), where log i is found by inverting Equation (6).Precisely, where, for z = (x, y) with x = 0, and: and, for z = (0, y), log i (z) = ln(y)i with ln denoting the natural logarithm.

Significance of z and γ
The parameters z and γ of a Gaussian distribution G(z, γ) have been called the center of mass and the dispersion parameter.In the present paragraph, it is proven that, and also that: where F (γ) was defined in Equation ( 22) and p(z |z, γ) is the probability density function of G(z, γ), given in Equation ( 14).Note that Equations ( 25) and ( 26) are asymptotic versions of Equations ( 18) and (22).Indeed, Equations ( 25) and ( 26) can be written: where E z,γ denotes the expectation with respect to G(z, γ), and the expectation is carried out on the variable z in the first formula.Now, these two formulae are the same as Equations ( 18) and ( 22), but with expectation instead of empirical mean.Note, moreover, that Equations ( 25) and ( 26) can be interpreted as follows.If z is distributed according to the Gaussian distribution G(z, γ), then Equation (25) states that z is the unique point, out of all z ∈ H, which minimizes the expectation of squared Rao's distance to z .Moreover, Equation (26) states that the expectation of squared Rao's distance between z and z is equal to F (γ), so F (γ) is the least possible expected squared Rao's distance between a point z ∈ H and z .This interpretation justifies calling z the center of mass of G(z, γ) and shows that γ is uniquely related to the expected dispersion, as measured by squared Rao's distance, away from z.
In order to prove Equation (25), consider the log-likelihood function, The score function, with respect to z is, by definition, where ∇ z indicates the Riemannian gradient (defined in Equation ( 13) of Section 2.3) is with respect to the variable z.Under certain regularity conditions, which are here easily verified, the expectation of the score function is identically zero, E z,γ ∇f z (z) = 0 (30) Let f (z) be defined by: with the expectation carried out on the variable z .Clearly, f (z) is the expression to be minimized in Equation (25) (or in the first formula in Equation ( 27), which is just the same).By interchanging Riemannian gradient and expectation, where the last equality follows from Equation (30).It has just been proved that z is a stationary point of f (a point where the gradient is zero).Theorem 2.1 in [16] states the function f has one and only one stationary point, which is moreover a global minimizer.This concludes the Proof (25).
The proof of Equation ( 26) follows exactly the same method, defining the score function with respect to γ and noting that its expectation is identically zero.

Classification of Univariate Normal Populations
The previous section studied Gaussian distributions on H, "as they stand", focusing on the fundamental issue of maximum likelihood estimation of their parameters.The present Section considers the use of Gaussian distributions as prior distributions on the univariate normal model.
The main motivation behind the introduction of Gaussian distributions is that a Gaussian distribution G(z, γ) can be used to give a geometric representation of a cluster or class of univariate normal populations.Recall that each point (x, y) ∈ H is identified with a univariate normal population with mean µ = √ 2x and standard deviation σ = y.The idea is that populations belonging to the same cluster, represented by G(z, γ), should be viewed as centered on z and lying within a typical distance determined by γ.
In the remainder of this Section, it is shown how the maximum likelihood estimation algorithm of Section 3.2 can be used to fit the hyperparameters z and γ to data, consisting in a class S = {S i ; i = 1, . . ., K} of univariate normal populations.This is then applied to the problem of the classification of univariate normal populations.The whole development is based on marginalized likelihood estimation, as follows.
Assume each population S i contains N i points, S i = {s j ; j = 1, . . ., N i }, and the points s j , in any class, are drawn from a univariate normal distribution with mean µ and standard deviation σ.The focus will be on the asymptotic case where the number N i of points in each population S i is large.
In order to fit the hyperparameters z and γ to the data S, assume moreover that the distribution of z = (x, y), where (x, y) = (µ/ √ 2, σ), is a Gaussian distribution G(z, γ).Then, the distribution of S can be written in integral form: where p(z|z, γ) is the probability density of a Gaussian distribution G(z, γ), defined in Equation ( 14).Moreover, expressing p(S i |z) as a product of univariate normal distributions p(s j |z), it follows, This expression, given the data S, is to be maximized over (z, γ).Using the Laplace approximation, this task is reduced to the maximum likelihood estimation problem, addressed in Section 3.2.
The Laplace approximation will here be applied in its "basic form" [9].That is, up to terms of order N −1 i .To do so, write each of the integrals in Equation (32), using Equation (8) of Section 2.2.These integrals then take on the form: where the univariate normal distribution p(s j |z) has been replaced by its full expression.Now, this expression can be written p(s j |z) = exp [−N i h(x, y)], where: Here, B 2 and V 2 i are the empirical bias and variance, within population S i , where Ŝi is the empirical mean of the population Ŝi = N −1 i N i j=1 s j .The expression h(x, y) is maximized when x = xi and y = ŷi , where ẑi = (x i , ŷi ) is the couple of maximum likelihood estimates of the parameters (x, y), based on the population S i .
According to the Laplace approximation, the integral Equation ( 33) is equal to: where ∂ 2 h(x i , ŷi ) is the matrix of second derivatives of h, and | • | denotes the determinant.Now, since h is essentially the logarithm of p(s j |z), a direct calculation shows that ∂ 2 h(x i , ŷi ) is the same as the Fisher information matrix derived in Section 2.1 (where it was denoted I(z)).Thus, the first factor in the above expression is 2π ŷ2 i , and cancels out with the last factor.Finally, the Laplace approximation of the integral Equation (33) reads: and the resulting approximation of the distribution of S, as given by Equation (32), can be written: where α is a constant, which does not depend either on the data or on the parameters, and p(ẑ i |z, γ) has the expression (14).Accepting this expression to give the distribution of the data S, conditionally on the hyperparameters (z, γ), the task of estimating these hyperparameters becomes the same as the maximum likelihood estimation problem, described in Section 3.2.
In conclusion, if one assumes the populations S i belong to a single cluster or class S and wishes to fit the hyperparameters z and γ of a Gaussian distribution representing this cluster, it is enough to start by computing the maximum likelihood estimates xi and ŷi for each population S i and then to consider these as input to the maximum likelihood estimation algorithm described in Section 3.2.
The same reasoning just carried out, using the Laplace approximation, can be generalized to the problem of classification of univariate normal populations.Indeed, assume that classes {S L , L = 1, . . ., C}, each containing some number K L of univariate normal populations, have been identified based on some training sequence.Using the Laplace approximation and the maximum likelihood estimation approach of Section 3.2, to each one of these classes, it is possible to fit hyperparameters (z L , γ L ) of a Gaussian distribution G(z L , γ L ) on H.
For a test population S t , the maximum likelihood rule, for deciding which of the classes S L this test population S t belongs to, requires finding the following maximum: and assigning the test population S t to the class with label L * .If the number of points N t in the population S t is large, the Laplace approximation, in the same way used above, approximates the maximum in Equation ( 35) by: where ẑt = (x t , ŷt ) is the couple of maximum likelihood estimates computed based on the test population S t and where p(ẑ t |z L , γ L ) is given by Equation ( 14).Now, writing out Equation ( 14), the decision rule becomes: Under the homoscedasticity assumption, that all of the γ L are equal, this decision rule essentially becomes the same as the one proposed in [5], which requires S t to be assigned to the "nearest" cluster, in terms of Rao's distance.Indeed, if all the γ L are equal, then Equation (37) is the same as, This decision rule is expected to be less efficient that the one proposed in Equation (37), which also takes into account the uncertainty associated with each cluster, as measured by its dispersion parameter γ L .

Application to Image Classification
In this section, the framework proposed in Section 4, for classification of univariate normal populations, is applied to texture image classification using Gabor filters.Several authors have found that Gabor energy features are well-suited texture descriptors.In the following, consider 24 Gabor energy sub-bands that are the result of three scales and eight orientations.Hence, each texture image can be decomposed as the collection of those 24 sub-bands.For more information concerning the implementation, the interested reader is referred to [22].
Starting from the VisTeX database of 40 images [10] (these are displayed in Figure 1), each image was divided into 16 non-overlapping subimages of 128 × 128 pixels each.A training sequence was formed by choosing randomly eight subimages out of each image.To each subimage in the training sequence, a bank of 24 Gabor filters was applied.The result of applying a Gabor filter with scale s and orientation o to a subimage i belonging to an image L is a univariate normal population S i,s,o of 128 × 128 points (one point for each pixel, after the filter is applied).If S t is a test subimage, then one should begin by applying the 24 Gabor filters to it, obtaining independent univariate normal populations S t,s,o , and then compute for each population the couple of maximum likelihood estimates ẑt,s,o = (x t,s,o , ŷt,s,o ).The decision rule Equation (37) of Section 4 requires that S t should be assigned to the image L * , which realizes the maximum: When considering the homoscedasticity assumption, i.e., γ L,s,o = γ s,o for all L, this decision rule becomes: For this concrete application, to the VisTex database, it is pertinent to compare the rate of successful classification (or overall accuracy) obtained using the Riemannian prior, based on the framework of Section 4, to that obtained using a more classical conjugate prior, i.e., a normal-inverse gamma distribution of the mean µ = √ 2x and the standard deviation σ = y.This conjugate prior is given by: with an inverse gamma prior, on σ 2 , Using this conjugate prior, instead of a Riemannian prior, and following the procedure of applying the Laplace approximation, a different decision rule is obtained, where L * is taken to be the maximum of the following expression: Recall that the overall accuracy is the ratio of the number of successfully classified subimages to the total number of subimages.The table shows that the use of a Riemannian prior, even under a homoscedasticity assumption, yields significant improvement upon the use of a conjugate prior.

Conclusions
Motivated by the problem of the classification of univariate normal populations, this paper introduced a new class of prior distributions on the univariate normal model.With the univariate normal model viewed as the Poincaré half plane H, these new prior distributions, called Gaussian distributions, were meant to reflect the geometric picture (in terms of Rao's distance) that a cluster or class of univariate normal populations can be represented as having a center z ∈ H and a "variance" or dispersion γ 2 .Precisely, a Gaussian distribution G(z, γ) has a probability density function p(z), with respect to Riemannian volume of the Poincaré half plane, which is proportional to exp − d 2 (z,z) 2γ 2 . Using Gaussian distributions as prior distributions in the problem of the classification of univariate normal populations was shown to lead to a new, more general and efficient decision rule.This decision rule was implemented in a real-world application to texture image classification, where it led to significant improvement in performance, in comparison to decision rules obtained by using conjugate priors.
The general approach proposed in this paper contains several simplifications and approximations, which could be improved upon in future work.First, it is possible to use different prior distributions, which are more geometrically rich than Gaussian distributions, to represent classes of univariate normal populations.For example, it may be helpful to replace Gaussian distributions that are "isotropic", in the sense of having a scalar dispersion parameter γ, by non-isotropic distributions, with a dispersion matrix Γ (a 2 × 2 symmetric positive definite matrix).Another possibility would be to represent each class of univariate normal populations by a finite mixture of Gaussian distributions, instead of representing it by a single Gaussian distribution.
These variants, which would allow classes with a more complex geometric structure to be taken into account, can be integrated in the general framework proposed in the paper, based on: (i) fitting each class to a prior distribution (Gaussian non-isotropic, mixture of Gaussians); and (ii) choosing, for a test population, the most adequate class, based on a rule.These two steps can be realized as above, through the Laplace approximation and maximum likelihood estimation, or through alternative techniques, based on Markov chain Monte Carlo stochastic optimization.
In addition to generalizing the approach of this paper and improving its performance, a further important objective for future work will be to extend it to other parametric models, beyond univariate normal models.Indeed, there is an increasing number of parametric models (generalized Gaussian, elliptical models, etc.), whose Riemannian geometry is becoming well understood and where the present approach may be helpful.

Figure 1 .
Figure 1.Forty images of the VisTex database.
xt,s,o and ŷt,s,o are the maximum likelihood estimates computed for the population S t,s,o .Both the Riemannian and conjugate priors have been applied to the VisTex database, with half of the database used for training and half for testing.In the course of 100 Monte Carlo runs, a significant gain of about 3% is observed with the Riemannian prior compared to the conjugate prior.This is summarized in the following table.