Approximation of Densities on Riemannian Manifolds

Finding an approximate probability distribution best representing a sample on a measure space is one of the most basic operations in statistics. Many procedures were designed for that purpose when the underlying space is a finite dimensional Euclidean space. In applications, however, such a simple setting may not be adapted and one has to consider data living on a Riemannian manifold. The lack of unique generalizations of the classical distributions, along with theoretical and numerical obstructions require several options to be considered. The present work surveys some possible extensions of well known families of densities to the Riemannian setting, both for parametric and non-parametric estimation.


Introduction
In probability and statistics, random variables whose law admits a probability density (with respect to e.g., the Lebesgue measure) are more tractable than general ones, both from the theoretical and the algorithmic point of view. When dealing with experimental data, the density is generally unknown and must be estimated.
In many cases, it is a function belonging to a given family, defined on the image of the random variable. When the family depends on a finite number of parameters, the estimation problem boils down to finding a point in the parameter space minimizing a goodness-of-fit criterion. Methods pertaining to this class are referred to as parametric. On the other hand, when prior information on the true density is lacking or when the parametric approach is too complicated or expensive from a computational point of view, it may be more pertinent to use another class of methods, the non-parametric ones, that does not rely on fitting parameters. In this case, the true density is computed from the samples themselves, often by summing copies of a model density known as a kernel function.
On the other hand, some applications require an approximation rather than a proper estimation of the density of a given dataset. When dealing with large datasets, it can be interesting to search for a summary of the empirical distribution in the form of a discrete probability measure with a given number of supporting points. This is known as the quantization problem and has received much attention from the signal processing community. It is worth noticing that finding an optimal quantization is closely related to clustering: each supporting point in the discrete distribution can be thought as a cluster center, with a membership function that associates with a sample the closest supporting point.
In all the above cases, density approximation is central. When the considered random variables are defined on a finite dimensional real vector space, the problem has been extensively studied [1,2]. However, in many applications, the data are best modeled as elements of a Riemannian manifold, X is called the tangent vector to α at p. Using Hadamard's lemma in a chart, one can show that X can be in fact represented by a vector in R d , hence the name. Please note that X as a derivation is coordinate-free, while it depends on the local coordinates when represented in R d .
A tangent vector at p is a tangent vector to a certain curve passing through p at time zero. The set of all tangent vectors at p defines the tangent space T p M, a vector space of same dimension as M, and the collection of all tangent spaces defines the tangent bundle TM = ∪ p∈M T p M. Given a coordinate chart ϕ = (x 1 , . . . , x d ), the tangent vectors defining partial derivation according to coordinates x i are denoted by ∂ ∂x 1 (p), . . . , ∂ ∂x d (p) and define a basis of T p M. As any vector space, the tangent space at p admits a dual space T * p M called the cotangent space, and composed of linear forms z p : T p M → R, also called cotangent vectors. The basis of T p M * in local coordinates is denoted by dx 1 (p), . . . , dx d (p).

Pullback and Pushforward
Associated with the dual notions of tangent and cotangent vectors are the dual notions of pushforward and pullback. Given a smooth map F : M → N between two smooth manifolds, the pushforward of F is a linear map F * : TM → TN that maps tangent vectors X p at point p ∈ M to tangent vectors F * X p at the image point F(p) ∈ N. The vector F * X p can be defined as acting on real-valued functions f : N → R or equivalently, as the velocity vector of a curve α : (− , ) → M passing through p at time zero with speedα(0) = X p , The pushforward is also called the differential of F and can also be denoted d p F := (F * ) p . Symmetrically, the pullback maps cotangent vectors z F(p) at F(p) ∈ N to cotangent vectors at p ∈ M, acting on tangent vectors X p ∈ T p M as (F * z F(p) )(X p ) = z F(p) (F * X p ).

Vector Fields and Covariant Derivatives
A vector field is a mapping X : M → TM that associates with each point p a tangent vector X p ∈ T p M. Just as tangent vectors, it acts on differentiable functions f : M → R in a way that can be written, in local coordinates, as It is possible to take the derivative of a vector field with respect to another using an affine connection, that is, a functional ∇ that acts on pairs of vector fields (X, Y) → ∇ X Y according to the following rules This action is referred to as covariant derivative. Vector fields V : (− , ) → TM can also be defined along a curve α(t), that is, V(t) is an element of T α(t) M) for all t. Then, covariant derivative is sometimes denoted by DV dt := ∇˙α (t) V.

Riemannian Metric and Geodesics
The possibility to compute angles and lengths in a differentiable manifold is given by a Riemannian metric, i.e., a smoothly varying inner product g p : T p M × T p M → R defined on each tangent space T p M at p ∈ M (recall that T p M is a vector space). The subscript p in the metric will often be omitted and the associated norm will be denoted by · := g(·, ·). There is only one affine connection that is symmetric, meaning XY − YX = ∇ X Y − ∇ Y X, and compatible with the Riemannian metric, which is d dt g(U, V) = g DU dt , V + g U, DV dt , for any vector fields U, V along a curve α. It is called the Levi-Civita connection associated with g and will be denoted by ∇ from now on. Just as the Euclidean distance can be measured as the length of a straight line, distances in a Riemannian manifold are computed through the length of minimizing geodesics. The geodesics of M are the curves γ satisfying the relation ∇γγ = 0, which implies that their speed has constant norm γ(t) = cst. They are also the local minimizers of the arc length functional l: if curves are assumed, without loss of generality, to be defined over the interval [0, 1]. When it exists, the length of the shortest geodesic linking two points defines their geodesic distance. The cut locus of p ∈ M is the set of points where the geodesics starting at p stop being minimizing, and the injectivity radius at p is its distance to the cut locus. The global injectivity radius of the manifold is the infimum of the injectivity radii over all points of M.

Exponential and Logarithm Maps
From the geodesics of M, we can now define the exponential map at point p, a diffeomorphism (i.e., a differentiable, bijective map of differentiable inverse) denoted by exp p , which maps a tangent vector v of an open ball B(0, r) ⊂ T p M centered in 0 to the endpoint γ (1) Intuitively, the exponential map moves the point p along the geodesic starting from p at speed v and stops after covering the length v . Conversely, the inverse of the exponential map log p (q) := exp −1 p (q) gives the vector that maps p to q. The image by the exponential map of the open ball B(0, r) ⊂ T p M, with r less than the injectivity radius at p, is called the geodesic ball of radius r > 0 centered in p.

Curvature and Jacobi Fields
The curvature tensor of M associates with any pair of vector fields (X, Y) on M a linear mapping R(X, Y) on the space of vector fields, defined for all vector field Z by where [X, Y] := XY − YX denotes the Lie bracket. Another way to characterize the curvature of M is through the sectional curvature, which is defined for any two-dimensional subspace σ ⊂ T p M of the tangent space at p by if u, v are two linearly independent vectors that span σ. Due to the curvature of a manifold, geodesics spreading out from a given point p can either diverge (negative curvature) or converge (positive curvature). The way these geodesics spread out is described by the Jacobi fields. If {t → exp p (tv(s)), s ∈ (− , )} is a sheave of geodesics starting from the same point p at speeds v(s) ∈ T p M, and γ(t) := exp p (tv(0)) denotes the one at the center of the sheave, the vector field along γ defined for all t by is a Jacobi field along γ. It is the only one with initial conditions J(0) = v(0) and DJ dt (0) =v(0) (where we identify the two vector spaces T p M ≈ T v(0) T p M) verifying the Jacobi equation D 2 dt 2 J + R(J,γ)γ = 0.

Measures and Integration over a Riemannian Manifold
A differentiable k-form ω on an orientable d-dimensional manifold M associates with all p ∈ M an alternating multilinear function ω p : (T p M) d → R (i.e., ω p associates zero to any d-tuple with a repetition). If ω is a differentiable k-form on a manifold N, then any smooth map F : M → N induces by pullback a k-form on M acting on k-tuples of tangent vectors at p ∈ M as The volume forms of M are the differential forms of maximal degree d (the dimension of the manifold), and are the only ones that can be integrated over M. If (U, ϕ) is a local chart such that supp ω ⊂ U, then (ϕ −1 ) * ω is a d-form on R d , and so it admits a density f : ϕ −1 (U) → R with respect to the volume element defined in local coordinates by the exterior product dx 1 ∧ . . . ∧ dx d . The integral of the volume form ω is then defined by Every volume form defines a measure on M, which is written by extension in local coordinates dµ = f dx 1 ∧ . . . ∧ dx d or dµ = f dx for short. The Riemannian measure is the volume form which density is given by the square root of the determinant of the Riemannian metric, i.e., The Riemannian measure will play the role of the Lebesgue measure for integrals defined on M.

The Laplace-Beltrami Operator
Finally, in order to make this work self-contained, we introduce the generalization of the Laplacian to manifolds, namely, the Laplace-Beltrami operator. Let X be a vector field on M and φ X (t, x), (t, x) ∈ (− , ) × U its local flow in a neighborhood U of p ∈ M, i.e., such that for all Then, the Lie derivative of the volume form along the vector field X is given by the derivative of its pullback by the flow of X Intuitively, it measures the way infinitesimal volume is transported by X. Since L X vol is a d-form, it admits a density with respect to the Riemannian volume form, which is defined to be the divergence of X, i.e., L X vol = (divX)vol. Then, the Laplace-Beltrami operator of a function f : M → R is, just as the Euclidean case, defined as the divergence of its gradient where the gradient is linked to the differential (or pushforward) as g p (grad p f, X p ) = d p f(X p ) for any tangent vector X p ∈ T p M.
The Laplace-Beltrami operator can defined alternatively using the Levi-Civita connection. Let X, Y be vector fields and f : M → R and f : M → R as above. The Hessian of f is the symmetric 2-tensor: The Laplacian of f is then defined as the trace of the Hessian with respect to the metric: where g ij stands for the i, j element of the inverse metric matrix and ∂ i is the i-th coordinate vector field.

Parametric Estimation
Let (E, F , µ) be a measure space. In the sequel, all distributions are assumed to be absolutely continuous with respect to µ and all densities will thus implicitly refer to it. In parametric density estimation, one wants to approximate an unknown distribution with density f by a member of a given parameterized family { f θ : E → R + , θ ∈ Θ} of densities. Most of the time, the optimal θ * is found using a maximum likelihood procedure: if (X i ) i=1...N is an iid sample with common distribution f , then The only requirement on the domain E of the family { f θ , θ ∈ Θ} is to be a measure space, which of course encompasses the Riemannian manifold case, with µ = vol, the Riemannian volume.
Being able to obtain a meaningful parameterized distribution family on a general manifold is not an easy task in general. Some clues will be given at the end of this section. In special cases, some well known distributions were introduced. Some of them will be presented now.

Directional Statistics
Directional statistics [5] deals with inference on unit vectors samples and introduces ad hoc distributions. Since unit vectors can be seen as points on the unit sphere of the underlying vector space, it yields a basic, yet extremely useful example of parameterized families of distributions on a Riemannian manifold.
Since the unit sphere S d−1 ⊂ R d has rotational invariance, it is expected that the parameterized families ( f θ ) θ∈Θ exhibit the same behavior, i.e., Please note that, to be able to write such a covariance property, it is required to have an action of the group SO(d) on Θ. The case Θ = S d−1 is, up to our knowledge, the only one that has been considered by the directional statistics community. A common choice is the von Mises-Fisher (vMF) distribution on S d−1 , denoted M(µ, κ), given by the density [6] f where is a normalization constant with I r (λ) denoting the modified Bessel function of the first kind at order r. The vMF density is unimodal, parameterized by the mean µ and the concentration parameter κ > 0 that controls the dispersion of the distribution around the mean. The limiting, degenerate case κ = 0 yields the uniform distribution on the sphere.
When the expectations of the projections on a fixed orthonormal basis (e 1 , . . . , e d ) are given, it is a maximum entropy distribution. This fact can be easily seen by writing the associated variational problem with linear constraints: with σ the solid angle measure on the sphere. Using Lagrange multipliers λ 1 , . . . λ d for the first d constraints and c for the last one yields a general form for the solution: with λ = (λ 1 , . . . , λ d ). The c is a normalization constant. The remaining ones can be interpreted as mean parameters by normalization, provided λ = 0: The vMF density extends readily to the Stiefel manifold O(d, p) of p-dimensional orthonormal families in R d using the same maximum entropy approach, but using projections on the elementary matrices of dimension d × p. The general form of the distribution is then: As in the spherical vMF case, M cannot be interpreted directly as a mean on O(d, p) and some kind of normalization is needed. In order to better understand the behavior of M, it is useful to use its singular values decomposition (SVD) decomposition [7]. Let M = UΣV t with U ∈ M d,d , V ∈ M p,p , Σ ∈ M d,p and U, V orthogonal matrices. Then: Let C j , j = 1 . . . p, be the columns of the d × p matrix XV. It comes: where σ j is the j-th diagonal element of Σ and U j is the j-th row of U. The net result is thus a product of vMF, after a change of basis given by the matrix V. A rank deficiency in the matrix M indicates a uniform distribution on a subspace, much like in the standard vMF case with κ = 0. In the matrix case, the limiting case can occur on full subspaces. A further extension to the Grassmann manifold can be done by quotienting out the density with respect to the O(p) group action [8].
A generalization of maximum entropy distributions with moment constraints to Riemannian manifolds M can be found in [9], where an analogous of the normal law is obtained. The constraints chosen are, in normal coordinates around the mean value: where Σ is a fixed symmetric, positive definite matrix. The resulting density is parameterized by a mean µ and concentration matrix Γ and is expressed as: As mentioned in [9], the distribution may not be differentiable or even continuous on the cut locus. Finally, a different construction [10,11] yields a directional distribution on the hyperbolic d-dimensional space. Only the approach of [11] will be detailed here, as it introduces another way to obtain distributions on manifolds using exit points of a Brownian motion with drift. The general idea underlying this approach is to use a submanifold of a well-known model manifold on which the Brownian motion with drift can be constructed. Starting at a fixed origin, a Brownian motion path will intersect the submanifold for the first time at a point, called the exit point. The distribution of the exit points will yield a generalized directional distribution. The original motivation of this construction comes from the exit distribution of a Brownian motion with drift starting at the origin on the unit circle in R 2 . The resulting density turns out to be exactly the vMF.
In the hyperbolic space, the Brownian motion is a diffusion with infinitesimal generator: where all the coordinates are given in the half-space model of H d : It is convenient to represent the half-space model of H 2 in C, with z = x + iy, y > 0. The two-dimensional hyperboloid embedded in R 3 associated with H 2 is given by: It admits hyperbolic coordinates: that transforms to the unit disk model as: where θ and r are the angular and radius coordinates. Finally, using a complex representation z = u + iv and the Möbius mapping z → i(1 − z)/(1 + z), it comes the expression of the half-plane coordinates: The hyperbolic von Mises distribution is then defined, for a given r > 0, as the density of the first exit on the circle of center i and radius r of the hyperbolic Brownian motion starting at i. Its expression is given in [11] (Section 2.2, Propostion 2) as: where P 0 −ν is the Legendre function of the first kind with parameters 0, −ν, which acts as a normalizing constant to get a true probability density. The parameter ν is similar to the concentration used in the classical von Mises distribution.

Gaussian-Like Distributions
The maximum entropy distributions introduced above are not the only possible choice for probabilities on manifolds. Another approach may be to mimic the multivariate normal density using the geodesic distance on the manifold. For the space of symmetric positive definite matrices, one can refer to [12], and to [13] for the general case of symmetric spaces.
Let M be a symmetric space [14]. The Gaussian-like density on M with mean µ and variance σ 2 > 0 is given by: where the normalizing constant Z is independent from µ. This last fact is one of the motivations to use the above definition: the density computation does not require anything more than the geodesic distance evaluation. The basic facts about Gaussian-like distributions on symmetric spaces are given below.

Definition 1.
A Riemannian symmetric space M is a complete Riemannian manifold on which geodesic symmetries exist everywhere and are isometric.
The above geometric definition fits within a Lie theoretic construction. For the details on such objects, the reader can refer to [15] or for a more complete reference to [16].

Definition 2. A Riemannian symmetric space M is diffeomorphic to a quotient G/H where G is a Lie connected group and H is a compact Lie subgroup.
The Lie group view enables the use of integral formulas [17] (pp. 203-209).

Proposition 1.
Let M be a symmetric space of non-compact type, diffeomorphic to G/K and let g = h + a + n be the Iwasawa decomposition of the Lie algebra g. For any f ∈ L 1 (M): with da the Lebesgue measure on a, dh the normalized Haar measure on H and: where the product is taken over the set of positive roots Σ + and m α is the dimension of the root space at α. o is a reference point.
Applying the previous formula to the mapping: with an origin at µ yields: with B the Killing form. Since the Haar measure dh is normalized and the integrand does not depend on h, it comes: thus proving that the normalizing constant of the Gaussian-like distribution is independent from µ. Ref. [13] also proposes a maximum-likelihood estimator (MLE) suitable for the estimation of the parameters µ, σ.

Wrapped Distributions
Among numerous properties, the Gaussian densities in R d are known to be solutions of the heat kernel. When the manifold of interest M is obtained as a quotient of a model space H by a discrete group, which is the case for example with Riemann surfaces, the heat kernel on M can be obtained by wrapping the one on H along the orbits of the group action.
The most basic distribution arising that way is the so-called wrapped Gaussian density on the unit circle in R 2 . It is defined as: and clearly exhibits a period 2π. The parameter σ controls the concentration. It is worth noticing that f wg (θ; σ) is in fact the heat kernel on the circle. Evaluating the density involves finding the sum of a convergent series, which may be costly when the computation is done numerically. In the case of the wrapped Gaussian density, the very fast decay at infinity of the usual normal distribution limits the number of terms to be taken into account.
In [18], the heat kernel of simply connected Riemann surfaces is given by wrapping one of the only three possible model spaces: the Euclidean plane, the hyperbolic plane and the sphere. The respective heat kernels are: where the expression for the hyperbolic plane comes from [19] (p. 360). The distances d(·, ·) are the geodesic distances.
Theorem 1. Let M be a Riemann surface, U its universal cover and G its covering group. Let K U be the heat kernel on U. Then, the heat kernel on M is obtained by wrapping K U along the orbits of the covering group action: wherex,ỹ are fixed pre-images of respectively x, y for the covering map.
The proof can be found in [18] (pp. 7-8). In principle, Theorem 1 yields a density similar to a directional one, but on a more general class of manifolds. Unfortunately, while a closed-form solution for the kernel K U is known, and is one of equations (19)- (21), the wrapped kernel is generally only computable numerically, after truncation to a finite number of terms in the sum.
In the case of surfaces with covering space H 2 , it is possible to obtain a more convenient description. In this case, the genus g of the surface is strictly larger than 1, and the fundamental region in H 2 is a hyperbolic polygon with 4g sides. For any g ∈ G, its length is defined to be l(g) = inf x d(x, gx), or using the conjugacy class of g: l(g) = inf x d(x, kgk −1 ) where k runs over G. Elements of G with non zero length are conjugate to hyperbolic elements in SL(2, R) and are thus conjugate to a scaling x → λ 2 x. Furthermore, a conjugacy class represents a free homotopy class of closed curves, which contains a unique minimal geodesic whose length is l(g), where g is a representative element. For p a primitive element, let G p denote its centralizer in G. The conjugacy classes in G are all of the form gp n g −1 , g ∈ G/G p with p primitive and n ∈ Z. The wrapped kernel can then be rewritten as: where p runs through the primitive elements of G. It indicates that the kernel K M can be understood as a sum of elementary wrapped kernels associated to primitive elements, namely thosek p defined by: with p primitive. Finally, p being hyperbolic, it is conjugate to a scaling, so it is enough to consider kernels of the form:k with λ > 1 a real number. To each primitive element p, a simple closed minimal geodesic loop is associated, which projects onto the axis of the hyperbolic transformation p. In the Poincaré half-plane model, such a loop unwraps onto the segment of the imaginary axis that lies between i and iλ 2 . It is easily seen that the action of the elements p n , n ∈ Z will give rise to a tiling of the positive imaginary axis with segments of the form [λ 2n , λ 2(n+1) [. This representation allows a simple interpretation of the elementary wrapped kernelsk p , where the wrapping is understood as a winding. Once again, the computational cost involved in the summation may be high. An approximation of the true wrapped kernel is given in [20]. It is similar to the vMF that approximates the wrapped Gaussian density in the circular case.

Exponential Families Arising from Group Actions
In many physical systems, some quantities must be invariant under the action of a group. This is a consequence of the celebrated Noether theorem [21]. Looking at little bit deeper at this theorem reveals the importance of a mapping, called the momentum map, that turns out to be constant under the evolution of the system. Turning back to densities but keeping the physical framework in mind, it seems natural to find families with a maximum entropy property, with constraints based on the momentum map. This approach and its relationship with information geometry has been thoroughly studied in [22] and will be presented later.
Another view at the same problem is to start from natural exponential families and try to impose group invariance [23]. Let E be a finite dimensional vector space and let E * be its dual.

Definition 3.
The set E is the subset of positive Radon measures µ on E such that: • µ is not concentrated on a proper affine subspace of E.
• The set of the θ ∈ E * such that: has non empty interior, hereafter denoted Θ(µ).
Any measure in E gives rise to a natural exponential family.

Definition 4.
Let µ ∈ E . The natural exponential family with base measure µ is the parameterized family Given a group G acting on E, a natural exponential family P θ , θ ∈ Θ(µ) with base µ is said to be invariant if for any g ∈ G and any p θ , g.p θ = p θ for a θ in Θ(µ). In the original work, groups of affinities were considered, namely: g.
where A g is the linear part of the affinity and v g the translation part. The main theorem characterizing the group action invariance for a given natural exponential family is [23]: Theorem 2. Let P θ , θ ∈ Θ(µ) be a natural exponential family and G a group of affinities. The family P θ is invariant under the action of G iff it exist an application a : G → E * and an application b : G → R such that: When G is a Lie group, a, b are differentiable mappings. With a group theoretic view, a is a cocycle for the action g : x ∈ E * → g · x = t g −1 · x. Theorem 2 was given in [24] and improved in [23] (p. 4). As an example of use, natural exponential families on R d are characterized by the next theorem. • ν is a measure on [0, +∞], with ν([0, 1]) < +∞ and it exists k > 0 such that: [1,+∞] x −(d−1)/2 exp(kx)ν(dx) < +∞, is the polar coordinates mapping: (r, u) → ru.
Going back to the mechanical formalism and the momentum map, invariant exponential families can be put into a wider framework. The underlying object is a symplectic manifold (M, ω) where ω is a closed, non degenerate two-form on M. Let G be a connected Lie group acting on M.
Definition 5. The action of G on M is said to be symplectic if for all g ∈ G, g · ω = ω.
The group action defines canonical vector fields on TM. Definition 6. Let ξ ∈ g. The vector field X ξ is defined by: where e denotes the identity of G.
The vector field X ξ can be interpreted as the infinitesimal action of g on the points of M.

Definition 7.
A mapping U : M → g * is said to be a momentum map for the G-action if for all ξ ∈ g: where α ξ is the 0-form defined as: ∀x ∈ M, α ξ (x) = U(x), ξ .
As noticed by Souriau [25], the momentum map allows a definition of exponential densities. For a symplectic manifold, the Liouville form is the canonical one and will be denoted vol as in the Riemannian case. It is invariant by symplectomorphisms, which is a key ingredient in the definition of exponential families on M.
Definition 9. Let (M, ω) be a symplectic manifold. Let G be a connected Lie group acting on M with momentum map U. If the set of ξ ∈ g such that: has non empty interior, hereafter denoted by Ξ, the exponential family associated with the group action is defined to be: The momentum map may be used to define the analog of the usual moments and will in turn allow constraints definition in a maximum entropy approach.

Definition 10.
With the hypothesis of Definition 9, the nth moment of a probability density f on M is defined as: Following [25], the exponential distributions are maximal entropy ones. Assuming that the first and second moments are defined for the exponential family, the next proposition holds.

Proposition 3.
Under the assumptions of Definition 9, the exponential distributions are the one with the largest entropy under the constraint E n = K, with K a fixed vector in g * .
The Souriau approach to invariant Gibbs measure has the obvious advantage of being intrinsic and adapted to a given group action. The parameters of the exponential families are elements of the Lie algebra and must be understood as a general way of fixing a generalized location (please note that it encompasses 'scale' parameters also). It requires a symplectic base manifold, that is very natural in mechanics, but may be a little bit tricky to obtain in a more general setting.

The Euclidean Case
The intuitive idea behind the projection approach to density estimation on measure set (E, F , µ) is to use an orthonormal Hilbert basis (φ n ) n∈N of the space L 2 (E, µ) to construct an approximation of the unknown density f ∈ L 2 (E, µ) from its projections. Namely, where E f denotes expectation taken with respect to the density f . The reconstruction formula in L 2 (E, µ) then reads as: To turn the expansion into a density estimator, it is necessary to have an estimator of the coefficients α n . Furthermore, in applications, the series (29) has to be truncated to a finite number of terms. It is thus advisable to have a fast decay of the expansion coefficients when n goes to infinity. For the first point, an empirical estimator of the expectation is generally used. Assuming an iid sample (X i ) i=1...N , the n-th projection is estimated as: and the density estimator isf Taking the expectation shows that the estimator is unbiased. It is worth to notice that the projection method does not use more than the measure space structure and is thus easy to use on many spaces, including manifolds. It is also very fast to evaluate provided the expansion functions are known and can be computed easily. It nevertheless suffers from two flaws:

•
The estimated densityf N is not necessarily non-negative as the expansion functions are generally not.

•
Depending on the underlying measure space, a countable Hilbert basis may not exist and, even if this holds, the expansion functions may not be expressed in a closed form.
Most of the usual Hilbert spaces are countable. However, the Besocovitch space B 2 of almost periodic functions [26] is a classical example for which an uncountable Hilbert basis exists.

The Riemannian Case
When the underlying measure space is a compact Riemannian manifold (M, g) of dimension d, equipped with its volume form, the Laplace-Beltrami operator ∆ naturally gives rise to a suitable Hilbert basis of L 2 (M, vol), which we will denote by L 2 (M) for short. Indeed, there exists a sequence (λ n , φ n ) n∈N such that for all n ∈ N, ∆φ n = λ n φ n , and the φ n 's form a Hilbert basis of L 2 (M). The seminal work [27] details the theory of non-parametric projection based estimation in this case. Unfortunately, only in a few cases are the eigenfunctions of ∆ known, limiting the applicability of the method. In the sequel, the estimatorf N with expansion truncated to the Q lowest terms will be denoted asf N Q . Two main theorems describe the behavior of the estimator.
The second theorem gives a L ∞ rate: The proofs are quite technical and the interested reader may refer to [27] for the details. When the eigenfunctions of the Laplacian are known, the method is quite effective, since the evaluation of the estimated density at a point does not depend on the sample size. In most cases, however, a closed form for the eigenfunctions is not available, thus limiting the practical usability of this estimator.
A possible workaround is to estimate the true eigenfunctions and eigenvalues from an approximate discrete problem. In the approach of [28], a weighted graph is constructed from a net of points on the manifold M, with weights given by a function of the geodesic distance between vertices. For the graph, extracting the eigenfunctions and eigenvalues boils down to a standard linear algebra problem and can thus be solved efficiently. The result of the procedure is a finite set of eigenvectors, which represent discrete measures on the manifold. The projection estimator in such a case yields a quantization (i.e., a discrete approximation) of the estimated measure. Going back to a density can be done using a smoothing procedure, or simply by turning the discrete approximation to a piecewise constant one. In both cases, evaluation at point requires the computation of the geodesic distance to all the samples, thus making the overall procedure far less efficient.
Finally, it is worth mentioning that Laplacian eigenfunctions and representations are closely related when the underlying space is a Lie group. The reader may refer to [29] for the special case SO(d).

Non-Parametric Kernel Estimation
Non-parametric estimation of densities is performed using a sum of elementary bell-shaped functions known as kernels most of the time. It was introduced in the 1960s by Parzen in its seminal work [30] and Rosenblatt [31].

The Euclidean Case
Assuming that the unknown probability density f is univariate, defined on R, it can be estimated using an iid sample X i , i = 1 . . . N as: where K : R → R + is a symmetric kernel, i.e., a measurable mapping verifying K(x) = K(−x) and integrating to 1, The parameter h is a strictly positive real number called the bandwidth of the estimator. It controls the degree of smoothing, and has to be tuned to get the best compromise between smoothness and accuracy. Kernel density estimators are biased. Their bias can be controlled in the following way.
Proof. Let X be a random variable with law the common law of the sample. Taking the expectation of f gives: comes. Then, since the kernel integrates to 1, hence proving the claim.
When the density is Lipschitz and the kernel satisfies R K(u)|u|du < +∞, the bound can be improved to: where C is the Lispchitz constant. As expected, the bias vanishes as h goes to 0 but is non zero when h > 0. The variance of the estimator can be computed pretty much the same way. Since the X i 's are independent, Thus, with the above Lipschitz condition, It then appears that the upper bound of the variance off goes to infinity as h goes to 0 due to the term This fact is an expression of the bias-variance dilemma.
When the density f is of class C 2 , the bias can be expressed more conveniently as: Please note that V K is the variance of the kernel, considered as a probability distribution. It is further assumed in the sequel that the kernel is square summable. The following holds: so that the variance of the kernel estimator is The mean square error (MSE) of the estimatorf is the sum of the variance and the squared bias: (37) The asymptotic MSE is thus: There exists an optimal value of h, minimizing the previous expression, This relation is very classical in density estimation and yields a pointwise convergence rate in o(n −4/5 ), slower than the usual o(n −1 ) in the parametric case. In the multivariate case, two approaches are of common use. In the first one, the multivariate kernel in R d is just an d-fold tensor product of univariate kernels: The tensor product kernel integrates to 1 by Fubini's theorem; and the density estimator writes as: where x ∈ R d . Apart from the slower convergence rates, things are similar to the univariate case. Another option is to let the kernel depend on the norm of the difference between x and the X i . In such a case, starting again with a univariate kernel K, the density estimator is: where C is a normalizing constant, such that: In this framework, the multivariate kernel pertains to the class of radial basis functions, which has been thoroughly studied in approximation theory [32].

The Riemannian Case
Extending the previous derivations to manifolds is not direct, since several problems must be addressed. In the case of Riemannian manifolds, the geodesic distance can be used in place of the Euclidean norm, thus making the radial basis kernels more natural than the tensor product ones. A density estimator based on that is due to Pelletier [33]. His approach is briefly described in the sequel.
Let (M, g) be an orientable Riemannian manifold of dimension d, equipped with its volume form vol, and let K : R + → R + be a mapping such that: The main problem is to define what a radial basis function is in the context of manifolds. A similar question was addressed in [34] for the definition of the Laplacian in radial coordinates. The idea is to use the exponential mapping exp p at a fixed point p ∈ M in a ball centered at the origin in T p M and of radius less than the injectivity radius r > 0 at p and to compose K with the distance function to p to get an equivalent of a radial basis function in R d . However, the volume form is not invariant under the action of the exponential map unless the manifold is flat. In order to keep the integral of the kernel equal to 1, a multiplicative corrective term has to be added.
By abuse of notations, we will also write θ p (x) = θ p (log p x).
The above definition with varying p extends readily to a mapping θ : TM → R. The exponential map exp p : T p M → M induces by pullback a volume form exp * p vol on T p M, and it is worth noticing that θ p is its density with respect to the Lebesgue measure of the Euclidean structure on T p M. That is, in normal coordinates, Definition 12. Let K be a kernel function in the sense of equation (41) and R > 0 be the injectivity radius of M. The radial kernel at p ∈ M with bandwidth 0 < r < R is defined as the mapping K p,r : Since K has a support in [0, 1] by assumption, the above expression will vanish for x / ∈ B(p, r), where B(p, r) is the geodesic ball of center p and radius r. We will thus never have to bother about a possible vanishing of θ. Proposition 5. The radial kernel satisfies: Proof. In normal coordinates at p and from (42), the following comes: The next proposition in [33] shows that the kernel K p,r is centered at p in a probabilistic sense.

Proposition 6.
Let p ∈ M and δ be the supremum of the sectional curvatures of M. Let µ be a probability measure, absolutely continuous with respect to the measure vol, and admitting K p,r as density. If r < inj p (M)/2 and, when δ > 0, also r < π 4 √ δ then p is the unique minimizer of the function The proof can be found in [33] (Propostion 2.2, p. 5) and relies on a computation of the gradient of E and the convexity of a geodesic ball of radius r satisfying the above conditions. Based on the above results, it is natural to choose the following definition for the Riemannian kernel estimator. Definition 13. Let X i , i = 1 . . . N, be an iid sample on the manifold M with common density function f . Let 0 < r < inj(M). The kernel density estimator on N points of f with kernel K and bandwidth r is: Using Propostion 5, it is easy to see thatf n,K,r is a probability density on M. The estimatorf n,K,r is consistent and behaves roughly as in the Euclidean case.
Theorem 6. Let f be a C 2 probability density on M with bounded first and second derivative. If 0 < r < inj(M), then there exists a constant C f such that: The proof that will be given here differs slightly from the one given in [33], and relies on the next lemma. Lemma 1. Let γ : [0, 1] → M be a smooth curve of M and let f : M → R be of class C k+1 . Letting p = γ(0), u = γ (0), the Taylor expansion at order k of f along γ with integral remainder at p is given by: where ∇ is an affine connection and ∇ u f = u( f ) = d f p · u.
Proof. The mapping α = f • γ is defined from [0, 1] to R and is of class C k+1 . It thus admits a Taylor expansion with integral remainder For a smooth function φ : where C f is a constant depending on f and (γ) is the length of γ.
The term ∇ k γ (y) d f is k-linear in γ and involves derivatives of f up to order k + 1 along γ. Since f is of class C k+1 by assumption, it comes Since γ is a geodesic, γ = d, where d is the length of γ. The claim follows by integration.
Proof of Theorem 6. The proof is essentially an adaptation of the Euclidean case. The bias at x ∈ M is given by and since the kernel integrates to 1, Using normal coordinates at x, it comes Using lemma 1, f (exp x (u)) − f (x) = ∇ u f (0) + R f (u). In coordinates, u = u i ∂ i and by linearity of the connexion ∇ and symmetry assumption on the kernel Since the first and second derivatives of f are assumed to be bounded on M, it exists, by Lemma 2, a When the manifold is of finite volume, then the integral of the squared bias over M is bounded by: The case of unbounded volume is still tractable provided the first and second derivatives of f are square-integrable over M. The computation of the variance is very close to what has been done in the Euclidean case. Since the kernel has vanishing first moment, the only term remaining in the variance is the integral of the squared kernel, which equals In normal coordinates around p, In contrast with the bias computation, the θ p does not cancel. If (p, x) → θ p (p, x) is bounded below by a constant L > 0 and using the fact that K is bounded above by K(0), the variance is bounded above by Integrating the variance over M yields M var(p)dvol ≤ K 2 (0) Lr d vol(B(0, 1)).
Summing the variance and the squared bias completes the proof.
It is worth noticing that, while the final result is essentially the same as in the Euclidean case, assumptions are somewhat stronger: the kernel used must be of compact support and the θ function has to be bounded below. Furthermore, bandwidth has to be small enough to ensure that the exponential mapping is one to one.

Computing the Kernel in the Riemannian Case
A important feature of the non-parametric kernel estimation in the Euclidean case is the ease of computation. Estimating the density at a given point is done easily using inner products and function evaluations (polynomials in most cases). Furthermore, when the kernel is compactly supported, it is quite simple to avoid summing vanishing terms: a k-d tree [35] structure can be used to allow a quick distance evaluation, thus excluding points that will not contribute to the estimator.
In the Riemannian case, evaluation the kernel K r,p at a point x requires much more computation. First of all, the geodesic distance from x to p must be found, along with the θ p function. If a shooting algorithm is used for approximating d(x, p), the derivative of the exponential mapping, needed for θ p , is generally also computed: the overall cost is thus not really different from the geodesic distance evaluation alone. Nevertheless, unless a closed form for the geodesics is known, the computational cost associated with the process is much higher than with Euclidean data.
In some special cases, however, the θ p function can be obtained in a closed form. It is the case for symmetric spaces, when the roots system of the base Lie group is known: in such a case, the integral formulas in [17] yields directly θ p .

Discrete Density Estimation through Quantization
Arguably the simplest form of estimate, one can choose for an unknown probability measure µ is a discrete densityμ with finite support, i.e., a linear combination of Dirac distributionŝ with weights w i , i = 1, . . . , n that sum up to 1. Finding such an approximation is the purpose of quantization. The theory was originally developed for signal compression purposes in the middle of the 20th century, in order to find appropriate ways to discretize a signal. Quantization was founded for probability distributions on Euclidean spaces, but its generalization to Riemannian manifolds presents no particular difficulty (with the necessary assumptions), and so we will present it directly in that more general setting. For further details, we refer the reader to [2] or the survey paper [36].

Optimal Quantization
Let µ be a probability measure with compact support on a Riemannian manifold (M, g). We assume that M is complete, i.e., that the exponential map at x is defined on the whole tangent space T x M. Then, by the Hopf-Rinow theorem, M is also geodesically complete, that is, any two points x, y ∈ M can be joined by a geodesic of shortest length and the geodesic distance is everywhere well defined. Optimal quantization addresses the problem of approximating a random variable X with distribution µ by a quantized version q(X) where q has an image Γ of cardinal at most n. More precisely, defining Q n = {q : M → Γ ⊂ M measurable, |Γ| ≤ n}, the optimal quantization problem is to find q ∈ Q that minimizes the L p distance between X and q(X), with the following error q * = argmin A solution to the above minimization problem is called an optimal n-quantizer, and the minimum error is denoted by e n,p (µ) = inf

Proposition 7.
The search for optimal n-quantizers can be limited to nearest-neighbor projections, i.e., where q Γ : M → Γ = {a 1 , . . . , a n } is given by and C i (Γ) denotes the i th Voronoi cell associated with Γ, Proof. Any n-quantizer q of image Γ ⊂ M verifies for all x ∈ M, d(x, q(x)) ≥ inf a∈Γ d(x, a), with equality if and only if q(x) = argmin a∈Γ d(x, a). Therefore, the optimal quantizer is the projection to the nearest neighbor of Γ. Moreover, if |Γ| < n and | supp µ| ≥ n, one easily checks that q can always be improved, in the sense of criteria (46), by adding an element to its image. This means that an optimal n-quantizer has an image of exactly |Γ| = n points, and is of the given form.
The optimal approximation of X is then given by the imageX = q Γ (X), while the optimal approximation of its distribution µ is given by the pushforward where the atoms a 1 , . . . , a n are chosen to minimize the distorsion function, simply obtained by evaluating the cost function of (46) at q = q Γ , F n,p (a 1 , . . . , a n ) = E µ min Notice that if we seek to approximate µ by a single point a ∈ M (i.e., n = 1) with respect to an L 2 criteria (p = 2), we retrieve the definition of the Riemannian center of mass, also called the Fréchet mean [37].
It is worth noting that the optimal quantization problem coincides with the optimal transport problem of approximating µ by the closest discrete measure with at most n supporting points, with respect to the L p -Wasserstein distance. Here, the infimum is taken over all measures P on M × M with marginals µ and ν.
The proof in the Euclidean case can be found in [2], and it applies verbatim to measures on manifolds. We restitute it here for the sake of completeness.
An important question that arises now is the existence of a minimizer of (48). The proof of the following claim can be found in [38].

Proposition 9.
Let M be a complete Riemannian manifold and µ a probability distribution on M with density and a compact support. Then, the distorsion function F n,p is continuous and admits a minimizer.
The minimizer α = (a 1 , . . . , a n ), referred to as optimal n-centers, is in general not unique: any symmetry of µ, if it exists, will transform a minimizer into another minimizer of F n,p . For example, any rotation of the optimal n-centers of the uniform distribution on the sphere conserves optimality.
The second question that comes naturally is: how does the error e n,p (µ) one makes by approximating µ by (47) evolve when the number n of points grows? In the vector case, Zador's theorem [2] (Theorem 6.2) tells us that it decreases to zero as n −p/d , and that the limit of n p/d e n,p (µ) is proportional to the p th quantization coefficient, i.e., the limit (which is also an infimum) when µ is the uniform distribution on the unit square of R d Moreover, when µ is absolutely continuous with density h, the asymptotic empirical distribution of the optimal n-centers is proportional to h d/(d+p) .
In the case of a Riemannian manifold M, the moment condition of the flat case generalizes to a condition involving the curvature of M. The following term measures the maximal variation of the exponential map at x ∈ M when restricted to a (d − 1)-dimensional sphere S ρ ⊂ T x M of radius ρ The following generalization of Zador's theorem to Riemannian quantization was proposed by Iacobelli [39] (Theorem 1.4 and Corollary 1.5) .

Theorem 7.
Let M be a complete Riemannian manifold without boundary, and let µ = h dvol + µ s be a probability measure on M, where dvol denotes the Riemannian volume form and µ s the singular part of µ. Assume there exist x 0 ∈ M and δ > 0 such that Then, lim n→∞ n p/d e n,p (µ) where · r denotes the L r -norm. In addition, if µ s = 0 and (a 1 , . . . , a n ) are optimal n-centers, then where D → denotes convergence in distribution and λ is the appropriate normalizing constant.

A Numerical Scheme
In practice, to compute the optimal n-centers α = (a 1 , . . . , a n ) from potentially large, manifold-valued datasets, one can search for the critical points of the distorsion function. Assume that the only knowledge that we have of the probability measure µ that we want to approximate is through an online sequence of i.i.d. observations X 1 , X 2 , . . . sampled from µ. A classical algorithm used for quadratic (p = 2) vector quantization, and easily generalized to the Riemannian setting, is the Competitive Learning Vector Quantization algorithm, a stochastic gradient descent method based on the differentiability of the distorsion function F n,2 .
Proposition 10. Let α = (a 1 , . . . , a n ) ∈ M n be an n-tuple of pairwise distinct components and p > 1. Then, F n,p is differentiable and its gradient in α is whereC i (α) is the interior of the i th Voronoi cell of α and −→ xy := exp −1 x (y) denotes the vector that sends x on y through the exponential map. In particular, the gradient of the quadratic distorsion function is given by (50) The proof can be found in [38]. Notice that optimal n-centers are Riemannian centers of mass of their Voronoi cells, in the sense of (49). More generally, for any value of p, each a i , i = 1, . . . , n, is the p-mean of its Voronoi cell, i.e., the minimizer of a → Therefore, the optimal n-centers are always contained in the compact support of µ. Notice also that the opposite direction of the gradient is, on average, given by the vectors inside the expectation. Competitive learning quantization consists in following this direction at each step k, that is, updating only the center a i corresponding to the Voronoi cell of the new observation X k , in the direction of that new observation. In the Riemannian setting, instead of moving along straight lines, we simply follow geodesics using the exponential map. This gives a convergent algorithm, as shown in [38], which is particularly adapted to large datasets as it is online, i.e., it processes one data point at a time. Moreover, unlike kernel based methods, it requires few distance computations: of order n × N instead of N 2 if N is the size of the dataset and n N the size of the summary. Finding the right number of centers may be a difficult question when there is no a priori knowledge on the distribution to be approximated. In practice, it is mainly a trial and error procedure, unless the problem itself allows an initial guess of the value. An alternative approach is to use the fact that any quantization defines a clustering by Voronoï cells: the quality of the later can be used to assess the performance of the former. Quite a lot of standard indicators exist for that purpose [40]. Just to mention a simple one, the Silhouette [41] is easy to compute and does not require the knowledge of the membership labels.

Some Open Problems
Density estimation on manifolds is still an open area of research and as such has several questions that are not yet been answered. Some of them are given below, which may serve as a starting basis for further research.

Parametric Estimation and Symplectic Structure
A non-trivial aspect of the parametric estimation on manifold is that the parameter space is generally different from the base manifold. In fact, it is already the case for location-scale models in R d , since the scale parameter is an element of R + , but the underlying abelian group structure makes the location part an element of R d , which is also the base space. If one tries to extend the concept of location parameters to manifolds, two approaches can be used: • Use local coordinates as parameters, and mimic the vector case as in [9]. • Replace the abelian group underlying R d by a Lie group acting on the base manifold.
In view of the general results mentioned in Section 3.4 for exponential families defined on symplectic manifolds, the second setting is the most natural one, but is not as general as the first one. However, given a Riemannian manifold M, its cotangent bundle TM * has a canonical symplectic structure, obtained from the so-called tautological one form [42] (pp. 9-14).
The tautological one-form gives rise to a symplectic form: Proposition 11. The two-form ω = −dα provides TM * with a symplectic struture. In local coordinates (x , . . . x d , ξ 1 , . . . , ξ d ), it reads as: Given the symplectic structure on TM * , one can derive exponential families provided a group action by symplectomorphisms exists. This construction may be a starting point for a quite general definition of parameterized families on manifolds, by moving from the base manifold to its cotangent bundle. It is worth noticing that location models may also be constructed using generating functions [42].

Manifolds with Boundaries
All the density estimators presented before where defined on manifolds without boundaries. However, in several settings, boundaries arise naturally as limiting cases: as an example, the space of symmetric positive semi-definite matrices of dimension d × d has stratified boundaries, namely the manifolds of matrices of rank 0 ≤ p < d. Densities localized on the boundaries are degenerate versions of the one defined in the interior, but must be taken into account as they carry a non-vanishing mass. Apart from the projection estimator that fits in the manifolds with boundaries framework, all the other methods must be adapted. In particular, distributions defined on matrix spaces must be parameterized in such a way that rank deficiency is allowed in the parameter space. For manifolds with corners [43], this is easily obtained from the particular structure of the local charts that map open subsets of the manifold to open subsets of [0, +∞[×R d−k , 0 ≤ k ≤ d and it turns out that all examples of matrix manifolds fall within this frame.
Extending parameterized density estimators or kernel estimators to manifolds with corners will have many practical applications, especially when dealing with matrix statistics. The same applies to optimal quantization, where dirac densities located on the boundaries must be added to the initial model.

Constrained Quantization
In the Riemannian quantization problem, an approximate distribution in the form ∑ n i=1 µ i δ a i , with a 1 , . . . a n the optimal centers located on the base manifold M and µ 1 , . . . , µ n be positive real numbers summing to 1. While such a representation is very natural both from a optimal approximation and clustering point of view, some specific applications require putting additional constraints on the weights µ i , i = 1 . . . n [44]. A common one is to impose that they are all equal, which in a clustering application means that all classes will have an equal expected number of members. The stochastic gradient algorithms presented above is no longer valid for dealing with this problem and has to be adapted. It is still an open question to find a suitable procedure.

Conclusions
Probability densities approximation and estimation on Riemannian manifolds are topics receiving an increasing interest in the statistics community. In many applications, data are living on non-Euclidean spaces and adapted procedures must be designed.
In Euclidean spaces, parametric and non-parametric estimation procedures have been intensively studied and are well understood. In the Riemannian manifold setting, several non-equivalent extensions are possible, making the task to find the right one quite tricky. The most obvious way of dealing with manifold valued data is to use a local linearization, like the use of normal coordinates. When applied to densities, it may be used to derive maximum entropy distributions with fixed moments, but some care must be taken with the cut locus, which may be charged by the defined distribution. Other approaches rely on a direct use of the geodesic distance, both in the parametric and non-parametric estimation framework. The computational cost associated with this operation may be high, as a differential problem with boundary conditions has to be solved. When dealing with kernel estimation, Jacobi fields must also be evaluated. While it involves only a classical Ordinary differential equation (ODE). integration, it is an increase in the overall complexity of the procedure.
Parametric estimation using exponential families based on group action invariance are theoretically appealing and makes use of the underlying information on the data. This a especially important when data is highly structured. A closed form momentum map is nevertheless a prerequisite to derive an efficient implementation.
Finally, projection based estimation is in principle very efficient, provided one can obtain the eigenfunctions of the Laplace-Beltrami operator in a closed form, or at least efficiently approximate them. In low dimension, numerical schemes may be used for that purpose, but do not scale well.
Classical directional densities and their generalizations are numerically appealing and offer a sound framework for some manifolds. They are designed on an ad hoc basis, and may not be adapted to all cases. Most of them are maximal entropy based and often exhibit a group invariance. An interesting question is whether approximate directional densities can be found for a wrapped distribution arising from a model heat kernel.
As a general conclusion, extending the usual distributions to general manifolds is by no way an elementary procedure. Furthermore, where in the Euclidean case some distributions satisfy many equivalent defining properties, it will not be the case for manifolds. Maximum entropy is generally a good criterion, provided the fixed moments are defined in a simple and natural way. Finally, an often overlooked issue with densities defined on Riemannian manifolds is the associated computational cost when closed forms are unknown. Since all distance computations require solving a boundary value problem, the complexity of the manifold algorithms may be some order of magnitudes higher than their Euclidean counterparts. This also limits the practical dimension of the problems.