Riemannian Laplace Distribution on the Space of Symmetric Positive Deﬁnite Matrices

: The Riemannian geometry of the space P m , of m × m symmetric positive deﬁnite matrices, has provided effective tools to the ﬁelds of medical imaging, computer vision and radar signal processing. Still, an open challenge remains, which consists of extending these tools to correctly handle the presence of outliers (or abnormal data), arising from excessive noise or faulty measurements. The present paper tackles this challenge by introducing new probability distributions, called Riemannian Laplace distributions on the space P m . First, it shows that these distributions provide a statistical foundation for the concept of the Riemannian median, which offers improved robustness in dealing with outliers (in comparison to the more popular concept of the Riemannian center of mass). Second, it describes an original expectation-maximization algorithm, for estimating mixtures of Riemannian Laplace distributions. This algorithm is applied to the problem of texture classiﬁcation, in computer vision, which is considered in the presence of outliers. It is shown to give signiﬁcantly better performance with respect to other recently-proposed approaches.


Introduction
Data with values in the space P m , of m × m symmetric positive definite matrices, play an essential role in many applications, including medical imaging [1,2], computer vision [3][4][5][6][7] and radar signal processing [8,9].In these applications, the location where a dataset is centered has a special interest.While several definitions of this location are possible, its meaning as a representative of the set should be clear.Perhaps, the most known and well-used quantity to represent a center of a dataset is the Fréchet mean.Given a set of points Y 1 , • • • , Y n in P m , their Fréchet mean is defined to be: where d is Rao's Riemannian distance on P m [10,11].Statistics on general Riemannian manifolds have been powered by the development of different tools for geometric measurements and new probability distributions on manifolds [12,13].On the manifold (P m , d), the major advances in this field have been achieved by the recent papers [14,15], which introduce the Riemannian Gaussian distribution on (P m , d).This distribution depends on two parameters Ȳ ∈ P m and σ > 0, and its density with respect to the Riemannian volume form dv(Y) of P m (see Formula (13) in Section 2) is: where Z m (σ) is a normalizing factor depending only on σ (and not on Ȳ).
For the Gaussian distribution Equation ( 2), the maximum likelihood estimate (MLE) for the parameter Ȳ based on observations Y 1 , • • • , Y n corresponds to the mean Equation (1).In [15], a detailed study of statistical inference for this distribution was given and then applied to the classification of data in P m , showing that it yields better performance, in comparison to recent approaches [2].
When a dataset contains extreme values (or outliers), because of the impact of these values on d 2 , the mean becomes less useful.It is usually replaced with the Riemannian median: Definition Equation (3) corresponds to that of the median in statistics based on ordering of the values of a sequence.However, this interpretation does not continue to hold on P m .In fact, the Riemannian distance on P m is not associated with any norm, and it is therefore only possible to compare distances of a set of matrices to a reference matrix.
In the presence of outliers, the Gaussian distribution on P m also loses its robustness properties.The main contribution of the present paper is to remedy this problem by introducing the Riemannian Laplace distribution while maintaining the same one-to-one relation between MLE and the Riemannian median.This will be shown to offer considerable improvement in dealing with outliers.
This paper is organized as follows.
Section 2 reviews the Riemannian geometry of P m , when this manifold is equipped with the Riemannian metric known as the Rao-Fisher or affine invariant metric [10,11].In particular, it gives analytic expressions for geodesic curves, Riemannian distance and recalls the invariance of Rao's distance under affine transformations.
Section 3 introduces the Laplace distribution L( Ȳ, σ) through its probability density function with respect to the volume form dv(Y): σ here, σ lies in an interval ]0, σ max [ with σ max < ∞.This is because the normalizing constant ζ m (σ) becomes infinite for σ ≥ σ max .It will be shown that ζ m (σ) depend only on σ (and not on Ȳ) for all σ < σ max .This important fact leads to simple expressions of MLEs of Y and σ.In particular, the MLE of Ȳ based on a family of observations with respect to the volume form dv(Y). Here, M is the number of mixture components, A new EM (expectation-maximization) algorithm that computes maximum likelihood estimates of the mixture parameters ( µ , Ȳµ , σ µ ) 1≤µ≤M is provided.The problem of the order selection of the number M in Equation ( 4) is also discussed and performed using the Bayesian information criterion (BIC) [16].
Section 5 is an application of the previous material to the classification of data with values in P m , which contain outliers (abnormal data points).Assume to be given a training sequence Y 1 , • • • , Y n ∈ P m .Using the EM algorithm developed in Section 4, it is possible to subdivide this sequence into disjoint classes.To classify new data points, a classification rule is proposed.The robustness of this rule lies in the fact that it is based on the distances between new observations and the respective medians of classes instead of the means [15].This rule will be illustrated by an application to the problem of texture classification in computer vision.The obtained results show improved performance with respect to recent approaches which use the Riemannian Gaussian distribution [15] and the Wishart distribution [17].

Riemannian Geometry of P m
The geometry of Siegel homogeneous bounded domains, such as Kähler homogeneous manifolds, have been studied by Felix A. Berezin [18] and P. Malliavin [19].The structure of Kähler homogeneous manifolds has been used in [20,21] to parameterize (Toeplitz-) Block-Toeplitz matrices.This led to a Hessian metric from information geometry theory with a Kähler potential given by entropy and to an algorithm to compute medians of (Toeplitz-)Block-Toeplitz matrices by Karcher flow on Mostow/Berger fibration of a Siegel disk.Optimal numerical schemes of this algorithm in a Siegel disk have been studied, developed and validated in [22][23][24].
This section introduces the necessary background on the Riemannian geometry of P m , the space of symmetric positive definite matrices of size m × m.Precisely, P m is equipped with the Riemannian metric known as the affine-invariant metric.First, analytic expressions are recalled for geodesic curves and Riemannian distance.Then, two properties are stated, which are fundamental to the following.These are affine-invariance of the Riemannian distance and the existence and uniqueness of Riemannian medians.
The affine-invariant metric, called the Rao-Fisher metric in information geometry, has the following expression: where Y ∈ P m and A, B ∈ T Y P m , the tangent space to P m at Y, which is identified with the vector space of m × m symmetric matrices.The Riemannian metric Equation ( 5) induces a Riemannian distance on P m as follows.The length of a smooth curve c : [0, 1] → P m is given by: where ċ(t) = dc dt .For Y, Z ∈ P m , the Riemannian distance d(Y, Z), called Rao's distance in information geometry, is defined to be: This infimum is achieved by a unique curve c = γ, called the geodesic connecting Y and Z, which has the following equation [10,25]: Here, and throughout the following, all matrix functions (for example, square root, logarithm or power) are understood as symmetric matrix functions [26].By definition, d(Y, Z) coincides with L(γ), which turns out to be: Equipped with the affine-invariant metric Equation ( 5), the space P m enjoys two useful properties, which are the following.
The first property is invariance under affine transformations [10,25].Recall that an affine transformation of P m is a mapping Y → Y • A, where A is an invertible real matrix of size m × m, and † denotes the transpose.Denote by GL(m) the group of m × m invertible real matrices on P m .Then, the action of GL(m) on P m is transitive.This means that for any Y, Z ∈ P m , there exists A ∈ GL(m), such that Y.A = Z.Moreover, the Riemannian distance Equation ( 8) is invariant by affine transformations in the sense that for all Y, Z ∈ P m : where Y • A and Z • A are defined by Equation (9).The transitivity of the action Equation ( 9) and the isometry property Equation ( 10) make P m a Riemannian homogeneous space.The affine-invariant metric Equation ( 5) turns P m into a Riemannian manifold of negative sectional curvature [10,27].As a result, P m enjoys the property of the existence and uniqueness of Riemannian medians.The Riemannian median of where d(Y, Y n ) is the Riemannian distance Equation (8).If Y 1 , • • • , Y N do not belong to the same geodesic, then ŶN exists and is unique [28].More generally, for any probability measure π on P m , the median of π is defined to be: Note that Equation (12) reduces to Equation (11) for π = 1 N ∑ N n=1 δ Y n .If the support of π is not carried by a single geodesic, then again, Ŷπ exists and is unique by the main result of [28].
To end this section, consider the Riemannian volume associated with the affine-invariant Riemannian metric [10]: where the indices denote matrix elements.The Riemannian volume is used to define the integral of a function f : P m → R as: where the integral on the right-hand side is a multiple integral over the m(m + 1)/2 variables, Y ij with i ≤ j.The integral Equation ( 14) is invariant under affine transformations.Precisely: where Y • A is the affine transformation given by Equation (9).It takes on a simplified form when f (Y) only depends on the eigenvalues of Y. Precisely, let the spectral decomposition of Y be given by where U is an orthogonal matrix and e r 1 , • • • , e r m are the eigenvalues of Y. Assume that f (Y) = f (r 1 , . . ., r m ), then the invariant integral Equation ( 14) reduces to: where the constant c m is given by , and Γ m is the multivariate gamma function given in [29].See Appendix A for the derivation of Equation ( 16) from Equation ( 14).

Riemannian Laplace Distribution on
The Riemannian Laplace distribution on P m is defined by analogy with the well-known Laplace distribution on R. Recall the density of the Laplace distribution on R, where x ∈ R and σ > 0. This is a density with respect to the length element dx on R. The density of the Riemannian Laplace distribution on P m will be given by: here, Ȳ ∈ P m , σ > 0, and the density is with respect to the Riemannian volume element Equation ( 13) on P m .The normalizing factor ζ m (σ) appearing in Equation ( 17) is given by the integral: Assume for now that this integral is finite for some choice of Ȳ and σ.It is possible to show that its value does not depend on Ȳ.To do so, recall that the action of GL(m) on P m is transitive.As a consequence, there exists A ∈ P m , such that Ȳ = I.A, where I.A is defined as in Equation ( 9).From Equation (10) From the invariance property Equation ( 15): The integral on the right does not depend on Ȳ, which proves the above claim.The last integral representation and formula Equation ( 16) lead to the following explicit expression: where |r| = (r 1 2 and c m is the same constant as in Equation ( 16) (see Appendix B for more details on the derivation of Equation ( 19)).
A distinctive feature of the Riemannian Laplace distribution on P m , in comparison to the Laplace distribution on R is that there exist certain values of σ for which it cannot be defined.This is because the integral Equation (19) diverges for certain values of this parameter.This leads to the following definition.
Then, for Ȳ ∈ P m and σ ∈ (0, σ m ), the Riemannian Laplace distribution on P m , denoted by L( Ȳ, σ), is defined as the probability distribution on P m , whose density with respect to dv(Y) is given by Equation (17), where ζ m (σ) is defined by Equation (19).
The constant σ m in this definition satisfies 0 < σ m < ∞ for all m and takes the value √ 2 for m = 2 (see Appendix C for proofs).

Sampling from L( Ȳ, σ)
The current section presents a general method for sampling from the Laplace distribution L( Ȳ, σ).This method relies in part on the following transformation property.
Proposition 2. Let Y be a random variable in P m .For all A ∈ GL(m), where Y • A is given by Equation (9).
where the equality is a result of Equation ( 15).However, p(X 10), which proves the proposition.
The following algorithm describes how to sample from L( Ȳ, σ) where 0 < σ < σ m .For this, it is first required to sample from the density p on R m defined by: This can be done by a usual Metropolis algorithm [30].
It is also required to sample from the uniform distribution on O(m), the group of real orthogonal m × m matrices.This can be done by generating A, an m × m matrix, whose entries are i.i.d. with normal distribution N (0, 1), then the orthogonal matrix U, in the decomposition A = UT with T upper triangular, is uniformly distributed on O(m) [29] (p.70).Sampling from L( Ȳ, σ) can now be described as follows.
Note that the law of X in Step 3 is L(I, σ); the proof of this fact is given in Appendix D. Finally, the law of Y in Step 4 is L(I. Ȳ 1 2 = Ȳ, σ) by proposition Equation (2).

Estimation of Ȳ and σ
The current section considers maximum likelihood estimation of the parameters Ȳ and σ, based on independent observations Y 1 , . . ., Y N from the Riemannian Laplace distribution L( Ȳ, σ).The main results are contained in Propositions 3 and 5 below.
Proposition 3 states the existence and uniqueness of the maximum likelihood estimates ŶN and σN of Ȳ and σ.In particular, the maximum likelihood estimate ŶN of Ȳ is the Riemannian median of Y 1 , . . ., Y N , defined by Equation (11).Numerical computation of ŶN will be considered and carried out using a Riemannian sub-gradient descent algorithm [8].
Proposition 5 states the convergence of the maximum likelihood estimate ŶN to the true value of the parameter Ȳ.It is based on Lemma 4, which states that the parameter Ȳ is the Riemannian median of the distribution L( Ȳ, σ) in the sense of definition Equation (12).

Proposition 3 (MLE and median).
The maximum likelihood estimate of the parameter Ȳ is the Riemannian median ŶN of Y 1 , . . ., Y N .Moreover, the maximum likelihood estimate of the parameter σ is the solution σN of: Both ŶN and σN exist and are unique for any realization of the samples Y 1 , . . ., Y N .
Proof of Proposition 3. The log-likelihood function, of the parameters Ȳ and σ, can be written as: As the first term in the last expression does not contain Ȳ, The quantity on the right is exactly ŶN by Equation (11).This proves the first claim.Now, consider the function: This function is strictly concave, since it is the logarithm of the moment generating function of a positive measure.Note that lim By the strict concavity of F, there exists a unique ηN < −1 σ m (which is the maximum of F), such that F ( ηN ) = 0.It follows that σN = −1 ηN lies in (0, σ m ) and satisfies Equation (20).The uniqueness of σN is a consequence of the uniqueness of ηN .Thus, the proof is complete.Now, it remains to check that lim η→−∞ F(η) = −∞ or just lim σ→+∞ where A m and B m are two constants only depending on m.Using this, it follows that: However, for some constant C m only depending on m, Combining this bound and Equation ( 21) yields lim σ→+∞ + ηc where c > 0 shows that the equation: Consider now the numerical computation of the maximum likelihood estimates ŶN and σN given by Proposition 3. Computation of ŶN consists in finding the Riemannian median of Y 1 , . . ., Y N , defined by Equation (11).This can be done using the Riemannian sub-gradient descent algorithm of [8].The k-th iteration of this algorithm produces an approximation Ŷk N of ŶN in the following way.For k = 1, 2, . .., let ∆ k be the symmetric matrix: Here, Log is the Riemannian logarithm mapping inverse to the the Riemannian exponential mapping: and ||Log a (b)|| = g a (b, b).Then, Ŷk N is defined to be: where τ k > 0 is a step size, which can be determined using a backtracking procedure.Computation of σN requires solving a non-linear equation in one variable.This is readily done using Newton's method.
It is shown now that the empirical Riemannian median ŶN converges almost surely to the true median Ȳ.This means that ŶN is a consistent estimator of Ȳ.The proof of this fact requires few notations and a preparatory lemma.
(ii) σ is given by: where the function According to Theorem 2.1 in [28], this function has a unique global minimum, which is also a unique stationary point.Thus, to prove that Ȳ is the minimum point of E , it will suffice to check that for any geodesic γ starting from Ȳ, ).Note that: where for all Z = Ȳ [32]: The integral in Equation ( 26) is, up to a constant, (ii) Differentiating P m exp(− with respect to σ, it comes that: which proves (ii).(12).Applying this result to π = L( Ȳ, σ) and using Ŷπ = Ȳ, which follows from item (i) of Lemma 4, shows that ŶN converges almost surely to Ȳ.

Mixtures of Laplace Distributions
There are several motivations for considering mixtures of distributions in general.The most natural approach is to envisage a dataset as constituted of several subpopulations.Another approach is the fact that there is a support for the argument that mixtures of distributions provide a good approximation to most distributions in a spirit similar to wavelets.
The present section introduces the class of probability distributions that are finite mixtures of Riemannian Laplace distributions on P m .These constitute the main theoretical tool, to be used for the target application of the present paper, namely the problem of texture classification in computer vision, which will be treated in Section 5.
A mixture of Riemannian Laplace distributions is a probability distribution on P m , whose density with respect to the Riemannian volume element Equation ( 13) has the following expression: where µ are nonzero weights, whose sum is equal to one, Ȳµ ∈ P m and σ µ ∈ (0, σ m ) for all 1 ≤ µ ≤ M, and the parameter M is called the number of mixture components.Section 4.1 describes a new EM algorithm, which computes the maximum likelihood estimates of the mixture parameters ( µ , Ȳµ , σ µ ) 1≤µ≤M , based on independent observations Y 1 , . . ., Y N from the mixture distribution Equation (27).
Section 4.2 considers the problem of order selection for mixtures of Riemannian Laplace distributions.Precisely, this consists of finding the number M of mixture components in Equation ( 27) that realizes the best representation of a given set of data Y 1 , . . ., Y N .This problem is solved by computing the BIC criterion, which is here found in explicit form for the case of mixtures of Riemannian Laplace distributions on P m .

Estimation of the Mixture Parameters
In this section, Y 1 , . . ., Y N are i.i.d.samples from Equation (27).Based on these observations, an EM algorithm is proposed to estimate ( µ , Ȳµ , σ µ ) 1≤µ≤M .The derivation of this algorithm can be carried out similarly to [15].
• Update for ˆ µ : Based on the current value of θ, assign to ˆ µ the new value ˆ µ = N µ ( θ) N.
• Update for Ŷµ : Based on the current value of θ, assign to Ŷµ the value: • Update for σµ : Based on the current value of θ, assign to σµ the new value: where the function Φ is defined in Proposition 4.
These three update rules should be performed in the above order.Realization of the update rules for ˆ µ and σµ is straightforward.The update rule for Ŷµ is realized using a slight modification of the sub-gradient descent algorithm described in Section 3.2.More precisely, the factor 1/N appearing in Equation ( 22) is only replaced with ω µ (Y n , θ) at each iteration.
In practice, the initial conditions ( ˆ µ 0 , Ŷµ 0 , σµ 0 ) in this algorithm were chosen in the following way.The weights ( µ 0 ) are uniform and equal to 1/M; ( Ŷµ 0 ) are M different observations from the set {Y 1 , .., Y N } chosen randomly; and ( σµ 0 ) is computed from ( µ 0 ) and ( Ŷµ 0 ) according to the rule Equation (30).Since the convergence of the algorithm depends on the initial conditions, the EM algorithm is run several times, and the best result is retained, i.e., the one maximizing the log-likelihood function.

The Bayesian Information Criterion
The BIC was introduced by Schwarz to find the appropriate dimension of a model that will fit a given set of observations [16].Since then, BIC has been used in many Bayesian modeling problems where priors are hard to set precisely.In large sample settings, the fitted model favored by BIC ideally corresponds to the candidate model that is a posteriori most probable; i.e., the model that is rendered most plausible by the data at hand.One of the main features of the BIC is its easy computation, since it is only based on the empirical log-likelihood function.
Given a set of observations {Y 1 , • • • , Y N } arising from Equation ( 27) where M is unknown, the BIC consists of choosing the parameter: where: Here, LL is the log-likelihood given by: and DF is the number of degrees of freedom of the statistical model: In Formula ( 32), ( ˆ k , Ŷk , σk ) 1≤k≤M are obtained from an EM algorithm as stated in Section 4.1 assuming the exact dimension is M. Finally, note that in Formula (33), M × m(m+1)

Application to Classification of Data on P m
Recently, several approaches have used the Riemannian distance in general as the main innovation in image or signal classification problems [2,15,34].It turns out that the use of this distance leads to more accurate results (in comparison, for example, with the Euclidean distance).This section proposes an application that follows a similar approach, but in addition to the Riemannian distance, it also relies on a statistical approach.It considers the application of the Riemannian Laplace distribution (RLD) to the classification of data in P m and gives an original Laplace classification rule, which can be used to carry out the task of classification, even in the presence of outliers.It also applies this classification rule to the problem of texture classification in computer vision, showing that it leads to improved results in comparison with recent literature.
Section 5.1 considers, from the point of view of statistical learning, the classification of data with values in P m .Given data points Y 1 , • • • , Y N ∈ P m , this proceeds in two steps, called the learning phase and the classification phase, respectively.The learning phase uncovers the class structure of the data, by estimating a mixture model using the EM algorithm developed in Section 4.1.Once training is accomplished, data points are subdivided into disjoint classes.Classification consists of associating each new data point to the most suitable class.For this, a new classification rule will be established and shown to be optimal.Section 5.2 is the implementation of the Laplace classification rule together with the BIC criterion to texture classification in computer vision.It highlights the advantage of the Laplace distribution in the presence of outliers and shows its better performance compared to recent approaches.

Classification Using Mixtures of Laplace Distributions
Assume to be given a set of training data Y 1 , • • • , Y N .These are now modeled as a realization of a mixture of Laplace distributions: In this section, the order M in Equation ( 34) is considered as known.The training phase of these data consists of learning its structure as a family of M disjoint classes C µ , µ = 1, • • • , M. To be more precise, depending on the family ( µ ), some of these classes may be empty.Training is done by applying the EM algorithm described in Section 4.1.As a result, each class C µ is represented by a triple ( ˆ µ , Ŷµ , σµ ) corresponding to maximum likelihood estimates of ( µ , Y µ , σ µ ).Each observation Y n is now associated with the class C µ * where µ * = argmax µ ω(Y n , ν) (recall the definition from Equation ( 28)).In this way, {Y 1 , The classification phase requires a classification rule.Following [15], the optimal rule (in the sense of a Bayesian risk criterion given in [35]) consists of associating any new data Y t to the class C µ * where: Here, Nµ is the number of elements in C µ .Replacing Nµ with N × ˆ µ , Equation ( 35) becomes argmax µ ˆ µ × p(Y t | Ŷµ , σµ ).Note that when the weights µ in Equation ( 34) are assumed to be equal, this rule reduces to a maximum likelihood classification rule max µ p(Y t | Ŷµ , σµ ).A quick look at the expression Equation (17) shows that Equation (35) can also be expressed as: The rule Equation (36) will be called the Laplace classification rule.It favors clusters C µ having a larger number of data points (the minimum contains − log ˆ µ ) or a smaller dispersion away from the median (the minimum contains log ζ( σµ )).When choosing between two clusters with the same number of points and the same dispersion, this rule favors the one whose median is closer to Y t .If the number of data points inside clusters and the respective dispersions are neglected, then Equation (36) reduces to the nearest neighbor rule involving only the Riemannian distance introduced in [2].
The analogous rules of Equation ( 36) for the Riemannian Gaussian distribution (RGD) [15] and the Wishart distribution (WD) [17] on P m can be established by replacing p(Y t | Ŷµ , σµ ) in Equation (35) with the RGD and the WD and then following the same reasoning as before.Recall that a WD depends on an expectation Σ ∈ P m and a number of degrees of freedom n [29].For the WD, Equation (36) becomes: Here, ˆ (µ), Σ(µ) and n(µ) denote maximum likelihood estimates of the true parameters (µ), Σ(µ) and n(µ), which define the mixture model (these estimates can be computed as in [36,37]).

Application to Texture Classification
This section presents an application of the mixture of Laplace distributions to the context of texture classification on the MIT Vision Texture (VisTex) database [38].The purpose of this experiment is to classify the textures, by taking into consideration the within-class diversity.In addition, the influence of outliers on the classification performances is analyzed.The obtained results for the RLD are compared to those given by the RGD [15] and the WD [17].
The VisTex database contains 40 images, considered as being 40 different texture classes.The database used for the experiment is obtained after several steps.First of all, each texture is decomposed into 169 patches of 128 × 128 pixels, with an overlap of 32 pixels, giving a total number of 6760 textured patches.Next, some patches are corrupted, in order to introduce abnormal data into the dataset.Therefore, their intensity is modified by applying a gradient of luminosity.For each class, between zero and 60 patches are modified in order to become outliers.An example of a VisTex texture with one of its patches and an outlier patch are shown in Figure 1.Once the database is built, it is 15-times equally and randomly divided in order to obtain the training and the testing sets that are further used in the supervised classification algorithm.Then, for each patch in both databases, a feature vector has to be computed.The luminance channel is first extracted and then normalized in intensity.The grayscale patches are filtered using the stationary wavelet transform Daubechies db4 filter (see [39]), with two scales and three orientations.To model the wavelet sub-bands, various stochastic models have been proposed in the literature.Among them, the univariate generalized Gaussian distribution has been found to accurately model the empirical histogram of wavelet sub-bands [40].Recently, it has been proposed to model the spatial dependency of wavelet coefficients.To this aim, the wavelet coefficients located in a p × q spatial neighborhood of the current spatial position are clustered in a random vector.The realizations of these vectors can be further modeled by elliptical distributions [41,42], copula-based models [43,44], etc.In this paper, the wavelet coefficients are considered as being realizations of zero-mean multivariate Gaussian distributions.In addition, for this experiment the spatial information is captured by using a vertical (2 × 1) and a horizontal (1 × 2) neighborhood.Next, the 2 × 2 sample covariance matrices are estimated for each wavelet sub-band and each neighborhood.Finally, each patch is represented by a set of F = 12 covariance matrices (2 scales The estimated covariance matrices are elements of P m , with m = 2, and therefore, they can be modeled by Riemannian Laplace distributions.More precisely, in order to take into consideration the within-class diversity, each class in the training set is viewed as a realization of a mixture of Riemannian Laplace distributions (Equation ( 27)) with M mixture components, characterized by ( µ , Ȳµ, f , σ µ, f ), having Ȳµ, f ∈ P 2 , with µ = 1, • • • , M and f = 1, • • • , F. Since the sub-bands are assumed to be independent, the probability density function is given by: The learning step of the classification is performed using the EM algorithm presented in Section 4, and the number of mixture components is determined using the BIC criterion recalled in Equation (31).Note that for the considered model given in Equation ( 37), the degree of freedom is expressed as: since one centroid and one dispersion parameter should be estimated per feature and per component of the mixture model.In practice, the number of mixture components M varies between two and five, and the M yielding to the highest BIC criterion is retained.As mentioned earlier, the EM algorithm is sensitive to the initial conditions.In order to minimize this influence, for this experiment, the EM algorithm is repeated 10 times, and the result maximizing the log-likelihood function is retained.Finally, the classification is performed by assigning each element Y t ∈ P 2 in the testing set to the class of the closest cluster µ * , given by: This expression is obtained starting from Equations ( 36) and ( 37), knowing that F features are extracted for each patch.
The classification results of the proposed model (solid red line), expressed in terms of overall accuracy, shown in Figure 2, are compared to those given by a fixed number of mixture components (that is, for M = 3, dashed red line) and with those given when the within-class diversity is not considered (that is, for M = 1, dotted red line).In addition, the classification performances given by the RGD model (displayed in black) proposed in [15] and the WD model (displayed in blue) proposed in [17] are also considered.For each of these models, the number of mixture components is first computed using the BIC, and next, it is fixed to M = 3 and M = 1.For all of the considered models, the classification rate is given as a function of the number of outliers, which varies between zero and 60 for each class.It is shown that, as the number of outliers increases, the RLD gives progressively better results than the RGD and the WD.The results are improved by using the BIC criterion for choosing the suitable number of clusters.In conclusion, the mixture of RLDs combined with the BIC criterion to estimate the best number of mixture components can minimize the influence of abnormal samples present in the dataset, illustrating the relevance of the proposed method.

Conclusions
Motivated by the problem of outliers in statistical data, this paper introduces a new distribution on the space P m of m × m symmetric positive definite matrices, called the Riemannian Laplace distribution.Denoted throughout the paper by L( Ȳ, σ), where Ȳ ∈ P m and σ > 0 are the indexing parameters, this distribution may be thought of as specifying the law of a family of observations on P m concentrated around the location Ȳ and having dispersion σ.If d denotes Rao's distance on P m and dv(Y) its associated volume form, the density of L( Ȳ, σ) with respect to dv(Y) is proportional to exp(− d(Y, Ȳ σ )).Interestingly, the normalizing constant depends only on σ (and not on Ȳ).This allows us to deduce exact expressions for maximum likelihood estimates of Ȳ and σ relying on the Riemannian median on P m .These estimates are also computed numerically by means of sub-gradient algorithms.The estimation of parameters in mixture models of Laplace distributions are also considered and performed using a new expectation-maximization algorithm.Finally, the main theoretical results are illustrated by an application to texture classification.The proposed experiment consists of introducing abnormal data (outliers) into a set of images from the VisTex database and analyzing their influences on the classification performances.Each image is characterized by a set of 2 × 2 covariance matrices modeled as mixtures of Riemannian Laplace distributions in the space P 2 .The number of mixtures is estimated using the BIC criterion.The obtained results are compared to those given by the Riemannian Gaussian distribution, showing the better performance of the proposed method.This proves the proposition (the factor m! 2 m comes from the fact that the correspondence between Y and (r, U) is not unique: m! corresponds to all possible reorderings of r 1 , . . ., r m , and 2 m corresponds to the orientation of the columns of U).
To check (i), note that ∏ i<j sinh ≤ exp(C|r|) for some constant C. Thus, for σ small enough, the integral I m (σ) = R m e − |r| σ ∏ i<j sinh dr given in Equation ( 19) is finite, and consequently, σ m > 0.

Figure 1 .
Figure 1.Example of a texture from the VisTex database (a), one of its patches (b) and the corresponding outlier (c).