Fisher Vector Coding for Covariance Matrix Descriptors Based on the Log-Euclidean and Afﬁne Invariant Riemannian Metrics

: This paper presents an overview of coding methods used to encode a set of covariance matrices. Starting from a Gaussian mixture model (GMM) adapted to the Log-Euclidean (LE) or afﬁne invariant Riemannian metric, we propose a Fisher Vector (FV) descriptor adapted to each of these metrics: the Log-Euclidean Fisher Vectors (LE FV) and the Riemannian Fisher Vectors (RFV). Some experiments on texture and head pose image classiﬁcation are conducted to compare these two metrics and to illustrate the potential of these FV-based descriptors compared to state-of-the-art BoW and VLAD-based descriptors. A focus is also applied to illustrate the advantage of using the Fisher information matrix during the derivation of the FV. In addition, ﬁnally, some experiments are conducted in order to provide fairer comparison between the different coding strategies. This includes some comparisons between anisotropic and isotropic models, and a estimation performance analysis of the GMM dispersion parameter for covariance matrices of large dimension


Introduction
In supervised classification, the goal is to tag an image with one class name based on its content. In the beginning of the 2000s, the leading approaches were based on feature coding. Among the most employed coding-based methods, there are the bag of words model (BoW) [1], the vector of locally aggregated descriptors (VLAD) [2,3], the Fisher score (FS) [4] and the Fisher vectors (FV) [5][6][7]. The success of these methods is based on their main advantages. First, the information obtained by feature coding can be used in a wide variety of applications, including image classification [5,8,9], text retrieval [10], action and face recognition [11], etc. Second, combined with powerful local handcrafted features, such as SIFT, they are robust to transformations like scaling, translation, or occlusion [11].
Nevertheless, in 2012, the ImageNet Large Scale Visual Recognition Challenge has shown that Convolutional Neural Networks [12,13] (CNNs) can outperform FV descriptors. Since then, in order to take advantage of both worlds, some hybrid classification architectures have been proposed to combine FV and CNN [14]. For example, Perronnin et al. have proposed to train a network of fully connected layers on the FV descriptors [15]. Another hybrid architecture is the deep Fisher network composed by stacking several FV layers [16]. Some authors have proposed to extract convolutional features from different layers of the network, and then to use VLAD or FV encoding to encode features

•
First, based on the conventional multivariate GMM, we introduce the log-Euclidean Fisher score (LE FS). This descriptor can be interpreted as the FS computed on the log-Euclidean vector representation of the covariance matrices set. • Second, we have recently introduced a Gaussian distribution on the space P m : the Riemannian Gaussian distribution [39]. This latter allows the definition of a GMM on the space of covariance matrices and an Expectation Maximization (EM) algorithm can hence be considered to learn the codebook [32]. Starting from this observation, we define the Riemannian Fisher score (RFS) [40] which can be interpreted as an extension of the RVLAD descriptor proposed in [11].  Fourth, all these coding methods will be compared on two image processing applications consisting of texture and head pose image classification. Some experiments will also be conducted in order to provide fairer comparison between the different coding strategies. It includes some comparisons between anisotropic and isotropic models. An estimation performance analysis of the dispersion parameter for covariance matrices of large dimension will also be studied.
As previously mentioned, hybrid architectures can be employed to combine FV with CNN. The adaptation of the proposed FV descriptors to these architecture is outside the scope of this paper but will remain one of the perspective of this work.
The paper is structured as follows. Section 2 introduces the workflow presenting the general idea of feature coding-based classification methods. Section 3 presents the codebook generation on the manifold of SPD covariance matrices. Section 4 introduces a theoretical study of the feature encoding methods (BoW, VLAD, FS and FV) based on the LE and affine invariant Riemannian metrics. Section 5 shows two applications of these descriptors to texture and head pose image classification. In addition, finally, Section 6 synthesizes the main conclusions and perspectives of this work.

General Framework
The general workflow is presented in Figure 1 and it consists of the following steps:

1.
Patch extraction is the starting step of the classification algorithm. At the beginning, the images are divided in patches, either in a dense way, by means of fixed grids, or in a non-dense way, based on representative points such as SIFT for example.

2.
A low level feature extraction step is then applied in order to extract some characteristics (such as spatial gradient components). These low-level handcrafted features capture the information contained in each patch. 3.
The covariance matrix of these features are then computed. As a result, each image is represented as a set of covariance matrices which compose the signature of an image. 4.
The codebook generation starts from the previously extracted covariance matrices. The purpose of this step is to identify the features containing the significant information. Usually, this procedure is performed by means of clustering algorithms, such as the k-means or expectation-maximization (EM) algorithm. Knowing that the features are covariance matrices, one of the following approaches can be chosen. The first one considers the LE metric. It consists of projecting the covariance matrices in the LE space [33,35] and then standard clustering algorithms for multivariate Gaussian distributions are used. The second approach considers the affine invariant Riemannian metric to measure the similarity between two covariance matrices. In this context, the conventional k-means or EM algorithm should be readapted to this metric [11,36,40]. For both approaches, the dataset is partitioned into a predefined number of clusters, each of them being described by parameters, such as the cluster's centroid, the dispersion and the associated weight. The obtained features are called codewords and they are grouped in a codebook, also called a dictionary. 5.
Feature encoding is based on the created codebook and it consists in projecting the extracted covariance matrices onto the codebook space. For this purpose, approaches like BoW, VLAD and FV can be employed, for both the LE and affine invariant Riemannian metrics. According to [41], these are global coding strategies, that describe the entire set of features, and not the individual ones. Essentially, this is accomplished using probability density distributions to model the feature space. More precisely, they can be viewed either as voting-based methods depending on histograms, or as Fisher coding-based methods by using Gaussian mixture models adapted to the considered metric [39,42]. 6.
Post-processing is often applied after the feature encoding step, in order to minimize the influence of background information on the image signature [6] and to correct the independence assumption made on the patches [7]. Therefore, two types of normalization are used, namely the power [7] and 2 [6] normalizations. 7.
Classification is the final step, achieved by associating the test images to the class of the most similar training observations. In practice, algorithms such as k-nearest neighbors, support vector machine or random forest can be used.
As shown in Figure 1, the codebook generation along with the feature encoding are the two central steps in this framework. The next two sections present a detailed analysis of how these steps are adapted to covariance matrix features.

Codebook Generation in P m
This section focuses on the codebook generation. At this point, the set of extracted low-level features, i.e., the set of covariance matrices, is used in order to identify the ones embedding the set's significant characteristics. In this paper, two metrics are considered to compute the codebook which are respectively the LE and the affine invariant Riemannian metric. The next two subsections describe these two strategies.

Log-Euclidean Codebook
Let M = {M n } n=1:N , with M n ∈ P m , be a sample of N training SPD matrices of size m × m. The LE codebook is obtained by considering the LE metric as similarity measure between two covariance matrices. For such a purpose, each training covariance matrix M n is first mapped on the LE space by applying the matrix logarithm M LE n = log M n [33,43,44]. Next, a vectorization operator is applied to obtain the LE vector representation. To sum up, for a given SPD matrix M, its LE vector representation, m ∈ R m(m+1) 2 , is defined as m = Vec(log(M)) where Vec is the vectorization operator defined as: with X ij the elements of X.
Once the SPD matrices are mapped on the LE metric space, all the conventional algorithms developed on the Euclidean space can be considered. In particular, the LE vector representation of M, i.e., {m n } n=1:N , can be assumed to be independent and identically distributed (i.i.d.) samples from a mixture of K multivariate Gaussian distributions, whose probability density function is p(m n |θ) = K ∑ k=1 k p(m n |m k , Σ k ) (2) where θ = {( k ,m k , Σ k ) 1≤k≤K } is the parameter vector. For each cluster k, k represent the mixture weight,m k the mean vector and M k the covariance matrices. It yields: where (·) T is the transpose operator, | · | is the determinant,m k ∈ R m(m+1) 2 , Σ k ∈ P m(m+1)/2 and k ∈ R.
In addition, the covariance matrix is assumed to be diagonal, i.e., σ 2 is the variance vector. For such a model, the classical k-means or EM algorithm can be applied to estimate the mixture parameters. The estimated parameters of each mixture component (m k , σ 2 k and k ) represent the codewords and the set composed by the K codewords gives the LE codebook.

New Riemannian Codebook
In this section, we present the construction of the Riemannian codebook which is based on the affine invariant Riemannian metric. We recall some properties of the manifold of SPD matrices and introduce the Riemannian Gaussian mixture model.

Riemannian Geometry of the Space of SPD Matrices
The space P m of m × m real SPD matrices M satisfies the following conditions: and x T Mx > 0, (5) ∀x ∈ R m and x = 0. In this space, the Rao-Fisher metric defines a distance, called the Rao's geodesic distance [45,46], given by the length of the shortest curve connecting two points in P m . Mathematically, this definition can be stated as follows [32]. Let M 1 , M 2 be two points in P m and c : [0, 1] −→ P m a differentiable curve, with c(0) = M 1 and c(1) = M 2 . Thus, the length of curve c, denoted by L(c) is defined as: The geodesic distance d : P m × P m −→ R + between M 1 and M 2 is the infimum of L(c) with respect to all differentiable curves c. Based on the properties of Rao-Fisher metric, it has been shown that the unique curve γ fulfilling this condition is [45,46]: called the geodesic connecting M 1 and M 2 . Moreover, the distance between two points in P m can be expressed as [47]: with λ i , i = 1, . . . , m being the eigenvalues of M −1 1 M 2 .
The affine invariant Riemannian (Rao-Fisher) metric can be also used to define the Riemannian volume element [45]: For each point on the manifold M 1 ∈ P m , the tangent space at M 1 , denoted by T M 1 can be defined. This space contains the vectors V T that are tangent to all possible curves passing through M 1 . The correspondence between a point on the manifold and its tangent space can be achieved by using two operators: the Riemannian exponential mapping and the Riemannian logarithm mapping [48,49].
More precisely, the Riemannian exponential mapping for a point M 1 ∈ P m and the tangent vector V T is given by [48,49]: where exp(·) is the matrix exponential. By this transformation, the tangent vector V T can be mapped on the manifold. Further on, the inverse of the Riemannian exponential mapping is the Riemannian logarithm mapping. For two points M 1 , M 2 ∈ P m , this operator is given by [48,49]: where log(·) is the matrix logarithm. In practice, this operation gives the tangent vector V T , by transforming the geodesic γ in a straight line in the tangent space. In addition, the geodesic's length between M 1 and M 2 is equal to the norm of the tangent vector V T .

Riemannian Gaussian model
To model the space P m of SPD covariance matrices, a generative model has been introduced in [39,42]: the Riemannian Gaussian distribution (RGD). For this model, the probability density function with respect to the Riemannian volume element given in (9) is defined as follow [39,42]: whereM and σ are the distribution parameters, representing respectively the central value (centroid) and the dispersion. d(·) is the Riemannian distance given in (8) and Z(σ) is a normalization factor independent ofM [39,50].
with Γ m the multivariate Gamma function [51]. In practice, for m = 2, the normalization factor admits a closed-form expression [32], while for m > 2 the normalization factor can be computed numerically as the expectation of the product of sinh functions with respect to the multivariate normal distribution N (0, σ 2 I m ) [39]. Afterwards, a cubic spline interpolation can be used to smooth this function [52].

Mixture model for RGDs
As for the LE codebook, a generative model is considered for the construction of the Riemannian codebook. For the former, a mixture of multivariate Gaussian distribution was considered since the SPD matrices were projected on the LE space. For the construction of the Riemannian codebook, we follow a similar approach by considering that M = {M n } n=1:N , are i.i.d. samples from a mixture of K RGDs. In this case, the likelihood of M is given by: where p(M n |M k , σ k ) is the RGD defined in (12) and θ = {( k ,M k , σ k ) 1≤k≤K } is the parameter vector containing the mixture weight k , the central valueM k and the dispersion parameter σ k . Once estimated, the parameters of each mixture component represent the codewords, and the set of all K codewords gives the Riemannian codebook. Regarding the estimation, the conventional intrinsic k-means clustering algorithm can be considered [36,53]. Nevertheless, it implies the homoscedasticity assumption, for which the clusters have the same dispersion. To relax this assumption, we consider in the following the maximum likelihood estimation with the expectation maximization algorithm defined in [32].

Maximum likelihood estimation
First, let us consider the following two quantities that are defined for each mixture component k, k = 1, . . . , K: and Then, the estimated parametersθ = {(ˆ k , M k ,σ k ) 1≤k≤K } are iteratively updated based on the current value ofθ:

•
The estimated mixture weightˆ k is given by: • The estimated central value M k is computed as: In practice, (18) is solved by means of a gradient descent algorithm [54].

•
The estimated dispersionσ k is obtained as: where Φ is the inverse function of σ → σ 3 × d dσ log Z(σ).
Practically, the estimation procedure is repeated for a fixed number of iterations, or until convergence, that is until the estimated parameters remain almost stable for successive iterations. Moreover, as the estimation with the EM algorithm depends on the initial parameter setting, the EM algorithm is run several times (10 in practice) and the best result is kept (i.e., the one maximizing the log-likelihood criterion).
Based on the extracted (LE or Riemannian) codebook, the next section presents various strategies to encode a set of SPD matrices. These approaches are based whether on the LE metric or on the affine invariant Riemannian metric. In the next section, three kinds of coding approaches are reviewed, namely the bag of words (BoW) model, the vector of locally agregated descriptors (VLAD) [2,3] and the Fisher vectors (FV) [5][6][7]. Here, the main contribution is the proposition of coding approaches based on the FV model: the Log-Euclidean Fisher vectors (LE FV) and the Riemannian Fisher vectors (RFV) [40].

Feature Encoding Methods
Given the extracted codebook, the purpose of this part is to project the feature set of SPD matrices onto the codebook elements. In other words, the initial feature set is expressed using the codewords contained in the codebook. Figure 2 draws an overview of the relation between the different approaches based on the BoW, VLAD and FV models. The LE-based metric approaches appear in red while the affine invariant ones are displayed in blue. The E-VLAD descriptor is displayed in purple since it considers the Riemannian codebook combined with LE representation of the features.

Bag of Words Descriptor
One of the most common encoding methods is represented by the BoW model. With this model, a set of features is encoded in an histogram descriptor obtained by counting the number of features which are closest to each codeword of the codebook. In the beginning, this descriptor has been employed for text retrieval and categorization [10,55], by modeling a text with an histogram containing the number of occurrences of each word. Later on, the BoW model has been extended to visual categorization [56], where images are described by a set of descriptors, such as SIFT features. In such case, the "words" of the codebook are obtained by considering a clustering algorithm with the standard Euclidean metric. Recently, the BoW model has been extended to features lying in a non-Euclidean space, such as SPD matrices. In this context, two approaches have been proposed based respectively on the LE and affine invariant Riemannian metrics: • the log-Euclidean bag of words (LE BoW) [33,35]. • the bag of Riemannian words (BoRW) [36].
These two descriptors have been employed successfully for different applications, including texture and human epithelial type 2 cells classification [36], action recognition [33,35].

Log-Euclidean Bag of Words (LE BoW)
The LE BoW model has been considered in [33,35]. First, the space of covariance matrices is embedded into a vector space by considering the LE vector representation m given in (1). With this embedding, the LE BoW model can be interpreted as the BoW model in the LE space. This means that codewords are elements of the log-Euclidean codebook detailed in Section 3.1. Next, each observed SPD matrix M n is assigned to cluster k of closest codewordm k to compute the histogram descriptor. The vicinity is evaluated here as the Euclidean distance between the LE vector representation m n and the codewordm k .
The LE BoW descriptor can also be interpreted by considering the Gaussian mixture model recalled in (2). In such case, each feature m n is assigned to the cluster k, for k = 1, . . . , K according to: where p(m n |m k , Σ k ) is the multivariate Gaussian distribution given in (3). In addition, two constraints are assumed ∀k = 1, . . . , K: • the homoscedasticity assumption: • the same weight is given to all mixture components:

Bag of Riemannian Words (BoRW)
This descriptor has been introduced in [36]. Contrary to the LE BoW model, the BoRW model exploits the affine invariant Riemannian metric. For that, it considers the Riemannian codebook detailed in Section 3.2. Then, the histogram descriptor is computed by assigning each SPD matrix to the cluster k of the closest codebook elementM k , the proximity being measured with the geodesic distance recalled in (8).
As for the LE BoW descriptor, the definition of the BoRW descriptor can be obtained by the Gaussian mixture model, except that the RGD model defined in (12) is considered instead of the multivariate Gaussian distribution. Each feature M n is assigned to the cluster k, for k = 1, . . . , K according to: In addition, the two previously cited assumptions are made, that are the same dispersion and weight are given to all mixture components.
It has been shown in the literature that the performance of BoW descriptors depends on the codebook size, best results being generally obtained for large dictionaries [5]. Moreover, BoW descriptors are based only on the number of occurrences of each codeword from the dataset.
In order to increase the classification performances, second order statistics can be considered. This is the case of VLAD and FV that are presented next.

Vectors of Locally Aggregated Descriptors
VLAD descriptors have been introduced in [2] and represent a method of encoding the difference between the codewords and the features. For features lying in a Euclidean space, the codebook is composed by cluster centroids {(x k ) 1≤k≤K } obtained by clustering algorithm on the training set. Next, to encode a feature set {(x n ) 1≤n≤N }, vectors v k containing the sum of differences between codeword and feature samples assigned to it are computed for each cluster: The final VLAD descriptor is obtained as the concatenation of all vectors v k : To generalize this formalism to features lying in a Riemannian manifold, two theoretical aspects should be addressed carefully, which are the definition of a metric to describe how features are assigned to the codewords, and the definition of subtraction operator for these kind of features. By addressing these aspects, three approaches have been proposed in the literature: • the log-Euclidean vector of locally aggregated descriptors (LE VLAD) [11]. • the extrinsic vector of locally aggregated descriptors (E-VLAD) [37]. • the intrinsic Riemannian vector of locally aggregated descriptors (RVLAD) [11].

Log-Euclidean Vector of Locally Aggregated Descriptors (LE VLAD)
this descriptor has been introduced in [11] to encode a set of SPD matrices with VLAD descriptors. In this approach, VLAD descriptors are computed in the LE space. For this purpose, (24) is rewritten as: where the LE representation m n of M n belongs to the cluster c k if it is closer tom k than any other element of the LE codebook. The proximity is measured here according to the Euclidean distance between the LE vectors.

Extrinsic Vector of Locally Aggregated Descriptors (E-VLAD)
The E-VLAD descriptor is based on the LE vector representation of SPD matrices. However, contrary to the LE VLAD model, this descriptor uses the Riemannian codebook to define the Voronoï regions. It yields that: where M n belongs to the cluster c k if it is closer toM k according to the affine invariant Riemannian metric. Note also that herem k is the LE vector representation of the Riemannian codebook elementM k .
To speed-up the processing time, Faraki et al. have proposed in [37] to replace the affine invariant Riemannian metric by the Stein metric [57]. For this latter, computational cost to estimate the centroid of a set of covariance matrices is less demanding than with the affine invariant Riemannian metric since a recursive computation of the Stein center from a set of covariance matrices has been proposed in [58].
Since this approach exploits two metrics, one for the codebook creation (with the affine invariant Riemannian or Stein metric) and another for the coding step (with the LE metric), we referred it as an extrinsic method.

Riemannian Vector of Locally Aggregated Descriptors (RVLAD)
this descriptor has been introduced in [11] to propose a solution for the affine invariant Riemannian metric. More precisely, the geodesic distance [47] recalled in (8) is considered to measure similarity between SPD matrices. The affine invariant Riemannian metric is used to define the Voronoï regions) and the Riemannian logarithm mapping [48] is used to perform the subtraction on the manifold. It yields that for the RVLAD model, the vectors v k are obtained as: where LogM k (·) is the Riemannian logarithm mapping defined in (11). Please note that the vectorization operator Vec(·) is used to represent v k as a vector.
As explained in [2], the VLAD descriptor can be interpreted as a simplified non probabilistic version of the FV. In the next section, we give an explicit relationship between these two descriptors which is one of the main contribution of the paper.

Fisher Vector Descriptor
Fisher vectors (FV) are descriptors based on Fisher kernels [59]. FV measures how samples are correctly fitted by a given generative model p(X|θ). Let X = {X n } n=1:N , be a sample of N observations. The FV descriptor associated to X is the gradient of the sample log-likelihood with respect to the parameters θ of the generative model distribution, scaled by the inverse square root of the Fisher information matrix (FIM).
First, the gradient of the log-likelihood with respect to the model parameter vector θ, also known as the Fisher score (FS) U X [59], should be computed: log p(X n |θ). (29) As mentioned in [5], the gradient describes the direction in which parameters should be modified to best fit the data. In other words, the gradient of the log-likelihood with respect to a parameter describes the contribution of that parameter to the generation of a particular feature [59]. A large value of this derivative is equivalent to a large deviation from the model, suggesting that the model does not correctly fit the data.
Second, the gradient of the log-likelihood can be normalized by using the FIM I θ [59]: where E X [·] denotes the expectation over p(X |θ). It yields that the FV representation of X is given by the normalized gradient vector [5]: As reported in previous works, exploiting the FIM I θ in the derivation of FV yields to excellent results with linear classifiers [6,7,9]. However, the computation of the FIM might be quite difficult. It does not admit a close-form expression for many generative models. In such case, it can be approximated empirically by carrying out a Monte Carlo integration, but this latter can be costly especially for high dimensional data. To solve this issue, some analytical approximations can be considered [5,9]. The next part explains how the FV model can be used to encode a set of SPD matrices. Once again, two approaches are considered by using respectively the LE and the affine invariant Riemannian metrics: It yields that, the elements of the LE Fisher score (LE FS) are obtained as: wherem d k (resp. σ d k ) is the dth element of vectorm k (resp. σ k ). Please note that to ensure the constraints of positivity and sum-to-one for the weights k , the derivative of the log-likelihood with respect to this parameter is computed by taking into consideration the soft-max parametrization as proposed in [9,60]: Under the assumption of nearly hard assignment, that is the soft assignment distribution γ k (m n ) is sharply peaked on a single value of k for any observation m n , the FIM I θ is diagonal and admits a close-form expression [9]. It yields that the LE FV of M is obtained as:

Riemannian Fisher Vectors (RFV)
Ilea et al. have proposed in [40] an approach to encode a set of SPD matrices with FS based on the affine invariant Riemannian metric: the Riemannian Fisher score (RFS). In this method, the generative model is a mixture of RGDs [39] as presented in Section 3.2.2. By following the same procedure as before, the RFS is obtained by computing the derivatives of the log-likelihood function with respect to the distribution parameters θ = {( k ,M k , σ k ) 1≤k≤K }. It yields that [40]: where LogM k (·) is the Riemannian logarithm mapping in (11) and Z (σ k ) is the derivative of Z(σ k ) with respect to σ k . The function Z (σ) can be computed numerically by a Monte Carlo integration, in a similar way to the one for the normalization factor Z(σ) (see Section 3.2.2).
In these expressions, γ k (M n ) represents the probability that the feature M n is generated by the kth mixture component, computed as: By comparing (33)-(35) with (40)-(42), one can directly notice the similarity between the LE FS and the RFS. In these equations, vector difference in the LE FS is replaced by log map function in the RFS. Similarly, Euclidean distance in the LE FS is replaced by geodesic distance in the RFS.
In [40], Ilea et al. have not exploited the FIM. In this paper, we propose to add this term in order to define the Riemannian Fisher vectors (RFV). To derive the FIM, the same assumption as the one given in Section 4.3.1 should be made, i.e., the assumption of nearly hard assignment, that is the soft assignment distribution γ k (M n ) is sharply peaked on a single value of k for any observation M n . In that case, the FIM is block diagonal and admits a close-form expression detailed in [61]. In this paper, Zanini et al. have used the FIM to propose an online algorithm for estimating the parameters of a Riemannian Gaussian mixture model. Here, we propose to add this matrix in another context which is the derivation of a descriptor: the Riemannian FV.
First, let's recall some elements regarding the derivation of the FIM. This block diagonal matrix is composed of three terms, one for the weight, one for the centroid and one for the dispersion.

•
For the weight term, the same procedure as the one used in the conventional Euclidean framework can employed [9]. In [61], they proposed another way to derive this term by using the notation s = [ √ 1 , . . . , √ K ] and observing that s belongs to a Riemannian manifold (more precisely the (K − 1)-sphere S K−1 ). These two approaches yield exactly to the same final result. • For the dispersion parameter, the notation η = − 1 2σ 2 is considered to ease the mathematical derivation. Since this parameter is real, the conventional Euclidean framework is employed to derive the FIM. The only difference is that the Euclidean distance is replaced by the geodesic one.
For more information on the derivation of the FIM for the Riemannian Gaussian mixture model, the interested reader is referred to [61]. To summarize, the elements of the block-diagonal FIM for the Riemannian Gaussian mixture model are defined by: where I K is the K × K identity matrix, ψ(η) = log (Z(σ)) and ψ (·) (resp. ψ (·)) are the first (resp. the second) order derivatives of the ψ(·) function with respect to η. ψ 2 (η) = ψ (η) + 1 2η . Now that the FIM and the FS score are obtained for the Riemannian Gaussian mixture model, we can define the RFV by combining (40) to (42) and (44) to (47) in (31). It yields that: Unsurprisingly, this definition of the RFV can be interpreted as a direct extension of the FV computed in the Euclidean case to the Riemannian case. In particular (37)-(39) are retrieved when the normalization factor Z(σ) is set to σ √ 2π in (48), (50) and (51). In the end, the RFVs are obtained by concatenating some, or all of the derivatives in (48)- (51). Note also that since (49) is a matrix, the vectorization operator Vec(·) is used to represent it as a vector.

Relation with VLAD
As stated before, the VLAD descriptor can be retrieved from the FV model. In this case, only the derivatives with respect to the central element (m d k orM k ) are considered. Two assumptions are also made: • the hard assignment scheme, that is: where M ∈ c k are the elements assigned to cluster c k and k = 1, . . . , K, • the homoscedasticity assumption, that is σ k = σ , ∀k = 1, . . . , K.
By taking into account these hypotheses, it can be noticed that (33) reduces to (26), confirming that LE FV are a generalization of LE VLAD descriptors. The same remark can be done for the approach exploiting the affine invariant Riemannian metric where the RFV model can be viewed as an extension of the RVLAD model. The proposed RFV gives a mathematical explanation of the RVLAD descriptor which has been introduced in [11] by an analogy between the Euclidean space (for the VLAD descriptor) and the Riemannian manifold (for the RVLAD descriptor).

Post-Processing
Once the set of SPD matrices is encoded by one of the previously exposed coding methods (BoW, VLAD, FS or FV), a post-processing step is classically employed. In the framework of feature coding, the post-processing step consists in two possible normalization steps: the power and 2 normalization. These operations are detailed next.

Power Normalization
The purpose of this normalization method is to correct the independence assumption that is usually made on the image patches [7]. For the same vector v, its power-normalized version v power is obtained as: where 0 < ρ ≤ 1, and sign(·) is the signum function and | · | is the absolute value. In practice, ρ is set to 1 2 , as suggested in [9].

2 Normalization
This normalization method has been proposed in [6] to minimize the influence of the background information on the image signature. For a vector v, its normalized version v L 2 is computed as: where · 2 is the L 2 norm. Depending on the considered coding method, one or both normalization steps are applied. For instance, for VLAD, FS and FV-based methods, both normalizations are used [36,40], while for BoW based methods only the 2 normalization is considered [33]. Table 1 draws an overview of the different coding methods. As seen before, two metrics can be considered, namely the LE and the affine invariant Riemannian metrics. This yields to two Gaussian mixture models: a mixture of multivariate Gaussian distributions and a mixture of Riemannian Gaussian distributions. These mixture models are the central point in the computation of the codebook which are further used to encode the features. In this table and in the following ones, the proposed coding methods are displayed in gray.

Synthesis
withM k ∈ P m , σ k ∈ R and k ∈ R. and k ∈ R.

Bag of Words (BoW)
Log-Euclidean BoW (LE BoW) [33,35] Bag of Riemannian Words (BoRW) [36] Histogram based on the decision rule Histogram based on the decision rule arg max k k p(m n |m k , Σ k ) arg max k k p(M n |M k , σ k )

Log-Euclidean Fisher Vectors (LE FV) Riemannian Fisher Vectors (RFV)
As observed, a direct parallel can be drawn between the different coding methods (BoW, VLAD, FS and FV). More precisely, it is interesting to note how the conventional coding methods used for descriptors lying in R m(m+1) 2 are adapted to covariance matrix descriptors.

Application to Image Classification
This section introduces some applications to image classification. Two experiments are conducted, one for texture image classification and one for head pose image classification. The aim of these experiments is three-fold. The first objective is to compare two Riemannian metrics: the log-Euclidean and the affine invariant Riemmannian metrics. The second objective is to analyze the potential of the proposed FV-based methods compared to the recently proposed BoW and VLAD-based models. In addition, finally, the third objective is to evaluate the advantage of including the FIM in the derivation of the FVs, i.e., comparing the performance between FS and FV.

Image Databases
To answer these questions, a first experiment is conducted on four conventional texture databases, namely the VisTex [62], Brodatz [63], Outex-TC-00013 [64] and USPtex [65] databases. Some examples of texture images issued from these four texture databases are displayed in Figure 3.
The VisTex database is composed of 40 texture images of size 512 × 512 pixels. In the following, each texture image is divided into 64 non-overlapping images of size 64 × 64 pixels, yielding to a database of 2560 images. The grayscale Brodatz database contains 112 textures images of size 640 × 640 pixels which represent a large variety of natural textures. Each one is divided into 25 non-overlapping images of size 128 × 128 pixels, thus creating 2800 images in total (i.e., 112 classes with 25 images/class). The Outex database consists of a dataset of 68 texture classes (canvas, stone, wood, . . . ) with 20 image samples per class of size 128 × 128 pixels. In addition, finally, The USPtex database is composed of 191 texture classes with 12 image samples of size 128 × 128 pixels. Table 2 summarizes the main characteristics of each of these four databases.

. Context
As shown in Figure 1, the first stage is the feature extraction step which consists in representing each texture image by a set of covariance matrices. Since the experiment purpose is not to find the best classification accuracies on these databases, but rather to compare the different strategies (choice of the metric, influence of the coding model) on the same features, we have adopted the simple but effective region covariance descriptors (RcovD) used in [34]. The extracted RCovD are the estimated covariance matrices of vectors v(x, y) computed on sliding patches of size 15 × 15 pixels where: v(x, y) = I(x, y), ∂I(x,y) ∂x , ∂I(x,y) ∂y In this experiment, the patches are overlapped by 50%. The fast covariance matrix computation algorithm based on integral images presented in [34] is adopted to speed-up the computation time of this feature extraction step. It yields that each texture class is composed by a set {M 1 , . . . , M N } of N covariance matrices, that are elements in P 5 .
For each class, codewords are represented by the estimated parameters of the mixture of K Gaussian distributions. For this experiment, the number of modes K is set to 3. In the end, the codebook is obtained by concatenating the previously extracted codewords (for each texture class). Please note that the same number of modes K has been considered for each class and has been set experimentally to 3 which represents a good trade-off between the model complexity and the within-class diversity. This parameter has been fixed for all these experiments since the aim is to fairly compare the different coding strategies for the same codebook.
Once the codebook is created, the covariance matrices of each image are encoded by one of the previously described method (namely BoW, VLAD, FS or FV) adapted to the LE or affine invariant Riemannian metric. Then after some post-processing (power and/or 2 normalization), the obtained feature vectors are classified. Here, the SVM classifier with Gaussian kernel is used. The parameter of the Gaussian kernel is optimized by using a cross validation procedure on the training set.
The whole procedure is repeated 10 times for different training and testing sets. Each time, half of the database is used for training while the remaining half is used for testing. Tables 3-6 show the classification performance in term of overall accuracy (mean ± standard deviation) on the VisTex, Brodatz, Outex and USPtex databases. In these tables, the best classification results are displayed.
As the FS and FV descriptors are obtained by deriving the log-likelihood function with respect to the weight, dispersion and centroid parameters, the contribution of each term to the classification accuracy can be analyzed. Therefore, different versions of the FS and FV descriptors can be considered to analyze separately the contribution of each term or by combining these different terms. For example, the row "LE FS/RFS:M" indicates the classification results when only the derivatives with respect to the centroid are considered to derive the FS (see (33) and (40)). In the following, only the results employing the mean are presented since the state-of-the-art have already proved that the mean provides the most significant information [6,7].
Please note that the use of the FIM in the derivation of the FV allows to improve the classification accuracy. As observed for the four considered databases, a gain of about 1% to 3% is obtained when comparing "LE FV/RFV:M" with "LE FS/RFS:M".
For these four experiments on texture image classification, the proposed FV descriptors outperform the state-of-the-art BoW and VLAD-based descriptors. Classifying with the best FV descriptor yields to a gain of about 1% to 4% compared to the best BoW and VLAD-based descriptors.

Comparison between Anisotropic and Isotropic Models
As observed in Tables 3-6, the performance for the LE metric are generally better than that with the affine invariant Riemannian metric. However, both approaches are not directly comparable since an anisotropic model is considered for the LE metric while an isotropic model is used for the affine invariant Riemannian metric. Indeed, for the former the dispersion for the Gaussian mixture model is a diagonal matrix Σ k while for the latter the dispersion σ k is a scalar. To provide a fairer comparison between these two approaches, an experiment is conducted to illustrate if the observed gain with the LE metric comes from the metric or from the fact that the Gaussian model is anisotropic.
For the LE metric, an isotropic model can be built by considering that Σ k = σ 2 k I m(m+1) This model will not be considered in the following. Table 7 shows the classification results obtained on the four considered texture databases. Here, the performances are displayed for the FV descriptor computed by using the derivative with respect to the centroid (i.e., LE FV/RFV:M). It can be noticed that for the LE metric, an anisotropic model yields to a significant gain of about 4% to 7% compared to an isotropic model. More interestingly, for an istropic model, descriptors based on the affine invariant Riemannian metric yield to better performances than that obtained with the LE metric. A gain of about 2% to 6% is observed. These experiments clearly illustrate that the gain observed in Tables 3-6 for the LE metric comes better from the anistropicty of the Gaussian mixture model than from the metric definition. According to these observations, it is expected that classifying with FV issued from anistropic Riemannian Gaussian mixture model will improve the performance. This point will be subject of future research works including the derivation of normalization factor of the anistropic Riemannian Gaussian model and the computation of the FIM.

Context
The aim of this second experiment is to illustrate how the proposed framework can be used for classifying a set of covariance matrices of larger dimension. Here, the head pose classification problem is investigated on the HOCoffee dataset [67]. This dataset contains 18,117 head images of size 50 × 50 pixels with six head pose classes (front left, front, front right, left, rear and right). Some examples of images of each class (one class per row) are displayed in Figure 4. It has a predefined experiment protocol where 9522 images are used for training and the remaining 8595 images are used for testing. We follow the same experiment protocol as in [11]. The extracted RCovD are the estimated covariance matrices of vectors v(x, y) computed on sliding patches of size 15 × 15 pixels where: v(x, y) = I L (x, y), I a (x, y), I b (x, y), I 2 x (x, y) + I 2 y (x, y), arctan with I c (x, y), c ∈ {L, a, b} are the CIELab color information for the pixel at coordinate (x, y), I x (x, y) and I y (x, y) are the first order luminance derivatives, and G i (x, y) denotes the response of the i-th Difference Of Offset Gaussian (DOOG) filter-bank centered at position (x, y) of I L . An overlap of 50% is considered to compute the covariance matrices. Hence, each image in the database is represented by a set of 25 covariance matrices of size 13 × 13. As for the previous experiment, 3 atoms per class are considered to compute the codebook.  Table 8 shows the classification accuracy on the HOCoffee dataset. Similar conclusions can be drawn with the previous experiment on texture image classification. The use of the FIM in the derivation of the FV still allows to improve the classification accuracy. The best performances are obtained for the LE metric compared to the affine invariant Riemannian metric. Nevertheless, for this latter, the performance are quite low, especially for the FV obtained by deriving with respect to the dispersion parameter. Please note that for this experiment the RVLAD descriptor allows to obtain better classification accuracy than the best RFV (70.6% vs. 67.9%). To understand why the performance with RFV are relatively low for the HOCoffee dataset, an experiment is conducted to see if the dispersion parameter can be considered with confidence.

Estimation Performance
This section presents simulation results to evaluate the performance of the estimator of the dispersion parameter for Gaussian models based on the LE and affine invariant Riemannian metrics. For all these experiments,M ij = ρ |i−j| for i, j ∈ 0, m − 1 . (57) ρ is set to 0.7 in the following. For the LE metric, N i.i.d. vector samples (x 1 , . . . , x N ) are generated according to a multivariate Gaussian distribution N (m, Σ), with Σ = σ 2 I m(m+1) 2 . For the affine invariant Riemannian metric, N i.i.d. covariance matrix samples are generated according the Riemannian Gaussian distribution defined in Section 3.2.2. In the following, 1000 Monte Carlo runs have been used to evaluate the performance of the estimation algorithm. Figure 5 draws the evolution of the root mean square error (RMSE) of the dispersion parameter σ for Gaussian models based on the LE and affine invariant Riemannian metrics as a function of σ. The red curve corresponds to an experiment with covariance matrices of dimension 5 × 5, while the blue one is for 13 × 13 covariance matrices. In this figure, 1000 (resp. 10,000) covariance matrices samples issued from the Gaussian model are generated to plot the solid (resp. the dashed) curve. This yields that the texture classification experiment of Section 5 is mimicked with the solid red curve while the head pose classification experiment is mimicked with the dashed blue one. As observed in Figure 5a for the LE m etric, the RMSE of the dispersion parameter is mainly influenced by the number of generated samples N. For this LE metric, the dimension of the covariance matrices has less importance, since the red and blue curves are superposed. Nevertheless, for the affine invariant Riemannian metric in Figure 5b, the RMSE of the dispersion parameter is greatly influenced by the dimension of the covariance matrices, especially for large values of σ. For the five databases, Figure 6 shows the boxplots of the dispersion parameter for the LE (Figure 6a) and Riemannian (Figure 6b) codebooks. Please note that since two different metrics are considered, the amplitude value of the dispersion parameter are not directly comparable between Figure 6a,b. However, for a given metric, it is possible to analyze the variability of the dispersion parameter for the five experiments. As observed in Figure 6b, the estimated dispersion parameter σ k for the Riemannian codebook takes larger values for the HOCoffee dataset than that for the four texture datasets. For the former, the estimated dispersion parameters of the Riemannian codebook are larger than 0.4 which corresponds to the area in Figure 5b where the RMSE of σ increases greatly. This explains why the performance with the RFV (especially when the dispersion is considered) are relatively low compared to the LE FV. Indeed, as observed in Figure 5a for the LE codebook, the dispersion parameters are much more comparable for the five datasets and the dimension m of the observed covariance matrix has less impact on the RMSE of σ for the LE metric.

Computation Time
The computation time can be separated in two parts: • The first one concerns the time used in learning stage to generate the codebook.

•
The second one concerns the time used to encode an image.
Obviously the codebook generation step requires much more time than the coding step. However, this codebook generation step can be done offline. This is similar to a deep learning approach where the estimation of the model takes much more time than the classification itself. Table 9 summarizes these computation times for the experiment on the VisTex database. For the coding method, the LE FV and RFV descriptors with only the derivative with respect to the centroidm orM are considered. All the implementations are carried out using MATLAB 2017 on a PC machine Core i7-4790 3.6GHz, 16GB RAM. As expected, the LE metric allows to faster the computation time compared to the affine invariant Riemannian metric. A gain of a factor of 6 is observed for the coding time with the log-Euclidean metric for 5 × 5 covariance matrices.

Conclusions
Starting from the Gaussian mixture model (for the LE metric) and the Riemannian Gaussian mixture model (for the affine invariant Riemannian metric), we have proposed a unified view of coding methods. The proposed LE FV and RFV can be interpreted as a generalization of the BoW and VLAD-based approaches. The experimental results have shown that: (i) the use of the FIM in the derivation of the FV allows to improve the classification accuracy, (ii) the proposed FV descriptors outperform the state-of-the-art BoW and VLAD-based descriptors, and (iii) the descriptors based on the LE metric lead to better classification results than those based on the affine invariant Riemannian metric. For this latter observation, the gain observed with the LE metric comes better from the anistropicty of the Gaussian mixture model than on the metric itself. For isotropic models, FV described issued from the affine invariant Riemannian metric leads to better results than those obtained with the LE metric. It is hence expected that the definition of a FV issued from an anistropic Riemannian Gaussian mixture model will improve the performance. This point represents one of the main perspective of this research work.
For larger covariance matrices, the last experiment on head pose classification has illustrated the limits of the RFV issued from the Riemannian Gaussian mixture model. It has been shown that the root mean square error of the dispersion parameter σ can be large for high value of σ (σ > 0.4). In that case, the LE FV are a good alternative to the RFV.
Future works will include the use of the proposed FV coding for covariance matrices descriptors in a hybrid classification architecture which will combine them with convolutional neural networks [17][18][19].
Author Contributions: All the authors contributed equally for the mathematical development and the specification of the algorithms. I.I. and L.B. conducted the experiments and wrote the paper. Y.B. gave the central idea of the paper and managed the main tasks and experiments. All the authors read and approved the final manuscript.