Measure of Similarity between GMMs by Embedding of the Parameter Space that Preserves KL Divergence

: In this work, we deliver a novel measure of similarity between Gaussian mixture models (GMMs) by neighborhood preserving embedding (NPE) of the parameter space, that projects components of GMMs, which by our assumption lie close to lower dimensional manifold. By doing so, we obtain a transformation from the original high-dimensional parameter space, into a much lower-dimensional resulting parameter space. Therefore, resolving the distance between two GMMs is reduced to (taking the account of the corresponding weights) calculating the distance between sets of lower-dimensional Euclidean vectors. Much better trade-off between the recognition accuracy and the computational complexity is achieved in comparison to measures utilizing distances between Gaussian components evaluated in the original parameter space. The proposed measure is much more efﬁcient in machine learning tasks that operate on large data sets, as in such tasks, the required number of overall Gaussian components is always large. Artiﬁcial, as well as real-world experiments are conducted, showing much better trade-off between recognition accuracy and computational complexity of the proposed measure, in comparison to all baseline measures of similarity between GMMs tested in this paper.


Introduction
The Gaussian Mixture Models have been used for many years in pattern recognition, computer vision, and other machine learning systems, due to their vast capability to model arbitrary distributions and their simplicity. The comparison between two GMMs plays an important role in many classification problems in the areas of machine learning and pattern recognition, due to the fact that arbitrary pdf could be successfully modeled by a GMM, knowing the exact number of "modes" of that particular pdf. Those problems include, but are not limited to speaker verification and/or recognition [1], content-based image matching and retrieval [2,3] (also classification [4], segmentation, and tracking), texture recognition [2,3,[5][6][7][8], genre classification, etc. In the area of Variational Auto-encoders (VAE), extensively used in emerging field of deep learning, GMMs have recently found their gateway (see [9]) with promising results. Many authors considered the problem of developing the efficient similarity measures between GMMs to be applied in such tasks (see for example [1][2][3]7,10]). The first group of those measures utilize informational distances. In some early works, Chernoff distance, Bhattacharyya distance, and Matusita distance were explored (see [11][12][13]). Nevertheless, Kullback-Leibler (KL) divergence [14] emerged as the most natural and effective informational distance measure. It is actually an informational distance between two probability distributions p and q. While the solution for the KL divergence between two Gaussian components exists in the analytic, i.e., closed-form, there is no analytic solution for the KL divergence between arbitrary GMMs, which is very important for various applications. The straight-forward solution of the mentioned problem is to calculate KL divergence between two GMMs via the Monte-Carlo method (see [10]). However, it is almost always an unacceptably computationally expensive solution, especially when dealing with a huge amount of data and large dimensionality of the underlying feature space. Thus, many researchers proposed different approximations for the KL divergence, trying to obtain acceptable precision in recognition tasks of interest. In [2], one such approximation is proposed and applied in image retrieval task as a measure of similarity between images. In [10], lower and upper approximation bounds are delivered by the same authors. Experiments are conducted on synthetic data, as well as in speaker verification task. In [1], accurate approximation built upon Unscented Transform is delivered and applied within a speaker recognition task in a computationally efficient manner. In [15], the authors proposed a novel approach to online estimation of pdf's, based on kernel density estimation. The second group of measures utilize informational geometry. In [16], the authors proposed a metric on the space of multivariate Gaussians by parameterizing that space as the Riemannian symmetric space. In [3], motivated by the mentioned paper and the efficient application of vector-based Earth-Movers Distance (EMD) metrics (see [17]) applied in various recognition tasks (see for example [18]), and their extension to GMMs in texture classification task proposed in [6], the authors proposed sparse EMD methodology for Image Matching based on GMMs. An unsupervised sparse learning methodology is presented in order to construct EMD measure, where the sparse property of the underlying problem is assumed. In experiments, it proved to be more efficient and robust than the conventional EMD measure. Their EMD approach utilizes information geometry based ground distances between component Gaussians, introduced in [16]. On the other hand, their supervised sparse EMD approach uses an effective pair-wise-based method in order to learn GMM EMD metric among GMMs. Both of these methods were evaluated using synthetic as well as real data, as part of texture recognition and image retrieval tasks. Higher recognition accuracy is obtained in comparison to some state-of-the-art methods. In [7], the method proposed in [3] was expanded. A study concerning ground distances and image features such as Local Binary Pattern (LBP) descriptor, SIFT, high-level features generated by deep convolution networks, covariance descriptor, and Gabor filter is also presented.
One of the main issues in pattern recognition and machine learning as a whole is that data are represented in high-dimensional spaces. This problem appears in many applications, such as information retrieval (and especially image retrieval), text categorization, texture recognition, and appearance-based object recognition. Thus, the goal is to develop the appropriate representation for complex data. The variety of dimensionality reduction techniques are designed in order to cope with this issue, targeting problems such as "curse of dimensionality" and computational complexity in the recognition phase of ML task. They tend to increase discrimination of the transformed features, which now lie either on a subspace of the original high dimensional feature space, or more generally, on some lower dimensional manifold embedded into it. Those are the so called manifold learning techniques. Some of the most commonly used subspace techniques, such as Linear Discriminant Analysis (LDA) [19] and maximum margin criterion (MMC) [3,20], trained in a supervised manner, or for example Principal Component Analysis (PCA) [21], trained in an unsupervised manner, handle this issue by trying to increase discrimination of the transformed features, and to decrease computational complexity during recognition. Some of the frequently used manifold learning techniques are Isomap [22], Laplacian Eigenmaps (LE) [23], Locality Preserving Projections (LPP) [24] (approach based on LE), and Local Linear Embedding (LLE) [25]. The LE method explores the connection between the graph Laplasian and the Laplace Beltrami operator, in order to project features in a locally-preserving manner. Nevertheless, it is only to be used in various spectral clustering applications, as it cannot deal with unseen data. An approach based on LE, called Locality Preserving Projections (LPP) (see [24]), manages to resolve the previous problem by learning linear projective map which best "fits" in the manifold, therefore preserving local properties of the data in the transformed space. In this way, we can transform any unseen data into a low-dimensional space, which can be applied in a number of pattern recognition and machine learning tasks. In [26], the authors proposed the Neighborhood Preserving Embedding (NPE) methodology that, similarly to LPP, aims to preserve the local neighborhood structure on data manifold, but it learns not only the projective matrix which projects the original features to lower-dimensional Euclidean feature space, but also, as an intermediate optimization step, the weights that extract the neighborhood information in the original feature space. In [27], some of the previously mentioned methods, such as LE and LLE, are generalized. An example of LE is given for the Riemannian manifold of positive-definite matrices, and applied as part of image segmentation task. Note that the mentioned dimensionality reduction techniques are applicable in many recent engineering and scientific fields, such as social network analysis and intelligent communications (see for example [28,29] , published within a special issue presented in an editorial article [30]).
In many machine learning systems, the trade-off between recognition accuracy and computational efficiency is very important for those to be applicable in real-life. In this work, we construct a novel measure of similarity between arbitrary GMMs, with an emphasis on lowering the complexity of the representation of all GMMs used in a particular system. Our aim is to investigate the assumption that the parameters of full covariance Gaussians, i.e., the components of GMMs, lie close to each other in a lower-dimensional surface embedded in the cone of positive definite matrices for the particular recognition task. Note that this is contrary to the assumption that data themselves lie on the lowerdimensional manifold embedded in the feature space. We actually use the NPE-based idea in order to reduce the projection matrix A, but we apply it on the parameter space of Gaussian components. The matrix A projects the parameters of Gaussian components to a lower-dimensional space. Local neighborhood information from the original parameter space is preserved. Let N (µ i , Σ i ), i = 1, . . . , M be a set of all Gaussian components, and M is the number of Gaussians for the particular task. We assume that parameters of any multivariate Gaussian component N (µ i , Σ i ), given as vectorized pair (µ i , Σ i ), live in a high-dimensional parameter space. Each Gaussian component is then assigned to a node of undirected weighted graph. The graph weights W ij are learned in the intermediate optimization step, forming the weight matrix W, where instead of the Euclidean distance figuring in the particular cost functional that is used in baseline NPE operating on feature space, we use a specified measure of similarity between Gaussian components and plug it into the cost functional. The ground distances between Gaussians N(µ i , Σ i ) and N(µ j , Σ j ), proposed in [3,16], are based on information geometry. We name the proposed GMM similarity measure as GMM-NPE.

GMM Similarity Measures
KL divergence is the most natural measure between probability distributions p and q. The measure is defined as KL(p||q) = R d p(x) log p(x) q(x) dx. However, as mentioned in the previous section, in the case of GMMs, it cannot be expressed as the closed-form solution.
The straightforward, but at the same time the most expensive, is a computation calculated by using the standard Monte-Carlo method (see [31]). The idea is to sample the probability distribution f by using i.i.d. samples g(x) = KL( f ||g). It is given by: Although it is the most accurate, the Monte-Carlo approximation (1) is computationally unacceptably expensive in real world applications, especially in recent years, when there is a huge amount of data present (big data) in almost all potential areas of interest. In order to cope with the mentioned problem, i.e., to obtain fast, but at the same time accurate, approximation, various approximations of the KL-divergence between two GMMs are proposed in . The roughest approximation is based on the convexity of the KL-divergence [32] and for two GMMs f = ∑ n i=1 α i f i and g = ∑ m j=1 β j g j , it holds where f i = N (Σ i , µ i ) and g j = N (Σ j , µ j ) are Gaussian components of the corresponding mixtures, while α i > 0, β j > 0 are corresponding weights, satisfying ∑ i α i = 1, ∑ j β j = 1.
The "roughest" approximation by upper bound (2), yielding the weighted average version given by plays special role in the case when Gaussians from different GMMs stand far from each other. On the other hand, KL divergence KL( f i ||g j ) between corresponding Gaussians exists in the closed-form given by so that (3) is computationally much cheaper than the Monte-Carlo approximation (1). Various approximations of the KL divergence between two GMMs were proposed in [1,2,10] and efficiently applied in real world problems, such as speech recognition, image retrieval, or speaker identification. For example, in [2], the Matching-based Approximation given by is proposed, based on the assumption that the element g j , i.e., the one that is most proximate to f i , dominates the integral f i log g. Motivated by (5), more efficient matching based approximation is given by showing good performances when the Gaussians figuring in f and those figuring in g are mostly far apart, but shows inappropriate if there is significant overlapping among Gaussian components of f and g. The authors proposed the Unscented Transform-based approximation as a way to deal with those overlapping situations. The Unscented Transformation is a mathematical function used to estimate the statistics of a random variable to which a nonlinear transformation is applied (see [33]). If it holds that KL( f ||g) = where √ Σ i k is the k-th column of the matrix square root of Σ i . Integrals R d f log f and R d f log f are now approximated in the previous manner, so that for second integral we have and similarly for the first. Thus, the KL UC ( f ||g) is obtained as above. GMM distance which utilizes KL divergence KL( f i ||g j ) between Gaussian components in order to obtain an approximate KL divergence between full GMMs is Variational approximation is proposed in [31] (see also [10]), given by Earth-Movers Distance (EMD) methodology motivated various recognition tasks (see for example [17,18]). Based on that, the authors in [6] proposed EMD to measure the distributional similarity by sets of the Gaussian components representing texture classes. We denote it as EMD-KL measure. In [3], the authors incorporate ground distances between component Gaussians into the unsupervised sparse EMD-based distance metrics between GMMs, using the perspective from the Riemannian geometry and the work delivered in [16]. The first one is based on Lie Groups and it performs better when incorporated into the sparse EMD-based measure of similarity between GMMs than the second one, based on the products of Lie groups. We denote it as SR-EMD measure in the rest of the text.

NPE Dimensionality Reduction on Euclidean Data
Unlike PCA, which aims to preserve the global Euclidean structure, and similarly to LPP (see [24]), the nonlinear dimensionality reduction technique NPE [26] aims to preserve the local manifold structure of the input data. Given an embedded set of data points in the configuration space (they lie on a low dimensional manifold, i.e., it is assumed that the samples from the same class probably lie close to each other in the input space), we first build a weight matrix W ∈ R m×d , which describes the relationship between data points. Namely, if we assume that data are embedded in the Euclidean R d space, each data point x i ∈ R d is represented as the linear combination of neighboring data points, where for the neighboring data point x j , the coefficients w ij ∈ R in the weight matrix represent the "local proximity" of those two points in the configuration space. The goal is to find the optimal embedding in order to preserve the neighborhood structure in the reduced space. The NPE procedure consists of the following steps:

1.
Constructing an adjacency graph: Let us consider a graph with m ∈ N nodes, where the i-th node corresponds to the data point x i . One way to construct the adjacency graph is to use K nearest neighbors (KNN), where we direct an edge from node i to j if x j is among the K nearest neighbors of x i . The other one is neighborhood: Put an edge between nodes i and j if Computing the weights: Let W denote the weight matrix with W ij > 0 if there is an edge from node i to node j, and W ij = 0 if there is no such edge. The weights on the edges can be computed by solving the following minimization problem: 3.
Computing the projections: In order to compute the projections, we need to solve the following optimization problem: which, by imposing constraint a T XX T a = 1 and by using the Lagrange multipliers, reduces to the following eigenvalue problem: Since M is symmetric and positive semi-definite, its eigenvalues are real and nonnegative. By taking the largest l ∈ N, l << d eigenvalues λ 0 , . . . , λ l−1 , and the corresponding l eigenvectors a 0 , . . . , a l−1 , we obtain the projection matrix A = [a 1 | · · · |a l−1 ] ∈ R l×d and the embedding x i → y i = Ax i , now projecting from the high-dimensional R d to the low-dimensional R l Euclidean space. Readers can find more details on the subject in [26].

GMM Similarity Measure by the KL Divergence Preserving NPE Embedding of the Parameter Space
We propose a novel measure of similarity between arbitrary GMMs by utilizing the NPE-based technique and the KL divergence type ground distance between the Gaussian embedded components, i.e., their parameters, instead of the Euclidean distance between some observations, as in the standard NPE procedure used as a feature dimensionality reduction technique.
The first step is to learn the projective matrix A in the neighborhood preserving manner with respect to informational ground distance, i.e., the (non-symmetric) KL divergence between Gaussian components of GMMs used, and to project those (vectorized) parameters into the low-dimensional Euclidean parameter space. Our goal is to preserve the local neighborhood information which exists in the original parameter space, while dealing with much lower-dimensional space of transformed parameters. The aim is to obtain the best possible trade-off between the recognition precision and computational efficiency in a particular pattern recognition task. We call it the NPE-based measure of similarity between GMMs and denote it further by GMM-NPE.
The second step is to aggregate the non-negative real value which represents a measure between two particular GMMs. For that purpose, we compare the transformed "clouds" of lower dimensional Euclidean parameter vectors corresponding to the original Gaussian components of GMMs used, pondered by their belonging weights. The first, the simpler technique that we use is based on aggregation operators (the weighted max-min operator and maximum of the weighted sums operator in particular), which we apply on "clouds" of lower dimensional Euclidean parameter vectors in order to aggregate value representing the final measure between two GMMs. Note that, regardless of the usage of non-symmetric KL divergence in the first step, i.e., in the calculation of the projective matrix A, the properties of the invoked measure in terms of the symmetry, satisfying the triangle inequality, etc., depends on the second step, i.e., on the type of aggregation of value of the measure. We will comment later on those properties.

KL Divergence Type Ground Distance, Forming the NPE-Type Weights and the Projection Matrix
The goal is to use the NPE-like approach in order to obtain the projection matrix A which transforms vectorized representatives of P i ∈ Sym + (d + 1) corresponding to Gaussian components g i = N (Σ i , µ i ), i = 1, . . . , M featuring in GMMs, where M represents the overall number of components and d is the dimension of the underlying feature space. Then, as explained previously, the measure of similarity comparing the "clouds" of pondered Euclidean vectors is to be used in order to obtain the final value of GMM measure.
To apply an NPE-like approach, we start from the fact that a set of multivariate Gaussians is a Riemannian manifold and that d-dimensional multivariate Gaussian components g = N (µ, Σ) can be embedded into Sym + (d + 1), i.e., a cone embedded in n = d(d + 1)/2 + d Euclidean dimensional space and also a Riemannian manifold [3,16]. It can be conducted as follows: |Σ| > 0 denotes the determinant of the covariance matrix of Gaussian component g. For the detailed mathematical theory behind the embedding (12), one can refer to [16]. We invoke the assumption that any representative P i ∈ Sym + (d + 1) can be approximated as the non-negative weighted sum of neighbors P j in the following way: where N (i) is the set of indices of neighboring representatives, i.e., the representatives is given by the expression (4). Thus, we obtain the following optimization problem: which reduces to M independent optimization problems given below, for i = 1, . . . , M: where the constraint 0 P i P i ensures that the residual is positive semi-definite, i.e., E i = P i −P i 0. By using (4), we have the following considerations: and thus, A more efficient way to achieve that only a few "neighbors" effect P i is to include sparsity constrain in the form of l 1 norm of the weight matrix W (which is the convex relaxation of the l 0 norm). Thus, we include the additional term λ W 1 in the penalty function (14), where λ > 0 is a parameter representing the trade-off between sparser representation and closer approximation. The following sparse convex problem is obtained (similar as in [34]): which is the final problem that we solve in order to obtain the weight matrix W. Note that W ij ≥ 0 ensures that the following condition is satisfied ∑ M j=1 W ijP (i) j 0. The above formulation of tensor sparse coding is associated with the general class of optimization problems denoted as determinant maximization problems, or MAXDET [35], while semidefinite programming (SDP) and linear programming (LP) are its special cases. These problems are convex and could be solved by a class of interior-point methods (see for example [36]). In order to implement the actual optimization, we used CVX [37].
Forming the projection matrix A which projects the vectorized parameters (corresponding to P i , i.e., the Gaussian representatives p i ),ṽ i = (P i ) ∈ R n , n = d(d + 1)/2, i = 1, . . . , M, into the lower l-dimensional Euclidean parameter space, with n l, is the next step. It is similar to step 3 from Section 3, and thus includes solving the spectral problem (11).

Constructing the GMM-NPE Similarity Measure
The remaining task in constructing the final GMM-NPE similarity measure is to aggregate the non-negative real value which represents the measure of similarity between two particular GMMs. Actually, we have to compare the transformed "clouds" of lower l-dimensional Euclidean parameter vectors, with l n, n = d(d + 1)/2, corresponding to the original Gaussian components of GMMs used. We also have to encounter the belonging weights into final result. In all approaches that we utilize, for the particular . . , M, with P i defined by (12), where we plug µ i and Σ i , and where A is the projection matrix obtained as explained in the previous section. Using the above-given representation, the similarity measure between two GMMs given by f = ∑ m f i=1 α i f i , and g = ∑ m g i=1 β j g j can be invoked by simply comparing the corresponding representatives F = v 1 , . . . , v m f , α 1 , . . . , α m f and G = u 1 , . . . , u m g , β 1 , . . . , β m g in the transformed space, i.e., by comparing them as weighted low-dimensional Euclidean vectors. Various approaches can be applied to aggregate a single positive scalar value in order to represent a "distance" between F and G and therefore implicitly a "measure" between GMMs f and g. In this work, we use two essentially different approaches. The first one is simpler and utilizes the arbitrary fuzzy union or intersection in order to extract the mentioned value, given, for example, by various aggregation operators (see, e.g., [38]). The second approach utilizes EMD distance on F and G, and it is based on the work proposed in [17].
For the first approach, we use types of fuzzy aggregation operators, operating on v i − u j 2 , using α i and β j as weights. For the above-mentioned representatives, we apply the weighted max-min operator in the following way: as well as the maximum of the positive weighted sums We denote the previously invoked GMM measure induced by D 1 by GMM-NPE 1 , while we denote the GMM measure induced by D 2 by GMM-NPE 2 . Note that the choice of the particular fuzzy aggregation operator, i.e., the fuzzy measure, determines all the distance-wise properties of the final GMM similarity measure. Those are in our case the properties of D 1 and D 2 . It is also interesting to discuss which properties of the KL divergence do GMM-NPE 1 and GMM-NPE 2 satisfy. Both of them satisfy self similarity and positivity, for arbitrary GMMs f and g, while self-identity is not satisfied. Furthermore, the measures D 1 and D 2 are both symmetric, while KL divergence is not. Nevertheless, note that we could easily obtain non-symmetry by, for example, letting D 1 (F, G) = a in (16), and D 2 (F, G) = a in (17), but we leave those considerations for some future work.
For the second, i.e., the EMD distance approach, the representatives F and G are interpreted as pondered "clouds" of Euclidean low-dimensional vectors. Thus, the final measure of similarity between GMMs f and g is given (see [17]) as follows: where the flow [ζ ij ] is given as one that solves the following LP type minimization problem: where [d ij ] is the matrix of Euclidean distances between v i and u j , i.e., d ij = v i − u j . Note that the constant 1 which appears in the right hand side of the constraint (19) is due to the fact that α i , as well as β j , sum to one. Thus, the term D EMD (F, G) is actually interpreted as the work necessary in order to move, by flow [ζ ij ], the maximum amount of supplies possible, from the "cloud" F to the "cloud" G. Furthermore, note that the fact that EMD distance is a metric (see [17]) implies that the measure of similarity between GMMs D EMD defined by (18) is also a metric. Thus, similarly to the case of D 1 and D 2 , it is symmetric. We denote the GMM measure induced by D EMD by GMM-NPE 3 .

Computational Complexity
In the given analysis, the computational efficiency of a measure is defined as the efficiency obtained in the testing (not the learning) phase. Let us, for the sake of simplicity and without loss of generality, further assume that GMMs f and g have the same number, n = m, and that we treat the full covariance case. Let d denotes the dimension of the original feature space. Let us first elaborate on baseline measures that we use.
The complexity of KL-based measures of similarity between GMMs KL WE , KL MB , and KL VAR (see [10]) given by (3) where N is the number of samples. The estimate is then obtained using the arguments described above. Furthermore, in order to obtain an efficient approximation, the number of samples N has to be large, i.e., N >> m.
For the state-of-the-art EMD-based measures of similarity between GMMs proposed in [3], the computational complexity for SR-EMD measure can be estimated as O(8m 5 d 3 ), as LARS/Homotopy algorithms that are usually used to find a numerical solution of the optimization problem elaborated in SR-EMD converge in about 2m iterations (see [39]). Namely, as (19) is a LP problem, in one iteration, the computational complexity is of order O(n const n var ), where n const is a number of constraints and n var is a number of variables for the particular problem. As it holds n const = n var = m and the complexity of the inversion of d × d matrix is of order O(d 3 ), since there are m 2 such inversions at each iteration, we obtain the previously mentioned estimate.
For the proposed similarity measures D 1 and D 2 given by (16) and (17), the analysis is as follows: the computational complexity of comparing F and G rise linearly with l and is given as O(m 2 l), where l d 3 is delivered a priory on the base of the analysis of the eigenvalues, as explained at the end of Section 4.1. Nevertheless, if we encounter the computational complexity required to transform the parameters of GMMs to the l dimensional space, there is an additional term O(md 2 l). One observes that for small l (l ∼ d in our experiments), the overall complexity of the proposed D 1 and D 2 is much smaller then all the baseline measures, and especially for large number of components m. For the EMD-based approach, i.e., the D EMD given by (18), the computational complexity is estimated as the sum of O(k iter m 4 l) term and the mentioned term O(md 2 l), making it significantly more efficient in comparison to EMD-KL and SR-EMD-M [3], as it holds l d d 2 . Instead of calculating the KL divergence between two d-variate Gaussians, we calculate the Euclidian distance between two vectors of length l, which is of complexity O(l).

Experimental Results
In this section, we present experiments comparing the proposed GMM-NPE measures with the baseline measures presented in Section 2. The experiments were conducted on synthetic as well as real data sets (texture recognition task). For the first case, synthetic data are constructed, satisfying specific assumptions, so that the proposed GMM-NPE measures could demonstrate their effectiveness over the baseline measures in such controlled conditions. In both synthetic and real data case, for the baseline measures, we chose KL WE , KL MB , and KL VAR , defined by (3)

Experiments on Synthetic Data
In order to demonstrate the effectiveness of the proposed method, we use toy examples consisting of two scenarios.
In the first scenario, we set the parameters of the Gaussians to lie on the low dimensional surface embedded in the cone SP D + (d + 1) ⊂ R n , n = (d + 1)(d + 2)/2, where the covariance matrix is of dimension d × d, with various dimensions d (d is also the dimension of the corresponding centroid), as it is given by (12). Dimensions of the surfaces containing data used in experiments are l = 1 and l = 2.
Mentioned surfaces are formed as follows: For the l = 1 case, we randomly generate positive-definite matrices A 1 , A 2 , both of dimension d × d, in a following way: let i = 1 (the procedure is identical for i = 2). Firstly we generate a matrixÃ 1 containing independent, identically distributed (i.i.d.) elements, where we set pdf to be U ([0, 1]). After symmetriza-tionÃ sym 1 = 1 2 Ã 1 +Ã T 1 , we obtain matrixÂ 1 by replacing only the diagonal elements of A sym 1 = ã ij d×d with the sum of off-diagonal elements of matrixÃ sym 1 , i.e.,â ii ← ∑ d j=1 j =iã ij (note thatã ij > 0) andâ ij ←ã ij for i = j. Thus, asÂ 1 is a symmetric and diagonally dominant matrix, it is positive semi-definite (see [40]). Finally, we obtain A 1 =Ã 1 + εI, for some small ε > 0 (thus A 1 is positive definite), where we chose 0.00001 for all experiments. The same stands for matrix A 2 . Finally, the l = 1 dimensional manifold in formed in the form of parabolic curve given by: and embedded into R n . For simplicity purposes, a = 1, b, c = 0 in all our experiments. For the case l = 2, we form the l = 2 dimensional surface given by embedded into R n . For the same reasons as in the case (20), we chose a = 1, b, c = 0 for all experiments.
We uniformly sample N = 800 Gaussians directly from the curve (20) for the l = 1 or (21) for the l = 2 case. From that pool, also by uniform sampling, we obtain M number of GMMs with the predefined size K, where we set all mixture weights to be 1/K. For the acquired set of GMMs, we conduct "leave 10 percent out" cross-validation for every trial. We find that the estimated number of nonzero eigenvalues in all experimentsl is fully coherent with the dimension l of the underlying manifolds, i.e.,l = 1 in the l = 1 case andl = 2 in the l = 2 case, where the threshold for neglecting the eigenvalues was set to T = 10 −3 . In all experiments, as the proposed method, we use GMM-NPE 1 , GMM-NPE 2 , or GMM-NPE 3 . We vary the parameter K representing the size of a particular GMM used in the training as well as dimension d. We use different values for K, namely K = 1 and K = 5. In the case l = 1, we first set the means of the Gaussians to be zero vectors, where the results of experiments are presented for [r 1 , r 2 ] = [−3, 5] and [r 1 , r 2 ] = [0, 5] in Tables 1 and 2, respectively. Next, we make the means of Gaussians used in GMMs to be d dimensional vectors (we have d ∈ {10, 20, 30, 50} in all experiments), by setting all means belonging to the first class equal to some fix m 1 ∈ R d , and all means belonging to the second class equal to some fix m 2 ∈ R d . We set  Tables 3 and 4, respectively. The same settings as previously described are kept for the l = 2 case. The experiments for the case where the means of the Gaussians are set to zero are presented in Tables 5 and 6, while those where the means of Gaussians are non-zero are presented in Tables 7 and 8, respectively.      Table 6. Results in the form of recognition accuracy, obtained on the synthetic data: l = 2, t 1 , t 2 ∈ [0, 5], N = 800, M = 200, m 1 , m 2 = 0.    For the second scenario, N = 800 positive-definite matrices A i are sampled, each one formed in a similar way, previously described for A 1 . Thus, we control the sampling process in order to obtain positive-definite matrices "uniformly" distributed in the cone SP D + (d), i.e., not lying on any lower dimensional embedded sub-manifold. The set of Gaussians is formed using the set of positive-definite matrices, while all means are set to zero vectors.Ñ different GMMs of size K are formed, their components sampled uniformly from the above-mentioned set of Gaussians (N = 800,Ñ = 200 and K = 5 in the experiment). The proposed GMM-NPE i (i = 1, 2, 3) performs equally well, concerning the recognition precision as well as computational efficiency in comparison to all baseline methods. Estimated number of the non-negligible characteristic values were equal to the dimension of the full space. All the above-mentioned confirms that if data do not lye on the lower dimensional manifold embedded in the cone SP D + (d), the proposed method does not provide any benefits in comparison to the baseline methods.

Experiments on Real Data
In this section, the performances of the proposed method described in Section 4.2, evaluated on real data (texture recognition task), are presented in comparison to baseline methods. As the baseline, we use KL-based KL WE , KL MB , and KL VAR GMM similarity measures, all described in Section 2. As the baseline, we also use the unsupervised sparse EMD-based measure proposed in [3], denoted by SR-EMD-M measure as well as the supervised sparse EMD-based measure, also proposed in [3], denoted by SR-EMD-M-L. For a texture recognition task, we conducted experiments on the following databases: UMD [41], containing 25 classes (1000 images); CUReT [42], containing 61 classes (5612 images); KTH-TIPS [43], containing 10 classes (8010 images).
We used covariance descriptors as texture features (see [34,44,45]) in the experiments, as they showed excellent performance in the texture recognition task. We briefly explain how they were formed: For any given textured image, the row features are calculated in a form [I, |I x |, |I y |, |I xx |, |I yy |](x, y) (the actual dimension of the vector isd = 5), from whom, extracted at the R × R patch (we used R = 30 in all experiments), centered in (x, y), we estimate the covariance matrix, and then finally vectorize its upper triangular into one d =d(d + 1)/2 = 15 dimensional feature vector. For that particular textured image, the parameters of GMMs are estimated using EM [46] on the pool of feature vectors obtained as previously explained. We note that for every train or test image example, we uniformly divide it into four sub-images and those are used for training/testing. Hence, each image is represented by four GMMs and compared to all GMMs in the training set, while its label is determined using the kNN algorithm (k = 3 and class label is obtained by voting). Recognition accuracy of the proposed GMM-NPE measures, in comparison to all baseline measures, for the above-mentioned texture databases, are presented in Figures 1-3. For all databases used, we vary from l = 30 to l = 100 in order to analyze the trade-off between accuracy and computational efficiency. We kept the number of Gaussian components fixed and equal to K = 5. For each class, a fixed number of N examples from the training set is randomly selected (by uniform distribution), keeping the rest for testing. We vary the mentioned number of training instances N across experiments. Final results are averaged over 20 trials. In all experiments, we obtained slightly better results using the GMM-NPE 2 measure defined by (18), (19) in comparison to the GMM-NPE 1 , so we present only the GMM-NPE 2 and GMM-NPE 3 in our results.
Recall that (see the analysis presented in Section 4.3) the computational complexity of the proposed GMM-NPE 1 (as well as GMM-NPE 2 ) is roughly O(K 2 l) + O(Kd 2 l), and for all the KL-based baseline algorithms, i.e., the KL WE , KL MB , and KL VAR , the computational complexity is estimated roughly as O(K 2 d 3 ). Furthermore, (see Section 4.3), for the EMDbased baseline algorithms, i.e., the EMD-KL and SR-EMD-M, the computational complexity is estimated roughly as O(8K 5 d 3 ) and O(k iter K 4 d 3 ), respectively, with k iter >> K (see [3]). It follows that the ratio between the computational complexity of the proposed GMM-NPE 1 and GMM-NPE 2 , and any mentioned baseline KL-based method is estimated roughly as l/d 3 + l/(Kd), while the ratios between the computational complexity of GMM-NPE 1 and GMM-NPE 2 , and the baseline EMD-based measures are estimated as l max k iter K 2 d 3 + l max 8K 4 d and l max 8K 3 d 3 + l max k iter K 3 d , respectively. Considering l max d and k iter >> K, it can be seen that the computational efficiency is largely in favor of the proposed GMM-NPE 1 , GMM-NPE 2 in comparison to all baseline measures in all experimental cases. Concerning the proposed EMD-based GMM-NPE 3 measure with its computational complexity roughly estimated as O(k iter K 4 l) + O(Kd 2 l) (see Section 4.3), we compare its computational complexity with the corresponding EMD-based baseline EMD-KL and SR-EMD-M measures. The complexity ratios are estimated as k iter l max 8Kd + l max 8K 3 d and l max d + l max k iter K 3 d , again largely in favor of the proposed GMM-NPE 3 measure, in comparison to the EMD-KL and SR-EMD-M measures, and especially for smaller values of l. Thus, we conclude that the trade-off between the recognition accuracy and the computational efficiency is in favor of the proposed GMM-NPE measures. In Table 9, CPU processing times are presented for the proposed GMM-NPE 1 and GMM-NPE 2 measures, in comparison to the baseline KL WE , KL MB , KL VAR , and EMD-KL measures. The results are obtained as CPU processing times needed for the evaluation of measures of similarity between two GMMs in 100 trials. All GMMs are learned using randomly chosen example images from KTH-TIPS texture classification database. For all the experiments, we set k iter = 20, d = 15 and l max = 30, l max = 70, or l max = 100. It can be seen that the proposed GMM-NPE measure provides significantly lower CPU processing times in comparison to all baseline measures when there is a significant reduction in dimensionality of the original parameter space, i.e., l max = 30 and l max = 60, where the original Euclidian parameter space is of dimension n = 120. However, in the case of a relatively insignificant reduction in dimensionality, i.e., l max = 100, the performances in terms of computational complexity deteriorate significantly for the GMM-NPE measures. These results are consistent concerning the computational bounds given for the proposed and the baseline measures given in Section 4.3. The experiments were conducted on a workstation equipped with one 2.3 GHz CPU and 6 GB RAM. Table 9. Average processing CPU times for the proposed GMM-NPEs, in comparison to the baseline measures, as a function of number of GMM components K used, as well as dimension of the reduced space l max (unit: [ms]). The proposed methodology could also be applied in realistic personalization and recommendation application scenarios presented in [47]. Namely, user profile features obtained in this process could store history over time, and therefore, the covariance matrix could be estimated in the learning phase. The transformation matrix could be formed as presented in Sections 3 and 4.1, and the covariance which represents any particular user could be projected and represented by low dimensional vector representatives. In the exploitation phase, stored features collected from users in some predefined period of time could also be used in order to form covariances which could then be projected. The measure of similarity between a user and item could then be computed by using similarity measures and the procedure proposed in Section 4.2.