The Gaussian Mixture Models have been used for many years in pattern recognition, computer vision, and other machine learning systems, due to their vast capability to model arbitrary distributions and their simplicity. The comparison between two GMMs plays an important role in many classification problems in the areas of machine learning and pattern recognition, due to the fact that arbitrary pdf could be successfully modeled by a GMM, knowing the exact number of “modes” of that particular pdf. Those problems include, but are not limited to speaker verification and/or recognition [
1], content-based image matching and retrieval [
2,
3] (also classification [
4], segmentation, and tracking), texture recognition [
2,
3,
5,
6,
7,
8], genre classification, etc. In the area of Variational Auto-encoders (VAE), extensively used in emerging field of deep learning, GMMs have recently found their gateway (see [
9]) with promising results. Many authors considered the problem of developing the efficient similarity measures between GMMs to be applied in such tasks (see for example [
1,
2,
3,
7,
10]). The first group of those measures utilize informational distances. In some early works, Chernoff distance, Bhattacharyya distance, and Matusita distance were explored (see [
11,
12,
13]). Nevertheless, Kullback–Leibler (KL) divergence [
14] emerged as the most natural and effective informational distance measure. It is actually an informational distance between two probability distributions
p and
q. While the solution for the KL divergence between two Gaussian components exists in the analytic, i.e., closed-form, there is no analytic solution for the KL divergence between arbitrary GMMs, which is very important for various applications. The straight-forward solution of the mentioned problem is to calculate KL divergence between two GMMs via the Monte-Carlo method (see [
10]). However, it is almost always an unacceptably computationally expensive solution, especially when dealing with a huge amount of data and large dimensionality of the underlying feature space. Thus, many researchers proposed different approximations for the KL divergence, trying to obtain acceptable precision in recognition tasks of interest. In [
2], one such approximation is proposed and applied in image retrieval task as a measure of similarity between images. In [
10], lower and upper approximation bounds are delivered by the same authors. Experiments are conducted on synthetic data, as well as in speaker verification task. In [
1], accurate approximation built upon Unscented Transform is delivered and applied within a speaker recognition task in a computationally efficient manner. In [
15], the authors proposed a novel approach to online estimation of pdf’s, based on kernel density estimation. The second group of measures utilize informational geometry. In [
16], the authors proposed a metric on the space of multivariate Gaussians by parameterizing that space as the Riemannian symmetric space. In [
3], motivated by the mentioned paper and the efficient application of vector-based Earth-Movers Distance (EMD) metrics (see [
17]) applied in various recognition tasks (see for example [
18]), and their extension to GMMs in texture classification task proposed in [
6], the authors proposed sparse EMD methodology for Image Matching based on GMMs. An unsupervised sparse learning methodology is presented in order to construct EMD measure, where the sparse property of the underlying problem is assumed. In experiments, it proved to be more efficient and robust than the conventional EMD measure. Their EMD approach utilizes information geometry based ground distances between component Gaussians, introduced in [
16]. On the other hand, their supervised sparse EMD approach uses an effective pair-wise-based method in order to learn GMM EMD metric among GMMs. Both of these methods were evaluated using synthetic as well as real data, as part of texture recognition and image retrieval tasks. Higher recognition accuracy is obtained in comparison to some state-of-the-art methods. In [
7], the method proposed in [
3] was expanded. A study concerning ground distances and image features such as Local Binary Pattern (LBP) descriptor, SIFT, high-level features generated by deep convolution networks, covariance descriptor, and Gabor filter is also presented.
One of the main issues in pattern recognition and machine learning as a whole is that data are represented in high-dimensional spaces. This problem appears in many applications, such as information retrieval (and especially image retrieval), text categorization, texture recognition, and appearance-based object recognition. Thus, the goal is to develop the appropriate representation for complex data. The variety of dimensionality reduction techniques are designed in order to cope with this issue, targeting problems such as “curse of dimensionality” and computational complexity in the recognition phase of ML task. They tend to increase discrimination of the transformed features, which now lie either on a subspace of the original high dimensional feature space, or more generally, on some lower dimensional manifold embedded into it. Those are the so called
manifold learning techniques. Some of the most commonly used subspace techniques, such as Linear Discriminant Analysis (LDA) [
19] and maximum margin criterion (MMC) [
3,
20], trained in a supervised manner, or for example Principal Component Analysis (PCA) [
21], trained in an unsupervised manner, handle this issue by trying to increase discrimination of the transformed features, and to decrease computational complexity during recognition. Some of the frequently used manifold learning techniques are Isomap [
22], Laplacian Eigenmaps (LE) [
23], Locality Preserving Projections (LPP) [
24] (approach based on LE), and Local Linear Embedding (LLE) [
25]. The LE method explores the connection between the graph Laplasian and the Laplace Beltrami operator, in order to project features in a locally-preserving manner. Nevertheless, it is only to be used in various spectral clustering applications, as it cannot deal with unseen data. An approach based on LE, called Locality Preserving Projections (LPP) (see [
24]), manages to resolve the previous problem by learning linear projective map which best “fits” in the manifold, therefore preserving local properties of the data in the transformed space. In this way, we can transform any unseen data into a low-dimensional space, which can be applied in a number of pattern recognition and machine learning tasks. In [
26], the authors proposed the Neighborhood Preserving Embedding (NPE) methodology that, similarly to LPP, aims to preserve the local neighborhood structure on data manifold, but it learns not only the projective matrix which projects the original features to lower-dimensional Euclidean feature space, but also, as an intermediate optimization step, the weights that extract the neighborhood information in the original feature space. In [
27], some of the previously mentioned methods, such as LE and LLE, are generalized. An example of LE is given for the Riemannian manifold of positive-definite matrices, and applied as part of image segmentation task. Note that the mentioned dimensionality reduction techniques are applicable in many recent engineering and scientific fields, such as social network analysis and intelligent communications (see for example [
28,
29], published within a special issue presented in an editorial article [
30]).
In many machine learning systems, the trade-off between recognition accuracy and computational efficiency is very important for those to be applicable in real-life. In this work, we construct a novel measure of similarity between arbitrary GMMs, with an emphasis on lowering the complexity of the representation of all GMMs used in a particular system. Our aim is to investigate the assumption that the parameters of full covariance Gaussians, i.e., the components of GMMs, lie close to each other in a lower-dimensional surface embedded in the cone of positive definite matrices for the particular recognition task. Note that this is contrary to the assumption that data themselves lie on the lower-dimensional manifold embedded in the feature space. We actually use the NPE-based idea in order to reduce the projection matrix
A, but we apply it on the parameter space of Gaussian components. The matrix
A projects the parameters of Gaussian components to a lower-dimensional space. Local neighborhood information from the original parameter space is preserved. Let
be a set of all Gaussian components, and
M is the number of Gaussians for the particular task. We assume that parameters of any multivariate Gaussian component
, given as vectorized pair
, live in a high-dimensional parameter space. Each Gaussian component is then assigned to a node of undirected weighted graph. The graph weights
are learned in the intermediate optimization step, forming the weight matrix
W, where instead of the Euclidean distance figuring in the particular cost functional that is used in baseline NPE operating on feature space, we use a specified measure of similarity between Gaussian components and plug it into the cost functional. The ground distances between Gaussians
and
, proposed in [
3,
16], are based on information geometry. We name the proposed GMM similarity measure as GMM-NPE.