A Gaussian Process Decoder with Spectral Mixtures and a Locally Estimated Manifold for Data Visualization

Watanabe, Koshi; Maeda, Keisuke; Ogawa, Takahiro; Haseyama, Miki

doi:10.3390/app13148018

Open AccessArticle

A Gaussian Process Decoder with Spectral Mixtures and a Locally Estimated Manifold for Data Visualization

¹

Graduate School of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-ku, Sapporo 060-0814, Hokkaido, Japan

²

Faculty of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-ku, Sapporo 060-0814, Hokkaido, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(14), 8018; https://doi.org/10.3390/app13148018

Submission received: 12 April 2023 / Revised: 1 July 2023 / Accepted: 5 July 2023 / Published: 9 July 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Dimensionality reduction plays an important role in interpreting and visualizing high-dimensional data. Previous methods for data visualization overestimate the local structure and lack the consideration of global preservation. In this study, we develop a Gaussian process latent variable model (GP-LVM) for data visualization. GP-LVMs are one of the frameworks of principal component analysis and preserve the global structure effectively. The drawbacks of GP-LVMs are the absence of local structure preservation and the use of low-expressive kernel functions. Therefore, we introduce regularization for local preservation and an expressive kernel function into GP-LVMs to overcome these limitations. As a result, we reflect the global and local structures in low-dimensional representations, improving the reliability and visibility of embeddings. We conduct qualitative and quantitative experiments comparing baselines and state-of-the-art methods on image and text datasets.

Keywords:

data visualization; dimensionality reduction; Gaussian process; neighborhood graph

1. Introduction

Real-world data generally contain high-dimensional structures with nonlinear correlation. Under this condition, it is required to identify a low-dimensional representation and to present a visible expression of data beneficial for understanding data structures. Dimensionality reduction [1] is one of the techniques for achieving this and it searches low-dimensional manifolds that preserve the original structures as much as possible. Principal component analysis (PCA) [2,3] is a classical approach to reduce dimensionality; however, PCA is a linear approach and has limitations in expressing the derived low-dimensional representation.

There are several nonlinear techniques for dimensionality reduction. Deep generative models (DGMs) [4], such as variational autoencoders [5,6] and diffusion models [7,8], have achieved remarkable success. DGMs are based on probabilistic modeling and are parameterized by deep neural networks [9], enabling them to obtain highly expressive low-dimensional structures. Although DGMs provide a strong representation, they necessitate the selection of many hyperparameters and network architectures and are typically overkill for data preprocessing and visualization. Graph-based approaches, such as t-distributed stochastic embedding (t-SNE) [10,11] or uniform manifold approximation and projection (UMAP) [12], are visualization-aided dimensionality reduction techniques frequently used for biological data visualization [13,14]. Graph-based approaches use few hyperparameters and have a high scalability; however, they overestimate local structures and tend to induce a misleading visualization overlooking global structures. Gaussian process latent variable models (GP-LVMs) [15,16] are another framework for dimensionality reduction and the extension of the classical PCA. The Gaussian process has similar properties to deep neural networks [17,18] and has fewer hyperparameters than DGMs. From the above, GP-LVMs are suitable for data preprocessing and visualization and have gained significant attention for dimensionality reduction as a counterpart to graph-based approaches [19,20,21,22,23].

The kernel function and prior distribution used in GP-LVMs influence the quality of low-dimensional representations. The kernel function determines the properties of the embedding space, and the prior distribution corresponds to the regularization term of low-dimensional embedding. Previously, researchers have proposed prior distributions for various purposes, such as time-series analyses [24,25], WiFi localization [26,27], and face verification [28]. Furthermore, many kernel functions have been developed in the regression literature [29,30,31] and have achieved many successful results. However, in GP-LVMs, the prior distribution is used to express supervised information [28,32,33] and is not designed for unsupervised cases typical in data visualization. Furthermore, the kernel functions used in GP-LVMs are limited to the class of homogeneous kernels, such as the squared exponential (SE) kernel and Matérn families, which have limited expressions. Therefore, we require a comprehensive approach to GP-LVMs for data visualization purposes.

In this study, we develop a GP-LVM for data visualization. Our main contribution is the development of a stationary kernel for GP-LVMs and a prior distribution for unsupervised cases. The detail of each component is as follows.

Kernel function. We focus on the spectral mixture (SM) kernel [30] and introduce it into a GP-LVM. Generally, the SE kernel and Matérn kernel are applied to GP-LVMs, which are homogeneous kernels that only depend on the distance between latent features. In contrast, the SM kernel is a broader class kernel that depends on not only the distance but the periodicity of input features. Although the SM kernel has strong expressiveness compared to the SE and Matérn kernels, it has been applied to regression models and not to GP-LVMs. In this study, we develop the SM kernel for GP-LVMs and reveal its high ability for pattern discovery and visualization.
Prior distribution. We focus on the use of a locally estimated manifold in the prior distribution. Local structures, such as the cluster structure, are crucial to improving the visibility of a low-dimensional representation. Therefore, we regularize low-dimensional embedding with the local estimation of the data manifold. Specifically, we locally approximate it with a neighborhood graph and regularize the low-dimensional representation with it. Through this strategy, we consider local structures and improve the visibility of low-dimensional representation. The regularization based on the graph Laplacian in supervised cases has been previously studied as the Gaussian process latent random field (GPLRF) [33], and we apply the GPLRF scheme in an unsupervised manner.

The remainder is as follows. In Section 2 and Section 3, we present related works and the preliminaries to our method. Section 4 presents our method, and we validate our method in Section 5.

2. Related Works

This section presents previous studies on dimensionality reduction approaches aiming at data visualization and preprocessing. We explain the graph-based method in Section 2.1 and PCA-based methods in Section 2.2.

2.1. Graph-Based Approaches

Graph-based approaches first encode high-dimensional data into a neighborhood graph to preserve the local structures of data and embed a neighborhood graph into a low-dimensional space, generally a two-dimensional space. Laplacian eigenmaps (LEs) [34] are one of the classical methods in graph-based approaches and they derive a low-dimensional representation as the eigenvectors of the graph Laplacian of the neighborhood graph. An LE is a linear method with rich theoretical properties for its convergence and loss function [35,36]. t-Distributed stochastic embedding (t-SNE) [10] is a long-standing nonlinear graph embedding approach that provides a low-dimensional representation by matching the neighborhood structure. Specifically, t-SNE minimizes the Kullback–Leibler (KL) divergence between neighborhood graphs in high- and low-dimensional space. For its strong visibility, t-SNE is frequently used to visualize complex structures, such as biological data and the middle layers of deep neural networks, and researchers have developed a computationally efficient version of t-SNE introducing the Burnes–Hut algorithm [11] or hierarchical optimization [37]. UMAP [12] is a state-of-the-art method in graph-based approaches and is a more efficient algorithm than t-SNE. UMAP derives a k-nearest neighborhood graph as a fuzzy topological representation of data. Then, UMAP minimizes the cross-entropy of neighborhood graphs in high- and low-dimensional space, which does not necessitate the normalization of the graph, in contrast to t-SNE which necessitates this normalization for the calculation of KL divergence. Furthermore, the cross-entropy objective is considered as a combination of the attraction force of neighbors and the repulsion force of non-neighbors, allowing negative-sampling-based optimization to boost the scalability of the UMAP algorithm. LE, t-SNE, and UMAP have similar properties [38] in their loss functions and produce visible two-dimensional representations compared with the other dimensionality reduction approaches. However, their representations do not preserve the global structures of data because of the encoding process of data into neighborhood graphs. Researchers have developed approaches for global preservation by introducing a graph construction scheme based on the triplet [39,40] or diffusion process [41]. However, they estimate global structures by connecting local information, which inevitably causes failure to preserve the global structure. In this study, we focus on PCA-based approaches, which are suited for global preservation, and introduce them in the following subsection.

2.2. PCA-Based Approaches

PCA [2,3] is a classical dimensionality reduction technique and has been applied in various situations, such as data visualization and preprocessing. PCA aims to derive a linear projection onto a subspace that maximizes the variance of projected high-dimensional data. This formulation yields a closed-form expression of the embedding, improving the applicability and scalability of PCA. Probabilistic PCA (PPCA) [42] introduces probabilistic formulation into PCA. Specifically, PPCA defines latent variables in a low-dimensional space and derives the projection by maximizing the likelihood after marginalizing these latent variables. Although PCA and PPCA have produced successful results, they are both linear methods, and data generally contain nonlinear correlations. GP-LVMs [15] are a nonlinear extension of the PCA framework and realize nonlinear embedding with two differences: optimization of latent variables instead of projection and introduction of the kernel method [43] for nonlinear expression. These differences are achieved by the Gaussian-process-based formulation, and GP-LVMs assume that high-dimensional data are generated from the Gaussian process [44] input using latent variables. GP-LVMs can derive nonlinear latent variables according to the nonlinearity of the kernel, and several extensions introduce prior distributions, including discriminative [28,32], dynamic [24,25], and hierarchical GP-LVMs [45]. Recently, GP-LVMs have been developed for scalable training by inducing methods [46,47,48], for the Bayesian inference of latent variables [16,23], and for deep modeling [49,50]. Although GP-LVMs and PCA-based approaches preserve the global structures of data, they do not consider the local structures of data beneficial for visualization-aided dimensionality reduction. Furthermore, the kernel functions used in GP-LVMs are selected from homogeneous kernels, which have limited expression. In this study, we design a GP-LVM for visualization with a stationary kernel and local estimation of data.

3. Preliminary

In this section, we present the preliminaries of our method. Specifically, we introduce the basic idea of a GP-LVM, the kernel function, and the UMAP algorithm for the local estimation of the data manifold.

3.1. Gaussian Process Latent Variable Model

Let

Y = {[y_{1}, y_{2}, \dots, y_{N}]}^{⊤} \in R^{N \times D}

be D-dimensional observed variables with N samples. A GP-LVM assumes the existence of a projection from the Q-dimensional latent space to the D-dimensional observed space. Specifically, a GP-LVM assumes that the observed variables are generated from the Gaussian process [44] input by Q-dimensional latent variables

X = {[x_{1}, x_{2}, \dots, x_{N}]}^{⊤} \in R^{N \times Q}

as follows:

\begin{matrix} y_{:, d} & = f (X) + ϵ, \end{matrix}

(1)

\begin{matrix} f & \sim GP (0, k (\cdot, \cdot)), \end{matrix}

(2)

\begin{matrix} ϵ & \sim N (0, β^{- 1} I), \end{matrix}

(3)

where

y_{:, d} \in R^{N}

is the d-th column of

Y

,

f \sim GP (0, k (\cdot, \cdot))

is a zero-mean Gaussian process prior with a kernel function

k (\cdot, \cdot)

, and

ϵ

is a Gaussian noise with a precision

β

. The GP-LVM estimates the latent variables

X

by maximum likelihood estimation after marginalizing the Gaussian process prior

f

. The log-likelihood function of the GP-LVM is derived from Equations (1)–(3) as

\begin{matrix} log p (Y | X) & = \sum_{d = 1}^{D} log N (y_{:, d} | 0, K_{N N} + β^{- 1} I) \\ = - \frac{N D}{2} log 2 π - \frac{D}{2} log | K_{N N} + β^{- 1} I | - \frac{1}{2} tr [{(K_{N N} + β^{- 1} I)}^{- 1} Y Y^{⊤}], \end{matrix}

(4)

where

K_{N N} \in R^{N \times N}

denotes a gram matrix whose

i j

-th entry is

k (x_{i}, x_{j})

. The extensions of a GP-LVM generally introduce a prior distribution of the latent variables

p (X)

and estimate them by maximum a posteriori (MAP) estimation. The posterior distribution is given as the following equation:

\begin{matrix} log p (X | Y) = log p (Y | X) + log p (X) + C, \end{matrix}

(5)

where C represents a constant term. From these equations, the selection of the kernel function

k (\cdot, \cdot)

and the prior

p (X)

significantly influences the quality of the derived latent variables. Furthermore, the evaluation of Equation (4) contains the inversion of an

N \times N

matrix

K_{N N}

whose evaluation necessitates cubic time complexity. This complexity limits the scalability of the model, and the original GP-LVM cannot handle datasets with thousands of samples. Therefore, we use the inducing method [47] for the evaluation of the likelihood function and describe it in Section 4.

3.2. Kernel Function

The kernel function can be considered as a similarity measure of two inputs

x, x^{'} \in R^{Q}

and should be a positive definite function. The most well-known kernel function is the SE or radius basis function (RBF) kernel, defined as follows:

\begin{matrix} k_{S E} (x, x^{'}) = σ^{2} exp (- \frac{1}{2 l^{2}} | | x - x^{'} {| |}^{2}), \end{matrix}

(6)

where

σ

indicates a variance and l indicates a lengthscale parameter. Another popular kernel is the Matérn kernel, which is the sparse version of the SE kernel. The Matérn kernel is given as

\begin{matrix} k_{M a t \overset{´}{e} r n} (x, x^{'}) = σ^{2} \frac{2^{1 - ν}}{Γ (ν)} {(\sqrt{2 ν} \frac{| | x - x^{'} | |}{κ})}^{ν} K_{ν} (\sqrt{2 ν} \frac{| | x - x^{'} | |}{κ}), \end{matrix}

(7)

where

ν

and

κ

denote kernel parameters,

Γ (\cdot)

denotes the gamma function, and

K_{v} (\cdot)

denotes the modified Bessel function of the second kind.

ν

is generally set to

\frac{3}{2}, \frac{5}{2}

, and they can be easily defined as follows:

\begin{matrix} k_{M a t \overset{´}{e} r n 3 / 2} (x, x^{'}) & = σ^{2} (1 + \frac{\sqrt{3} | | x - x^{'} | |}{κ}) exp (- \frac{\sqrt{3} | | x - x^{'} | |}{κ}), \end{matrix}

(8)

\begin{matrix} k_{M a t \overset{´}{e} r n 5 / 2} (x, x^{'}) & = σ^{2} (1 + \frac{\sqrt{5} | | x - x^{'} | |}{κ} + \frac{5 | | x - x^{'} {| |}^{2}}{3 κ^{2}}) exp (- \frac{\sqrt{5} | | x - x^{'} | |}{κ}) \end{matrix}

(9)

Furthermore, if

ν \to \infty

, the Matérn kernel converges to the SE kernel in Equation (6). Although the SE and Matérn kernels have different expressions, they are both homogeneous kernels as

k (x, x^{'}) = k (| | x - x^{'} | |)

. The stationary kernel is in a broader class of homogeneous kernels as

k (x, x^{'}) = k (x - x^{'})

. The SM kernel [30] is a stationary kernel and can measure the periodicity of input signals. The SM kernel is defined by the following equation:

\begin{matrix} k_{S M} (x, x^{'}) = \sum_{p = 1}^{N_{w}} w_{p} \prod_{q = 1}^{Q} exp [- 2 π^{2} {(x_{q} - x_{q}^{'})}^{2} v_{q}^{(p)}] cos [2 π (x_{q} - x_{q}^{'}) μ_{q}^{(p)}], \end{matrix}

(10)

where

N_{w}

denotes the number of mixtures,

w_{p}

denotes a weight parameter of each mixture,

v_{q}^{(p)}

denotes a lengthscale parameter of the dimension q of a mixture p, and

μ_{q}^{(p)}

denotes a mean parameter of the dimension q of a mixture p. We collectively denote

μ_{p} = {[μ_{1}^{(p)}, μ_{2}^{(p)}, \dots, μ_{Q}^{(p)}]}^{⊤} \in R^{Q}

. The SM kernel is derived from Bochner’s theorem, which determines the spectral density of any stationary kernel. On this basis,

μ_{q}^{(p)}

indicates the center in the spectral domain, and the SE kernel is the special case of the SM kernel as

μ_{q}^{(p)} = 0

. The SM kernel measures the similarity between two inputs with the distance

| | x - x^{'} | |

and the periodicity represented by the cosine function of the spectral

μ_{p}

. By optimizing these parameters, we can find the optimal center in the spectral domain from the input signals and derive a data-specific expression of the kernel. For intuitive comparison of these kernels, we present the sampling results from

GP (0, k (\cdot, \cdot))

for each kernel in Figure 1.

3.3. Uniform Manifold Approximation and Projection

The approximation of data manifolds is crucial for graph-based dimensionality reduction approaches. UMAP is a state-of-the-art method and incorporates the idea of algebraic topology and categorical theory. As a result, the approximation of data manifolds is simplified to a k-nearest neighborhood graph computation, which is computationally efficient due to its very sparse structure. Specifically, UMAP first calculates a weight matrix

W_{n | m}

as follows:

\begin{matrix} {[W_{n | m}]}_{i j} = \{\begin{matrix} exp [- \frac{d (y_{i}, y_{j}) - ρ_{i}}{σ_{i}}] & (j \in {i_{1}, i_{2}, \dots, i_{k}}) \\ 0 & (otherwise) \end{matrix}, \end{matrix}

(11)

where

{[W_{n | m}]}_{i j}

denotes an

(i, j)

element of

W_{n | m}

,

i_{l} (l = 1, 2, \dots, k)

denotes the l-nearest neighbor of

y_{i}

,

d (\cdot, \cdot)

denotes an arbitrary distance function,

ρ_{i} = d (y_{i}, y_{i_{1}})

denotes a nearest neighbor distance, and

σ_{i}

denotes local connectivity around

y_{i}

. The weight matrix represents the similarities between samples, and UMAP approximates the data manifold as follows:

\begin{matrix} W = W_{n | m} + W_{m | n} - W_{n | m} \circ W_{m | n}, \end{matrix}

(12)

where ∘ denotes the Hadamard product. Equation (12) corresponds to the symmetrization of

W_{n | m}

, and

W

is called a fuzzy topological representation of observed variables

Y

.

W

denotes a locally estimated data manifold that contains local information about data. Then, UMAP minimizes the cross-entropy between the fuzzy topological representation and the local structure of the low-dimensional space. In this study, we focus on the estimated manifold

W

and regularize the embedding with

W

to improve the visibility of the representation.

4. Proposed Method

In Section 4.1, we define our model based on GP-LVMs and introduce the regularization with the graph Laplacian based on GPLRF. In Section 4.2, we derive the scalable lower bound of the log-posterior distribution, which enables our model to deal with a larger dataset than the original GP-LVM. In Section 4.3, we present the optimization method of our model and extend the SM kernel for the use of latent variable models.

4.1. Model Formulation

As in GP-LVMs, we define the generation process from the latent space to the observed space through the Gaussian process. Furthermore, we establish a prior distribution of latent variables by applying the GPLRF scheme in an unsupervised manner. The model of our method is as follows:

\begin{matrix} y_{:, d} & = f (X) + ϵ, \end{matrix}

(13)

\begin{matrix} x_{:, q} & \sim N (0, α^{- 1} L^{†}), \end{matrix}

(14)

\begin{matrix} f & \sim GP (0, k_{S M} (\cdot, \cdot)), \end{matrix}

(15)

\begin{matrix} ϵ & \sim N (0, β^{- 1} I), \end{matrix}

(16)

where

x_{:, q} \in R^{N}

denotes the q-th column of

X

,

α

denotes a hyperparameter to control the strength of the regularization,

L \in R^{N \times N}

is a graph Laplacian of a fuzzy topological representation

W

in Equation (12), and † denotes the pseudoinverse. From these definitions, we estimate the latent variables and kernel parameters by MAP estimation, and the objective function is given as follows:

\begin{matrix} log p (X | Y) & = log p (Y | X) + log p (X) + C \\ = \sum_{d = 1}^{D} log p (y_{:, d} | X) + \sum_{q = 1}^{Q} log p (x_{:, q}) + C, \end{matrix}

(17)

where

\begin{matrix} p (x_{:, q}) & = N (x_{:, q} | 0, α^{- 1} L^{†}) . \end{matrix}

(18)

The novelty of our method is the use of the expressive kernel in Equation (15) and the prior distribution utilizing the locally estimated manifold W in Equation (14). We further explain the effects of the prior distribution in Equation (18). The log-prior term in Equation (17) can be written as follows:

\begin{matrix} log p (X) & = - \frac{α}{2} \sum_{q = 1}^{Q} x_{:, q}^{⊤} L x_{:, q} + C \\ = - \frac{α}{4} \sum_{i = 1}^{N} \sum_{j = 1}^{N} w_{i j} | | x_{i} - x_{j} {| |}^{2} + C, \end{matrix}

(19)

where C denotes a constant. From Equation (19), the log-prior term is equal to the weighted sum of the distances between connected samples on the neighborhood graph. Latent variables are coordinated to minimize them through optimization; thus, the connected samples become closer, and the local information

W

is reflected in the latent representation. In Equation (17), the first term is equal to Equation (4) and necessitates

O (N^{3})

time complexity. In the following section, we introduce the inducing method [47] into Equation (17) and derive a scalable objective function.

4.2. Lower Bound of Log-Posterior Distribution

The inducing method [47] uses the property of the conditional distribution of a Gaussian and replaces the matrix inversion of a large matrix with that of a small matrix. Specifically, we define a few inducing points

u \in R^{M}

and their positions

Z = {[z_{1}, z_{2}, \dots, z_{M}]}^{⊤} \in R^{M \times Q}

(M being the number of inducing points) in the latent space and evaluate the log-likelihood function after conditioning by these inducing positions

Z

as

log p (y_{:, d} | X, Z)

. First, the joint distribution of

y_{:, d}

,

f

, and

u

is given as follows:

\begin{matrix} p (y_{:, d}, f, u | X, Z) = p (y_{:, d} | f) p (f | u, X, Z) p (u | Z) . \end{matrix}

(20)

Each distribution is given as the following equation:

\begin{matrix} p (y_{:, d} | f) & = N (y_{:, d} | f, β^{- 1} I), \end{matrix}

(21)

\begin{matrix} p (f | u, X, Z) & = N (f | K_{N M} K_{M M} u, K_{N N} - K_{N M} K_{M M}^{- 1} K_{M N}), \end{matrix}

(22)

\begin{matrix} p (u | Z) & = N (u | 0, K_{M M}), \end{matrix}

(23)

where

K_{M N} = K_{N M}^{⊤} \in R^{M \times N}

denotes a covariance matrix whose

i j

-th entry is

k_{S M} (x_{i}, z_{j})

and

K_{M M} \in R^{M \times M}

is a gram matrix whose

i j

-th entry denotes

k_{S M} (z_{i}, z_{j})

. For the evaluation of

log p (y_{:, d} | X, Z)

, we need to marginalize the Gaussian process prior

f

and inducing points

u

. The variational method [47] introduces the Jensen inequality to evaluate the log-likelihood as follows:

\begin{matrix} log p (y_{:, d} | X, Z) & = log \int \{\int p (y_{:, d} | f) p (f | u, X, Z) d f\} p (u | Z) d u \\ \geq log \int exp \{\int log p (y_{:, d} | f) p (f | u, X, Z) d f\} p (u | Z) d u \\ ≜ L_{d} . \end{matrix}

(24)

We evaluate the lower bound

L_{d}

of the log-posterior in Equation (24) instead of the original log-likelihood.

L_{d}

can be computed in the closed-form expression as follows:

\begin{matrix} L_{d} = - \frac{N}{2} log 2 π - \frac{1}{2} log | Q_{N N} + β^{- 1} I | - \frac{1}{2} y_{:, d}^{⊤} {(Q_{N N} + β^{- 1} I)}^{- 1} y_{:, d} - \frac{β}{2} tr (K_{N N} - Q_{N N}), \end{matrix}

(25)

where

Q_{N N} = K_{N M} K_{M M}^{- 1} K_{M N}

. By comparing Equation (4) with Equation (25), the

N \times N

matrix

K_{N N}

is replaced by the matrix

Q_{N N}

, which is considered a low-rank approximated form of the original matrix

K_{N N}

. Therefore, by applying the matrix formula to Equation (25), we can evaluate the lower bound of the likelihood in

O (N M^{2})

time complexity more efficiently than the likelihood function in Equation (4). By substituting (25) into (17) and neglecting the constants, we obtain the following objective function

L

:

\begin{matrix} L & = - \frac{D}{2} log | Q_{N N} + β^{- 1} I | - \frac{1}{2} tr [{(Q_{N N} + β^{- 1} I)}^{- 1} Y Y^{⊤}] \\ - \frac{β D}{2} tr (K_{N N} - Q_{N N}) - \frac{α Q}{2} tr (L X X^{⊤}) . \end{matrix}

(26)

We maximize the lower bound of the likelihood function in Equation (26) and explain the methodology of the optimization in the following section.

4.3. Optimization

The objective function in Equation (26) is a continuous function with respect to the latent variables and kernel parameters and can be generally maximized by gradient-based optimization, such as conjugate gradient descent [51] and the quasi-Newton method [52]. Generally, gradients are derived from a chain role of differentiation, and, thus, we need to compute the differentiation of the SE kernel with respect to the latent variables. This differentiation is given as follows:

\begin{matrix} \frac{\partial}{\partial x_{q}} k (x, x^{'}) & = \sum_{p = 1}^{N_{w}} w_{p} [[- 4 π^{2} (x_{q} - x_{q}^{'}) v_{q}^{(p)} - 2 π μ_{q}^{(p)} tan \{2 π (x_{q} - x_{q}^{'}) μ_{q}^{(p)}\}] \\ \times \prod_{q = 1}^{Q} exp \{- 2 π^{2} {(x_{q} - x_{q}^{'})}^{2} v_{q}^{(p)}\} cos \{2 π (x_{q} - x_{q}^{'}) μ_{q}^{(p)}\}] . \end{matrix}

(27)

Furthermore, the differentiations with respect to the kernel parameters

w_{p}

,

v_{q}^{(p)}

, and

μ_{q}^{(p)}

are given as follows:

\begin{matrix} \frac{\partial}{\partial w_{p}} k_{S M} (x, x^{'}) & = \prod_{q = 1}^{Q} exp \{- 2 π^{2} {(x_{q} - x_{q}^{'})}^{2} v_{q}^{(p)}\} cos \{2 π (x_{q} - x_{q}^{'}) μ_{q}^{(p)}\}, \end{matrix}

(28)

\begin{matrix} \frac{\partial}{\partial v_{q}^{(p)}} k_{S M} (x, x^{'}) & = - 2 π^{2} {(x_{q} - x_{q}^{'})}^{2} exp \{- 2 π^{2} {(x_{q} - x_{q}^{'})}^{2} v_{q}^{(p)}\} cos \{2 π (x_{q} - x_{q}^{'}) μ_{q}^{(p)}\}, \end{matrix}

(29)

\begin{matrix} \frac{\partial}{\partial μ_{q}^{(p)}} k_{S M} (x, x^{'}) & = - 2 π (x_{q} - x_{q}^{'}) tan \{2 π (x_{q} - x_{q}^{'}) μ_{q}^{(p)}\} \\ \times exp \{- 2 π^{2} {(x_{q} - x_{q}^{'})}^{2} v_{q}^{(p)}\} cos \{2 π (x_{q} - x_{q}^{'}) μ_{q}^{(p)}\} . \end{matrix}

(30)

We optimize the parameters with Equations (27)–(30) as well as

\frac{\partial L}{\partial K_{M N}}

and

\frac{\partial L}{\partial K_{M M}}

, which have been derived in previous research [48]. After optimization, we obtain the optimal latent variables

X

, spectral

μ_{p}^{(q)}

, weights

w_{p}

, and lengthscales

v_{q}^{(p)}

. As a result, we can reduce the number of hyperparameters and improve the tractability of our model.

5. Experiment

In this section, we validate our method through qualitative and quantitative experiments. We implemented our method with GPy [53], an open library for the Gaussian process, and ran it on an Intel Core i7-10700 CPU and 16 GB random access memory.

5.1. Experimental Setup

5.1.1. Dataset

We used four real-world datasets. Table 1 shows summary of the datasets used in this study.

MNIST (http://yann.lecun.com/exdb/mnist/ (accessed on 25 July 2022)) contains 70K images of hand-written digits and labels corresponding to each digit. We randomly selected 20K images and colored embeddings along the corresponding labels. MNIST contains a cluster structure of each digit, and the embedding should preserve the cluster structure.
COIL-20 [54] contains 1440 grayscale images of rotated objects. We used the first ten objects and colored embeddings along each object. Images of COIL-20 have a rotating structure, and the embedding should preserve it.
DBPedia (https://wiki.dbpedia.org/ (accessed on 29 July 2022)) contains 530K Wikipedia articles classified into 14 categories. We used 20K random articles and extracted feature vectors by FastText following [37]. We colored the embedding following the category information, and the low-dimensional representation should preserve the cluster structure.
Fashion MNIST (FMNIST) [55] contains 70K images of 10 kinds of fashion items and labels corresponding to each item. We randomly selected 20K images and colored embeddings to match the corresponding labels. Although FMNIST has a cluster structure similar to MNIST, the categories of FMNIST are more correlated than those of MNIST, and separating clusters in a low-dimensional representation is more difficult.

5.1.2. Comparative Methods

To validate the novelty of our method, we compared the proposed method (PM) with its version without the SM kernel in Equation (10) (PM w/o SM kernel) and without the prior in Equation (18) (PM w/o prior). Furthermore, we compared our method with the following comparative methods.

PCA [3] is a classical method for dimensionality reduction and linearly derives its embedding. We compared our method with PCA as a benchmark method. Note that GP-LVM-based methods typically use PCA as the initial values of their latent variables.
LE [34] is a classical graph-based approach and derives its low-dimensional embedding as the eigenvectors of the graph Laplacian. We used LE as the benchmark in the graph-based approaches.
t-SNE [10,11] is a long-standing graph-based data visualization technique that derives the low-dimensional representation by minimizing the KL divergence between the observed and low-dimensional spaces. We used t-SNE as the baseline method of the graph-based dimensionality reduction approaches.
Bayesian GP-LVM (BGP-LVM) [16,25] is a GP-LVM that introduces the Bayesian inference into the latent variables. We used BGP-LVM as the baseline method and visualized the mean vectors as a low-dimensional representation. We used the RBF kernel in Equation (6) and standard normal density as the prior distribution following the original work [16].
Potential of heat diffusion of affinity-based transition embedding (PHATE) [41] is a recently proposed graph-based approach and derives the neighborhood graph on the basis of the diffusion operation [56], enabling the global preservation of data. We used PHATE as a state-of-the-art method among the methods aimed at preserving global structures.

5.1.3. Evaluation

We used average-based and correlation-based metrics to evaluate the local and global preservation quality. The average-based metric is calculated by averaging the preservation quality around each data point and evaluates the local preservation around them. The correlation-based metric is a correlation value between the observed and embedding spaces and quantifies the global similarity between two spaces. We used (1) trustworthiness [57] as an average-based metric and (2) Shepard goodness [58] as a correlation-based one, which have also been used in previous studies [58,59]. Trustworthiness

T (k) \in [0, 1]

is defined as follows:

\begin{matrix} T (k) = 1 - \frac{2}{N k (2 N - 3 k - 1)} \sum_{i = 1}^{N} \sum_{j \in N_{i}^{k}} \max (0, r (i, j) - k), \end{matrix}

(31)

where

N_{i}^{k}

denotes an index list of the k-nearest neighbors in the low-dimensional space, and

r (i, j)

indicates that samples i and j are the

r (i, j)

-th nearest neighbors in the observed space. Therefore, trustworthiness penalizes unexpected neighbors in the low-dimensional space and can evaluate the preservation quality of the local structures of data. We set the number of neighbors as

k = 5

, but we observed that the results are robust to this setting. Shepard goodness is a global metric and is calculated with the Shepard diagram [60]. The Shepard diagram is a scatterplot of the point-wise distances of all points in the observed and embedding spaces. The Shepard goodness is the Spearman rank correlation of the Shepard diagram and can evaluate the global similarity between the original data and embedding. The Shepard goodness is computed as follows:

\begin{matrix} ρ = 1 - \frac{6 \sum_{n = 1}^{N_{s}} D_{n}^{2}}{N_{s}^{3} - N_{s}}, \end{matrix}

(32)

where

N_{s} = \frac{N (N - 1)}{2}

is the number of the points on the Shepard diagram and

D_{n}

is the distance between the ranks of two features of the n-th point on the Shepard diagram. Both metrics are within

[0, 1]

, and a higher value indicates better results. Importantly, the global and local metrics are tradeoffs of each other and should be balanced to realize reliable visualization.

5.1.4. Training Procedure

Our method optimizes the latent variables and kernel parameters by maximizing the lower bound in Equation (26). The hyperparameters in our method include the strength of the regularization

α

and the number of inducing points M. The number of dimensions D enlarges the log-likelihood function in Equation (26); thus, we chose

α

as the proportional number to the dimensions

α = 0.1 D

. The number of inducing points significantly impacts the quality of the low-dimensional representation and is easily chosen in the small natural number (e.g., 10∼50); thus, we adaptively selected M according to the dataset. We optimized the objective function in Equation (26) using L-BFGS-B [52] of SciPy implementation [61].

5.2. Ablation Study

Figure 2 shows the ablation results on MNIST. We observed that the low-dimensional representation with PM w/o SM kernel divided the independent clusters into different parts in contrast to PM and PM w/o prior that properly connected each cluster. PM and PM w/o prior used the SM kernel considering the spectral density of the latent variables, which contributed to the proper embeddings of the cluster structure. Although the effect of the prior distribution was unclear from the MNIST results, it was demonstrated in the COIL-20 results in Figure 3. PM w/o prior does not preserve the object variation and rotation structure, due to the absence of a local structure consideration. From these comparisons, we confirm the effectiveness of our novelty.

5.3. Qualitative Results

Figure 4 shows the visualization results with the comparative methods. On MNIST, PCA and LE do not reflect the cluster structure corresponding to each digit, which is difficult to capture by linear dimensionality reduction. In contrast, embeddings of the nonlinear methods, such as t-SNE, BGP-LVM, PHATE, and PM, contain several cluster structures and successfully preserve the local structure of data. In the COIL-20 results, PCA and LE do not preserve the object variation. t-SNE, BGP-LVM, and PM retain the object variation, and PM retains the rotation structure more correctly than t-SNE and BGP-LVM, which have several linear and separated clusters. PHATE can embed COIL-20 with a complete object variation and rotation structure, due to the graph embedding optimization. On DBPedia, PCA and LE also fail to preserve local clusters. t-SNE, BGP-LVM, PHATE, and PM preserve them, but their shapes are different, and their correctness needs to be evaluated by quantitative metrics. On FMNIST, although all embeddings do not preserve the cluster structure, t-SNE, PHATE, and PM split them into several independent clusters, beneficial for gaining insight into the data structure.

5.4. Quantitative Results

Table 2 and Table 3 show the quantitative results of trustworthiness and Shepard goodness on each dataset, respectively. We first evaluate these results by separating them into graph-based methods (LE, t-SNE, and PHATE) and PCA-based methods (PCA and BGP-LVM). LE, t-SNE, and PHATE exhibit high values on trustworthiness (local metric) but low values on Shepard goodness (global metric). They embedded a neighborhood graph, the local estimation of given data, and thus over-reflected the local structures in low-dimensional representations. This drawback resulted in misleading visualizations due to the lack of global preservation, such as the shape of clusters on DBPedia by t-SNE and PHATE in Figure 4. In contrast, PCA and BGP-LVM, especially PCA, performed a reliable reduction in global preservation with a high Shepard goodness value, but they did not show good local preservation, resulting in a non-visible representation. Our method exhibited high values for both trustworthiness and Shepard goodness, indicating that our method balances global and local structure preservation, i.e., ensuring the reliability and visibility of low-dimensional embedding. Our method is based on GP-LVMs and regularizes latent variables with a neighborhood graph. This approach contributed to retaining local and global structures and achieving the high reliability and visibility of low-dimensional representations. From the above, we confirm the effectiveness of our method quantitatively.

6. Conclusions

In this study, we proposed a novel visualization-aided dimensionality reduction technique based on the Gaussian process. Our proposed method derives the latent representation using an expressive kernel function and by regularization based on a graph Laplacian retaining the local structure of data. Furthermore, we introduced a sparse method and realized scalable optimization with the lower bound of the log-posterior distribution. Our method preserves the global structure of data with a Gaussian-process-based formulation and the local structure with a locally estimated data manifold. In the experiment, we validated our method on multiple datasets and confirmed the effectiveness qualitatively and quantitatively. However, the visibility of low-dimensional representations is better with graph embedding methods, and the PCA-based methods have low scalability. In future work, we will introduce the local information and scalable optimization in a more efficient way to improve its applicability to visualization-aided dimensionality reduction.

Author Contributions

Conceptualization, K.W., K.M., T.O. and M.H.; methodology, K.W., K.M. and T.O.; software, K.W.; validation, K.W.; data curation, K.W.; writing—original draft preparation, K.W.; writing—review and editing, K.M., T.O. and M.H.; visualization, K.W.; funding acquisition, K.M., T.O. and M.H. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported in part by JSPS KAKENHI grant numbers JP21H03456 and JP20K19856 and AMED grant number JP22zf0127004h0002.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were used in this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Van der Maaten, L.; Postma, E.; Van den Herik, J. Dimensionality reduction: A comparative. J. Mach. Learn Res. 2009, 10, 66–71. [Google Scholar]
Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933, 24, 417–441. [Google Scholar] [CrossRef]
Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52. [Google Scholar] [CrossRef]
Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. β-vae: Learning basic visual concepts with a constrained variational framework. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 2256–2265. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. NeurIPS 2020, 33, 6840–6851. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Van der Maaten, L. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 2014, 15, 3221–3245. [Google Scholar]
McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
Kobak, D.; Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun. 2019, 10, 5416. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Becht, E.; McInnes, L.; Healy, J.; Dutertre, C.A.; Kwok, I.W.; Ng, L.G.; Ginhoux, F.; Newell, E.W. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 2019, 37, 38–44. [Google Scholar] [CrossRef]
Lawrence, N.D. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. J. Mach. Learn. Res. 2005, 6, 1783–1816. [Google Scholar]
Titsias, M.; Lawrence, N.D. Bayesian Gaussian process latent variable model. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Sardinia, Italy, 13–15 May 2010; pp. 844–851. [Google Scholar]
Neal, R.M. Bayesian Learning for Neural Networks; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 118. [Google Scholar]
Lee, J.; Bahri, Y.; Novak, R.; Schoenholz, S.S.; Pennington, J.; Sohl-Dickstein, J. Deep neural networks as Gaussian processes. arXiv 2017, arXiv:1711.00165. [Google Scholar]
Märtens, K.; Campbell, K.; Yau, C. Decomposing feature-level variation with covariate Gaussian process latent variable models. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 4372–4381. [Google Scholar]
Jensen, K.; Kao, T.C.; Tripodi, M.; Hennequin, G. Manifold GPLVMs for discovering non-Euclidean latent structure in neural data. Adv. Neural Inf. Process. Syst. 2020, 33, 22580–22592. [Google Scholar]
Liu, Z. Visualizing single-cell RNA-seq data with semisupervised principal component analysis. Int. J. Mol. Sci. 2020, 21, 5797. [Google Scholar] [CrossRef]
Jørgensen, M.; Hauberg, S. Isometric Gaussian process latent variable model for dissimilarity data. In Proceedings of the International Conference on Machine Learning (ICML), Sydney, Australia, 18–24 July 2021; pp. 5127–5136. [Google Scholar]
Lalchand, V.; Ravuri, A.; Lawrence, N.D. Generalised GPLVM with stochastic variational inference. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Virtual Event, 28–30 March 2022; pp. 7841–7864. [Google Scholar]
Wang, J.M.; Fleet, D.J.; Hertzmann, A. Gaussian process dynamical models for human motion. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 30, 283–298. [Google Scholar] [CrossRef] [Green Version]
Damianou, A.C.; Titsias, M.K.; Lawrence, N.D. Variational Inference for Latent Variables and Uncertain Inputs in Gaussian Processes. J. Mach. Learn. Res. 2016, 17, 1–62. [Google Scholar]
Ferris, B.; Fox, D.; Lawrence, N. WiFi-SLAM using Gaussian process latent variable models. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, 6–12 January 2007; pp. 2480–2485. [Google Scholar]
Zhang, G.; Wang, P.; Chen, H.; Zhang, L. Wireless indoor localization using convolutional neural network and Gaussian process regression. Sensors 2019, 19, 2508. [Google Scholar] [CrossRef] [Green Version]
Lu, C.; Tang, X. Surpassing human-level face verification performance on LFW with GaussianFace. In Proceedings of the AAAI conference on Artificial Intelligence (AAAI), Austin, TX, USA, 25–30 January 2015. [Google Scholar]
Cho, Y.; Saul, L. Kernel methods for deep learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 7–10 December 2009. [Google Scholar]
Wilson, A.; Adams, R. Gaussian process kernels for pattern discovery and extrapolation. In Proceedings of the International Conference on Machine Learning (ICML), Atlanta, GA, USA, 16–21 June 2013; pp. 1067–1075. [Google Scholar]
Lloyd, J.; Duvenaud, D.; Grosse, R.; Tenenbaum, J.; Ghahramani, Z. Automatic construction and natural-language description of nonparametric regression models. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Quebec City, QC, Canada, 27–31 July 2014. [Google Scholar]
Urtasun, R.; Darrell, T. Discriminative Gaussian process latent variable model for classification. In Proceedings of the International Conference on Machine Learning (ICML), Hong Kong, China, 19–22 August 2007; pp. 927–934. [Google Scholar]
Zhong, G.; Li, W.J.; Yeung, D.Y.; Hou, X.; Liu, C.L. Gaussian process latent random field. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Atlanta, GA, USA, 11–15 July 2010; pp. 679–684. [Google Scholar]
Belkin, M.; Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 2003, 15, 1373–1396. [Google Scholar] [CrossRef] [Green Version]
Belkin, M.; Niyogi, P. Convergence of Laplacian eigenmaps. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 4–7 December 2006. [Google Scholar]
Carreira-Perpinán, M.A. The Elastic Embedding Algorithm for Dimensionality Reduction. In Proceedings of the International Conference on Machine Learning (ICML), Haifa, Israel, 21–24 June 2010; pp. 167–174. [Google Scholar]
Fu, C.; Zhang, Y.; Cai, D.; Ren, X. AtSNE: Efficient and robust visualization on gpu through hierarchical optimization. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AL, USA, 4–8 August 2019; pp. 176–186. [Google Scholar]
Böhm, J.N.; Berens, P.; Kobak, D. Attraction-repulsion spectrum in neighbor embeddings. J. Mach. Learn. Res. 2022, 23, 1–32. [Google Scholar]
Amid, E.; Warmuth, M.K. TriMap: Large-scale dimensionality reduction using triplets. arXiv 2019, arXiv:1910.00204. [Google Scholar]
Wang, Y.; Huang, H.; Rudin, C.; Shaposhnik, Y. Understanding how dimension reduction tools work: An empirical approach to deciphering t-SNE, UMAP, TriMAP, and PaCMAP for data visualization. J. Mach. Learn. Res. 2021, 22, 9129–9201. [Google Scholar]
Moon, K.R.; van Dijk, D.; Wang, Z.; Gigante, S.; Burkhardt, D.B.; Chen, W.S.; Yim, K.; Elzen, A.v.d.; Hirn, M.J.; Coifman, R.R.; et al. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol. 2019, 37, 1482–1492. [Google Scholar] [CrossRef] [PubMed]
Tipping, M.E.; Bishop, C.M. Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B Stat. Methodol. 1999, 61, 611–622. [Google Scholar] [CrossRef] [Green Version]
Hofmann, T.; Schölkopf, B.; Smola, A.J. Kernel methods in machine learning. Ann. Stat. 2008, 36, 1171–1220. [Google Scholar] [CrossRef] [Green Version]
Rasmussen, C.E.; Williams, C.K. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
Lawrence, N.D.; Moore, A.J. Hierarchical Gaussian process latent variable models. In Proceedings of the International Conference on Machine Learning (ICML), Corvallis, OR, USA, 20–24 June 2007; pp. 481–488. [Google Scholar]
Quinonero-Candela, J.; Rasmussen, C.E. A unifying view of sparse approximate Gaussian process regression. J. Mach. Learn. Res. 2005, 6, 1939–1959. [Google Scholar]
Titsias, M. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the International Conference on Artificial intelligence and statistics (AISTATS), Clearwater Beach, FL, USA, 16–18 April 2009; pp. 567–574. [Google Scholar]
Lawrence, N.D. Learning for larger datasets with the Gaussian process latent variable model. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), San Juan, Puerto Rico, 21–24 March 2007; pp. 243–250. [Google Scholar]
Damianou, A.; Lawrence, N.D. Deep Gaussian processes. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR, Scottsdale, AZ, USA, 29 April–1 May 2013; pp. 207–215. [Google Scholar]
Dai, Z.; Damianou, A.; González, J.; Lawrence, N. Variational auto-encoded deep Gaussian processes. In Proceedings of the International Conference on Learning Representation (ICLR), San Juan, Puerto Rico, 2–4 May 2016; pp. 1–11. [Google Scholar]
Shewchuk, J.R. An Introduction to the Conjugate Gradient Method without the Agonizing Pain; Carnegie Mellon University: Pittsburgh, PA, USA, 1994. [Google Scholar]
Zhu, C.; Byrd, R.H.; Lu, P.; Nocedal, J. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw. 1997, 23, 550–560. [Google Scholar] [CrossRef]
GPy. GPy: A Gaussian Process Framework in Python. 2012. Available online: http://github.com/SheffieldML/GPy (accessed on 26 June 2020).
Nene, S.A.; Nayar, S.K.; Murase, H. Columbia Object Image Library (COIL-20); Technical Report CUCS-006-96; Columbia University: New York, NY, USA, 1996. [Google Scholar]
Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
Coifman, R.R.; Lafon, S. Diffusion maps. Appl. Comput. Harmon. Anal. 2006, 21, 5–30. [Google Scholar] [CrossRef] [Green Version]
Venna, J.; Kaski, S. Neighborhood preservation in nonlinear projection methods: An experimental study. In Proceedings of the International Conference on Artificial Neural Networks (ICANN), Vienna, Austria, 21–25 August 2001; pp. 485–491. [Google Scholar]
Espadoto, M.; Martins, R.M.; Kerren, A.; Hirata, N.S.; Telea, A.C. Toward a quantitative survey of dimension reduction techniques. IEEE Trans. Vis. Comput. Graph. 2019, 27, 2153–2173. [Google Scholar] [CrossRef] [PubMed]
Zu, X.; Tao, Q. SpaceMAP: Visualizing High-dimensional Data by Space Expansion. In Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022; pp. 27707–27723. [Google Scholar]
Joia, P.; Coimbra, D.; Cuminato, J.A.; Paulovich, F.V.; Nonato, L.G. Local affine multidimensional projection. IEEE Trans. Vis. Comput. Graph. 2011, 17, 2563–2571. [Google Scholar] [CrossRef] [PubMed]
Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. Sampling results from

GP (0, k (\cdot, \cdot))

for each kernel function. Since the SE kernel is a special case of the SM kernel, they produce similar results.

Figure 1. Sampling results from

GP (0, k (\cdot, \cdot))

for each kernel function. Since the SE kernel is a special case of the SM kernel, they produce similar results.

Figure 2. Ablation study of our method on MNIST visualization.

Figure 3. Ablation study of our method on COIL-20 visualization.

Figure 4. Qualitative results of visualization on each dataset. We color the embedding following the category given in the datasets.

Table 1. Details of the datasets used in this study.

Dataset	Samples	Dimensions	Categories	Type	Features
MNIST	20,000	784	10	Image	Pixels
COIL-20	720	16,384	10	Image	Pixels
DBPedia	20,000	100	14	Text	FastText
FMNIST	20,000	784	10	Image	Pixels

Table 2. Quantitative results of trustworthiness. We boldface the best results and underline the second-best results on each dataset.

Dataset	PCA	LE	t-SNE	BGP-LVM	PHATE	PM
MNIST	0.738	0.756	0.994	0.823	0.871	0.881
COIL-20	0.898	0.882	0.997	0.984	0.931	0.974
DBPedia	0.883	0.956	0.998	0.989	0.986	0.990
FMNIST	0.912	0.927	0.994	0.909	0.958	0.953

Table 3. Quantitative results of Shepard goodness. We boldface the best results and underline the second-best results on each dataset.

Dataset	PCA	LE	t-SNE	BGP-LVM	PHATE	PM
MNIST	0.503	0.431	0.349	0.464	0.368	0.512
COIL-20	0.818	0.633	0.611	0.525	0.355	0.687
DBPedia	0.778	0.490	0.339	0.594	0.361	0.694
FMNIST	0.876	0.692	0.579	0.883	0.615	0.862

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Watanabe, K.; Maeda, K.; Ogawa, T.; Haseyama, M. A Gaussian Process Decoder with Spectral Mixtures and a Locally Estimated Manifold for Data Visualization. Appl. Sci. 2023, 13, 8018. https://doi.org/10.3390/app13148018

AMA Style

Watanabe K, Maeda K, Ogawa T, Haseyama M. A Gaussian Process Decoder with Spectral Mixtures and a Locally Estimated Manifold for Data Visualization. Applied Sciences. 2023; 13(14):8018. https://doi.org/10.3390/app13148018

Chicago/Turabian Style

Watanabe, Koshi, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. 2023. "A Gaussian Process Decoder with Spectral Mixtures and a Locally Estimated Manifold for Data Visualization" Applied Sciences 13, no. 14: 8018. https://doi.org/10.3390/app13148018

APA Style

Watanabe, K., Maeda, K., Ogawa, T., & Haseyama, M. (2023). A Gaussian Process Decoder with Spectral Mixtures and a Locally Estimated Manifold for Data Visualization. Applied Sciences, 13(14), 8018. https://doi.org/10.3390/app13148018

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Gaussian Process Decoder with Spectral Mixtures and a Locally Estimated Manifold for Data Visualization

Abstract

1. Introduction

2. Related Works

2.1. Graph-Based Approaches

2.2. PCA-Based Approaches

3. Preliminary

3.1. Gaussian Process Latent Variable Model

3.2. Kernel Function

3.3. Uniform Manifold Approximation and Projection

4. Proposed Method

4.1. Model Formulation

4.2. Lower Bound of Log-Posterior Distribution

4.3. Optimization

5. Experiment

5.1. Experimental Setup

5.1.1. Dataset

5.1.2. Comparative Methods

5.1.3. Evaluation

5.1.4. Training Procedure

5.2. Ablation Study

5.3. Qualitative Results

5.4. Quantitative Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI