Laplacian Eigenmaps Dimensionality Reduction Based on Clustering-Adjusted Similarity

: Euclidean distance between instances is widely used to capture the manifold structure of data and for graph-based dimensionality reduction. However, in some circumstances, the basic Euclidean distance cannot accurately capture the similarity between instances; some instances from different classes but close to the decision boundary may be close to each other, which may mislead the graph-based dimensionality reduction and compromise the performance. To mitigate this issue, in this paper, we proposed an approach called Laplacian Eigenmaps based on Clustering-Adjusted Similarity (LE-CAS). LE-CAS ﬁrst performs clustering on all instances to explore the global structure and discrimination of instances, and quantiﬁes the similarity between cluster centers. Then, it adjusts the similarity between pairwise instances by multiplying the similarity between centers of clusters, which these two instances respectively belong to. In this way, if two instances are from different clusters, the similarity between them is reduced; otherwise, it is unchanged. Finally, LE-CAS performs graph-based dimensionality reduction (via Laplacian Eigenmaps) based on the adjusted similarity. We conducted comprehensive empirical studies on UCI datasets and show that LE-CAS not only has a better performance than other relevant comparing methods, but also is more robust to input parameters.


Introduction
Dimensionality reduction is a typical data preprocessing step in data mining and pattern recognition [1][2][3].It aims to project original high-dimensional data into a low-dimensional subspace while preserving the geometric structure of them as much as possible.These low-dimensional representations of the original data can be used for different follow-up tasks, such as visualization, clustering, classification and so on.Dimensionality reduction has been studied for several decades [4][5][6].By exploring and exploiting the geometric structure of samples from different perspectives, various unsupervised dimensionality reduction methods have been proposed [7].Principal Component Analysis (PCA) is a representative unsupervised method [8], it seeks to maximize the internal information of data after dimension reduction, and measures the importance of the direction by measuring the variance of the data in the direction of projection.However, such projection does not play a big role in data differentiation, and may make data points undistinguishable by mixing them together.In this case, Locally linear embedding (LLE) [9] and its variants [10][11][12][13][14] were proposed to seek the low-dimensional embedding of high-dimensional samples by preserving the local geometric structure of samples.Further on, to make well use of valuable labeled samples, some semi-supervised dimensionality reduction methods have also been introduced [15][16][17][18].
In this paper, we focus on the Graph Embedding-based Dimensionality Reduction (GEDR), which can unify most dimensionality reduction solutions [19].GEDR highlights that the main difference between existing dimensionality reduction solutions is the adopted graph structure.GEDR methods typically rely on the adopted graph structures to capture the geometric relation between samples in the high-dimensional space.This kind of graph is usually called an affinity graph [20], since its edge set conveys information about the proximity of the data in the input space.Once the affinity graph is constructed, these methods derive the low-dimensional samples by imposing that certain graph properties are preserved in the reduced low-dimensional subspace.This typically results in an optimization problem, whose solution provides the reduced data, or a mechanism to project data from the original space to low-dimensional space.For example, Belkin et al. [21] introduced Laplacian Eigenmaps (LE), which constructs a neighborhood graph (normally are k-nearest-neighbor (kNN) or ε-nearest-neighbor (εNN) graph [22]) to capture the local structure of samples.Tenenbaum et al. [23] proposed Isomap, which estimates the geodesic distance between samples and then uses multidimensional scaling to induce a low-dimensional representation.Weinberger et al. [24,25] introduced an approach called Maximum Variance Unfolding (MVU) to preserve both the local distances and angles between the samples.He et al. [13] extends LE for linear dimensionality reduction, which can output a linear projective matrix to project new samples into the low-dimensional subspace.In these studies, researchers have been recognizing that the constructed graph on the instances determines the performance of GEDR.However, how to construct a graph that correctly reflects the similarity between instances is a public problem [2].The distance between instances becomes isometric as the dimensionality of instances increases [26] and many traditional similarity metrics are distorted by noisy or redundant features of high-dimensional data.Thus, researchers have developed several graph construction methods to improve the performance of GEDR.
Some efforts have been made to improve the performance of LE, a classical and representative GEDR method.Zeng et al. [27] proposed geodesic distance-based generalized Gaussian Laplacian Eigenmap (GGLE) method using different generalized Gaussian functions to measure the similarity between high-dimensional data points.Raducanu et al. [28] introduced Self-regulation of neighborhood parameter for Laplacian Eigenmaps (S-LE) by measuring similarity between instances via the ratio of geodesic distance and Euclidean distance between the samples and their neighborhood nodes, and the adopted neighborhood parameters are adjusted and optimized.Ning et al. [29] developed Supervised Cluster Preserving Laplacian Eigenmap (SCPLE), which constructs an intra-class graph and an inter-class graph, and determines the edge weights by class label information and adaptive thresholds.By maximizing the weighted neighbor distances between heterogeneous samples and minimizing the weighted neighbor distances between homogeneous samples, SCPLE maps homogeneous samples closer and heterogeneous samples faraway in the low-dimensional space.
Although these approaches improve the performance of LE, they still have problems when embedding instances close to the decision boundary.For example, as shown in Figure 1(a), instances from different clusters close to the decision boundary may be much closer than instances from the same cluster.As a result, they will be placed nearby in the reduced low-dimensional space, which misleads the data distribution and compromises the learning performance.To remedy the issue illustrated in Figure 1, we proposed Laplacian Eigenmaps dimensionality reduction based on Clustering-Adjusted Similarity (LE-CAS).LE-CAS applies clustering technique on the original instances to explore the underlying data distribution and global structure of instances.At this stage, we initially intended to use k-means clustering [22] as the method to obtain the decision boundaries.However, as our feature space has large scale and complex structure, the clusters produced by k-means clustering depends largely on the distribution of samples and may not be related to the structure of the boundaries.To solve this kind of problem, we decided to employ an optimization method of k-means clustering called kernel k-means [30].Kernel k-means maps the data to a higher-dimensional feature space through a nonlinear mapping and performs cluster analysis in the feature space.This method of mapping data to high-dimensional space can highlight the feature differences between sample classes and obtain more accurate clustering results.After that, LE-CAS uses the cluster structure to adjust the similarity between instances based on Gaussian heat kernel.In particular, if pairwise instances belong to different clusters, the similarity between them will be reduced based on the original, otherwise the similarity is unchanged.As shown in Figure 1(b), the similarities between pairwise samples from different clusters are reduced, while the similarities between samples from the same cluster remain high.In this way, the global structure information revealed by clustering is embedded into the adjusted similarity, which can better capture the structure between samples.Finally, LE-CAS executes Laplacian Eigenmaps dimensionality reduction based on this clustering-adjusted similarity.Extensive experimental results on UCI datasets from different domains show that LE-CAS significantly outperforms other approaches, which aim to improve LE by different techniques.
The structure of this paper is organized as follows.In Section 2, we give the details of how to adjust similarity between samples and list the procedures of LE-CAS.The preparatory works of the experiment are introduced in Section 3, experimental results and analysis on UCI datasets are presented in Section 4, followed with conclusions and future work in Section 5.

Methodology
Suppose X = [x 1 ; x 2 ; . . .; x n ] ∈ R n×d be n instances in the d-dimensional space.LE-CAS targets to project X into a low-dimensional subspace with new representation Y ∈ R n×c with c d.At the beginning of this section, we would like to introduce the basic idea inside our method called Clustering-Adjusted Similarity (CAS) [31].In addition, then we will briefly illustrate the proceeding of our LE-CAS.

Clustering-Adjusted Similarity
The GEDR methods often resort to a typical similarity metric to capture the similarity between samples and structure among them.In this paper, we start with the widely used Gaussian heat kernel as follows: where σ m > 0 is Gaussian hear kernel width.W ij ∈ (0, 1] represents the similarity between x i and x j , and W ii = 1.W ∈ R n×n can be viewed as a weighted adjacency matrix of a graph, which stores the pairwise similarity between n samples.This way of graph construction is proved to be a simple and effective solution by previous studies [13]. However, for pairs of instances located close to the decision boundary but from different classes, they have a high similarity (see two instances in different clusters but with high similarity 0.9674 in Figure 1(a)), which mislead them being close in the projected subspace.To mitigate the problem, we aim at reducing the similarity between instances close to decision boundary, which are from different clusters, while remaining the high similarity between samples of the same cluster.To reach that target, we apply CAS as described below.
CAS is based on the idea that samples in the same cluster are similar and those in different clusters should be dissimilar as much as much possible.To explore the clusters of the origin instances, we perform kernel k-means clustering on all instances.Suppose the set of the final clusters is |C h | is the number of instances placed into the h-th cluster.Then the similarity between two cluster centroids is similarly defined based on the Gaussian heat kernel as follows: is the Gaussian heat kernel similarity between u h 1 and u h 2 .Obviously, the similarity between two cluster centroids is small when the distance in Euclidian measurement is large.
To facilitate the clustering-based adjustment, we define l as follows: l ∈ R n×d is the label indicator matrix, each row corresponds to an instance and each column for a distinct label.d represents the number of clusters of the whole instances.If instance X i belong to the h th cluster, l i = h, otherwise, l i = 0. Finally, the original Gaussian heat kernel similarity data matrix W is adjusted into W as follows: From the definition of l and S, we can find that for two samples (i and j) placed into the same cluster On the other hand, for two samples from different clusters, Wij shrinks to S l i l j W ij , since l i = l j and S l i l j < 1.

Laplacian Eigenmap-Based Clustering-Adjusted Similarity
Based on the clustering-adjusted similarity, we present the procedure of LE-CAS as follows, and the influence of our improved method is also shown in Figure 1.
1. Carry out kernel k-means clustering on the original dataset X, the original data is clustered into k 1 classes.During the clustering, we first use an nonlinear mapping function to map the instances from the original space R n×d to a higher-dimensional space F, and then clustering in this space.
The instances of the original space becomes On this basis, kernel clustering is to minimize the following criterion function, t φ i is the class-mean of i-th cluster.
After clustering, we get the following groups: 2. Construct a graph with the edge weight between x i and x j specified as Equation ( 1) (x i , x j are instances in X).Set up edges between each point and its nearest k 2 points via kNN method, k 2 is a preset value.This graph will continue to be used in the following steps.3. To determine the weight between points, which is different from typical LE method, the adjusted weight matrix Wij calculated according to Equation (11) before is selected as the final weight matrix of the previous graph.σ m > 0 is Gaussian heat kernel width, u h 1 , u h 2 represents the cluster centroids of the cluster which x i and x j belongs to.l i , l j are indicative vectors.
4. Construct graph Laplace matrix L, D is a diagonal matrix with its (i, i)-element equal to the sum of the i th row of Wij

Datasets
We employ experiments on ten publicly available UCI datasets to quantitatively evaluate the performance of LE-CAS.The statistics of these datasets are listed in Table 1.These datasets are with different numbers of features and of samples: Msplice represents dataset on molecular biology splice online learning, which is used for multiclass clustering.W1a data set records information from a TV series called W1A.Soccer-sub1 stores the information on players registered in FIFA.Madelon is an artificial dataset, which was part of the NIPS 2003 feature selection challenge.This is a two-class classification problem with continuous input variables.FG-NET is a dataset used for face recognition.ORL contains 400 images of 40 different people, was created by the Olivetti research laboratory in Cambridge, England, between April 1992 and April 1994.Musk describes a set of 102 molecules of which 39 are judged by human experts to be musks and the remaining 63 molecules are judged to be non-musks.CNAE-9 is a data set containing 1080 documents of free text business descriptions of Brazilian companies categorized into a subset of 9 categories cataloged in a table called National Classification of Economic Activities.SECOM is about data from a semi-conductor manufacturing process.DrivFace contains images sequences of subjects while driving in real scenarios.All data sets are available from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/index.php).

Comparing Methods
We compare LE-CAS against four different GDR methods: the original LE [21], GGLE [27], S-LE [28] and SCPLE [29].The latter three comparing methods try to improve LE from different aspects and have close connection with our work.These comparing methods were introduced in section 1.We first apply these dimensionality reduction methods to project the high-dimensional samples into a low-dimensional subspace.After that, the widely adopted k-means [22] clustering is applied to cluster the samples projected in the subspace by the respective comparing method.
In the experiments, we specify (or optimize) the input parameters of these comparing methods as the authors suggested in the original papers.As to LE-CAS, k 1 is determined by calculating the contour coefficient to obtain a better result of initial clustering.k 2 =10 is used to construct kNN graph.The sensitivity of k 2 for these kNN graph-based methods (LE [21], GGLE [27], and LE-CAS) and σ m for Gaussian heat kernel function-based methods (GGLE [27], LE-CAS) will be studied later.

Evaluation Metrics
To evaluate the performance of different dimensionality reduction methods, we adopt four frequently used clustering-effect evaluation index: Fowlkes and Mallows Index (FMI) [32], F-measure [33] and Purity (PU) [34].
For dataset X = [x 1 ; x 2 ; . . .; x n ], it is assumed that the cluster obtained through clustering is divided into C = {C 1 , C 2 , . . ., C m }, and the cluster given by the reference model is divided into Accordingly, let λ and λ * respectively represent the cluster marker vectors corresponding to C and C * .We consider the sample pairwise, as defined below: a = |SS|, SS = x i , F-measure is based on precision and recall of clustering results: Purity represents the number of samples with correct clustering in the total number of samples.N represents total number of samples, C = {C 1 , C 2 , . . ., C m } is the collection of cluster, C h represents h-th cluster, X = {X 1 , X 2 , . . . ,X n } is the collection of sample, X i represents i-th sample.Computational formula is shown as below: 4. Results

Results on Different Datasets
Figure 2 (for FMI's performance), Figure 3 (for F-measure's performance) and Figure 4 (for PU's performance) show the results with respect to different fixed target dimensionality after applying different dimensionality reduction solutions and with the same k-means clustering.To avoid random effects, we repeated the experiment for 30 times and calculated the mean and variance of the results of LE-CAS and comparing methods with respect to four evaluation metrics on respective datasets, which is shown on Table 2.
According to the table and figures, we can observe that the performances of all methods increases with the target dimensionality rising and LE-CAS always outperforms other comparing methods across all the evaluation metrics and six datasets.LE-CAS and LE both uses Laplacian Eigenmaps, while LE-CAS achieves better performance on four evaluation indices than that of the LE.That is because LE-CAS changes the similarity calculation method of the instances on the decision boundary, and the adjacent instances are still close enough and the clusters they belong to remain the same after dimensionality reduction.The fact indicates the clustering similarity adjustment (CAS) can improve the performance of GEDR methods.
LE-CAS and GGLE both uses Gaussian kernel function (LE-CAS calculated the similarity between clusters and instances by Gaussian kernel function, GGLE measures similarity between instances via different generalized Gaussian kernel functions), while the performance of GGLE is much lower than that of LE-CAS.This fact shows that CAS used by LE-CAS can better reflect the real relationship between the instances than generalized Gaussian kernel functions used by GGLE.
The performance of LE-CAS is better than that of S-LE and SCPLE, which use the ratio of Euclidean distance to geodesic distance between two instances adjusts the number of nearest neighbors.That is because the clustering-adjusted similarity can more well explore the geometric structure of instances than adaptive adjustment of neighborhood parameters.SCPLE performs better than S-LE, which verifies that SCPLE is an improved method based on S-LE via concerning neighborhood parameter of instances from same or different clusters.
As for stability, LE-CAS performs well, the variance is basically stable and its effect of cluster can maintain good stability in different target dimensions, while the variance of other improved LE methods fluctuates and its effect of cluster has a complex turbulence into different target dimensions at the same time.That is because the similarity without CAS can be easily destroyed by noisy featured.After detailed observations of the results on different datasets, we found that the FMI results on several datasets are much lower than others (Look at the FMI values on dataset soccer-sub1, ORL, DrivFace).Since FMI is an evaluation index which measures the consistency between the clustering results and the real labels as we mentioned before, there are good reasons to believe that clustering occur in these datasets are not so perfect, which means that the clusters identified by the clustering method we used in LE-CAS (kernel k-means) are not well suitable for the true decision boundaries.Fortunately, even though the clusters we obtained are not so satisfied, the performance of the LE-CAS is still better than the performance of GGLE, S-LE, SCPLE and LE, we can still get relatively good results.

Parameter Sensitivity Analysis
Because we adopted kernel k-means just as data preprocessing in our approach, so for the determination of k 1 value, we make the k 1 from 2 to 10, repeat several times kernel k-means on each k 1 value (to avoid local optimal solution), and calculate the current average contour coefficient.Finally, the k 1 corresponding to the maximum contour coefficient is selected as the final number of clusters.
LE-CAS, LE, and GGLE make use of a kNN graph to set up the adjacency matrix and the weights between samples, and hence for graph-based dimensionality reduction.To study the sensitivity of these methods to the input value of k 2 , we increase k 2 from 5 to 12, and report the FMI values of these algorithms under each input value of k 2 (number of neighbors) in Figure 5 .From this figure, we can observe that no matter how the k 2 value changes, the results of LE-CAS on these datasets are better than those of another two comparing algorithms.Fluctuations mainly happen in LE and GGLE, while LE-CAS gets a relatively smooth curve.GGLE adopts the kNN graph constructed in the original high-dimensional space to explore the local geometric structure of samples.It uses the k 2 nearest neighbors of a sample to seek the linear relationships between k 2 nearest neighbors.Therefore, it still focuses on the local manifold, and is sensitive to k 2 .As k 2 increase, the linear relationship between them becomes increasingly complicated, which causes the increase of error rate.LE and LE-CAS also adopt the kNN graph, LE-CAS additionally uses the global cluster structure of samples to adjust the similarity between neighborhood samples.The performance margin between LE-CAS and LE, and the better stability of LE-CAS proves the effectiveness of clustering-adjusted similarity.From these results, we can conclude LE-CAS is robust to noise and can work well under a wide range of input values of k 2 .
Both LE-CAS and GGLE specify edge weight by Gaussian heat kernel function.From Equation (11), we can see that they both rely on a suitable Gaussian heat kernel width σ m .σ m should not be too small or too large.If σ m is too small, the similarity in Equation ( 11) will be close to 0, On the other hand, σ m should not be too big.If σ m is too big, the similarity in Equation ( 11) will be close to 1.In our previous experiments, we set σ m as the mean of square Euclidean distance between all training instances for both LE-CAS and GGLE.To investigate the sensitivity of σ m on LE-CAS and GGLE, we conduct experiments to investigate the influence of σ m .In the following experiments, we increase σ m from 1 × 10 −2 to 1 × 10 5 for LE-CAS and GGLE.Other parameter settings are kept the same with previous experiments.Similarly, we run 30 independent experiments for each fixed σ m and report the FMI under each σ m in Figure 6.From Figure 6, we can find that LE-CAS outperforms LGC in a wide range of σ m .Both GGLE and LE-CAS depend on a suitable σ m .The performance of two methods reaches relatively stable with the increase in σ m .The FMI of LE-CAS is similar to GGLE when σ m is too small.This is because if σ m is too small, the clustering-adjusted similarity is similar to the original Gaussian heat kernel similarity.The performance of both LE-CAS and GGLE becomes relatively stable when σ m ≥ 100 in our experiments, and LE-CAS performs significantly better than GGLE.The reason is that the clustering-adjusted similarity has its effect.From the above results, we can conclude that LE-CAS effectively improves the performance of GEDR methods and makes stable performance in a wide range of σ m .

Robustness Analysis
Since our method combines clustering with dimensionality reduction, the performance of LE-CAS seems heavily depends on whether the clustering method used in previous steps.To clarify the robustness of our method, we employed two different clustering methods, k-means and kernel k-means, as the initial clustering method in the course of LE-CAS.The performance of these methods are shown as Figure 7. From Figure 7, we can illustrate that the performances of two clustering methods are merely close to each other on ten datasets and each method performs better than LE method, which indicates that the performance of our LE-CAS cannot be influenced much by the clustering method used in initial steps.

Conclusions
In this paper, we introduced the Laplacian Eigenmaps dimensionality reduction based on Clustering-Adjusted Similarity(LE-CAS), which leverages local manifold structure and global cluster structures to adjust the similarity between neighborhood samples.In particular, the adjusted similarity can reduce the similarity between pairwise samples from different clusters while maintain the similarity between samples of the same cluster.Experimental results on public benchmark datasets show that that the clustering-adjusted similarity improve the performance of classical LE and outperforms other related competitive solutions.
In our future work, we want to explore other principle ways to refine the similarity between instances and further improve the performance of GEDR methods.In addition, we will pay more attention to weakly supervised graph-based dimensionality reduction.Otherwise, for problems happened near the decision boundary, fuzzy set theory may also can help.We are willing to compare our method to fuzzy set theory in the future.
(a) Euclidean distance-based similarity between samples (b) Cluster-adjusted similarity between samples

Figure 1 .
Figure 1.Comparisons between two types of similarities between samples, shown as (a) and (b).The clustering-adjusted similarity clearly tunes down the Euclidean-based similarity between two instances from different clusters.

) 5 .
The objective function of Laplace feature mapping optimization is as follows: min ∑ i,j y i − y j 2 Wij y i and y j are the representation of instance i, j in the target c-dimensional subspace Y.The objective function after transform is as follows: min trace Y T LY , s.t.Y T LY = I (14) Where the constraint function s.t.Y T LY = I guarantees the optimization problem has solutions.6. Do feature mapping, and calculate the eigenvectors and eigenvalues of L. The column vectors of Y that minimize the formula are the eigenvectors corresponding to c minimum non-zero eigenvalues (including multiple roots) of the generalized eigenvalue problem.The smallest c eigenvectors which are correspond to the non-zero eigenvalues are used as the output after dimensionality reduction.Ly = λDy (15) contains sample pairs which belong to the same cluster in C are still belong to the same cluster in C * , |SD| set contains sample pairs which belong to the same cluster in C are not belong to the same cluster in C * .|DS| set contains sample pairs which not belong to the same cluster in C are belong to the same cluster in C * , |DD| set contains sample pairs which not belong to the same cluster in C are still not belong to the same cluster in C * .a, b, c, d represents the number of data pairs in set |SS|, |SD|, |DS|, |DD|.Thus,

Table 1 .
Statistics of datasets used for experiments.