A Nonparametric Model for Multi-Manifold Clustering with Mixture of Gaussians and Graph Consistency

Multi-manifold clustering is among the most fundamental tasks in signal processing and machine learning. Although the existing multi-manifold clustering methods are quite powerful, learning the cluster number automatically from data is still a challenge. In this paper, a novel unsupervised generative clustering approach within the Bayesian nonparametric framework has been proposed. Specifically, our manifold method automatically selects the cluster number with a Dirichlet Process (DP) prior. Then, a DP-based mixture model with constrained Mixture of Gaussians (MoG) is constructed to handle the manifold data. Finally, we integrate our model with the k-nearest neighbor graph to capture the manifold geometric information. An efficient optimization algorithm has also been derived to do the model inference and optimization. Experimental results on synthetic datasets and real-world benchmark datasets exhibit the effectiveness of this new DP-based manifold method.


Introduction
Over the past decades, clustering has been the most fundamental task in many computer vision and data mining applications [1,2], e.g., image/motion segmentation [3,4], community detection [5], feature selection [6] and biological/network information analysis [7,8]. However, most of the conventional clustering methods assume that data samples are scattered in the feature space, which ignores the intrinsic underlying data structure that many real datasets have [3,9]. To overcome this problem, various manifold-based clustering (multi-manifold clustering) methods have been proposed and developed. Compared to the conventional clustering method, which regards the cluster as the data points with small distances between cluster members or dense areas of the feature space, the multi-manifold approach aims to gather the given data points into disparate groups, which come from different underlying submanifolds [10].
Unlike the conventional clustering methods [11,12], multi-manifold clustering can be classified into two different categories, the linear method and the nonlinear method [13]. In the first category, linear methods (also known as subspace clustering) construct the multi-manifold clustering by assuming that the underlying cluster can be well approximated by a union of low dimensional linear manifolds [14]. For example, Gholami [14] and Vidal [15] used a linear function to fit the underlying submanifold and cluster the clusters with the mixture model. Sparse Subspace Clustering (SSC)- [16], Low-Rank Representation (LRR)- [17] and Least Squares Regression (LSR)-based [18] methods approach the linear manifold clustering problem by finding a sparse representation of each point in terms of other data points. After forming a similarity graph with the learned sparse representation, spectral clustering methods are used to cluster data into distinctive clusters. As an expanding framework of the linear multi-manifold clustering methods, non-linear algorithms can be naturally applied to linear and/or nonlinear manifolds. For example, the K-manifold clusters the nonlinear subspace dataset by expanding the conventional K-means with geodesic distance [19]. Spectral Multi-Manifold Clustering (SMMC) integrates the local geometric information within the subspace clustering framework to handle the manifold structure [13], Multi-Manifold Matrix Decomposition for Co-clustering (M3DC) handles the manifold dataset by considering the geometric structures of both the sample manifold and the feature manifold simultaneously [20]. Recently, the state-of-the-art method may be deep subspace clustering, which assembles the deep framework and the conventional subspace clustering method [21,22].
However, a drawback of most conventional manifold clustering methods is that the clustering accuracy depends on the cluster number, which is always unavailable in advance [23]. To overcome this model selection problem, one category of the most widely-studied methods is that equipping the conventional methods with a Dirichlet process prior, e.g., Dirichlet Process Mixture (DPM) models [24,25], Multilevel Clustering with Context (MC 2 ) [26] and Dirichlet Process Variable Clustering (DPVC) [27]. Since the distributions adopted in these nonparametric models are defined in the Euclidean space, those conventional Dirichlet process clustering methods suffer difficulty when dealing with the manifold data. To overcome this problem, many manifold DP clustering models have been proposed. Wang [28] and Gholami [14] assumed that the submanifold is lying on the linear manifold and can be fitted with the hyperplane. Straub et al. [29,30] defined the Gaussian distribution on the sphere surface and introduced an auxiliary indicator vector zwith a DP prior. More than the sphere manifold, Simo et al. [31] expanded the distribution to the manifold space with the logarithmic and exponential mapping. Although these models are quite powerful and have been widely studied in many applications, they have their drawbacks when the manifold structure is not prespecified [31]. For example, the DP-space and temporal subspace clustering model is an expanding method of the linear manifold clustering method. It lacks the capability to handle a non-linear manifold dataset.
In the geodesic mixture model, the logarithmic and exponential mapping algorithms [32,33] used in this model depend mainly on the pre-defined geometric structure, which is always unavailable. For the sphere mixture model, the sphere manifold has not been extended to arbitrary manifolds [31].
In this paper, we investigate the manifold clustering method with no prespecified manifold structure and cluster number in the DPM framework. In order to model the complicated manifold cluster distributions, we integrate the original DPM with the conventional Mixture of Gaussians (MoG) [34,35] to handle the manifold distribution ( Figure 1a). Furthermore, we also notice that an unconstrained MoG distribution fails to capture the manifold geometrical information (Figure 1b). Inspired by the previous study [23,36], we regularize our model with a k-nearest neighbor graph. To form a meaningful cluster, in which samples from the same cluster are closed and related, we constrain the MoG mean with a Mahalanobis distance. The main contributions are as follows: • A constrained MoG distribution has been applied to model the non-Gaussian manifold distribution.

•
We integrate the graph theory with DPM to capture the manifold geometrical information.

•
The variational inference-based optimization framework is proposed to carry out the model inference and learning.
The organization of our paper proceeds as follows. In Section 2, we review the background knowledge of the Dirichlet process mixture model. Simultaneously, we present the generation procedure of the proposed manifold Dirichlet process mixture model and give the variational expectation maximization inference algorithm. Experimental comparisons will be presented in Section 3. In Section 4, we give the detailed discussions and present the limitations and advantages. Section 5 concludes the paper.

Materials and Methods
In this section, we firstly review the basic concept of the Dirichlet Process Mixture (DPM) model. Then, we propose the multi-manifold clustering method by equipping DPM with MoG and the k-nearest neighbor graph. In our method, the Dirichlet process is used to generate the suitable cluster number. MoG and the k-nearest neighbor graph are applied to model the non-Gaussian manifold distribution and capture the manifold geometric information. Finally, variational inference is derived to do the model inference and learning.

Dirichlet Process Mixture Model
The Dirichlet Process Mixture (DPM) model is an approach that extends the mixture model by introducing a Dirichlet process prior within the Bayesian framework. In DPM, we firstly sample a prior distribution G from the Dirichlet process and then sample the likelihood parameters {θ n } N n=1 from G. With the sampled likelihood parameters, observation data x n can be generated from the likelihood distribution F(x|θ n ). This procedure can be concluded as follows: where F(x|θ n ) is a likelihood distribution and G 0 is a base distribution. x n is the observation sample. By integrating out G, the joint distribution of the likelihood parameters {θ n } N n=1 exhibits a clustering effect. Suppose that we have N − 1 parameters {θ n } N−1 n=1 sampled from our Dirichlet process. We then have the following probability for the N-th value of θ.
where n i denotes the θ frequency of occurrence in {θ n } N−1 n=1 and δ(j) represents the delta function. I denotes the number of unique values in {θ n } N n=1 .
(2) reveals the fact that a new sample θ n is either generated from a new cluster with probability G 0 or extracted from the existing clusters {θ n } N−1 n=1 with probability n i /(α + N − 1).

Our Proposed Method
In this section, we expand the original DPM model to a multi-manifold clustering framework named the Similarity Dirichlet Process Mixture (SimDPM) model. The main notations and descriptions used in our method are summarized in Table 1. As we have debated, DPM is unable to model the manifold dataset since the conventional likelihood distribution F(x|θ n ) is defined in the Euclidean space or prespecified manifold. To overcome this problem, we approximate the manifold distribution with MoG ( Figure 1a). Then, we construct the sample generation process with two phases, a single Gaussian distribution and a mixture of Gaussians distribution. The reason we generate the data with both the single Gaussian distribution and the MoG distribution is that some simple submanifolds and non-manifold clusters can be modeled by the single Gaussian distribution.
Suppose that we are given N observation samples X = {x n } N n=1 where x n ∈ R D . Given the additional parameters of the MoG distribution, we assume the following generative process for each observation data x i : For every data point i: where Beta(1, α) is a beta distribution with parameter one and α, mult(π(v)) is a categorical distribution parameterized by π(v), v and π(v) are vectors with To form a meaningful cluster (samples from the same cluster are closely related) and respect the manifold geometrical information, we constrain the MoG mean with: and use a k-nearest neighbor graph to regularize the posterior probability inspired by [23], in which the graph Laplacian is used to capture the geometric information that has been missed by the MoG distribution.
where p(k|X) = {p(k|x n )} N n=1 is the posterior probability. L is the graph Laplacian constructed by the k-nearest neighbor graph [13]. Note that the constraint ofũ k,m depends only on the previousũ k,m−1 , but not onũ k,m+1 . Below, we characterize the k-nearest neighbor graph L k . Given the unlabeled data X, for any point x i , we sort the rest of the data samples and select the top-k nearest neighbors. If node x j is in the top-k nearest points of node x i , we set: Here, we define the L as the equation L = D k − L k . D k is a diagonal matrix whose entries are column (or row, since Sis symmetric) sums of L k . For convenience, the neighbor number used in our graph is denoted as r.

Variational Expectation Maximization Inference
Our scheme for estimating the data cluster depends mainly on our capability to infer the posterior distribution. We solve this using variational expectation maximization inference.
Unlike the conventional expectation maximization algorithm, the posterior probability in our model will be estimated via the variational inference, and then, we optimize the MoG parameter by maximizing the lower bound with the fixed variational parameter. Following the general variational inference framework, we firstly give the Evidence Lower BOund (ELBO) for the SimDPM with the truncated stick-breaking process (when applying this process, the maximum cluster number ∞ is truncated to K) [37].
n } N n=1 and z (2) = {z (2) n } N n=1 are the indicator variables sampled from the categorical distribution p(z n |v). Following the factorized family variational inference [37], which can make the posterior distribution computable, q can be expressed as: where For derivation convenience, we denote ELBO as L(γ, τ, Φ,Θ). By using this inequality relaxation, we note that learning the model and estimating the model parameters are altered to maximize the following equation.
arg max We also notice that, since we have truncated the maximum cluster number to K, the penalty term R is altered to be R = ∑ K k=1 p(k|X) T Lp(k|X). Variational E-step: In the variational inference framework, the variational parameter can be estimated by maximizing the lower bound of likelihood function log p(X|α, λ) with the coordinate ascent algorithm.
For φ n,k in {φ n,k } K k=1 , note that this is a constrained maximization since ∑ K k=1 φ n,k = 1, and the probability p(k|X) can be approximated by the variational parameter Φ .,k . To solve this problem, we use an auxiliary variable A k = Φ .,k where A k ∈ R N and form the Lagrangian by isolating the terms in ELBO, which contain φ n,k as: where λ L is a Lagrangian multiplier and λ R is a penalty parameter.
Fix A k to update Φ .,k . The updating rule for φ n,k can be achieved by taking the derivation.
arg min where λ A is the penalty parameter. For the other variational parameter, we can attain the following closed-form solutions when taking the derivation of the previous proposed ELBO function and setting it to zero: where N k , S k and − x k can be estimated as follows: For the prior parameters u 0 , c 0 , W 0 , v 0 , we use them in a non-informative manner to make them influence as little as possible the inference of the variational posterior distributions. For the other variational parameters, we initialize them in a random way. Variational M-step: To optimize the lower bound parameterθ, we apply the EM framework again, in which we introduce an auxiliary posterior variable q(k, m|x n ) and Jensen's inequality [37].
(1 − s)φ n,k q(k, m|x n ) logπ k,m N(x n |ũ k,m ,δ k,m ) q(k, m|x n ) where C is a constant value with no respect toΘ in L(γ, τ, Φ,Θ). By using the inequality relaxation, the variational M-step can be reformulated as the optimization problem: max N ∑ n=1 φ n,k q(k, m|x n ) logπ k,m N(x n |ũ k,m ,δ k,m ) q(k, m|x n ) where λ L and H are the Lagrangian multipliers. We therefore achieve the following closed-form solution by taking the derivative and setting the lower bound of (14) to zero: when m is greater than one, we have: Similar to the mean parameterũ k,m , forδ k,m , we have: where: since the constraint does not exist in the components where m < 2, the updating rule for m = 1 is a little different.δ For the computation of π k,m , we have: The computation of q(k, m|x n ) will be identical to the standard mixture of Gaussians model learning algorithm [37].

Agorithm
The full learning and inference algorithm is summarized in Algorithm 1. The flowchart of our proposed framework is demonstrated in Figure 3. Below, we analyze the computational complexity.

19: end while
Algorithm complexity: Suppose that we have N samples, each sample has D dimensions. The maximum cluster number in our experiment is K. Expectation step converges after running T e times. The whole algorithm converges after T times. From the derivation, we know that the main computation lies on the Equations (7), (8) and (11), in which we need to calculate the inverse and determinant of the matrix. For Equations (7) and (11), we need O(K · D 3 ). For Equation (8), we need O(N 3 ). Another major computation is the Equations (16) and (17), which takes the computational complexity of O((M − 1) · K · N · D 2 ). According to the debates, we know that the whole algorithm computational complexity is O(T · (T e · (N 3 + K · D 3 ) + (M − 1) · K · N · D 2 )).
For the space complexity, the main cost is the variational parameters which takes O(K · D 2 + N · K). Another cost is the MoG parameters which needs O(M · K · D 2 ). Then, the total space complexity is O(K · D 2 + N · K + M · K · D 2 ).

Results
To demonstrate the usefulness of the proposed manifold model, we tested our method on both synthetic and real-world datasets and compared it with the following methods:

5.
Another category is the clustering method, which needs to specify the class number, K-means, LRR [17] and LatLRR [41].
Clustering accuracy in our experiment was measured through Normalized Mutual Information (NMI) [32]. Suppose U = {U 1 , U 2 , U 3 , ..., U |U| } denotes the real cluster labels obtained from the ground truth and V = {V 1 , V 2 , V 3 , ..., V |V| } obtained from a clustering algorithm. |U| and |V| denote the cluster number. Then, a mutual information metric between U and V can be defined as: ) (20) where P(i) and P(j) are the probability that a sample picked at random falls into class U i or V j and P(i, j) denotes the probability that a sample falls into both classes U i and V j . The Normalized Mutual Information (NMI) then can be defined as: where H(U) and H(V) denote the entropy. Experimental setup: In our experiment, we ran every algorithm 10 times and report the average accuracy. The parameters of the SimDPM algorithm were selected using the ground-truth labels of less than 40% according to the clustering accuracy. The default value for α and the maximum cluster number K were set at 20 and 30. The other variational parameters were initialized randomly except u k and W k , for which we used the mean and covariance of the observation data to initialize. All our algorithms were implemented in MATLAB R2016a on a DELL Precision Workstation with 8.00 G RAM and a Xeon(R) E3 CPU.
For the original DPM, we used the α = 20 and set the other variational parameters randomly. When operating the DP-space, we used λ and s from the values, as this was suggested in the original codes, and we selected this by using 30% ground-truth labels. The parameters used in LatLRR and LRR were that α = 1, β = 1.4 and λ = 4. For CFSFDP, we chose the determination points that were significantly different from the other points in the decision graph. In the setting of AP, we used the preference value as a scalar one. Both CFSFDP and AP used the K-nearest neighbor graph as the similarity matrix.

Synthetic Dataset
In this section, we evaluate our SimDPM model on a synthetic dataset. We show the results in Figure 4. Clearly, there are two patterns.
Visual comparison on synthetic hybrid data shows that SimDPM performed better than the traditional DPM model. In our result, Figure 4b shows the result using the original variational DPM. As can be seen, the original DPM tended to partition the synthetic dataset in a hard manner. Our manifold method yielded an ideal clustering result. The reason is that our model handled the dataset with a Gaussian expanding distribution and reserved the local geometrical structure of the data space by applying a k-nearest neighbor graph.

Real Dataset
(a) Motion segmentation: Motion segmentation usually refers to the task of separating the movements of multiple rigid-body objects from video sequences. Linear manifold clustering methods are popular in this task [42]. In our experiment, we used the Hopkin155 dataset [43] and cast it into a general multi-manifold clustering task. We show some samples in Figure 5. According to the dataset itself, we divided the universal set into checkerboard and others [43], in which each contained 26 and nine subsets. For the checkerboard dataset, we separated it into the Linear manifold dataset (L) and Non-Linear manifold dataset (Non-L) according to the 3D projection of PCA. When applying our algorithm, we projected point trajectories into 10D features. The clustering result and the estimated cluster number are presented in Tables 2 and 3. As can be seen, our proposed method performed the best on the Non-L dataset. On the others and L dataset, DP-space was the first best, and our method was the second best. For the estimated cluster number, we can observe that our model could produce the suitable cluster size compared with the ground truth.   Table 3. The estimated cluster number on the Hopkin155 dataset with 3 motions. L, Linear.

Method Checkerboard Others L Non-L Average
Ground truth 3.00 3.00 3.00 3.00 The estimated cluster number 3. 33 3.60 3.10 3.09 (b) Coil20 image dataset: The coil20 [44] image database is a popular manifold database containing 20 objects from the Columbia university image library. Some image samples are demonstrated in Figure  6. Each image is taken from five degrees apart as the object is rotated on a turntable. Thus, each object in coil20 has 72 images. The size of the object is 128 × 128, with 256 grey levels per pixel. In our experiment, each image was firstly represented by a 128 × 128 dimensional vector, and then, we projected it into a 10D feature using the PCA method. To test the general clustering performance, we used five coil20 subsets. For the overall testing, we also gave the universal dataset (Dataset 20). The clustering result and the estimated cluster number are demonstrated in Tables 4 and 5. From the result, we know that our method consistently outperformed the DP-based algorithms such as DP-space and DPM. When comparing with the other methods, our method was the first or the second best, especially compared with the approaches that do not need to specify the cluster number.  (c) Swedish leaf image dataset: The Swedish dataset introduced in [45] consists of 1125 leaves of 15 species with 75 images per species. In this dataset, we firstly extracted the outer contour and then achieved the contour features by applying the Fourier transform [46]. Every leaf in our experiments was represented as a 10-dimensional feature. Some samples are shown in Figure 7.
Similar to the coil20 dataset, we demonstrated the efficiency on five subsets and the universal dataset. We ran every algorithm in our experiment 10 times, and took the accuracy by averaging the 10 results. The experimental results are demonstrated in Tables 6 and 7, which present some observations: (1) compared with the original DPM, the improvement of the clustering accuracy (average 0.07) was lower than the improvement in the coil20 dataset (average 0.05); (2) the cluster number was consistently increasing as the ground truth cluster number was increasing.   Table 7. The estimated cluster number on the leaf dataset.  Tables 2-7, we can draw some points as follows.

•
The proposed method obtained the highest clustering accuracy especially on the Non-L and coil20 dataset compared with the non-prespecified cluster number methods, which validates the effectiveness of our non-linear assumption. • DP-space performed better than our method on the L and others dataset, the reason being that DP-space has a prior structure assumption, which introduces additional manifold geometric information. • LRR, LatLRR and K-means outperformed our algorithm on some coil20 and leaf subdatasets, the reason being that our method needed to estimate the cluster number along with clustering. This made our algorithm hard to optimize.

•
Compared to the coil20 and leaf dataset, our method achieved an all-around performance boosting on the motion segmentation dataset; this is because the simple clustering task (the linear manifold has only three classes) was easy for our algorithm to optimize and model.

•
Compared to the leaf dataset, our method achieved a better clustering performance boosting on the coil20 dataset. The reason is that coil20 is a well-defined manifold dataset, in which the structure among samples is easy to capture by the graph Laplacian. • Our manifold model consistently produced the suitable cluster number with the increasing of the data cluster size, which indicates that our model could provide a flexible model size when fitting different datasets.

The Effect of the Algorithm Parameters
In this section, we firstly investigate the effects of the parameters λ A and s on the Non-L dataset. More specifically, in the experiment, when one parameter is being tuned, the value of the other parameter is fixed. The parameters λ A and s were sampled from {1000, 100, 80, 60, 40, 20} and {1, 0.9, 0.8, 0.7, 0.6, 0.5}. We show the clustering accuracy in Figure 8.
As we can see in Figure 8, experimental results indicate that the proposed model was sensitive to λ A . Empirically, the best clustering accuracy was achieved when λ A = 100. We also observed that our method achieved the best clustering accuracy when s = 0.7, 0.8. This reveals that the MoG had improved the clustering accuracy. Besides, we measured the clustering accuracy with the different M and the neighbor number r using the k-nearest neighbor graph. The clustering accuracy is demonstrated in Figures 10 and 9. As can be seen, the clustering accuracy achieved the best performance on the leaf subdataset and the coil20 dataset when the neighbor number r = {5, 6, 7, 8, 9, 10}. Clustering accuracy increased along with the increasing of the M in the subdataset of the coil20 dataset and leaf dataset. Unlike the subdataset, parameter M and the neighbor number r had little effect on the full dataset of coil20 and leaf. The reason is that our model is a non-convex model, and the complicated dataset led to a much more complicated optimization.

Discussion
Compared to the previous linear manifold and geodesic mixture models, our theoretical analysis has shown that our method is a prespecified manifold and cluster number-free model. This is because we use a DP prior to generate the cluster indicator with the suitable cluster number and use the MoG and K-nearest neighbor graph to capture the submanifold rather than using a predefined manifold. Additionally, compared with different multi-manifold clustering methods with prespecified manifolds and cluster numbers like DP-space, LRR and LatLRR, our method has shown superior performance on the general manifold clustering task (coil20 and leaf dataset). This indicates our method can fill the research gap we have mentioned in the Introduction.
Although our method can handle the problems we have mentioned (estimating the cluster number and handling the general manifold clustering task), limitations still exist. That is, our method is not a full Bayesian framework. Thereby, some parameters should be tuned manually. This may be unacceptable in some real applications. Moreover, we note that MoG and the Gaussian distribution are sensitive to the dimension of the data. In future work, we will explore a full generative model, in which the parameters can be generated by using some Bayesian priors. Since our approach is sensitive to the dimension, we will also explore certain methods to integrate the dimension reduction method and the manifold clustering.

Conclusions
In this paper, we have proposed a nonparametric generative model to handle the manifold dataset with no prespecified cluster number and manifold distribution. In the course of the theoretical and experimental analysis, we have demonstrated that MoG can extend the application scope of the original DPM and can significantly improve the clustering accuracy compared to the previous proposed method. However, to be frank, the proposed method can only partially handle the problem we state in the Introduction due to the facts that: (1) MoG, the mean constraint and K-nearest neighbor graph are hard to optimize when we incorporate them into the DP framework; this can be observed when we use it in the coil20 and full leaf dataset; (2) the DP prior has a limitation when generating the suitable cluster number.