Multi-Partitions Subspace Clustering

: In model based clustering, it is often supposed that only one clustering latent variable explains the heterogeneity of the whole dataset. However, in many cases several latent variables could explain the heterogeneity of the data at hand. Finding such class variables could result in a richer interpretation of the data. In the continuous data setting, a multi-partition model based clustering is proposed. It assumes the existence of several latent clustering variables, each one explaining the heterogeneity of the data with respect to some clustering subspace. It allows to simultaneously ﬁnd the multi-partitions and the related subspaces. Parameters of the model are estimated through an EM algorithm relying on a probabilistic reinterpretation of the factorial discriminant analysis. A model choice strategy relying on the BIC criterion is proposed to select to number of subspaces and the number of clusters by subspace. The obtained results are thus several projections of the data, each one conveying its own clustering of the data. Model’s behavior is illustrated on simulated and real data.


Introduction
In exploratory data analysis, the statistician often uses clustering and visualization in order to improve his knowledge on the data. In visualization he looks for some principal components explaining some major characteristics of the data. For example in principal component analysis (PCA) the goal is to find a linear combination of the variables explaining the major variability of the data. In cluster analysis, the goal is to find some clusters explaining the major heterogeneity of the data. In this article we suppose that the data can contain several clustering latent variables, that is, we are in the multiple partition setting, and we are simultaneously looking for clustering subspaces, that is, linear projections of the data each one related to some clustering latent variable thus the developed model is later called multi-partitions subspace clustering. A solution to perform multi-partition subspace clustering is to use a probabilistic model on the data such as a mixture model [1], it allows to perform the parameters estimation, and model selection such as the choice of the number of subspaces and the number of clusters per subspace using standard model choice criteria such as BIC [2]. Thus the main fields related to our work are model based subspace clustering and multi-partitions clustering.
In the model based subspace clustering framework, let first notice that PCA can be re-interpreted in a probabilistic way by considering a parsimonious version of a multivariate Gaussian distribution [3] and that the k-means algorithm can be re-interpreted as a particular parsimonious Gaussian mixture model estimated using a classification EM algorithm [4]. A re-interpretation of the probabilistic PCA has also been used in clustering by Bouveyron et al. [5] in order to cluster high-dimensional data. Although the proposed high dimensional mixture does not performs dimension reduction, it rather operates a class per class dimension reduction which does not allow to have a global model-based data visualization. Thus Bouveyron and Brunet [6] proposed the so called Fisher-EM algorithm which simultaneously performs clustering and dimension reduction. This is performed through a modified version of the EM algorithm [7] by including a Fisher step between the E and the M step. This approach allows the same projection to be applied to all data, but does not guarantee the increasing of the likelihood at each iteration of the algorithm.
In the context of multi-partitions clustering, Galimberti and Soffritti [8] assumed that the variables can be partitioned into several independent blocks, each one following a full-covariance Gaussian mixture model. The model selection was done by maximizing the BIC criterion by a forward/backward approach. Then, Galimberti et al. [9] generalized thier previous work by relaxing the assumption of block independence. The proposed extension takes into account three types of variables, classifying variables, redundant variables and non-classifying variables. In this context, the choice of the model is difficult because several roles have to be taken into account for each variable, which requires a lot of calculations, even for the reallocation of only one variable. Poon et al. [10] also took into account the multi-partition setting, called as facet determination in their article. The model considered is similar to that of Galimberti and Soffritti [8], but it also allows tree dependency between latent class variables, resulting in the Pouch Latent Tree Models (PLTM). Model selection is performed by a greed search to maximize the BIC criterion. The resulting model allows a broad understanding of the data, but the tree structure search makes estimation even more difficult as the number of variables increases. More recently, Marbac and Vandewalle [11] proposed a tractable muti-partition clustering algorithm not limited to continuous data; in the Gaussian setting it can be seen as particular case of Galimberti and Soffritti [8] where they assume a diagonal covariance matrix allowing a particularly efficient search of the partition of the variables in sub-vectors.
In this article we suppose that the data can contain several clustering latent variables, that is we are in the multiple partition setting. But contrary to Marbac and Vandewalle [11] where it is assumed that variables are divided into blocks each one related to some clustering of the data, we are looking for clustering subspaces, i.e., linear projections of the data each one related to some particular clustering latent variable thus replacing the combinatorial question of finding the partition of the variables in independent sub-vectors by the question of finding the coefficients of the linear combinations. The proposed approach can be related to the independent factor analysis [12] where the author deals with source separation, in our framework a source can be interpreted as some specific clustering subspace; however, their approach becomes intractable as the numbers of sources increases and does not allow to consider multivariate subspaces. Moreover, it is not invariant up to a rotation and rescaling of the data, where our proposed methodology is.
The organisation of the paper is the following, in Section 2 we present a reinterpretation of the factorial discriminant analysis as a search of discriminant components and of independent non-discriminant components. In Section 3, the multi-partitions subspace clustering model and the EM algorithm to estimate the parameters of the model will be presented. In Section 4, results on simulated and real data will show the interest of the method in practice. In Section 5, a conclusion and discussion of future extension of the paper will be made.

Linear Discriminant Analysis (LDA)
It is supposed that n quantitative data in dimension d are available, the data number i will be denoted by x i = (x i1 , . . . , x id ) T , where x ij is the value of variable j of data i. The whole dataset will be denoted by x = (x 1 , . . . , x n ) T . Let assume that the data is clustered in K clusters, the class label of data i will be denoted by z i = (z i1 , . . . , z iK ) T , with z ik equals to 1 if data i belongs to cluster k and 0 otherwise. Let also denote by z = (z 1 , . . . , z n ) the partition of x. In this section z is supposed to be known. For sake of simplicity the random variables and their realisations will be denoted in lower case, and p will be used as a generic notation to denote a probability distribution function (p.d.f.) which will be interpreted according to its arguments.
In the context of linear discriminant analysis [13], it is supposed that the distribution x i given the cluster follows a d-variate Gaussian distribution with common covariance matrices: with µ k the vector of means in cluster k and Σ the common class conditional covariance matrix. Let also denote by π k = p(z ik = 1) the prior weights of each cluster.
The posterior cluster membership probabilities can be computed using the Bayes formula: where φ d (·; µ k , Σ) stands for the p.d.f. of the d-variable Gaussian distribution with expectation µ k and covariance matrix Σ. Letx k denote the class conditional mean in cluster k: with n k = ∑ n i=1 z ik the number of data in cluster k, and byx the unconditional mean. Let also denote by W the empirical intra-class covariance matrix: and by B the empirical between class covariance matrix: If the data are supposed to be independent, the likelihood can simply be written as: (π 1 , . . . , π K , µ 1 , . . . , µ K , Σ; x, z) = − n 2 log(det(Σ)) − 1 2 The maximum likelihood estimators of the parameters of π k , µ k and Σ areπ k = n k n ,μ k =x k and Σ = W. A new data point can be then classified by plugin the estimated values of the parameters in Equation (1) the resulting classification boundary being linear in this case.

Factorial Discriminant Analysis (FDA)
Let us note that, from a descriptive viewpoint, one can be interested in dimension reduction in order to visualize the data. This could be done by using PCA, but from a classification perspective the component explaining the largest variability in the data are often not the same that the components providing the best separation between the clusters.
The goal of factorial discriminant analysis (FDA) is to find the component maximizing the variance explained by the cluster above the intra-class variance. The coefficients of the first discriminant It is well known that v 1 is the eigen vector associated with the highest eigen value λ 1 of W −1 B [14]. The remaining discriminant components are obtained through the remaining eigen vectors of W −1 B. Let denote by λ 1 , . . . , λ K−1 the eigen values of W −1 B sorted in decreasing order and by v 1 , . . . , v K−1 the associated eigen vectors. Moreover, if each component is constrained to have an intra-class variance equal to one (i.e., v k T W v k = 1, ∀k ∈ {1, . . . , K − 1}), the classification obtained using the Mahalonobis distance can simply be obtained by using the Euclidean distance on the data projected on the discriminant components.

Equivalence between LDA and FDA
As proved in Campbell [15] and detailed in Trevor Hastie [16], the FDA can be interpreted in a probabilistic way as an LDA where the rank of {µ 1 , . . . , µ K } is constrained to be equal to p with p ≤ K − 1 under the common class covariance matrix assumption. This allows us to reparametrize the probabilistic model in the following way: . Let notice that this new parametrization at this step is not unique but the model can easily be made identifiable by imposing some constraints on the parameters. Let y i ∈ R p and u i ∈ R d−p two random variables, the new parametrization can be reinterpreted in the following generative framework: • Draw z i : z i ∼ M(1; π 1 , . . . , π K ) where M stands for the multinomial distribution • Draw y i |z i : • Compute x i based on y i and u i : Thus the p.d.f. of x i can be factorized in the following way: From a graphical angle the model can be reinterpreted as in Figure 1, where u i and y i are latent random variables. In practice we are interested in finding y i and u i from x i . We will denote by V ∈ M p,d (R) and R ∈ M d−p,d (R) the matrices which allow this computation: for the rest of the article only the parametrization in terms of V and R will be used.
The main interest of this parametrization is that It means that only V x i is required to compute the posterior class membership probabilities: Parameters are estimated by maximum likelihood, using the variable change formulae the likelihood can be written: The first term is related to the variable change, the second term is related to the discriminant components and the third term is related to prior class membership probabilities and the fourth term is related to the non-discriminant components. In practice, this decomposition is the corner stone of the proposed approach since it separates the clustering part from the non-clustering part.
As stated in Campbell [15] the maximum likelihood estimator of V are the first p eigen vectors v 1 , . . . , v p of W −1 B in rows: The maximum likelihood estimator of R is obtained such that : As stated in Trevor Hastie [16] the link between the two parametrizations is as follows: From this formula we can clearly see that the reduced rank constraint operates some regularization on the parameters estimation. Moreover, from a practical perspective the reparametrization Equation (2) is more efficient for computing the posterior class membership probabilities.

Application in the Clustering Setting
As in Trevor Hastie [16], the model can easily be used in the clustering setting using the EM algorithm to maximise the likelihood. Thus W, B,x k are recomputed at each iteration by using their version weighted by their posterior membership probabilities.
The EM algorithm is now presented. The algorithm is first initialized with some starting value of the parameters or of the partition. Then the E step and the M step are iterated until convergence. Let (r) denote the value of the parameters at iteration r, the E and the M steps are the followings: • E step: compute the posterior class membership probabilities.
and γ (r+1) as in the previous section using the As noticed in Trevor Hastie [16], this approach is not equivalent to performing a standard EM algorithm and then performing FDA at the end of the EM algorithm. FDA must be computed at each iteration of the EM algorithm since the posterior membership probabilities are only computed based on the p first clustering projections.
Let us also notice that the M step can be interpreted in a Fisher-M step since FDA is required. In this sense it can be interpreted as a particular version of the Fisher-EM algorithm of Bouveyron and Brunet [6]. Although the homoscedasticity could be seen as particularly constraining, it is the best framework for introducing our model in the next section, since it is easily interpretable and allows for efficient computation owing to the closed form of FDA. This limitation could be easily overcome by using rigorous extensions of the FDA in the heteroscedastic setting as in Kumar and Andreou [17]; however, the computation would be much more intensive, since in this case no closed form formula is available and an iterative algorithm would be required even for finding the best projection.

Presentation of the Model
Let us now suppose that instead of having only one class variable z i for data i, we now have H class variables Let also denote by y h i the variables related to the clustering variable z h i such that: and that we will denote by p • = ∑ H h=1 p h . Let us still denote by u i the non clustering variables Let us also define x i by: Thus, The Figure 2 illustrates the model in the case of H = 2. Let us notice that this model allows us to visualize many clustering viewpoints in a low dimensional space since x i can be summarized by y 1 i , . . ., y H i . For instance, one can suppose that p 1 = . . . = p H = 1. In this case each clustering variable can be visualized on one component. We will denote by θ = (V 1 , . . . , V H , R, γ, ν 1 1 , ..., ν H K H ) the parameters of the model to be estimated.

Discussion about the Model
The Cartesian product of cluster spaces results in ∏ H h=1 K h clusters, which can be very large without needing many parameters. Thus the proposed model can be interpreted as being a very sparse Gaussian mixture model allowing the possibility to deal with a very large number of clusters, the resulting conditional means and covariances matrices are given in the following formulas: Thus, the expectation of x i given in all the clusters is a linear combination of the cluster specific means which can be referred to as a multiple-way MANOVA setting. On the one hand, as a particular homoscedastic Gaussian mixture, our model is more prone to model bias than free homoscedastic Gaussian mixture, and in the case when our model would be well-specified the homoscedastic Gaussian mixture would give a similar clustering for a large sample size (i.e., the same partitions with respect to the partition resulting from the product space of our multi-partitions model). On the other hand, our approach produces a factorised version of the partition space as well as the related clustering subspaces which is not a standard output of clustering methods, and it can deal with a large number of clusters in a sparse way which can be particularly useful for a moderated sample size. In practice, the choice between our model and an other mixture model can simply be performed through the BIC criterion.
In some sense our model can be linked with the mixture of factor analyzers [18]. In mixture of factor analyzers the model is of the type: where A is a low rank matrix. But here we have chosen a model of the type which allows us to deal with the noise in a different way. Actually, our model is invariant up to a bijective linear transformation of the data which is not the case for the mixtures of factor analyzers. On the other hand, our model can only deal with data with moderated dimension with respect to the number of statistical units; it assumes that the sources y i can be recovered from the observed data x i .

Estimation of the Parameters of the Model in the Supervised Setting
The likelihood of the model can be written: The likelihood cannot be maximized directly. However, in the case of H = 1, it reduces to the problem of Section 2. Let notice that if all the parameters are fixed except V h and R, ν h k and γ, the optimisation can be easily performed by constraining V h and R (r) . Thus the likelihood will be optimized by using an alternate optimization algorithm.
where M 1 is the sub-matrix containing the p h first rows of M and M 2 the matrix containing the last Thus, the increase of the likelihood when all the parameters are fixed except V h , R, ν h k and γ becomes: By denoting we have to maximize: Consequently, M and the others parameters can be obtained by applying a simple FDA on . In order to optimise over all the parameters, we can loop over all the clustering dimensions. Thus, in the case of mixed continuous and categorical data, this model can be used to visualize the clustering behavior of the categorical variables with respect to the quantitative ones.

Estimation of the Parameters of the Model in the Clustering Setting
Here our main goal is to consider the clustering setting, that is, when z 1 i , . . . , z H i are unknown. Consequently we will use an EM algorithm to "reconstitute the missing label" in order to maximize the likelihood. Thus the algorithm stays the same as in the supervised setting, except that the data at each iteration are now weighted by t h ik (r+1) instead of z h ik . The algorithm is the following: • Until convergence, for h ∈ {1, . . . , H} iterates the following steps: , R (r+1) , γ (r+1) and ν h k (r+1) based on formulas given in the supervised setting.

Parsimonious Models and Model Choice
The proposed model needs the user to define the number of clustering subspaces H, the number of cluster in each clustering subspace K 1 , . . . , K H , and the dimensionality p 1 , . . . , p H of each subspace. The constraints are that H < d, that p h ≤ K h − 1 and p • = p 1 + · · · + p H < d. It is clear that the number of possible models can become very high. To limit the combinatorial aspect, one can impose K 1 = · · · = K H = K and/or p 1 = · · · = p h = p. In practice the choice of p = 1 enforces to find clustering which could be visualized in one dimension, which can help the practitioner. Moreover, choosing K = 2 is the minimal requirement in order to investigate a clustering structure. However, if possible we recommend to explore the largest possible number of models and choosing the best one with the BIC. Let us define the following parsimonious models and their related number of parameters: where the number of clusters is the same for each subspace where the dimensionalities are the same for each subspaces where the dimensionalities are equals to one for each subspaces where the number of clusters is the same for each subspace and the dimensionalities are the same for each subspace where the number of clusters is the same for each subspace and the dimensionalities are equals to one for each subspaces For a given model m the BIC is computed as: where ν m is the number of parameters of the model detailed above. Thus the model choice consists of choosing the model maximising the BIC. BIC enjoys good theoretical consistency properties, thus providing a guarantee to select the true model as the number of data increases. The ICL criterion [19] could also be used to enforce the choice of well separated clusters, since from a classification perspective BIC is known to over-estimate the number of clusters if model assumption are violated. Let us however notice that in practice the user could be mainly interested by a low value of H, since even H = 2 can provide him with new insights about his data, focusing on finding several clustering view points.

Experiments on Simulated Data
We now present a tutorial example. Let us consider n = 100 data, with H = 2 clustering subspaces each one of dimension one (p 1 = p 2 = 1) and containing each of two clusters (K 1 = K 2 = 2). Let us draw y 1 i , the first clustering variable, according to a mixture of two Gaussian univariate distributions: 5N (4, 1) and draw independently y 2 i , the second clustering variable, according to the same mixture distribution: y 2 i ∼ 0.5N (0, 1) + 0.5N (4, 1), then draw u i the non-classifying components according to a multivariate Gaussian distribution in R 4 , u i ∼ N 4 (0, I 4 ). Thus the proposed model, based on the observed data x, aims at recovering the clustering variables y 1 and y 2 as well as the associated cluster variables z 1 and z 2 . The initial data x are presented on Figure 3, we see that the clustering structure of the data is not apparent from these scatter plots. The underlying clustering variables are presented on Figure 4, where we see the separation of the colors on the first clustering variable Y 1 and separation of the shapes on the second clustering variable Y 2 . Let us notice that such factorization gives a more synthetic view of the clustering than seeing these clusters as four clusters. Using standard dimension reduction techniques such as PCA does not succeed in recovering the clustering subspace see Figure 5. Performing a factorial discriminant analysis considering the four clusters in the supervised setting we get Figure 6, it finds good separation between the clusters, however we do not obtain the factorised interpretation of the clustering as the Cartesian product of two independent clusterings. Finally by performing the estimation of the model parameters in an unsupervised setting based on the data x we get Figure 7. We see thatŶ 1 succeeds in recovering Y 1 up to a sign and thatŶ 2 succeeds in recovering Y 2 .
Moreover, supposing p 1 = p 2 = 1 we can choose K 1 and K 2 according to the BIC criterion. Values of the BIC criterion are presented in Table 1, where we show that the selected model is the true one with K 1 = K 2 = 2. The lower diagonal of the table is not presented for symmetry reasons, and we limited ourselves to K 1 , K 2 ∈ {1, . . . , 5}.

Experiments on Real Data
Let us consider the crabs dataset [20]. It consists of 200 crabs morphological data, each crab has two categorical (cluster) attributes-the species, orange or blue, and the sex, male or female. The dataset is composed of 50 males orange, 50 males blue, 50 females orange, 50 females blue for which 5 numerical attributes have been measured: the frontal lobe size, the rear width, the carapace length, the carapace width and the body depth. We can see the PCA of the data in Figure 8. We see that component two separates males and females well, whereas component three separates orange and blue subspecies. However, we will see that by applying our model we obtained a better separation of the clusters.
Like in the tutorial example we will take p 1 = p 2 = 1 and K 1 , K 2 ∈ {1, . . . , 5}. The resulting BIC tabular is given in Table 2, it suggests the choice of K 1 = 3 and K 2 = 4. The resulting visualization of the clustering variables in given Figure 9. Let us notice that Y 2 is divided in four clusters however, however we only see three since two of them have the same mean. We can see that even if the numbers of clusters do not correspond, the first clustering subspace finds the subspecies, whereas the second clustering subspace finds the sex. We could also look at the solution provided by K 1 = K 2 = 2 on Figure 10, this one has a lower BIC but seems more natural for the problem at hand. We see that the obtained map is in fact quite similar the map obtained Figure 9; however, we notice that from a density approximation point of view we obtain a lower fit. In fact, if we look at the correlations between Y 1 Figure 9 and Y 2 Figure 10 we have a correlation of −0.97, and a similar correlation is obtained between Y 2 Figure 9 and Y 1 Figure 10. Thus, the produced subspace are finally quite similar.    Figure 9. Scatter plots of the clustering subspace on the crabs data for K 1 = 3 and K 2 = 4, 95% isodensity is given for each component resulting of the Cartesian product.  Figure 10. Scatter plots of the clustering subspace on the crabs data for K 1 = K 2 = 2, 95% isodensity is given for each component resulting of the Cartesian product.

Conclusions and Perspectives
We have proposed a model which allows us to combine visualization and clustering with many clustering viewpoints. Moreover, we have shown the possibility of performing model choice by using the BIC criterion. The proposed model can provide new information on the structure present in the data by trying to reinterpret the cluster as a result of the Cartesian product of several clustering variables.
The proposed model is limited to the homoscedatic setting, which could be seen as a limitation; however, from our point of view this is more robust than the heteroscedastic setting, which is known to be jeopardized by the degeneracy issue [21]. However, the extension of our work on the heteroscedastic setting can easily be performed from the modeling point of view; the main issue in this case would be the parameters estimation where an extension of the FDA to the heteroscedastic setting would be needed, as presented in Kumar and Andreou [17]. Another difficult issue is the choice of H, K 1 . . . , K H and p 1 , . . . , p H , which is very combinatorial. Here we have proposed an estimation strategy for all these tuning parameters being fixed, and then performed a selection of the best tuning according to BIC. However, in future work, a model selection strategy to perform the model selection through a modified version of the EM algorithm will also be investigated as in Green [22]; it would thus limit the combinatorial aspect of the global model search through EM-wise local model searches.
Funding: This research received no external funding.