Entropy-based Incomplete Cholesky Decomposition for a Scalable Spectral Clustering Algorithm: Computational Studies and Sensitivity Analysis

Spectral clustering methods allow datasets to be partitioned into clusters by mapping the input datapoints into the space spanned by the eigenvectors of the Laplacian matrix. In this article, we make use of the incomplete Cholesky decomposition (ICD) to construct an approximation of the graph Laplacian and reduce the size of the related eigenvalue problem from N to m, with m ≪ N. In particular, we introduce a new stopping criterion based on normalized mutual information between consecutive partitions, which terminates the ICD when the change in the cluster assignments is below a given threshold. Compared with existing ICD-based spectral clustering approaches, the proposed method allows the reduction of the number m of selected pivots (i.e., to obtain a sparser model) and at the same time, to maintain high clustering quality. The method scales linearly with respect to the number of input datapoints N and has low memory requirements, because only matrices of size N × m and m × m are calculated (in contrast to standard spectral clustering, where the construction of the full N × N similarity matrix is needed). Furthermore, we show that the number of clusters can be reliably selected based on the gap heuristics computed using just a small matrix R of size m × m instead of the entire graph Laplacian. The effectiveness of the proposed algorithm is tested on several datasets.


Introduction
In this paper, we deal with the data clustering problem.Clustering refers to a technique for partitioning unlabeled data into natural groups, where data points that are related to each other are grouped together and points that are dissimilar are assigned to different groups [1].In this context, spectral clustering [2][3][4][5] has been shown to be among the most successful methods in many application domains, due mainly to its ability to discover nonlinear clustering boundaries.The algorithm is based on computing the eigendecomposition of a matrix derived from the data called Laplacian.The eigenvectors of the Laplacian represent an embedding of the input data, which reveals the underlying clustering structure.A major drawback of spectral clustering is its computational and memory cost.If we denote the number of datapoints by N, solving the eigenvalue problem has complexity O(N 3 ), the construction of the Laplacian matrix has cost O(N 2 ), and the Laplacian may not fit into the main memory when N is large.A number of algorithms have been devised to make spectral clustering feasible for large scale applications, which include power iteration clustering [6], spectral clustering in conjunction with the Nyström approximation [7], incremental spectral clustering techniques [8][9][10], kernel spectral clustering [11][12][13], parallel spectral clustering [14], consensus spectral clustering [15], vector quantization-based approximate spectral clustering [16], and approximate pairwise clustering [17].In this article, we introduce a spectral clustering algorithm that exploits the incomplete Cholesky decomposition to reduce the size of the eigenvalue problem.The idea behind the proposed method is similar to [18], but a number of novelties are introduced.First, a new stopping criterion based on normalized mutual information is devised, which allows us to decrease the number m of selected pivots, and hence, the computational complexity.Second, the number of clusters is automatically selected by means of the eigen-gap heuristics computed on a small similarity matrix of size m × m.Third, a sensitivity analysis shows how to select specific threshold values in order to achieve desired cluster quality, sparsity, and computational time.The rest of this paper is organized as follows.Section 2 summarizes the spectral clustering method and the incomplete Cholesky decomposition.In Section 3, the proposed algorithm is introduced.Section 4 describes the results of the experiments, Section 5 analyzes the computational cost of the proposed algorithm, and finally Section 6 concludes the article.

Spectral Clustering
Spectral clustering solves a relaxation of the graph partitioning problem.In its most basic formulation, one is provided with an unweighted/weighted graph and is asked to split it into k non-overlapping groups A 1 , . . ., A k in order to minimize the cut size, which is the number of edges running between the groups (or the sum of their weights).This idea is formalized via the mincut problem [19], that is, the objective of finding k subgraphs such that a minimal number of edges are cut off and that the sum of all weights of these cut edges is minimal.Furthermore, in order to favour balanced clusters, the normalized cut problem can be defined as follows: where: is called the normalized Laplacian; • S is the similarity matrix, which describes the topology of the graph; j=1 S ij , denotes the degree matrix; • G = [g 1 , . . ., g k ] is the matrix containing the normalized cluster indicator vectors • y l , with l = 1, . . ., k, is the cluster indicator vector for the l-th cluster.It has a 1 in the entries corresponding to the nodes in the l-th cluster and 0 otherwise.Moreover, the cluster indicator matrix can be defined as Since this is a NP-hard problem, a good approximate solution can be obtained in polynomial time, allowing G to take continuous values; i.e., G ∈ R N×k .In this case it can be shown that solving problem (1) is equivalent to finding the solution to the following eigenvalue problem: where λ 1 , . . ., λ k are the k smallest eigenvalues of the normalized Laplacian L n , which contain the clustering information.

Incomplete Cholesky Decomposition
A Cholesky decomposition [20] of a matrix A ∈ R N×N is a decomposition of a symmetric positive definite matrix into the product of a lower triangular matrix and its transpose; i.e., A = CC T , and it is widely used to solve linear systems.The incomplete Cholesky decomposition (ICD) [21] allows the reduciton of the computational time required by the Cholesky decomposition by computing a low rank approximation of accuracy τ of the matrix A in O(m 2 N), such that ||A − CC T || F < τ, with C ∈ R N×m and m ≪ N. In fact, the ICD selects the rows and the columns of A in an appropriate manner, such that the rank of the approximation is close to the rank of the original matrix.In other words, the selected rows and columns, also called pivots, are related to certain data points, and this sparse set of data points is a good representation of the full data set.As discussed in [22], the ICD leads to small numerical error only when there is a fast decay of the eigenvalues.However, as pointed out in [18], this condition is not always met.Therefore, the ICD stopping criterion based on the low rank assumption is not optimal.

A Reduced Eigenvalue Problem
As described in [18,22], the ICD technique briefly summarized in the previous section can be used to speed up the solution of the spectral clustering eigenvalue problem (2).In this section, we will review the main linear algebra operations that are needed for this purpose.Let's start by considering the following eigenvalue problem: with Ln = D − 1 2 SD − 1 2 , whose eigenvalues λl are related to the eigenvalues of L n by the relation λl = 1 − λ l (the eigenvectors are the same).Therefore, the clustering information is contained in the eigenvectors corresponding to the largest eigenvalues of Ln .If we replace the similarity matrix S with its ICD, we obtain that Ln ≈ D− 1 2 CC T D− 1  2 .In order to reduce the size of the eigenvalue problem involving Ln , we can replace D− 1  2 C with its QR factorization and substitute R with its singular value decomposition to obtain: where Notice that now we have to solve an eigenvalue problem of size m × m involving matrix RR T , which can be much smaller than the size N × N of the original problem (3).Furthermore, the eigenvectors of problem (3) can be estimated as ĝl = QU R,l , whose related eigenvalues are λl = σ 2 R,l .Finally, the extraction of the cluster indicator matrix from the top k eigenvectors can be achieved by computing a pivoted LQ decomposition of the eigenvector matrix D − 1 2 Ĝ as proposed in [23]: where Ĝ = [ ĝ1 , . . ., ĝk ], P ∈ R N×N is a permutation matrix, L ∈ R N×k is a lower triangular matrix, and Q ∈ R k×k denotes a unitary matrix.In real-world scenarios, the clusters present a certain amount of overlap.Therefore, matrix Ŷ becomes real-valued and the cluster assignment for point x i is computed as:

Proposed Algorithm
As explained previously, the classic ICD algorithm is based on the assumption that the spectrum of the Laplacian matrix is characterized by a fast decay.Since this property in some cases does not hold [18], in this article we introduce a novel stopping criterion, which will be explained now.

New Stopping Criterion
The new stopping condition only assumes that the cluster assignments after the selection of each pivot tend to converge.In particular, given the cluster assignments j s at step s and j s−1 at iteration s − 1, with j = [j 1 , . . ., j N ], we can compute the normalized mutual information (NMI) [24] as follows: nmi s = NMI(j s , j s−1 ).
The value nmi s measures the statistical information shared between the cluster assignments j s and j s−1 , and takes values in the range [0, 1].It tells us how much knowing one of these clusterings reduces our uncertainty about the other.The higher the NMI, the more useful the information in j s helps us to predict the cluster memberships in j s−1 and viceversa.In practice, this means that when the cluster assignments between two consecutive iterations are the same (up to the labelling), nmi s = 1.On this basis, we propose to terminate the ICD algorithm when |nmi s − 1| < THR stop , THR stop being a user-specified threshold value.Furthermore, to speed up the procedure, we start to check the convergence of the cluster assignments only when the approximation of the similarity matrix is good enough.An approximation of matrix S implies that the degree of each datapoint is also approximated.Therefore, the ratio r deg = min ( d)  max( d) can be used to have an idea of the quality of the approximation [18], where d = CC T 1 N , and min and max denote the minimum and maximum element of a vector.In particular, the convergence of the cluster assignments begins to be monitored when min( d) max( d) > THR deg .From our experience THR stop = THR deg = 10 −6 represents a good choice, which prevents termination of the ICD algorithm too early (with poor clustering performance), but also not too late (by selecting more pivots than needed).In this realm, a sensitivity analysis (the study is related to the dataset Three 2D Gaussians) of the proposed algorithm with respect to different threshold settings is depicted in Figure 1.In Figure 2, the trend of nmi s as a function of the number of selected pivots is shown for the synthetic datasets analysed in this paper.

Choosing the Number of Clusters
The number of clusters k present in the data is not known beforehand and must be chosen carefully to ensure meaningful results.To tackle this issue, we exploit the theoretical fact that the multiplicity of the eigenvalue 1 of the Laplacian L n equals the number of connected components (i.e., clusters) in the graph.In other words, we use the eigen-gap heuristics [4], which unlike standard spectral clustering, in our case is computed using the m × m R matrix (see Equation ( 4)) rather than the original N × N Laplacian matrix.Furthermore, we consider an eigenvalue to have converged if | λl − 1| < THR eig .For simplicity, we set THR eig = THR stop = 10 −6 , which from our experiments, turned out to be a good choice.In Figure 3, we show through several examples that a meaningful value for k can be detected using only the information provided by the m selected pivots.However, it should be pointed out that the eigengap heuristic can fail in real situations when, due to some overlap between the clusters, the sharp decay of the eigenvalues of the ideal case quickly deteriorates.In this case, the Gershgorin circle theorem, which provides upper bounds on the eigenvalues of the Laplacian matrix, can be utilized to determine a meaningful interval for the number of clusters [25].Another alternative to select the number of clusters, although more computationally demanding, could be the use of any internal cluster validity criterion in conjunction with the current cluster assignments.
The complete clustering algorithm proposed in this paper, which we call ICD-NMI, is summarized in Algorithm 1. Furthermore, a Matlab implementation can be downloaded from [26].

Algorithm 1: ICD-NMI algorithm
Data: Data set D = {x i } N i=1 , positive semi-definite similarity measure W(x i , x r ) = S ir , thresholds THR stop , THR deg , and THR eig , maximum number of clusters to search for maxk, maximum number of pivots (e.g., maximum number of iterations) d C Result: Selected number of clusters k, vector of cluster assignments j, matrix of soft cluster memberships F.
/* Initialize variables: Find new pivot element r ⋆ = arg max r∈[s,N] h r Update permutation matrix P such that P ss = P r ⋆ r ⋆ = 0 and P sr ⋆ = P r ⋆ s = 1 Permute elements s and r ⋆ in S as S1:N,s ↔ S1:N,r ⋆ and Ss,1:  6), where k = k s Store current assignments for the N datapoints in vector j s Compute nmi s according to Equation (7).
*/ Calculate soft cluster membership matrix F: • normalize each column of matrix F between 0 and 1 • normalize each row of matrix F such that ∑ r F ir = 1.

Experimental Results
In this section the outcomes of the experiments are presented.Given a dataset {x i } N i=1 , with x i ∈ R d , we start by constructing a graph G = (V, E ) where the N data points represent the vertices V = {v i } N i=1 , and their pairwise similarity S ir the weight of the edge between them E = {S ir } N i,r=1 .Throughout this paper, the radial basis function with parameter σ is taken as the similarity measure between two data points x i and x r ; i.e., S ir = W(x i , x r ) = exp(− ).For simplicity, the parameter σ is chosen based on the Silverman (the issue concerning the tuning of the bandwidth parameter is outside of the scope of the paper) rule of thumb [27].To understand the behaviour of the proposed algorithm, we have performed simulations on a number of synthetic datasets that are commonly used to benchmark (the majority of the datasets has been downloaded from http://cs.joensuu.fi/sipu/datasets/)clustering algorithms.In Figure 3, the eigenvalues λl that are estimated after selecting the last pivot are illustrated.We can notice how, in general, the number of eigenvalues that have converged to 1 up to threshold THR eig reflect the true number of clusters present in the data.Figure 2 shows the working principle of the stopping criterion introduced in Section 3.1: the value nmi s converges to 1 − THR stop after a certain number of iterations.In Figure 4, the detected clusters are depicted together with the selected pivots.In all the datasets, a meaningful clustering result has been obtained, and the pivot elements represent the clustered structure of the related data distribution well.
Table 1 reports a comparison between the proposed method and two other clustering algorithms based on the incomplete Cholesky decomposition, namely algorithms (for a fair comparison, we report the results of the algorithm that does not use the L 1 regularization) [18,22].The comparison concerns both the number of selected pivots and the match between the detected clusters and the true groupings, as measured by the Adjusted Rand Index ( [28]).The results indicate that the proposed algorithm, namely ICD-NMI, requires a minor number of pivots, resulting in a sparser spectral clustering method with comparable or higher accuracy.Figure 5 shows the convergence of the cluster assignments in the analysis of a number of large real-life databases.The results confirm the main observation that we have made in the case of the synthetic datasets; that is, the proposed technique requires a limited number of pivots (in the case of the poker dataset, the method stopped because it reached the maximum number of iterations without converging) to perform the clustering.Table 2 reports a comparison with the k-means algorithm [1] used as baseline and an alternative low-rank method based on the Nyström approximation (the size of the subset should be less than 500 points to not get out-of-memory error).The cluster quality is measured in terms of the Silhouette index [29] and the Davies-Bouldin (DB) criterion [30], and the computational burden in terms of runtime.The results indicate that the proposed approach, although slower than k-means, reaches a higher clustering performance compared to the alternative approaches, in general.

Computational Complexity and Memory Requirements
In this section, the computational burden of the proposed method is analysed in more detail.In Algorithm 1, two main parts are present: (i) the computation of the current ICD approximation at each step, which scales linearly with respect to the total number of data points N; (ii) the steps involved in the calculation of the NMI between current and previous cluster assignments, all of which depend linearly on N.This reasoning is supported by Figure 6, where the runtime of the proposed approach is analysed by using one of the synthetic datasets mentioned earlier.It can be evinced how algorithm ICD-NMI scales linearly, i.e., has complexity O(N).
Regarding the memory load, the proposed method has low requirements, because only matrices of size N × m and m × m need to be constructed.Furthermore, as we have shown, in general the algorithm selects a low number of pivots, even in case of very large datasets, meaning that m ≪ N. The Three rings dataset has been used to perform this analysis.The CPU complexity is O(N), which makes the method suitable for handling large-scale clustering problems.Furthermore, the memory requirements are low compared to standard spectral clustering because the full N × N similarity matrix is never constructed.

Conclusions
In this paper we have introduced a new stopping criterion for the incomplete Cholesky decomposition (ICD).The proposed criterion terminates the ICD when the change in the cluster assignments is below a given threshold, as measured by the normalized mutual information between consecutive partitionings.This allows the selection of a limited number of pivots compared to existing techniques, and at the same time, achieves good clustering quality, as shown for a number of synthetic and real-world datasets.Furthermore, the number of clusters is selected efficiently based on the eigengap heuristic computed on a small m × m matrix, with m ≪ N. Finally, a sensitivity analysis demonstrated how specific threshold values can influence the desired cluster quality, sparsity, and computational burden.Future work may be related to exploiting memory mapping to handle bigger datasets that do not fit in memory.

2 C
max( d) if r deg > THR deg then Compute QR decomposition of D− 1 Compute the singular value decomposition of R as R = UΣV T Obtain the approximated eigenvectors via Ĝ = QU R,1:maxk ./* Select current number of clusters */ Check number of eigenvalues approximating 1, i.e., such that | λj − 1| < THR eig Set this number as the current number of clusters k s ./* Check stopping condition */ Set Ĝ = Ĝ1:k s Compute LQ factorization with row pivoting as D Ĝ Ĝ = PLQ Ĝ Put Ŷ = P L, with L = [L T 11 L T 22 ]L −1 11 , being L 11 ∈ k s × k s a lower triangular matrix Compute cluster assignment for point x i according to Equation (

Figure 4 .
Figure 4. Clustering results.Clusters detected by the proposed approach in different colors.The selected pivots are indicated with red stars.We can notice how the detected partitions are meaningful, and the distribution of the pivots is representative of the distribution of the related dataset.(a) Aggregation; (b) Compounds; (c) D31; (d) Flames; (e) Jain; (f) R15; (g) Three 2D Gaussians; (h) Three rings; (i) Two spirals.

Figure 6 .
Figure 6.Computational complexity.Scalability of the proposed algorithm with the number N of datapoints.The Three rings dataset has been used to perform this analysis.The CPU complexity is O(N), which makes the method suitable for handling large-scale clustering problems.Furthermore, the memory requirements are low compared to standard spectral clustering because the full N × N similarity matrix is never constructed.

Table 1 .
Comparison with other incomplete Cholesky decomposition (ICD)-based methods on toy datasets.Different algorithms are contrasted in terms of the number of selected pivots and clustering quality, in the case of three synthetic datasets.We can notice how the proposed technique obtains a high clustering performance in terms of Adjusted Rand Index (ARI), and at the same time it requires a small number of pivots.(Boldface numbers indicate the best performance).

Table 2 .
Comparison with k-means and the Clustered Nyström method on three large real-life datasets Performance of different approaches in terms of runtime, Silhouette and DB criterion (average values over ten randomizations).(Boldface numbers indicate the best performance).