Improving Scalable K-Means++

: Two new initialization methods for K-means clustering are proposed. Both proposals are based on applying a divide-and-conquer approach for the K-means (cid:107) type of an initialization strategy. The second proposal also uses multiple lower-dimensional subspaces produced by the random projection method for the initialization. The proposed methods are scalable and can be run in parallel, which make them suitable for initializing large-scale problems. In the experiments, comparison of the proposed methods to the K-means++ and K-means (cid:107) methods is conducted using an extensive set of reference and synthetic large-scale datasets. Concerning the latter, a novel high-dimensional clustering data generation algorithm is given. The experiments show that the proposed methods compare favorably to the state-of-the-art by improving clustering accuracy and the speed of convergence. We also observe that the currently most popular K-means++ initialization behaves like the random one in the very high-dimensional cases.


Introduction
Clustering is one of the core techniques in data mining. Its purpose is to form groups from data in a way that the observations within one group, the cluster, are similar to each other and dissimilar to observations in other groups. Prototype-based clustering algorithms, such as the popular K-means [1], are known to be sensitive to initialization [2,3], i.e., the selection of initial prototypes. A proper set of initial prototypes can improve the clustering result and decrease the number of iterations needed for the convergence of an algorithm [3,4]. The initialization of K-means was remarkably improved by the work of Arthur and Vassilvitskii [5], where they proposed the K-means++ method. There, the initial prototypes are determined by favoring distinct prototypes, which in high probability are not similar to the already selected ones.
A drawback of K-means++ is that the initialization phase requires K inherently sequential passes over the data, since the selection of a new initial prototype depends on the previously selected prototypes. Bahmani et al. [6] proposed a parallel initialization method called K-means (Scalable K-means++). The K-means speeds up initialization by sampling each point independently and by updating sampling probabilities less frequently. Independent sampling of the points enables parallelization of the initialization, thus providing a speedup over K-means++. However, for example MapReduce-based implementation of K-means needs multiple MapReduce jobs for the initialization. The MapReduce K-means++ method [7] tries to address this issue, as it uses one MapReduce job to select K initial prototypes, which speeds up the initialization compared to K-means . Suggestions of parallelizing the second, search phase of K-means have been given in several papers (see, e.g., [8,9]). On a single machine, distance pruning approaches can be used to speed up K-means without affecting the clustering results [10][11][12][13][14]. Besides parallelization and distance pruning, data summarization is also a viable option for speeding up the K-means clustering [15,16]. ter or equal running time. The proposed initialization method reduces data processing with sampling, subsampling, and dimensional reduction solving the K-means clustering problem in a coarse fashion. Moreover, from the perspective of parallel computing, using a parallelizable clustering method in the subset clustering allows fixing the number of subsets and treating each subset locally in parallel, hence improving the scalability.
For quantified testing and comparison of the methods, we introduce a novel clustering problem generator for high dimension spaces (see Section 3.4). Currently, challenging simulated datasets for high-dimensional clustering problems are difficult to find. For instance, the experiments with DIM datasets of tens or hundreds of dimensions in [4] were inconclusive: all clustering results and cluster validation index comparisons behaved perfectly without any errors. Therefore, better experimental datasets are needed and can be produced with the proposed algorithm.

Existing Algorithms
In this section, we introduce the basic composition of the existing algorithms.

K-Means Clustering Problem and the Basic Algorithms
Let X = {x 1 , x 2 , ..., x N } be a dataset such that x i ∈ R M ∀1 ≤ i ≤ N, where M denotes the dimension, and let C = {c 1 , c 2 , ..., c K } be a set of prototypes, where each prototype also belongs to R M . The goal of the K-means clustering algorithm is to find a partition of X into K disjoint subsets, by minimizing the sum-of-squares error (SSE) defined as An approximate solution to the minimization problem with (1) is typically computed by using the Lloyd's K-means algorithm [1]. Its popularity is based on simplicity and scalability. Even if the cost function in (1) is mathematically nondifferentiable because of the min-operator, it is easy to show that after the initialization, the K-means type of iterative relocation algorithm converges in finite many steps [4].
Prototype-based clustering algorithms, such as K-means, are initialized before the prototype relocation (search) phase. The classical initialization algorithm, readily proposed in [33], is to randomly generate the initial set of prototypes. A slight refinement of this strategy is to select, instead of random points (from appropriate value ranges), random indices and use the corresponding observations in data as initialization [34]. Because of this choice, there cannot be empty clusters in the first iteration. Bradley and Fayyad [35] proposed an initialization method where J randomly selected subsets of the data are first clustered with K-means. Next, it forms a superset of the J × K prototypes obtained from the subset clustering. Finally, the initial prototypes are achieved as the result of K-means clustering of the superset.
Arthur and Vassilvitskii [5] introduced the K-means++ algorithm, which improves the initialization of K-means clustering. The algorithm selects first prototype at random, and then the remaining K − 1 prototypes are sampled using probabilities based on the squared distances to the already selected set, thus favoring distant prototypes. The generalized form of such an algorithm with different l p -distance functions and the corresponding cluster location estimates was depicted in [4].
The parallelized K-means++ method, called K-means , was proposed by Bahmani et al. [6] (see Algorithm 1). In Algorithm 1, and from here onwards, symbol "#" denotes 'number of'. In the algorithm, sampling from X is conducted in a slightly different fashion compared to K-means++. More precisely, the sampling probabilities are multiplied with the over-sampling factor l and the sampling is done independently for each data point. The initial SSE for the first sampled point ψ determines the number of sampling iterations. K-means runs O(log(ψ)) sampling iterations. For each iteration, the expected number of points is l. Hence, after O(log(ψ)) iterations, the expected number of points added to C is O(l log(ψ)). Finally, weights representing the accumula-tion of data around the sampled points are set and the result of the weighted clustering then provides the K initial prototypes. K-means++ can be used to cluster the weighted data (see Algorithm 1 in [36]). Selecting r = 5 instead of O(log(ψ)) rounds and setting the over-sampling factor to 2K were demonstrated to be sufficient in [6]. Recently, Bachem et al. [36] proved theoretically that small r instead of O(log(ψ)) iterations is sufficient in K-means . A modification of K-means for initializing robust clustering was described and tested in [37].

Algorithm 1: K-means
Input: Dataset X, #clusters K, and over-sampling factor l.
1: C ← select point c 1 uniformly random from X.

5:
C ← C ∪ C 6: For each x in C attach a weight defined as the number of points in X closer to x than any other point in C. 7: Do a weighted clustering of C into K clusters.

Random Projection
The background for RP [22] comes from the Johnson-Lindenstrauss lemma [38]. The lemma states that points in a high-dimensional space can be projected to a lower dimension space while approximately preserving the distances of the points, when the projection is done with a matrix whose elements are randomly generated. Hence, for an N × M dataset X, let R ∈ M × P be a random matrix. Then, the random projected data matrix X is given by X = 1 √ P XR. The random matrix R consists of independent random elements (r ij ) which can be drawn from one of the following probability distributions [22]: r ij = +1 with probability 1/2, or −1 with probability 1/2; or r ij = +1 with probability 1/6, 0 with probability 2/3, or −1 with probability 1/6.

New Algorithms
Next we introduce the novel initialization algorithms for K-means, their parallel implementations, and the novel dataset generator algorithm.

SK-Means
The first new initialization method for K-means clustering, Subset K-means (SKmeans ), is described in Algorithm 2. The method is based on S randomly sampled nondisjoint subsets {X 1 , X 2 , ..., X S } from X of approximately equal size, such as X = ∪ S i=1 X i . First, K-means is applied in each subset, which gives the corresponding set of initial prototypes C i . Next, each initial prototype set C i in X i is refined with T init Lloyd's iterations. T init is assumed to be significantly smaller than the number of Lloyd's iterations needed for convergence. Then, SSE is computed locally for each C i in X i . Differently from the earlier work [30], this locally computed SSE is now used as the selection criteria for the initial prototypes instead of the global SSE. Computation of SSE for X i in Step 3 is obviously much faster than to compute it for the whole X. However, a drawback is that if the subsets are too small to characterize the whole data, the selection of the initial prototypes might fail. Therefore, S should be selected such that the subsets are sufficiently large. For example, if S is close to the number of samples in the smallest cluster, then this cluster will appear as an anomaly for most of the subsets. On the other hand, this property can also be beneficial to exclude anomalous clusters already in the initialization phase. Currently, there are no systematic comparisons in the literature on the size of the subsets in sampling-based clustering approaches [39]. In [35], the number of subsets was set to 10. Based on [30,35], this selection appears reasonable for the sampling-based clustering initialization approaches.   The convergence rate of K-means is fast and the most significant improvements in the clustering error are achieved during the first few iterations [40,41]. Therefore, for the initialization purposes, T init can restricted, e.g., to 5 iterations. Moreover, since the number of Lloyd's iterations needed for convergence might vary significantly (e.g., [4]), a restriction on the number of Lloyd's iterations helps in synchronization, when a parallel implementation of the SK-means method is used.
The computational complexity of the K-means method is of the order O(rlN M), where r is the number of initialization rounds. Therefore, SK-means also has the complex-

SRPK-Means
The second novel proposal, Subset Random Projection K-means (SRPK-means ), adds RPs to SK-means . Since SK-means mainly uses time in computing distances in Steps 1 and 2, it is reasonable to speed up the distance computation with RP. The RP-based method is presented in Algorithm 3. Generally, SRPK-means computes a set of candidate initial prototypes in a lower-dimensional space and then evaluates these in the original space. As with Algorithm 2, the best set of prototypes based on the local SSE are selected.
3: C i ← for each X i run K-means . 4: I i ← for each X i run T init Lloyd's iterations initialized with C i . 5: For each partitioning I i compute prototypes C i in original space X i . 6: Compute local SSE for each C i in X i . 7: C ← select prototypes corresponding to smallest local SSE.
The proposal first computes a unique random matrix for each subset X i . Then, the P-dimensional random projected subset X i is computed in each subset X i . Steps 3-4 are otherwise the same as the Steps 1-2 in Algorithm 2, but these steps are applied for the lower-dimensional subsets { X 1 , X 2 , ..., X S }. Next, the labels I i for partitioning each subset are used to compute C i in the original space X i . Finally, the local SSEs are computed, and the best set of prototypes are returned as the initial prototypes. Please note that the last two steps in Algorithm 3 are the same as Steps 3-4 in Algorithm 2. SRPK-means computes projected data matrices, which require a complexity of O(PN M) (naive multiplication) [17]. Execution of K-means in the lower-dimensional space requires O(rlNP), and T init Lloyd's iterations requires O(T init KNP) operations.
Step 6 requires O(KN M) operations, since it computes the local SSEs in the original space, so that the total computational complexity of the SRPK-means method is O(PN M + rlNP + T init KNP + KN M). Typically, applications of RP are based on the assumption P << M. Thus, when the dimension of data M is increased, the contribution of the second and the third term of the total computational complexity start to diminish. Moreover, when both M and K are large compared to P, the last term dominates the overall computational complexity. Therefore, in terms of running time, SRPK-means is especially suited for clustering large-scale data with very high dimensionality into a large number of clusters.
Fern and Brodley [18] noted that clustering with RP produces highly unstable and diverse clustering results. However, this can be exploited in clustering to find different candidate structures of data, which then can be combined into a single result [18]. The proposed initialization method in this paper uses a similar idea as it tries to find structures from multiple lower-dimensional spaces that minimize the local SSE. In addition, selecting a result that gives the smallest local SSE excludes the bad structures, which could be caused by inappropriate R i or C i .

Parallel Implementation of the Proposed K-Means Initialization Algorithms
Bahmani et al. [6] implemented K-means with the MapReduce programming model. It can also be implemented by the Single Program Multiple Data (SPMD) programming model with message passing. Then all the steps of the parallelized Algorithms 1-3 are executed inside an SPMD block. Next, a parallel implementation of K-means as depicted in Algorithm 1 is briefly described, by using Matlab Parallel Computing Toolbox (PCT), SPMD blocks, and message passing functions (see [42] for a detailed description about PCT). First, data X is split into Q subsets of approximately equal size and then the subsets are distributed to Q workers. Step 1 picks a random point from a random worker and broadcasts this point to all other workers. In Step 2, each worker calculates distances and SSE for its local data. Next, points are aggregated by calling gplus-function, after which the aggregation distributes this sum to other workers. In Steps 4 and 5, each worker samples points from its local data, the next points are aggregated to C by calling gop-function, and then C is broadcasted to all workers. Again, distances and SSE are calculated similarly as in Step 2. Each worker in Step 6 assigns weights based on its local data, after which the weights are aggregated with gop-function. Finally, Step 7 is computed sequentially.
As with the parallel K-means implementation, a parallel implementation of Algorithm 2 with SMPD and message passing is described next. First, each subset X i from S subsets is split into J approximately equal size subsets and then these subsets are distributed to J × S workers, e.g., subset X i is distributed to workers (i − 1)J + 1, ..., (i − 1)J + J. In Steps 1-3, each subset of workers runs steps for subset X i in parallel similarly as described in the previous paragraph. For parallel Lloyd's iterations, a similar strategy as proposed in [8] can be used in Step 2. Steps 1-3 require calling modified gop-function and gplus-function for the subset of workers; these functions were modified to support this requirement. Finally, prototypes corresponding to the smallest local SSE from the subset i allocated workers (i − 1)J + 1, ..., (i − 1)J + J are returned as the initialization.
The parallel SRPK-means in Algorithm 3 can be implemented in a highly similar fashion to the parallel SK-means . More precisely, in Step 1, each worker (i − 1)J + 1, where i ∈ {1, 2, ..., S}, generates the random matrix R i and broadcasts it to workers (i − 1)J + 1, ..., (i − 1)J + J. In Step 2, each worker computes random projected data for its local data. Steps 3-4 are otherwise computed similarly to the parallel SK-means Steps 1 and 2, except these steps are executed for the projected subsets. In Step 5, the prototypes are computed in the original space in parallel. Finally, Steps 6 and 7 are the same as Steps 3 and 4 in Algorithm 2. The parallel implementations of the proposed methods and K-means are available in (https://github.com/jookriha/Scalable-K-means).

M-Spheres Dataset Generator
The novel dataset generator is given in Algorithm 4. Based on the method proposed in ( [43], p. 586), Algorithm 5 generates a random point that is on the M-dimensional sphere centered on c with radius of d. The core principle is to draw M independent values from the standard normal distribution and transform these with corresponding M-direction cosines. The obtained M-dimensional vector is then scaled with the radius d and relocated with the center c. The generated points are uniformly distributed on the surface of the sphere, because of the known properties of the standard normal distribution (see [43] (p. 587) and articles therein).

Algorithm 4: M-spheres dataset generator
Input: #clusters K, #dimensions M, #points per cluster N K , nearest center distance d c , radius of M-sphere d r .

22:
n ← n + 1 23: The generator uses Algorithm 5 for generating K cluster centers so that c i − c j = d c , where i = j, d c is the given distance between the centers, and both c i , c j then belong to the set of centers C. Finally, N K data points for each cluster are generated by applying Algorithm 5 with a uniformly random radius from the interval (0, d r ]. This means, in particular, that points in a cluster are nonuniformly distributed and approximately 100a d r % percentages of the points are within a M-dimensional sphere with radius a for 0 ≤ a ≤ d r . In Algorithm 5, N(0, 1) denotes the standard normal distribution. x m ← N(0, 1). 5: For simplicity, generation of the cluster centers starts from the origin. When the new centers are randomly located with the fixed distance and then expanded as clusterwise data in R M , the generator algorithm does not restrict the actual values of the generated data and centers. Hence, depending on the input, data range can be large. However, this can be alleviated with the min-max scaling as part of the clustering process.

Empirical Evaluation of Proposed Algorithms
In this section, empirical comparison between K-means++, K-means , SK-means , and SRPK-means is presented by using 21 datasets. In Section 4.1, the results are given for 15 reference datasets. The performance of the methods was evaluated by analyzing SSE, the number of iterations needed for convergence, and the running time. Finally, In Section 4.2, we analyze the final clustering accuracy for six novel synthetic datasets that highlight the effects of the curse of dimensionality in the K-means++ type initialization strategies. The simulated clustering problems have been formed with the novel generator described in Algorithm 4. The MATLAB implementation of the algorithm is available in (https://github.com/jookriha/M_Spheres_Dataset_Generator).

Experiments with Reference Datasets
In this section, the results are shown and analyzed for 15 publicly available reference datasets by considering separately the accuracy (Section 4.1.2), efficiency (Section 4.1.3), and scalability (Section 4.1.4) of the algorithms.

Experimental Setup
Basic information about the datasets is shown in Table 1. The parallel implementations of the proposed methods and K-means (omitting K-means++ readily tested in [6]) were applied to the seven largest datasets and serial implementations were used otherwise. . The BIR dataset [44] was selected to test SK-means for lowdimensional data. With the OXB dataset, we used the transformed dataset with 128dimensional SIFT descriptors extracted from the original dataset. For the TIN dataset, we sampled a 20% subset from the Gist binary file (tinygist80million.bin), where 79,302,017 images are characterized with 384-dimensional Gist descriptors. The highest-dimensional dataset was SVH, where we combined the training, testing, and validation subsets into a single dataset. We excluded the attack type feature (class label) from the KDD dataset, used the Twitter data for the BSM dataset, and restricted to the training dataset of the HAR dataset. For the RCV dataset, we used the full industries test set (350 categories) and selected 1000 out 47,236 features with the same procedure as in [45]. For the M8M and the RCV datasets, we used the scaled datasets given in http://cs.joensuu.fi/sipu/datasets/, all other datasets were min-max scaled into [−1, 1]. In Section 3.1, we discussed the selection of parameter S. Because the results in [30,35] and that the computing nodes' have typically the number of cores in powers of two, we fixed S = 8 in the experiments. This means that each dataset was randomly divided into 8 subsets, which were roughly of equal size. Matlab 2018a environment was used in the serial experiments with the K-means++, K-means , SK-means and SRPK-means methods. In Matlab 2018a version, we were not able to modify gop-function and gplus-function (see Section 3.3). In Matlab R2014a environment, these modifications were possible for the proposed methods. To follow the good practices of running time evaluation [46], all the parallel experiments with the K-means , SK-means and SRPK-means methods were run Matlab R2014a environment. The parallel algorithms were implemented with Matlab Parallel Computing Toolbox with the SPMD blocks and message passing functions as discussed in Section 3.3. The parallel experiments were run in a computer cluster using Sandy Bridge nodes with 16 cores and 256 GB memory. A parallel pool of 32 workers was used in the experiments; therefore, 4 workers were allocated for each subset. In the parallel experiments, each worker had a 1 4 random disjoint partition of the subset on the local workspace.
For all datasets we used the following settings: (i) for K-means : l = 2K and r = 5; (ii) for SK-means and SRPK-means : T init = 5 and S = 8; (iii) and for SRPK-means : P ∈ {5, 10, 20, 40} and R with r ij = ±1. Please note that occurrence of an empty cluster for SRPK-means is possible in rare cases when all S subsets produce an empty cluster.
(For instance, empty clusters appeared seven times in all the experiments with the synthetic datasets as reported in Section 4.2). In these cases, we repeated the whole clustering initialization from the start. After initialization, the Lloyd's algorithm was executed until the number of new assignments between the consecutive iterations was below or equal to the threshold. For the five largest datasets (SVH, RCV, OXB, M8M and TIN) we set this threshold to 1% of N and otherwise to zero. In the parallel experiments, runs were repeated 10 times for each setting. In the serial experiments, runs were repeated 100 times for each setting. Values for the number of clusters, K, are given in the last column of Table 1. Since the MNIST8M dataset is constructed artificially from the original MNIST dataset, we set K for MNIST8M based on the optimal value for MNIST used by Gallego et al. [47]. Otherwise, the selection is either based on the known number of classes or fixed arbitrarily (indicated with * in Table 1).
The quality of the clustering results between the methods was compared by using SSE. The SSE values were computed with formula (1) for the whole data. Finally, statistical comparison between the methods was performed with the nonparametric Kruskal-Wallis test [48,49], since in most cases the clustering errors were not normally distributed. The significance level was set to 0.05. Table 2. For the initial SSE, we did not include any results from the statistical testing because of differences in variances. Moreover, note that the assumption of equal variances of the final SSEs underlying the Kruskal-Wallis test, as tested with the Brown-Forsythe test, was only satisfied for SVH, RCV, USC, KDD, OXB, and M8M. For most datasets (HAR, ISO, LET, GFE, MNI, BIR, BSM, FCT, and TIN), this assumption was not satisfied.

SSE values after the initialization (initial SSE) and after the Lloyd's iterations (final SSE) are summarized in
Clearly, SK-means and SRPK-means outperform K-means and K-means++ in terms of the initial SSE. SRPK-means with p = 40 reaches almost the same initialization SSE level as SK-means . For the six largest datasets, SRPK-means with p = 20 always had smaller max-value of SSE after initialization than the min-value of K-means . K-means++ has about two times larger initial SSE than SK-means and SRPK-means (p = 40).
For all other datasets than SVH and TIN, the final clustering accuracy (SSE) was statistically significantly different between the four methods. Overall, in terms of the final clustering error, SRPK-means achieved better final SSE than K-means and K-means++. Moreover, one can note that for 11 out of 15 datasets, the random projection-based initialization was better than the main baseline K-means . In many cases, SK-means gives better final SSE than the baseline methods, but for the high-dimensional datasets, the results are equally good compared to the baseline methods. One can notice, based on the statistical testing, that the final SSE is highly similar for K-means++ and K-means . Please note that in Table 2 the min-values of all methods for the final SSE are equal for small number of clusters (K ≤ 10). This is probably due to the fact that the smaller number of possible partitions [50] implies a smaller number of local minima compared to higher values of K. Table 2. Clustering accuracy using SSE. The statistically significant differences of the final SSE according to the Kruskal-Wallis test are indicated with * * in the first column. Symbols +, * , ‡, and † P indicate that the method has statistically significantly better SSE in a pairwise comparison with respect to K-means++ (K++), K-means (K ), SK-means (SK ), and SRPK-means (SRPK ) for P = P , respectively. The coefficient under the name of the data in the first column is the data-specific multiplier which scales the SSE to the true level. The best performances are highlighted in bold.

Results for Running Time and Convergence
Running time for the initialization (median of 10 runs) for the parallel experiments is shown in Table 3. Running time for the initialization taken by K-means is around 60-80% of the running time of SK-means . SRPK-means runs clearly faster than SK-means for datasets with dimensionality more than 100, and for the four highest-dimensional datasets, SRPK-means runs clearly faster than K-means . Please note that differences are small between p = 5 and p = 40 for SRPK-means . The median number of Lloyd's iterations needed for convergence after the initialization phase are summarized in Table 4, where the statistically significant differences are denoted similarly as in Table 2. The assumption of equal variances was satisfied for all datasets except for FCT. In general, SK-means seems to require smaller number of Lloyd's iterations than K-means++ and K-means , which directly translates to faster running time of the K-means search. Based on the statistical testing, SRPK-means is better than or equal compared to the baseline methods in terms of the number of iterations. Therefore, SRPK-means can also speed up the search phase of the K-means clustering method. Increasing the RP dimension from 5 to 40 further improved the speed of convergence for SRPK-means . Out of the parameter values used in the experiments, selecting p = 40 gives the best tradeoff between the running time and the clustering accuracy for SRPK-means . Furthermore, note that there is no statistical difference between K-means++ and K-means with respect to the number of Lloyd's iterations. Table 4. The number of iterations needed for convergence. The statistically significant differences according to the Kruskal-Wallis test are indicated with * * after the dataset acronym. Symbols +, * , ‡, and † P indicate that the method has statistically significantly faster convergence in a pairwise comparison with respect to K-means++ (K++), K-means (K ), SK-means (SK ), and SRPK-means (SRPK ) for P = P , respectively. Median values are on a gray background. Corresponding median absolute deviation values are below the median values. The best performances are highlighted in bold. HAR * * ISO * * LET * * GFE * * MNI * * BIR BSM * * FCT * * SVH * * RCV * * USC KDD M8M * * TIN * * OXB * *

Results for Scalability
We conducted scalability tests for TIN and SVH to show how running time varies as a function of #processing elements (Matlab workers) and to demonstrate the benefits of using SRPK-means for a very high-dimensional dataset (SVH) when K is increased.
We concentrated on the running time of the initialization and the corresponding SSE. We performed scalability experiments in two parts: (1) Tests with TIN: #processing elements was varied from 8 to 64 and K = 100 was fixed; (2) Tests with SVH: The number of clusters was varied as K ∈ {100, 200, 400, 800} and #processing elements was fixed to 32. Otherwise, we used the same parameter settings as in the previous experiments.
Median running time and SSE curves out of 10 runs are shown in Figure 1. Results for the experiment 1 are shown in Figure 1a. In terms of Amdahl's law [51], K-means and SK-means perform equally well: running time is nearly halved when #processing elements is doubled from 8 to 16 and from 16 to 32. From this perspective, performance of SRPK-means is slightly worse than K-means and SK-means . The results for the experiment 2 are shown in Figure 1b,c. Clearly, for very high-dimensional data, SRPK-means runs much faster compared to K-means and SK-means . As analyzed in Section 3.2, the speedup for SRPK-means is increased when K is increased. A similar observation was made between K-means++ and RPK-means++ in [28]. Furthermore, when K = 800, the speedup for SRPK-means with respect to K-means is 7-8 and with respect to SK-means 9-10. Moreover, according to Figure 1c, SRPK-means (when p = 40) and SK-means sustain their accuracy when K is increased in a frame of K-means .

Experiments with High-Dimensional Synthetic Datasets
Finally, to strengthen the evaluation, we next summarize experiments with novel synthetic datasets, where symmetric, spherical clusters are hidden in a very high-dimensional space. Comparison of K-means initialization methods for datasets with assured spherical shape is clearly relevant because K-means restores such geometries during the clustering process [4,6,44]. Please note that using SSE for high-dimensional data can be ambiguous [52]. As we will next demonstrate, the SSE error difference of good and bad clustering results in a high-dimensional space can be surprisingly small. To show this, we also analyze the final clustering accuracy with normalized mutual information (NMI) [53] with the actual entropies. Please note that it would have been uninformative to use NMI as a quality measure for datasets in Section 4.1 because there we had no information on the cluster geometry. Mutual Information (MI) can be defined as where I refers to the cluster labels, L to the ground truth labels, K * to the number of unique ground truth labels, and n denotes labels' frequency counts. Then, NMI can be defined as NMI(I, L) = 2MI(I, L) where H(I) = log 2 (K) and H(L) = log 2 (K * ) denote the entropies of the cluster labels and ground truth labels. Details of the six generated datasets with Algorithm 4 are summarized in Table 5. The generated datasets are referred as M-spheres. For each dataset, we set N K = 10, 000, K = 10, and d r = 1. To demonstrate interesting effects in the clustering initialization, we varied the nearest cluster center distance as d c = {0.05, 0.1, 0.2} and the data dimension as M = {1000, 10, 000}. We set d c values to much smaller than d r in order to increase the difficulty of the clustering problems. In Figure 2, PCA projections on the three largest principal components show that the clusters are more separated for M = 10, 000 than for M = 1000. For the M-spheres datasets, we used the serial implementations of the initialization methods with the same settings as before.  Results for the final clustering accuracy using NMI with 100 repeats are shown in Figure 3. Clearly, SRPK-means outperforms other methods in terms final clustering accuracy for all the synthetic datasets. Moreover, if we compare the results between the datasets with M = 1000 and M = 10, 000, we observe that the clustering accuracy for SRPKmeans is improved when the dimensionality increases. The most significant difference is obtained for the M-spheres-M10k-d c 0.05 dataset, where K-means++ has a total breakdown of the accuracy while SRPK-means is able find the near optimal clustering result out of 100 repeats. Moreover, the accuracy of K-means++ is clearly worse compared to K-means and SK-means for very high-dimensional datasets. We tested K-means with random initialization for this dataset and observed from the statistical testing that K-means++, K-means and SK-means are no better than the random initialization in terms of NMI. These results demonstrate that the use of distances in the K-means++ type of initialization strategies can become meaningless in very high-dimensional spaces.
In Figure 4, scatter plots of SSE and NMI values show that the relative SSE difference between worst possible clustering result (NMI = 0) and the optimal clustering result (NMI = 1) can be surprisingly small for very high-dimensional data. Therefore, the improvements for the final clustering accuracy in Table 2 can be much more significant than the impression given by SSE in terms of how spherical clusters are found for highdimensional datasets. Figures 3 and 4 illustrate the deteriorating behavior of the currently most popular K-means++ initialization method in high dimensions. We especially observe that the K-means++ initialization behaves like (i.e., is not better than) the random one in the very high-dimensional cases. Such finding also suggests further experiments, where as a function of the data dimension, emergence of such a behavior is being studied to identify most appropriate random project dimensions to restore the quality of initialization and the whole clustering algorithm.

Conclusions
In this paper, we proposed two parallel initialization methods for large-scale K-means clustering and a new high-dimensional clustering data generation algorithm to support their empirical evaluation. The methods are based on divide-and-conquer type of K-means approach and random projections. The proposed initialization methods are scalable and fairly easy to implement with different parallel programming models.
The experimental results for an extensive set of benchmark and novel synthetic datasets showed that the proposed methods improve clustering accuracy and the speed of convergence compared to state-of-the-art approaches. Moreover, the deteriorating behavior of the K-means++ and K-means initialization methods in high dimensions can be recovered with the proposed RP-based approach to provide accurate initialization also for high-dimensional data. Our experiments also confirmed the finding (e.g., [52]) that the difference between the errors (SSE) of good and bad clustering results in high-dimensional spaces can be surprisingly small also challenge cluster validation and cluster validation indices (see [4] and references therein) in such cases.
Experiments with SRPK-means method demonstrate that use of RP and K-means is beneficial for clustering large-scale high-dimensional datasets. In particular, SRPK-means is an appealing approach as a standalone algorithm for clustering very high-dimensional large-scale datasets. In future work, it would be interesting to test a RP-based local SSE selection for SRPK-means , which uses the same RP matrix in each subset for the initial prototype selection. In this case, use of sparse RP variants [54] or the mailman algorithm [17] for the matrix multiplication could be beneficial, particularly in applications where K is close to P. Furthermore, integrating the proposed methods into the robust Kmeans [37] would be beneficial for clustering noisy data, because the clustering problems in these cases are especially challenging.