Efﬁcient High-Dimensional Kernel k-Means++ with Random Projection

: Using random projection, a method to speed up both kernel k-means and centroid initialization with k-means++ is proposed. We approximate the kernel matrix and distances in a lower-dimensional space R d before the kernel k-means clustering motivated by upper error bounds. With random projections, previous work on bounds for dot products and an improved bound for kernel methods are considered for kernel k-means. The complexities for both kernel k-means with Lloyd’s algorithm and centroid initialization with k-means++ are known to be O ( nkD ) and Θ ( nkD ) , respectively, with n being the number of data points, the dimensionality of input feature vectors D and the number of clusters k . The proposed method reduces the computational complexity for the kernel computation of kernel k-means from O ( n 2 D ) to O ( n 2 d ) and the subsequent computation for k-means with Lloyd’s algorithm and centroid initialization from O ( nkD ) to O ( nkd ) . Our experiments demonstrate that the speed-up of the clustering method with reduced dimensionality d = 200 is 2 to 26 times with very little performance degradation (less than one percent) in general.


Introduction
The clustering problem is one of the oldest and most important problems in machine learning. k-means clustering has attracted increasing attention recently. It can be used in many fields, such as market segmentation, social network analysis, and astronomical data analysis. k-means clustering is an NP-hard problem. In this problem, we are given n data points, in the D-dimensional space. The goal is to separate the points X into k clusters C, where C i denotes the i-th cluster, so as to minimize the potential function, which is the total sum of squared distances between each point to its closest center. That is, where c i denotes the center point of the i-th cluster. Data in high dimensionality has become an increasingly frequent problem in machine learning. Memory and the computational cost would be reduced if dimensionality can be reduced. To deal with this issue, plenty of techniques have been proposed, such as locally linear embedding, principle component analysis (PCA), ISOMAP, and multidimensional scaling. However, high computational training overheads and complex procedures are involved with a lot of the proposed methods, especially with very high-dimensional data. Because of this, there has been a surge of interest in new, simple and more computationally efficient techniques using random projection for very high-dimensional data, for example, [3][4][5].
In addition, kernel k-means is a popular method due to the fact that k-means clustering utilizes the Euclidean distance metric to calculate the similarity between data points, implying that only linearly separable clusters in the original input space can be found. Obviously, the measure is not suitable for just any dataset.
With kernel methods, data in the original input space is explicitly mapped into a different feature space through the user-specified kernel function. Features in data points are mapped to higher dimensional feature spaces. Putting data points in higher dimensional spaces after they have been mapped is a reasonable approach without the assumption that data points are linearly separable, which is also applicable for traditional k-means clustering. This advantage has made kernel k-means popular in recent years [6,7]. Kernel k-means, which is an extension of k-means, computes kernels instead of the Euclidean distance for point-wise distances. Kernel k-means projects data points into high-dimensional kernel space and performs k-means clustering in the mapped feature space to allow nonlinearly separable clusters. In kernel k-means, it is required to compute an n × n kernel matrix with feature vectors in the original space, and the computation cost is a function of n and the D.
To improve the performance with centroid initialization, the k-means++ seeding algorithm [8] is used to obtain k initial centers, which give the expected O(log k) approximation ratio, as shown by Arthur and Vassilvitskii. In [8], the k-means++ algorithm is introduced, which is a seeding algorithm for k-means to replace the uniform distribution of Lloyd's initialization with a probability distribution given by the points' squared distances from the closest chosen centers as the weights, i.e., the weight for data point x is with Dist(x) denoting the shortest distance to the closest center already chosen. k-means++ provides a performance guarantee of only a constant factor from the optimum and is proven to be O(log k)-competitive. The performance of the k-means++ algorithm is guaranteed, but the computational effort of the clustering still depends on the number of points n, dimensionality D and the number of clusters to be discovered k with the complexity of O(Dnk).
The remainder of the paper describes our contributions by applying kernel k-means with random projection and efficient initialization in the section "Our Method". In the Experiment Section, a number of experimental results have been shown. The optimal dimensionality, an ablation study, and a number of accuracy measures for our method are discussed before the Conclusion Section.

Contributions
In this work, there are two main contributions in the kernel k-means clustering method. The first contribution of this paper is the proposal of a sped-up version of k-means++. By using this centroid initialization prior kernel k-means, the initial centers are selected carefully by probability in a projected lower-dimensional subspace. The second contribution of this paper is a method speeding up the computation of the Gaussian kernel by projecting the data matrix into a lower-dimensional subspace. After the dimensionality of data is reduced, it effectively speeds up the computation of kernel k-means.

Related Work
In [9], random projection is adopted with iteratively increasing dimensionality for k-means clustering, which successfully obtains increasingly detailed resolutions and avoids local minima. Different methods for dimensionality reduction with feature selection and feature extraction for k-means clustering have been investigated by Boutsidis and Zouzias et al. in [10][11][12]. In [13], it is shown that dimensionality could be reduced by the use of PCA for data clustering with the objective function of k-means. An enhanced k-means method using a new method for centroid initialization with PCA is proposed by Napoleon and Pavalakodi in [14].
To improve clustering performance with kernels, in [15], an explicitly theoretical connection between kernel k-means and spectral clustering methods is presented by S. Dhillon et al. That connection leads to a weighted kernel k-means that monotonically decreases the normalized cut. In [16], a new clustering scheme is proposed by Zhang et al., which changes the clustering order from the sequence of samples to the sequence of kernels and employs a disk-based strategy to control the size of data. The new clustering scheme has been demonstrated to save ninety percent of the running time. A novel algorithm called approximate kernel k-means is proposed by Chitta et al. in [17], which tries to scale up the kernel-based clustering algorithm by using randomization to reduce both the computational complexity and the memory requirement.
However, with very large-scale datasets, memory requirements have always been a concern for kernel k-means. Incomplete Cholesky factorization is used to accelerate clustering and reduce the memory cost [18]. Its core idea is to utilize the product of a lowrank matrix and its transposition to approximate the complete kernel matrix. It is shown that with the decrease of the eigenvalues of the kernel matrix, the incomplete Cholesky factorization is exponentially convergent. Compared to k-means, kernel k-means can seize nonlinear structures with the cost of scaling poorly when the size of the matrix increases due to the fact that kernel methods always operate on the kernel matrix of the data. By employing approximately finite-dimensional feature maps based on spectral analysis, Amir Aradnia et al. [19] propose an approach to combine the advantages of both the linear and nonlinear methods. They suggest applying approximately finite-dimensional feature maps in kernel k-means to reduce the huge stored kernel matrix in the memory. Their explicit kernel Minkowski weighted k-means has been shown to be adaptive and capable of clustering in the new space. By adopting the Minkowski metric and feature weighting, the clustering performance improved and exhibited stronger robustness to noise features. Exploiting information from multiple views can result in better performance compared to clustering on a single view. For multiple-view clustering, a cluster-weighted kernel k-means method has been proposed [20]. It demonstrates high efficiency by assigning reasonable weights to related clusters among different views. The weight accounts for the significance of each cluster from each view to the final solution. In the meantime, it can also be learned automatically based on the intra-cluster similarity of the cluster different from all its corresponding clusters in different views.
In previous work, for improved centroid initialization, k-means++ mentioned in the introduction by Arthur and Vassilvitskii [8] is O(log k)-competitive with the optimal clustering. However, the question whether k-means++ yields an O(1)-approximation with probability 1 poly(k) or not is still open. In the work of Jaiswal and Garg [21], it is shown that the sampling algorithm gives an O(1) approximation with probability Ω( 1 k ) for any k-means problem instance where the data satisfies certain separation conditions. On the other hand, instances in [22] found that k-means++ achieves an approximation ratio of ( 2 3 − )log k. In [23] by Bhattacharya, Jaiswal, and Ailon, a simple two-dimensional dataset is presented, on which the seeding algorithm achieves an O(log k) approximation ratio with probability exponentially small in k. In [24], the presented seeding algorithm works well even when more than one center is chosen in a single iteration with k-means++ parallelized, which drastically reduces the number of passes needed to obtain a good initialization for k-means. In [25], it has been shown that by mapping data points into randomized lowerdimensional subspace, the point-wise distances are preserved as well as conventional dimensionality reduction methods, such as principal component analysis (PCA).
Unlike the approaches in [18,20], our goal is to improve the computational efficiency for the kernel k-means and its initialization with k-means++ using random projection for dimensionality reduction such that all distances are computed in a much lower-dimensional space. We propose an efficient clustering method by (1) making k-means++ more efficient with random projection and (2) subsequently clustering data using kernel k-means in the projected lower-dimensional space. The intuition of the proposed method is to reduce the computational complexity by performing dimensionality reduction prior to both centroid initialization and kernel k-means. One of the advantages of the proposed method is that it is simple to code and analyze. The induced error is theoretically bound by the Johnson Lindenstrauss lemma (the JL lemma) (i.e., = 20 log n d ). Compared to the traditional dimensionality reduction approaches, such as feature selection, the features extraction we proposed would consider all the available features. This reduces the risk that relevant features might be mistakenly overlooked. Furthermore, in the efficient initialization we proposed, the user can strike a balance between accuracy and required learning time by the selection between kmFRP/kmIRP/kmBRP.

Kernel k-Means with Random Projection
With kernel k-means, similarities are determined by replacing inner products with kernel functions. The complexity of kernel computation is O(n 2 D), which is a function of the dimensionality. The intuition of our method is to perform dimensionality reduction to reduce D with random projection prior to centroid initialization and kernel k-means clustering. The proposed method allows the computation of the kernel matrix and distances for initialization to be performed in a much lower-dimensional space in order to speed up the clustering process.
As the complexity of the kernel method with conventional kernel matrix computation X · X T is O(n 2 D), the computation of the kernel matrix can be expensive when the dimensionality D is very large with genomic data, for example. We propose an approximation to the kernel matrix by using random projection, which maps the data points to a lower-dimensional space φ(X) · φ(X) T , with the time complexity of O(n 2 ), which is independent of D. With the same rationale as the method we propose for k-means++, combining random projection with kernel k-means allows the clustering process to be done in a lower-dimensional space without significant performance degradation.
With random projection and D-dimensional data vectors x with n data points, we have X ∈ R n×D in which vectors are projected into much lower-dimensional subspace R d with d D, i.e., where D denotes the original dimensionality, d denotes the reduced dimensionality, and R D×d is a random matrix with its entries as normal random variables complexity of this algorithm is of order O(Ddn). If the data matrix X is very sparse with s non-zero entries per dimension, the complexity is of order O(sdn), where s denotes the non-zero entries per dimension.
where is the error bound that is determined as a function of the number of data points and dimensionality of the projected matrix, i.e., n and d as For kernel methods, in Lemma 1 of [1], there is a probability of at least 1 − δ that where the distance between the dot product in the original feature space and the dot product in the projected space is bounded by δ . In our method with kernel k-means using random projection, the kernel matrix is approximated in a lower-dimensional space with the above error bound considered, shown in Algorithm 1.

Algorithm 1:
Kernel k-means with random projection.
Choose initial centers c 1 , c 2 , . . . , c k uniformly at random from X proj . 5 do 6 for i ∈ 1, . . . , n do 7 for j ∈ 1, . . . , k do Since we propose to use random projection for both the center points initialization and clustering, the same projected matrix in a lower-dimensional space is used as input to both the k-means++ and kernel k-means. However, for kernel methods with random projection, the dot product is approximated like the approximated Euclidean distance using the JL lemma for the preservation of the point-wise Euclidean distance. In [2], an improved upper-bound on the error of the dot product with random projection is provided. Thus, to justify the upper bounded error of random projection for kernel k-means++ using both k-means++ and kernel k-means, the JL lemma and also the improved bound for kernel methods from [2] are considered.
In addition, in [26], it has been shown that initial values from k-means++ with the Euclidean distance could also improve the kernel k-means clustering performance measured by within-cluster-sum-of-squares (WCSS) and the time for convergence during training. In [27], it is also shown that the selection of k-means centroids in the instance space provides a good estimate of the kernel k-means centroids with the Gaussian kernel. It is computationally cheaper than the initialization performed in the kernel feature space.

Efficient Initialization
The k-means++ initialization algorithm is a popular initialization method for k-means with lower performance bounded by the statistical guarantee in [8]. We used random projection to approximate distances in the k-means++ algorithm to reduce the time complexity of O(nkD). We let Dist(x) denote the shortest distance from a data point x to the closest center already chosen. Since the preservation of point-wise distances in random projection is guaranteed by the JL lemma [28], the dimensionality of the input data can be reduced for Dist(x) to remain close to its original value. The computation of the initialization can be reduced significantly, especially when the original dimensionality is very high.
An expensive step in the k-means++ algorithm is to compute the weights Dist(x) 2 , which takes a new center c i and chooses x ∈ X with probability Like kernel methods, after random projection, we do not need to have access to x directly as Dist(x) 2 is only required to be computed for choosing the initial center. We approximate Dist(x) 2 as the original dimensionality of x, which can be very high and obtaining Dist(x) 2 can be expensive.
Unlike the work done by Boutsidis and Zouzias et al. [10][11][12], our idea is to find a good approximation to Dist(x) 2 , which can be shown to be sufficiently close to the original value of Dist(x) 2 and at the same time, computationally simple. By approximating Dist(x) 2 in the k-means++ algorithm, we obtain an approximate algorithm to k-means++ with a complexity of O(nk). The approximation to Dist(x) 2 can be obtained through random projections without perturbing the distances too much. We empirically show that these algorithms perform well on real-world datasets.
It has been shown that k-means++ is O(log k)-competitive with the optimal clustering. The algorithm has been shown to improve both the speed and the accuracy of Lloyd's algorithm for k-means. We take advantage of the performance guarantee of k-means++ provided by Theorem 3.1, which states that "if C is constructed with k-means++, then the corresponding potential function satisfies, E[φ]8(ln k + 2)φ OPT ." [8].
Our proposed method for initialization is shown in Algorithm 3, which is called subroutine Algorithm 2. Algorithm 2 generates one of three types of random matrices for k-means++ with random projection, i.e., fixed random matrices, different random matrices for each iteration, and buffered random matrices (kmFRP, kmIRP, and kmBRP). Algorithm 3 obtains the random matrix generated in Algorithm 2, which computes the projected vectors with random projection and finds initial centers using data points projected into a lowerdimensional space in matrix X proj .

Algorithm 2: Proposed function to generate random matrix.
Input: type is the type of the random matrix, which can be kmFRP, kmIRP or kmBRP b is required for the buffer size if type is kmBRP Output: R ∈ R D×d random matrix 1 Create buffer B for the first time this subroutine is called such that B := {R 1 , R 2 , . . . , R b } with the loop: kmFRP is the simplest variant of the initialization method. We approximate X by using X proj such that Dist(x) 2 can be found with a time complexity of O(d) instead of O(D). X proj ∈ R n×d with projected vectors are computed before the seeding stage of k-means++ with random projections.

k-means++ with Iterative Random Projection (kmIRP)
With kmIRP, we also approximate X by using X proj . However, a new random projection is used at each iteration to produce X proj with the aim of reducing the performance degradation measured in WCSS introduced by the approximation using random projec-tion. This is like a stochastic gradient descent. With a different X proj at each iteration, the approximation to Dist(x) 2 is different such that at the end of all iterations, X proj better approximates the original high-dimensional vectors x with averaging. However, the disadvantage is that the computation for random projection is done at each iteration, so it is computationally more expensive.

k-Means++ with Buffered Random Projection (kmBRP)
Buffering a fixed number of random matrices is also considered for the initial center selection. The aim is to combine the advantages of the previous two variants with a limited number of random projections for efficiency. Although multiple random projections could help to reduce the bias in approximation, a new random matrix at each iteration may not be necessary. At the initialization stage, we have a buffer B with b random matrices that, k-means++ with fixed random projection (kmFRP) is the simplest and the most computationally efficient variant of our method, but it may lead to a larger degradation of performance, as discussed previously. k-means++ with iterative random projection (kmIRP) theoretically provides the least biased clustering at different iterations with more computation for random projections. k-means++ with buffered random projection (kmBRP) is something in the middle between kmFRP and kmIRP that strikes a balance for good optimization performance while maintaining reasonable computational efficiency. With the free parameter b, which is the buffer size, the user can control the trade-off between optimization performance and computational efficiency.

Algorithm 3: Proposed k-means++ approximation with different types of random projection.
Input: type is the type of the random matrix, which can be kmFRP, kmIRP or kmBRP Input: b is required for the buffer size if type is kmBRP Output: k initial centers selected 1 Choose initial center c 1 uniformly at random from X if type is either kmIRP or kmBRP then 7 R = call Algorithm 2 with type, b 8 end

Complexity
Random projection is known to have a time complexity of O(sdn) instead of O(Ddn) with sparse matrices, where s denotes non-zero entries per dimension. Sparse data are common in various applications, including text classification and bioinformatics data (e.g., gene expression data). k-means++ has the complexity of O(Dnk). With random projections, the complexity of k-means++ becomes of order O(dnk) as we can avoid computing for all D dimensions. In practice, with our JL lemma application in the centers' initialization in k-means++, dimensionality can be reduced to a few hundred, e.g., 200. The lemma requires no assumption on the original data [28]. Therefore, d in O(dnk) can be replaced by a constant. With the constant, O(dnk) becomes O(nk).
MNIST is a ten-class database for handwritten digit recognition. There are 60,000 samples and 10,000 samples in the training set and the testing set, respectively. In the testing set, there are 1000 samples per class label. Every sample consists of 28 by 28 pixels (i.e., 784 features) with which the value is inside the interval [0,255], and represents the greyscale of that pixel. We evaluate clustering performance by comparing how well different methods optimize clustering with the WCSS measure. In our experiments, we only sample a subset of 5000 data points from the training set. Thus, the dataset to be examined consists of 5000 data points with dimensionality 784 (i.e., n = 5000, D = 784).
The ImageNet [30] database consists of over 14.2 million images that are organized according to the WordNet hierarchy. Each node in this hierarchy represents a concept (known as the "synset"), and all the images are indexed by over 21-thousand synsets. In our analysis, we chose 14,112 images from ten cat-like synsets (namely, an Egyptian cat, a foumart, a lynx, a Madagascar cat, a mountain lion, a Persian cat, a Siamese cat, a skunk, a tabby cat, and a tiger cat). They are all from the leaf nodes in the hierarchy. We called this dataset the ImageNet-Cat dataset. Figure 1 shows an example from each synset used to form this dataset. We used the bag-of-visual-words features of these images from image-net.org. It consists of 1000 visual-words features per photo. In our experiments, we only sampled 2000 images from the ImageNet-Cat dataset for evaluation with n = 2000 and D = 1000. The 20-Newsgroups dataset contains a collection of newsgroup articles from 20 different newsgroups. It consists of approximately 1000 documents from each of these newsgroups. In our experiment, we only clustered data from 5 of these newsgroups, including alt.atheism, os.ms-windows, sport.baseball, sci.space and religion.misc. With the corpus, we applied a number of data pre-processing steps that include converting characters to lower cases, removing punctuations, removing white spaces, and removing stop words in English. Stop words are 174 predefined English common words that are believed to contain very little information for text mining (such as "we", "your", "the", "any", and "only"). Terms in the corpus are columns, while documents are the rows. Columns with a sparsity of 98% or above were removed, and only 2000 data points were sampled before each iteration of the evaluation. This resulted in a subsampled set with 2000 data points and 929 dimensions, i.e., n = 2000, D = 929.
Reuters-21578 contains a collection of 21,578 news articles as Reuters newswires in 1987. Documents were assembled and indexed with categories by staff at Reuters Ltd. and Carnegie Group, Inc. The Reuters-21578 dataset that we used is the well-formed dataset assembled in the text mining package (i.e., tm) in R [35]. In our experiment, we only selected news from the acq and earn topic categories with 2362 and 3945 news articles, respectively. We removed the terms with sparsity 99.5% or above before clustering. Thus, there are 6307 data points with 974 dimensions (n = 6307, D = 974) in the document-term matrix.

Environments
Our experiments (the source code of the project would be provided upon publishing of the paper through: https://github.com/janchanyk/kmeans-plusplus-RP, accessed date 27 July 2021) were run on a 64-bit platform with an Intel 3.1 GHz Quad-Core CPU, equipped with 32 GB of memory. All the experiments were run on the same platform.

Performance Metrics
We evaluated the results by observing the performance and the execution time of each of the experiments. The way we evaluated the performance was just to use the standardly potential function of the k-means problem, i.e., the within-cluster sum of squares (WCSS). However, we used it to measure the distance of each data point to its closest initial center after the initialization. We also used this metric to evaluate the overall clustering performance. Our objective is to test how well the proposed algorithms minimize the potential when compared to the baseline. The potential, WCSS, could be defined as where C i is the i-th cluster, and c i is the initial center point of the i-th cluster.
Since we used dimensionality reduction before our proposed initialization algorithm and clustering, the clustering performance in that lower-dimensional space should not be compared to that of the original input space. Therefore, we obtained the cluster label of each point and computed WCSS of these points to the center of mass of each cluster, back in the original input space.
Another evaluation metric is the execution time. For the experiments of our proposed algorithm on k-means++, we only counted for the time used in the initial center selection. Our objective is to see the time used by the algorithms for the center points initialization, instead of the time consumed by k-means. In the experiments on clustering, we also used this time metric to see the time reduced in our proposed method. The execution time of clustering includes the processing time of both random projection and kernel k-means clustering.

Experiments on the Efficient Centroid Initialization
On the two bioinformatics datasets, namely GLI-85 and SMK-187, we first established the baseline by running k-means++ in the original input space. Subsequently, we ran the efficient centroid initialization method (i.e., kmFRP). We iterated each of the above experiments ten times to measure its average performance (in WCSS). We compared the WCSS from our proposed method against the baseline.

Experiments on the Kernel k-Means Clustering
On each dataset, we first established the baseline by running kernel k-means and kernel k-means++ in the original input space. Subsequently, the data were projected into a lower-dimensional feature space, and kernel k-means and kernel k-means++ are evaluated in that projected space. The Gaussian kernel was adopted when the data were fitted to the kernel k-means clustering. For the experiments in the projected space, we increased the number of dimensions from 2 to 500 to see its impact on the clustering performance. We iterated each experiment ten times to measure its average performance and execution time.

Performance of Our Method for Centroid Initialization
We first evaluated the initialization performance when the k-means++ was run in the projected subspace with our method. We compared the average performance in WCSS of our proposed k-means++ approximation method, with the average performance of the original version of the k-means++.
Both Figures 2 and 3 show that the initialization performance in WCSS by our method is close to the baseline. The WCSS by our k-means++ approximation is always within the error bar (i.e., one standard deviation) of the WCSS distribution by the original k-means++. When the k is increasing, the error bar of the WCSS distribution by the original k-means++ is narrowing down, and the performance in WCSS by our k-means++ approximation method is still close to the baseline and falls within the error bar. We then evaluated the difference in initialization performance from our method against the baseline. We evaluated the degradation of the average performance of our proposed k-means++ approximation algorithms by comparing them with the original version of k-means++. The degradation of the average performance in our experiment is defined as, where WCSS rp is the distance from our method after random projection and WCSS orig is the distance from the original version of k-means++. The values and the execution time are compared for different methods in Tables 1 and 2, respectively, to show the seeding performance on GLI-85, SMK-187 and the Reuters-21578 dataset.  The larger the dataset, the bigger the time improvement. There is a computational advantage for the proposed method with reduced execution time when k ≥ 20 in general.
Throughout the experiments on the bioinformatics datasets of GLI-85, SMK-187, and the Reuters-21578 dataset, it is found that the dimensionality reduction using random projection does not induce any significant increase to the of the center point initialization. The figures also show that the dimensionality could be reduced down to 10% of the original dimensionality, and the center point initialization still performs well with a small . In general, it is also found that the larger the number of clusters (i.e., the k), the smaller the induced by our approximation.
In the experiment on the Reuters-21578 dataset, since it is the largest dataset among three datasets, (i.e., n = 21, 578, D = 2003), time improvement on the center point ini-tialization exists in most of the experiments with different parameters. For the GLI-85 and SMK-187 datasets, the major time reduction only happens on the large k. When the k is small, the cost of random projection offsets the cost reduced from the center point initialization. In other words, the proposed solution is especially fit for solving problems in high dimensionality and when the k is large.

Clustering Performance of Our Method
We evaluated the clustering performance on six different real-world datasets. Figure 4 indicates that the overall execution time improvement in the projected space can be seen in all experiments. With enough dimensionality such that no performance degradation is observed, it is shown that our method generally reduces the execution time by over 50%. With the bioinformatics dataset, kernel k-means++ in the projected space runs 10-20 times faster than the ordinary kernel k-means in the original input data space. Table 3 summarizes the time reduced in our experiments. With the execution time of the baseline, it is not always the case that k-means++ can shorten the clustering process. The execution time on the bioinformatics dataset GLI-85 and SMK-187 indicate that, on a dataset with high dimensionality and limited data points (e.g., bioinformatics data), the k-means++ takes much longer time to find the initial centers than the computation of kernel k-means clustering.  The execution time of our proposed method. 2 The execution of the vanilla kernel k-means. Figure 5 illustrates the clustering performance of the proposed method against the increasing dimensionality. There are two horizontal lines in green and blue, corresponding to kernel k-means and kernel k-means++, respectively, in the input feature space (the baseline) with the clustering performance of our methods in black and red against an increasing number of dimensions.
The results with the six experiments generally indicate that the clustering performance of the kernel k-means++ is either better than or close to that of kernel k-means in the original input space or the projected feature space. In the projected data space, WCSS is much higher than that of the baseline only when the dimensionality is very low. In other words, the performance deteriorates a lot when the data have been projected into an extremely low dimensional space. The clustering performance converges to the performance of the baseline when the number of dimensions increases. In general, the performance with projection converges to that of the baseline when the dimensionality is between 100 and 200.

Optimal Dimensionality of the Projected Space
We evaluated our methods on six datasets, and the performance deteriorates a lot when the dimensionality is extremely low. The performance converges to the result of the baseline until the optimal dimensionality is reached. The optimal dimensionality varies among different datasets. In the experiment on the ImageNet-Cat dataset, if we pick synsets very different from each other (e.g., newspaper vs. tiger cat), which means the data instances among these synsets, with bag-of-visual-words scale-invariant feature transform (SIFT) features, are very separable. In such cases, this problem could be solved even in a very low-dimensional (i.e., even below 10) feature space. In other words, we may start the algorithm in a low-dimensional space and observe the clustering performance improvements by increasing the dimensionality until it saturates.

Ablation Study on Our Method
The performance of our proposed method, kernel k-means++ with random projection shown with the red lines in Figure 5 is close to the performance of kernel k-means in the original input space in green lines. For the training time in Figure 4, we can observe the time consumption of different components of the method. For kernel k-means in the projected feature space, the black lines in the figure correspond to experiments without k-means++ initialization and red lines correspond to experiments with k-means++. It can be observed that both kernel k-means and kernel k-means++ take more time to complete without random projection. The execution time increases linearly when the projected dimensionality increases. From the clustering performance point of view, in all our results, kernel k-means++ performs close to kernel k-means whether or not clustering is performed in the original input space or the projected low-dimensional space. Table 4 presents some other clustering performance measurements that help assess the reliability, stability and robustness of the proposed method.

Reliability
Since our method is an approximation to the kernel k-means, we assessed the accuracy of our approximation by observing the percentage change in WCSS compared to the vanilla kernel k-means. The percentage changes in WCSS of Table 4 show that in all the six datasets, the resulting WCSS from our method is very close to that (i.e., below one percent) of the vanilla kernel k-means. Some cases record negative changes in WCSS, which represents that the clustering performance from our method is even better than the vanilla kernel k-means.

Robustness
For algorithm robustness, it usually represents the comparison between the training error and the test error for supervised learning. In our case of clustering, we are assessing the robustness by considering the noise associated with the random projection. In our experiments, when the projected dimensionality is small, the noise induced by the random projection is theoretically higher than that of higher projected dimensionality. In our experiments with six datasets, when d > 200, percentage changes in WCSS are less than 1% in most of the cases. In other words, the proposed method remains robust when the projected dimensionality d is greater than 200.

Stability
In our experiments, for each combination of the dataset and d, we iterated the evaluation of our proposed method ten times. We can estimate the stability by means of fluctuation (i.e., stdev/means) of the resulting WCSS from our proposed method. It is found that fluctuation of the WCSS is only below 5% of the mean WCSS, and most of the cases are below 2%. They are close to the fluctuation of the resulting WCSS from vanilla kernel k-means. On the other hand, it is also found that the fluctuation of WCSS from our approximation approach is always higher (by less than one percent) than that from the vanilla kernel k-means. In other words, the speed-up from our proposed solution comes with a cost, but this cost is relatively small in our experiment with six datasets.

Conclusions
We propose a method to utilize dimensionality reduction before k-means++ centroid initialization and kernel k-means clustering. To justify the clustering performance of the proposed method using random projection for both k-means++ and kernel k-means, the JL lemma and the improved bound for kernel methods from [2] are considered. Experimental results indicate that the performance with our approach is close to that of kernel k-means with k-means++ initialization without significant degradation. On a high dimensional dataset in our experiment (n = 187, D = 19,993), it is shown that the speed-up of the proposed algorithms is 26 times compared to that of vanilla kernel k-means. The result demonstrates that the proposed method speeds up the kernel k-means, by means of dimensionality reduction using random projection, without significant performance degradation.

Managerial and Academic Implications
Clustering is one of the major approaches in unsupervised learning. They are heavily used in many areas like spam filtering, marketing and document analysis, etc., and k-means is one of the most popular methods in clustering. However, k-means does not scale very well when the dimensionality of a dataset is very high. The method we proposed tackles this weakness and improves the scalability of k-means++ and kernel k-means.

Limitations
The method we propose only reduces the dimensionality from D to d, and the complexity of the method is reduced from O(nkD) to O(nkd). On the other hand, the number of data points (i.e., n) remains unchanged. Therefore, the execution time may remain sensitive to n.

Future Studies and Recommendations
In the proposed method, we reduce the computational cost of k-means++ and kernel k-means using random projection. The random projection could be further sped up by some advanced approaches [36][37][38][39]. It could be interesting to consider these approaches to further speed up the k-means clustering algorithm.