1. Background
Given a set of n points in , and an error parameter , a coreset in this paper is a small set of weighted points in , such that the sum of squared distances from the original set of points to any set of k centers in  can be approximated by the sum of weighted squared distances from the points in the coreset. The output of running an existing clustering algorithm on the coreset would then yield approximation to the output of running the same algorithm on the original data, by the definition of the coreset.
Coresets were first suggested by [
1] as a way to improve the theoretical running time of existing algorithms. Moreover, a coreset is a natural tool for handling 
 using all the computation models that are mentioned in the previous section. This is mainly due to the merge-and-reduce tree approach that was suggested by [
2,
3] and is formalized by [
4]: coresets can be computed independently for subsets of input points, e.g., on different computers, and then be merged and re-compressed again. Such a binary compression tree can also be computed using one pass over a possibly unbounded streaming set of points, where in every moment only 
 coresets exist in memory for the 
n points streamed so far. Here the coreset is computed only on small chunks of points, so a possibly inefficient coreset construction still yields efficient coreset constructions for large sets; see 
Figure 1. Note that the coreset guarantees are preserved while using this technique, while no assumptions are made on the order of the streaming input points. These coresets can be computed independently and in parallel via 
M machines (e.g., on the cloud) and reduce the running time by a factor of 
M. The communication between the machines is also small, since each machine needs to communicate to a main server only the coreset of its data.
In practice, this technique can be implemented easily using the map-reduce approach of modern software for 
 such as [
5].
Coresets can also be used to support the dynamic model where points are deleted/inserted. Here the storage is linear in 
n since we need to save the tree in memory (practically, on the hard drive), however the update time is only logarithmic in 
n since we need to reconstruct only 
 coresets that correspond to the deleted/inserted point along the tree. First such coreset of size 
independent of 
d was introduced by [
6]. See [
1,
2] for details.
  Constrained k-Means and Determining k
Since the coreset approximates every set of 
k centers, it can also be used to solve the 
k-means problem under different constraints (e.g., allowed areas for placing the centers) or given a small set of candidate centers. In addition, the set of centers may contain duplicate points, which means that the coreset can approximate the sum of squared distances for 
 centers. Hence, coresets can be used to choose the right number 
k of centers up to 
, by minimizing the sum of squared distances plus 
 for some function of 
k. Full opensource is available [
8].
  2. Related Work
We summarize existing coresets constructions for 
k-means queries, as will be formally defined in 
Section 4.
  Importance Sampling
Following a decade of research, coreset of size polynomial in 
d were suggested by [
9]. Ref [
10] suggested an improved version of size 
 in which is a special case of the algorithms proposed by [
11]. The construction is based on computing an approximation to the 
k-means of the input points (with no constraints on the centers) and then sample points proportionally to their distance to these centers. Each chosen point is then assigned a weight that is inverse proportional to this distance. The probability of failure in these algorithms reduces exponentially with the input size. Coresets of size 
, i.e., linear in 
k, were suggested in work of [
12], however the weight of a point may be negative or a function of the given query. For the special case 
, Inaba et al. [
13], provided constructions of coresets of size 
 using uniform sampling.
Projection based coresets. Data summarization which are similar to coresets of size 
 that are based on projections on low-dimensional subspaces that diminishes the sparsity of the input data were suggested by [
14] by improving the analysis of [
4]. Recently [
15] improves both on [
4,
14] by applying Johnson-Lindenstrauss Lemma [
16] on construction from [
4]. However, due to the projections, the resulting summarizations of all works mentioned above are not subset of the input points, unlike the coreset definition of this paper. In particular, for sparse data sets such as adjacency matrix of a graphs, documents-term matrix of Wikipedia, or images-objects matrix, the sparsity of the data diminishes and a single point in the summarization might be larger than the complete sparse input data.
 Other type of weak coresets approximates only the optimal solution, and not every 
k centers. Such randomized coresets of size independent of 
d and only polynomial in 
k were suggested by [
11] and simplified by [
12].
Deterministic Constructions. The first coresets for 
k-means [
2,
17] were based on partitioning the data into cells, and take a representative point from each cell to the coreset, as is done in hashing or Hough transform [
18]. However, these coresets are of size at least 
, i.e., exponential in 
d, while still providing result which is a sub-set of the the input in contrast to previous work [
17]. While, our technique is most related to the deterministic construction that was suggested in [
4] by recursively computing 
k-means clustering of the input points. While the output set has size independent of 
d, it is not a coreset as defined in this paper, since it is not a subset of the input and thus cannot handle sparse data as explained above. Techniques such as uniform sampling for each cluster yields coresets with probability of failure that is linear in the input size, or whose size depends on 
d.
 m-means is a coreset for k-means? A natural approach for coreset construction, that is strongly related to this paper, is to compute the 
m-means of the input set 
P, for a sufficiently large 
m, where the weight of each center is the number of point in its cluster. If the sum of squared distances to these 
m-centers is about 
 factor from the 
k-means, we obtain a 
-coreset by (a weaker version of) the triangle inequality. Unfortunately, it was recently proved in [
19] that there exists sets such that 
 centers are needed in order to obtain this small sum of squared distances.
   3. Our Contribution
We suggest the following deterministic algorithms:
An algorithm that computes a  approximation for the k-means of a set P that is distributed (partitioned) among M machines, where each machine needs to send only  input points to the main server at the end of its computation.
A streaming algorithm that, after one pass over the data and using 
 memory returns an 
-approximation to the 
k-means of 
P. The algorithm can run “embarrassingly in parallel [
20] on data that is distributed among 
M machines, and support insertion/deletion of points as explained in the previous section.
Description of how to use our algorithm to boost both the running time and quality of any existing k-means heuristic using only the heuristic itself, even in the classic off-line setting.
Extensive experimental results on real world data-sets. This includes the first k-means clustering with provable guarantees for the English Wikipedia, via 16 EC2 instances on Amazon’s cloud.
Open-code for for fully reproducing our results and for the benefit of the community. To our knowledge, this is the first coreset code that can run on the cloud without additional commercial packages.
  3.1. Novel Approach: m-Means Is A Coreset for k-Means, for Smart Selection of m
One of our main technical result is that for every constant 
, there exists an integer 
 such that the 
m-means of the input set (or its approximation) is a 
-coreset; see Theorem 2. However, simply computing the 
m-means of the input set for a sufficiently large 
m might yield 
m that is exponential in 
d, as explained by [
19] and the related work. Instead, Algorithm 1 carefully selects the right 
m between 
k and 
 after checking the appropriate conditions in each iteration.
  3.2. Solving k-Means Using k-Means
It might be confusing that we suggest to solve the k-means problem by computing m-means for . In fact, most of the coreset constructions actually solve the optimal problem in one of their first construction steps. The main observation is that we never run the coreset on the complete input of n (or unbounded stream of) points, but only on small subsets of size . This is since our coresets are composable and can be merged (to  points) and reduced back to m using the merge-and-reduce tree technique. The composable property follows from the fact that the coreset approximates the sum of squared distances to every k-centers, and not just the k-means for the subset at hand.
  3.3. Running Time
Running time that is exponential in 
k is unavoidable for any 
-approximation algorithm that solves 
k-means, even in the planar case 
 [
21] and 
 [
4]. Our main contributions is a coreset construction that uses memory that is independent of 
d and running time that is 
near-linear in 
n. To our knowledge this is an open problem even for the case 
. Nevertheless, for large values of 
 (e.g., 
) we may use existing constant factor approximations that takes time polynomial in 
k to compute our coreset in time that is near-linear in 
n but also polynomial in 
k.
In practice, provable -approximations for k-means are rarely used due to the lower bounds on the running times above. Instead, heuristics are used. Indeed, in our experimental results we evaluate this approach instead of computing the optimal k-means during the coreset construction and on the resulting coreset itself.
  4. Notation and Main Result
The input to our problem is a set  of n points in , where each point  includes a multiplicative weight . In addition, there is an additive weight  for the set. Formally, a weighted set in  is a tuple , where , . In particular, an unweighted set has a unit weight  for each point, and a zero additive weight.
  4.1. k-Means Clustering
For a given set 
 of 
 centers (points) in 
, the Euclidean distance from a point 
 to its closest center in 
Q is denoted by 1.7
em. The sum of these weighted squared distances over the points of 
P is denoted by
        
If P is an unweighted set, this cost is just the sum of squared distances over each point in  to its closest center in Q.
Let  denote the subset of points in  whose closest center in Q is , for every . Ties are broken arbitrarily. This yields a partition  of  by Q. More generally, the partition of P by Q is the set  where , and  for every  and every .
A set 
 that minimizes this weighted sum 
 over every set 
Q of 
k centers in 
 is called the 
k-means of 
P. The 1-means 
 of 
P is called the centroid, or the center of mass, since
        
We denote the cost of the k-means of P by .
  4.2. Coreset
Computing the 
k-means of a weighted set 
P is the main motivation of this paper. Formally, let 
 be an error parameter. The weighted set 
 is a 
-coreset for 
P, if for every set 
 of 
 centers we have
        
To handle streaming data we will need to compute “coresets for union of coresets”, which is the reason that we assume that both the input P and its coreset S are weighted sets.
  4.3. Sparse Coresets
Unlike previous work, we add the constraints that if each point in P is sparse, i.e., has few non-zeroes coordinates, then the set S will also be sparse. Formally, the maximum sparsity  of P is the maximum number of non zeroes entries  over every point p in P.
In particular, if each point in S is a linear combination of at most  points in P, then . In addition, we would like that the set S will be of size independent of both  and d.
We can now state the main result of this paper.
Theorem 1 (Small sparse coreset). For every weighted set  in ,  and an integer , there is a -coreset  of size  where each point in  is a linear combination of  points from . In particular, the maximum sparsity of  is .
 By plugging this result to the traditional merge-and-reduce tree in 
Figure 1, it is straight-forward to compute a coreset using one pass over a stream of points.
Corollary 1. A  coreset  of size  and maximum sparsity  can be computed for the set P of the n points seen so far in an unbounded stream, using  memory words. The insertion time per point in the stream is . If the stream is distributed uniformly to M machines, then the amortized insertion time per point is reduced by a (multiplicative) factor of M to . The coreset for the union of streams can then be computed by communicating the M coresets to a main server.
   5. Coreset Construction
Our main coreset construction algorithm k gets a set P as input, and returns a -coreset ; see Algorithm 1.
To obtain running time that is linear in the input size, without loss of generality, we assume that 
P has 
 points, and that the cardinality of the output 
S is 
. This is thanks to the traditional merge-and-reduce approach: given a stream of 
n points, we apply the coreset construction only on subsets of size 
 from 
P during the streaming and reduce them by half. See 
Figure 1 and e.g., [
4,
7] for details.
  Algorithm Overview
In Line 1 we compute the smallest integer 
 such that the cost 
 of the 
m-means of 
P is close to the cost 
 of the 
-means of 
P. In Line 3 we compute the corresponding partition 
 of 
P by its 
m-means 
. In Line 5 a 
-sparse coreset 
 of size 
 is computed for every 
, 
. This can be done deterministically e.g., by taking the mean of 
 as explained in Lemma 1 or by using a gradient descent algorithm, namely Frank-Wolfe, as explained in [
22] which also preserves the sparsity of the coreset as desired by our main theorem. The output of the algorithm is the union of the means of all these coresets, with appropriate weight, which is the size of the coreset.
The intuition behind the algorithm stems from the assumption that m-means clustering will have lower cost than 
k-means and this is actually supported by series of previous work [
23,
24]. In fact our experiments in 
Section 7 evidence that in practice it actually works better than anticipated by theoretical bounds.
  6. Proof of Correctness
The first lemma states the common fact that the sum of squared distances of a set of point to a center is the sum of their squared distances to their center of mass, plus the squared distance to the center (the variance of the set).
Proof.  The last term equals zero since 
, and thus
        
 The second lemma shows that assigning all the points of P to the closest center in Q yields a small multiplicative error if the 1-mean and the k-means of P has roughly the same cost. If , this means that we can approximate  using only one center in the query; see Line 1 of Algorithm 1. Note that  for .
Lemma 2. For every set  of  centers we have  Proof.  Let 
 denote a center that minimizes 
 over 
. The left inequality of (
1) is then straight-forward since
        
It is left to prove the right inequality of (
1). Indeed, for every 
, let 
 denote the closest point to 
p in 
Q. Ties are broken arbitrarily. Hence,
        
Let 
 denote the partition of 
P by 
, where 
 are the closest points to 
 for every 
; see 
Section 4. For every 
, let 
. Hence,
        
        where in (4) and (5) we substituted 
 and 
 respectively in Lemma 1, and in (6) we use the fact that 
 and 
 for every 
. Summing (6) over 
 yields
        
To bound (8), we substitute 
 and then 
 in Lemma 1, and obtain that for every 
        where the last inequality is by the definition of 
. This implies that for every 
,
        
Plugging the last inequality in (8) yields
        
        where (11) is by Cauchy-Schwartz inequality, and in (12) we use the fact that 
 for every 
.
To bound the left term of (13) we use the fact 
 and substitute 
, 
 in Lemma 1 for every 
 as follows.
        
To bound the right term of (13) we use 
 to obtain
        
Plugging (
14) and the last inequality in (13) yields
        
Together with (
2) this proves Lemma 2. □
 Lemma 3. Let S be a -coreset for a weighted set P in . Let  be a finite set. Then  Proof.  Let 
 be a center such that 
, and let 
 be a center such that 
. The right side of (
15) is bounded by
        
        where the first inequality is by the optimality of 
, and the second inequality is since 
S is a coreset for 
P. Similarly, the left hand side of (
15) is bounded by
        
        where the last inequality follows from the assumption 
. □
 Lemma 4. Let S be the output of a call to k. Then S is a -coreset for P.
 Proof.  By replacing 
P with 
 in Lemma 1 for each 
 it follows that
        
Summing the last inequality over each 
 yields
        
Since 
 is the partition of the 
m-means of 
P we have 
. By letting 
 be the 
m-means of 
 we have
        
Hence,
        
        where the second inequality is by Line 1 of the algorithm. Plugging the last inequality in (
16) yields
        
Using Lemma 3, for every 
By summing over 
 we obtain
        
Plugging the last inequality in (
17) yields
        
Hence, S is a  coreset for P. □
 Lemma 5. There is an integer  such that  Proof.  Contradictively assume that (
19) does not hold for every integer 
. Hence,
        
Contradiction, since . □
 Using the mean of  in Line 5 of the algorithm yields a -coreset  as shown in Lemma 1. The resulting coreset is not sparse, but gives the following result.
Theorem 2. There is  such that the m-means of P is a -coreset for P.
 Proof of Theorem 1. We compute 
 a 
 mean coreset for 1-mean of 
 at line 5 of Algorithm 1 by using variation of Frank-Wolfe [
22] algorithm. It follows that 
 for each 
i, therefore the overall sparsity of 
S is 
. This and Lemma 4 concludes the proof. □
   7. Comparison to Existing Approaches
In this section we provide experimental results of our main algorithm of coreset constructions. We compare the clustering with existing coresets and small/medium/large datasets. Unlike most of the coreset papers, we also run the algorithm on the distributed setting via a cloud as explained below.
  7.1. Datasets
For our experimental results we use three well known datasets, and the English Wikipedia as follows.
MNIST handwritten digits [25]. The MNIST dataset consists of 
 grayscale images of handwritten digits. Each image is of size 28 × 28 pixels and was converted to a row vector row of 
 dimensions.
 Pendigits [26]. This dataset was downloaded from the UCI repository. It consists of 250 written letters by 44 humans. These humans were asked to write 250 digits in a random order inside boxes of 500 by 500 tablet pixel resolution. The tablet sends 
x and 
y tablet coordinates and pressure level values of the pen at fixed time intervals (sampling rate) of 100 milliseconds. Digits are represented as constant length feature vectors of size 
 the number of digits in the dataset is 
.
 NIPS dataset [
27]. The OCR scanning of NIPS proceedings over 13 years. It has 15,000 pages and 1958 articles. For each author, there is a corresponding words counter vector, where the 
ith entry in the vector is the number of the times that the word used in one of the author’s submissions. There are overall 
 authors and 
 words in this corpus.
 English Wikipedia [28]. Unlike previous datasets that were uploaded to memory and then compressed via streaming coresets, the English Wikipedia practically can not be uploaded completely to memory. The size of the dataset is 15GB after converting to a term-documents matrix via gensim [
29]. It has 4M vectors, each of 
 dimensions and an average of 200 non-zero entries, i.e., words per document.
   7.2. The Experiment
We applied our coreset construction to boost the performance of Lloyd’s 
k-means heuristic as explained in 
Section 8 of previous work [
6]. We compared the results with the current data summarization algorithms that can handle sparse data: uniform and importance sampling.
  7.3. On the Small/Medium Datasets
We evaluate both the offline computation and the streaming data model. For offline computation we used the datasets above to produce coresets of size 
, then computed 
k-means for 
 till convergence. For the streaming data model, we divided each dataset into small subsets and computed coresets via the merge-and-reduce technique to construct a coreset tree as in 
Figure 1. Here, the coresets are smaller, of size 
, and the values for 
k-means are the same.
We computed the sum of squared distances to the original (full) set of points, from each resulting set of k centers that was computed from the coreset. These sets of centers are denoted by  and  for uniform, non uniform sampling and our coreset respectively. The “ground truth” or “optimal solution”  was computed using k-means on the entire dataset until convergence. The empirical estimated error  was then defined to be  for coreset number . Note that, since Lloyd’s k-means is a heuristic, its performance on the reduced data might be better, i.e., .
These experiments were run on a single common laptop machine with 2.2GHz quad-core Intel Core i7 CPU and 16GB of 1600MHz DDR3L RAM with 256GB PCIe-based onboard SSD.
  7.4. On the Wikipedia Dataset
We compared the three discussed data summarization techniques, while each one was computed in parallel and in a distributed fashion on 16 EC2 virtual servers. We repeated computation for  and 128, and coresets size in the range .
This experiment was executed via Amazon’s Web Services (“cloud”), using 16 EC2 virtual computation nodes of type c4.4xlarge, which 8 vCPU and 15GiB of RAM. We repeated distributed computation evaluating for coresets of sizes 256, 512, 1024 and 2048 points for k-means with .
  7.5. Results
The results of experiment for 
 on small datasets for offline computation are depicted on 
Figure 2, where it’s evident that error of kmeans computation fed by our coreset algorithm results outperforms error of uniform and non-uniform sampling.
For streaming computation model our algorithm is able to provide results which are better than other two as could be explored in 
Figure 3. In addition, existing algorithms suffer from “cold start” as common in random sampling techniques: there is a slow convergence to the small error, compared to our deterministic algorithm that introduces small error already after a small sample size.
At 
Figure 4 presented results of the experiment on Wikipedia dataset for different values of 
, as it could be easily observed proposed coreset algorithm provides good results of big sparse dataset and provides lower energy cost compared to uniform and non-uniform approaches.
Figure 5, 
Figure 6 and 
Figure 7 show the box-plot of error distribution for all the three coresets in the offline and streaming settings. Our algorithm shows a little variance across all experiments, its mean error is very close to its median error, indicating that it produces stable results.
 Figure 8 shows the memory (RAM) footprint during the coreset construction based on synthetically generated random data. The oscillations corresponds to the number of coresets in the tree that each new subset needs to update. For example, the first point in a streaming tree is updated in 
, however the 
th point for some 
 climbs up through 
 levels in the tree, so 
 coresets are merged.
   8. Conclusions
We proved that any set of points in  has a -coreset which consists of a weighted subset of the input points whose size is independent of n and d, and polynomial in . Our algorithm carefully selects m such that the m-means of the input with appropriate weights (clusters’ size) yields such a coreset.
This allows us to finally compute coreset for sparse high dimensional data, in both the streaming and the distributed setting. As a practical example, we computed the first coreset for the full English Wikipedia. We hope that our open source code will allow researchers in the industry and academia to run these coresets on more databases such as images, speech or tweets.
The reduction to k-means allows us to use popular k-means heuristics (such as Lloyd-Max) and provable constant factor approximations (such as k-means++) in practice. Our experimental results on both a single machine and on the cloud shows that our coreset construction significantly improves over existing techniques, especially for small coresets, due to its deterministic approach.
We hope that this paper will also help the community to answer the following three open problems:
- (i)
 Can we simply compute the m-means for a specific value  and obtain a -coreset without using our algorithm?
- (ii)
 Can we compute such a coreset (subset of the input) whose size is ?
- (iii)
 Can we compute such a smaller coreset deterministically?