Deterministic Coresets for k -Means of Big Sparse Data †

: Let P be a set of n points in R d , k ≥ 1 be an integer and ε ∈ ( 0, 1 ) be a constant. An ε -coreset is a subset C ⊆ P with appropriate non-negative weights (scalars), that approximates any given set Q ⊆ R d of k centers. That is, the sum of squared distances over every point in P to its closest point in Q is the same, up to a factor of 1 ± ε to the weighted sum of C to the same k centers. If the coreset is small, we can solve problems such as k -means clustering or its variants (e.g., discrete k -means, where the centers are restricted to be in P , or other restricted zones) on the small coreset to get faster provable approximations. Moreover, it is known that such coreset support streaming, dynamic and distributed data using the classic merge-reduce trees. The fact that the coreset is a subset implies that it preserves the sparsity of the data. However, existing such coresets are randomized and their size has at least linear dependency on the dimension d . We suggest the ﬁrst such coreset of size independent of d . This is also the ﬁrst deterministic coreset construction whose resulting size is not exponential in d . Extensive experimental results and benchmarks are provided on public datasets, including the ﬁrst coreset of the English Wikipedia using Amazon’s cloud.


Background
Given a set of n points in R d , and an error parameter ε > 0, a coreset in this paper is a small set of weighted points in R d , such that the sum of squared distances from the original set of points to any set of k centers in R d can be approximated by the sum of weighted squared distances from the points in the coreset. The output of running an existing clustering algorithm on the coreset would then yield approximation to the output of running the same algorithm on the original data, by the definition of the coreset.
Coresets were first suggested by [1] as a way to improve the theoretical running time of existing algorithms. Moreover, a coreset is a natural tool for handling Big Data using all the computation models that are mentioned in the previous section. This is mainly due to the merge-and-reduce tree approach that was suggested by [2,3] and is formalized by [4]: coresets can be computed independently for subsets of input points, e.g., on different computers, and then be merged and re-compressed again. Such a binary compression tree can also be computed using one pass over a possibly unbounded streaming set of points, where in every moment only O(log n) coresets exist in memory for the n points streamed so far. Here the coreset is computed only on small chunks of points, so a possibly inefficient coreset construction still yields efficient coreset constructions for large sets; see Figure 1. Note that the coreset guarantees are preserved while using this technique, while no assumptions are made on the order of the streaming input points. These coresets can be computed independently and in parallel via M machines (e.g., on the cloud) and reduce the running time by a factor of M. The communication Coresets construction from data streams [2]. The black arrows indicate "merge-and-reduce" operations. The intermediate coresets C 1 , . . . , C 7 are numbered in the order in which they would be generated in the streaming case. In the parallel case, C 1 , C 2 , C 4 and C 5 would be constructed in parallel, followed by C 3 and C 6 , finally resulting in C 7 . The Figure is from [7].

Constrained k-Means and Determining k
Since the coreset approximates every set of k centers, it can also be used to solve the k-means problem under different constraints (e.g., allowed areas for placing the centers) or given a small set of candidate centers. In addition, the set of centers may contain duplicate points, which means that the coreset can approximate the sum of squared distances for k < k centers. Hence, coresets can be used to choose the right number k of centers up to k , by minimizing the sum of squared distances plus f (k) for some function of k. Full opensource is available [8].

Related Work
We summarize existing coresets constructions for k-means queries, as will be formally defined in Section 4.

Importance Sampling
Following a decade of research, coreset of size polynomial in d were suggested by [9]. Ref [10] suggested an improved version of size O(dk 2 /ε 2 ) in which is a special case of the algorithms proposed by [11]. The construction is based on computing an approximation to the k-means of the input points (with no constraints on the centers) and then sample points proportionally to their distance to these centers. Each chosen point is then assigned a weight that is inverse proportional to this distance. The probability of failure in these algorithms reduces exponentially with the input size. Coresets of size O(dk/ε 2 ), i.e., linear in k, were suggested in work of [12], however the weight of a point may be negative or a function of the given query. For the special case k = 1, Inaba et al. [13], provided constructions of coresets of size O(1/ε 2 ) using uniform sampling.
Projection based coresets. Data summarization which are similar to coresets of size O(k/ε) that are based on projections on low-dimensional subspaces that diminishes the sparsity of the input data were suggested by [14] by improving the analysis of [4]. Recently [15] improves both on [4,14] by applying Johnson-Lindenstrauss Lemma [16] on construction from [4]. However, due to the projections, the resulting summarizations of all works mentioned above are not subset of the input points, unlike the coreset definition of this paper. In particular, for sparse data sets such as adjacency matrix of a graphs, documents-term matrix of Wikipedia, or images-objects matrix, the sparsity of the data diminishes and a single point in the summarization might be larger than the complete sparse input data.
Other type of weak coresets approximates only the optimal solution, and not every k centers. Such randomized coresets of size independent of d and only polynomial in k were suggested by [11] and simplified by [12].
Deterministic Constructions. The first coresets for k-means [2,17] were based on partitioning the data into cells, and take a representative point from each cell to the coreset, as is done in hashing or Hough transform [18]. However, these coresets are of size at least k/ε O(d) , i.e., exponential in d, while still providing result which is a sub-set of the the input in contrast to previous work [17]. While, our technique is most related to the deterministic construction that was suggested in [4] by recursively computing k-means clustering of the input points. While the output set has size independent of d, it is not a coreset as defined in this paper, since it is not a subset of the input and thus cannot handle sparse data as explained above. Techniques such as uniform sampling for each cluster yields coresets with probability of failure that is linear in the input size, or whose size depends on d.
m-means is a coreset for k-means? A natural approach for coreset construction, that is strongly related to this paper, is to compute the m-means of the input set P, for a sufficiently large m, where the weight of each center is the number of point in its cluster. If the sum of squared distances to these m-centers is about ε factor from the k-means, we obtain a (k, ε)-coreset by (a weaker version of) the triangle inequality . Unfortunately, it was recently proved in [19] that there exists sets such that m ∈ k Ω(d) centers are needed in order to obtain this small sum of squared distances.

Our contribution
We suggest the following deterministic algorithms: 1.
An algorithm that computes a (1 + ε) approximation for the k-means of a set P that is distributed (partitioned) among M machines, where each machine needs to send only k O(1) input points to the main server at the end of its computation.

2.
A streaming algorithm that, after one pass over the data and using k O(1) log n memory returns an O(log n)-approximation to the k-means of P. The algorithm can run "embarrassingly in parallel [20] on data that is distributed among M machines, and support insertion/deletion of points as explained in the previous section.

3.
Description of how to use our algorithm to boost both the running time and quality of any existing k-means heuristic using only the heuristic itself, even in the classic off-line setting.

4.
Extensive experimental results on real world data-sets. This includes the first k-means clustering with provable guarantees for the English Wikipedia, via 16 EC2 instances on Amazon's cloud.

5.
Open-code for for fully reproducing our results and for the benefit of the community. To our knowledge, this is the first coreset code that can run on the cloud without additional commercial packages.

Novel Approach: m-Means is A Coreset for k-Means, for Smart Selection of m
One of our main technical result is that for every constant ε > 0, there exists an integer m ∈ k O(1) such that the m-means of the input set (or its approximation) is a (k, ε)-coreset; see Theorem 2. However, simply computing the m-means of the input set for a sufficiently large m might yield m that is exponential in d, as explained by [19] and the related work. Instead, Algorithm 1 carefully selects the right m between k and k O(1) after checking the appropriate conditions in each iteration.

Solving k-Means Using k-Means
It might be confusing that we suggest to solve the k-means problem by computing m-means for m > k. In fact, most of the coreset constructions actually solve the optimal problem in one of their first construction steps. The main observation is that we never run the coreset on the complete input of n (or unbounded stream of) points, but only on small subsets of size ∼ 2m. This is since our coresets are composable and can be merged (to 2m points) and reduced back to m using the merge-and-reduce tree technique. The composable property follows from the fact that the coreset approximates the sum of squared distances to every k-centers, and not just the k-means for the subset at hand.

Running Time
Running time that is exponential in k is unavoidable for any (1 + ε)-approximation algorithm that solves k-means, even in the planar case (d = 2) [21] and ε < 0.1 [4]. Our main contributions is a coreset construction that uses memory that is independent of d and running time that is near-linear in n. To our knowledge this is an open problem even for the case k = 2. Nevertheless, for large values of ε (e.g., ε > 10) we may use existing constant factor approximations that takes time polynomial in k to compute our coreset in time that is near-linear in n but also polynomial in k.
In practice, provable (1 + ε)-approximations for k-means are rarely used due to the lower bounds on the running times above. Instead, heuristics are used. Indeed, in our experimental results we evaluate this approach instead of computing the optimal k-means during the coreset construction and on the resulting coreset itself.

Notation and Main Result
The input to our problem is a set P of n points in R d , where each point p ∈ P includes a multiplicative weight u(p) > 0. In addition, there is an additive weight ρ > 0 for the set. Formally, a weighted set in R d is a tuple P = (P , u, ρ), where P ⊆ R d , u : P → [0, ∞). In particular, an unweighted set has a unit weight u(p) = 1 for each point, and a zero additive weight.

k-Means Clustering
For a given set Q = {q 1 , · · · , q k } of k ≥ 1 centers (points) in R d , the Euclidean distance from a point p ∈ R d to its closest center in Q is denoted by 1.7em(p, Q) = min q∈Q p − q 2 . The sum of these weighted squared distances over the points of P is denoted by If P is an unweighted set, this cost is just the sum of squared distances over each point in P to its closest center in Q.
Let P i denote the subset of points in P whose closest center in Q is q i , for every i ∈ [m] = {1, · · · , m}. Ties are broken arbitrarily. This yields a partition P 1 , · · · , P k of P by Q. More generally, the partition of P by Q is the set {P 1 , · · · , P k } where P i = (P i , u i , ρ/k), and u i (p) = u(p) for every p ∈ P i and every i ∈ [k].
A set Q k that minimizes this weighted sum cost(P, Q) over every set Q of k centers in R d is called the k-means of P. The 1-means µ(P) of P is called the centroid, or the center of mass, since We denote the cost of the k-means of P by opt(P, k) := cost(P, Q k ).

Coreset
Computing the k-means of a weighted set P is the main motivation of this paper. Formally, let ε > 0 be an error parameter. The weighted set That is, To handle streaming data we will need to compute "coresets for union of coresets", which is the reason that we assume that both the input P and its coreset S are weighted sets.

Sparse Coresets
Unlike previous work, we add the constraints that if each point in P is sparse, i.e., has few non-zeroes coordinates, then the set S will also be sparse. Formally, the maximum sparsity s(P) := max p∈P p 0 of P is the maximum number of non zeroes entries In particular, if each point in S is a linear combination of at most α points in P, then s(S) ≤ α · s(P). In addition, we would like that the set S will be of size independent of both n = |P| and d.
We can now state the main result of this paper.
where each point in S is a linear combination of O(1/ε 2 ) points from P . In particular, the maximum sparsity of S is s(P)/ε 2 .
By plugging this result to the traditional merge-and-reduce tree in Figure 1, it is straight-forward to compute a coreset using one pass over a stream of points.
and maximum sparsity s(P)/ε 2 can be computed for the set P of the n points seen so far in an unbounded stream, using |S| · s(P)/ε 2 memory words. The insertion time per point in the stream is log(n) · 2 (k/ε) O (1) . If the stream is distributed uniformly to M machines, then the amortized insertion time per point is reduced by a (multiplicative) factor of M to (1/M) log(n) · 2 (k/ε) O(1) . The coreset for the union of streams can then be computed by communicating the M coresets to a main server.
To obtain running time that is linear in the input size, without loss of generality, we assume that P has |P| = k O(1/ε 2 ) points, and that the cardinality of the output S is |S| ≤ |P|/2. This is thanks to the traditional merge-and-reduce approach: given a stream of n points, we apply the coreset construction only on subsets of size 2 · |S| from P during the streaming and reduce them by half. See Figure 1 and e.g., [4,7] for details.

Algorithm Overview
In Line 1 we compute the smallest integer m = k t such that the cost opt(P, m) of the m-means of P is close to the cost opt(P, mk) of the (mk)-means of P. In Line 3 we compute the corresponding partition {P 1 , · · · , P m } of P by its m-means Q m = {q 1 , · · · , q m }. In Line 5 a (1, ε)-sparse coreset S i of size O(1/ε 2 ) is computed for every P i , i ∈ [m]. This can be done deterministically e.g., by taking the mean of P i as explained in Lemma 1 or by using a gradient descent algorithm, namely Frank-Wolfe, as explained in [22] which also preserves the sparsity of the coreset as desired by our main theorem. The output of the algorithm is the union of the means of all these coresets, with appropriate weight, which is the size of the coreset.
The intuition behind the algorithm stems from the assumption that m-means clustering will have lower cost than k-means and this is actually supported by series of previous work [23,24]. In fact our experiments in Section 7 evidence that in practice it actually works better than anticipated by theoretical bounds.

Proof of Correctness
The first lemma states the common fact that the sum of squared distances of a set of point to a center is the sum of their squared distances to their center of mass, plus the squared distance to the center (the variance of the set).

Proof. We have
The last term equals zero since µ(P) = 1 ∑ p ∈P u(p ) · ∑ p∈P u(p) · p, and thus Hence, The second lemma shows that assigning all the points of P to the closest center in Q yields a small multiplicative error if the 1-mean and the k-means of P has roughly the same cost. If t = 0, this means that we can approximate cost(P, Q) using only one center in the query; see Line 1 of Algorithm 1. Note that (1 + 2ε)/(1 − 2ε) ≤ 1 + 4ε for ε < 1/4.
Proof. Let q * denote a center that minimizes cost(P, {q}) over q ∈ Q. The left inequality of (1) is then straight-forward since It is left to prove the right inequality of (1). Indeed, for every p ∈ P , let q p ∈ Q denote the closest point to p in Q. Ties are broken arbitrarily. Hence, Let {P 1 , · · · , P k } denote the partition of P by Q = q 1 , · · · , q k , where P i are the closest points to q i for every i ∈ [k]; see Section 4. For every p ∈ P i , let q * p = µ(P i ). Hence, where in (4) and (5) we substituted x = q * and x = q p respectively in Lemma 1, and in (6) we use the fact that q * p = µ(P i ) and q p = q i for every p ∈ P i . Summing (6) To bound (8), we substitute x = q * and then x = q in Lemma 1, and obtain that for every q ∈ Q where the last inequality is by the definition of q * . This implies that for every p ∈ P , Plugging the last inequality in (8) yields where (11) is by Cauchy-Schwartz inequality, and in (12) we use the fact that 2ab ≤ a 2 + b 2 for every a, b ≥ 0.
To bound the left term of (13) we use the fact q * p = µ(P i ) and substitute x = µ(P), P = P i in Lemma 1 for every i ∈ [k] as follows.
To bound the right term of (13) we use (a Plugging (14) and the last inequality in (13) yields Rearranging, Together with (2) this proves Lemma 2. where the first inequality is by the optimality of q S , and the second inequality is since S is a coreset for P. Similarly, the left hand side of (15) is bounded by where the last inequality follows from the assumption ε < 1.

Lemma 4.
Let S be the output of a call to k-MEAN-CORESET(P, k, ε). Then S is a (k, 15ε)-coreset for P.
Proof. By replacing P with P i in Lemma 1 for each i ∈ [m] it follows that Summing the last inequality over each P i yields (opt(P i , 1) − opt(P i , k)) . (16) Since {P 1 , · · · , P m } is the partition of the m-means of P we have ∑ m i=1 opt(P i , 1) = opt(P, m). By letting Q i be the m-means of P i we have Hence, where the second inequality is by Line 1 of the algorithm. Plugging the last inequality in (16) yields Using Lemma 3, for every i ∈ [m] By this and Lemma 1 Plugging the last inequality in (17) yields Hence, S is a (k, 15ε) coreset for P.
Using the mean of P i in Line 5 of the algorithm yields a (1, ε)-coreset S i as shown in Lemma 1. The resulting coreset is not sparse, but gives the following result.

Theorem 2.
There is m ≤ k 1/ε 2 such that the m-means of P is a (k, 15ε)-coreset for P.

Proof of Theorem 1:
We compute S i a (1, ε) mean coreset for 1-mean of P i at line 5 of Algorithm 1 by using variation of Frank-Wolfe [22] algorithm. It follows that |S i | = O(1/ε 2 ) for each i, therefore the overall sparsity of S is s(P)/ε 2 . This and Lemma 4 concludes the proof.

Comparison to Existing Approaches
In this section we provide experimental results of our main algorithm of coreset constructions. We compare the clustering with existing coresets and small/medium/large datasets. Unlike most of the coreset papers, we also run the algorithm on the distributed setting via a cloud as explained below.

Datasets
For our experimental results we use three well known datasets, and the English Wikipedia as follows.
MNIST handwritten digits [25]. The MNIST dataset consists of n = 60, 000 grayscale images of handwritten digits. Each image is of size 28x28 pixels and was converted to a row vector row of d = 784 dimensions.
Pendigits [26]. This dataset was downloaded from the UCI repository. It consists of 250 written letters by 44 humans. These humans were asked to write 250 digits in a random order inside boxes of 500 by 500 tablet pixel resolution. The tablet sends x and y tablet coordinates and pressure level values of the pen at fixed time intervals (sampling rate) of 100 milliseconds. Digits are represented as constant length feature vectors of size d = 16 the number of digits in the dataset is n = 10, 992.
NIPS dataset [27]. The OCR scanning of NIPS proceedings over 13 years. It has 15,000 pages and 1958 articles. For each author, there is a corresponding words counter vector, where the ith entry in the vector is the number of the times that the word used in one of the author's submissions. There are overall n = 2865 authors and d = 14, 036 words in this corpus.
English Wikipedia [28]. Unlike previous datasets that were uploaded to memory and then compressed via streaming coresets, the English Wikipedia practically can not be uploaded completely to memory. The size of the dataset is 15GB after converting to a term-documents matrix via gensim [29]. It has 4M vectors, each of 10 5 dimensions and an average of 200 non-zero entries, i.e, words per document.

The Experiment
We applied our coreset construction to boost the performance of Lloyd's k-means heuristic as explained in Section 8 of previous work [6]. We compared the results with the current data summarization algorithms that can handle sparse data: uniform and importance sampling.

On the small/medium Datasets
we evaluate both the offline computation and the streaming data model. For offline computation we used the datasets above to produce coresets of size 100 ≤ m ≤ 1500, then computed k-means for k = 10, 15, 20, 25 till convergence. For the streaming data model, we divided each dataset into small subsets and computed coresets via the merge-and-reduce technique to construct a coreset tree as in Figure 1. Here, the coresets are smaller, of size 10 ≤ m ≤ 500, and the values for k-means are the same.
We computed the sum of squared distances to the original (full) set of points, from each resulting set of k centers that was computed from the coreset. These sets of centers are denoted by C 1 , C 2 and C 3 for uniform, non uniform sampling and our coreset respectively. The "ground truth" or "optimal solution" C k was computed using k-means on the entire dataset until convergence. The empirical estimated error ε was then defined to be = C t /C k − 1 for coreset number t = 1, 2, 3. Note that, since Lloyd's k-means is a heuristic, its performance on the reduced data might be better, i.e., ε < 0.
These experiments were run on a single common laptop machine with 2.2GHz quad-core Intel Core i7 CPU and 16GB of 1600MHz DDR3L RAM with 256GB PCIe-based onboard SSD.

On the Wikipedia Dataset
We compared the three discussed data summarization techniques, while each one was computed in parallel and in a distributed fashion on 16 EC2 virtual servers. We repeated computation for k = 16, 32, 64 and 128, and coresets size in the range [32,1024].

Results
The results of experiment for k = 15, 20, 25 on small datasets for offline computation are depicted on Figure 2, where it's evident that error of kmeans computation fed by our coreset algorithm results outperforms error of uniform and non-uniform sampling. For streaming computation model our algorithm is able to provide results which are better than other two as could be explored in Figure 3. In addition, existing algorithms suffer from "cold start" as common in random sampling techniques: there is a slow convergence to the small error, compared to our deterministic algorithm that introduces small error already after a small sample size.  At Figure 4 presented results of the experiment on Wikipedia dataset for different values of k = 32, 64, 128, as it could be easily observed proposed coreset algorithm provides good results of big sparse dataset and provides lower energy cost compared to uniform and non-uniform approaches.   Figures 5-7 show the box-plot of error distribution for all the three coresets in the offline and streaming settings. Our algorithm shows a little variance across all experiments, its mean error is very close to its median error, indicating that it produces stable results.

Conclusions
We proved that any set of points in R d has a (k, ε)-coreset which consists of a weighted subset of the input points whose size is independent of n and d, and polynomial in 1/ε. Our algorithm carefully selects m such that the m-means of the input with appropriate weights (clusters' size) yields such a coreset.
This allows us to finally compute coreset for sparse high dimensional data, in both the streaming and the distributed setting. As a practical example, we computed the first coreset for the full English Wikipedia. We hope that our open source code will allow researchers in the industry and academia to run these coresets on more databases such as images, speech or tweets.
The reduction to k-means allows us to use popular k-means heuristics (such as Lloyd-Max) and provable constant factor approximations (such as k-means++) in practice. Our experimental results on both a single machine and on the cloud shows that our coreset construction significantly improves over existing techniques, especially for small coresets, due to its deterministic approach.
We hope that this paper will also help the community to answer the following three open problems: (i) Can we simply compute the m-means for a specific value m ∈ k O(1/ε) and obtain a (k, ε)-coreset without using our algorithm? (ii) can we compute such a coreset (subset of the input) whose size is m ∈ (k/ε) O(1) ? (iii) Can we compute such a smaller coreset deterministically?
Author Contributions: Conceptualization and Formal Analysis D.F. and A.B.; funding acquisition, D.F.; Code writing A.B.. All authors have read and agreed to the published version of the manuscript.