Next Article in Journal
Stochastic Models to Qualify Stem Tapers
Next Article in Special Issue
Mining Sequential Patterns with VC-Dimension and Rademacher Complexity
Previous Article in Journal
A Generalized Alternating Linearization Bundle Method for Structured Convex Optimization with Inexact First-Order Oracles
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deterministic Coresets for k-Means of Big Sparse Data †

Department of Computer, Science University of Haifa, Haifa 32000, Israel
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in the proceedings of Proceedings of the 2016 SIAM International Conference on Data Mining, Miami, FL, USA, 5–7 May 2016.
These authors contributed equally to this work.
Algorithms 2020, 13(4), 92; https://doi.org/10.3390/a13040092
Submission received: 3 January 2020 / Revised: 15 March 2020 / Accepted: 1 April 2020 / Published: 14 April 2020
(This article belongs to the Special Issue Big Data Algorithmics)

Abstract

:
Let P be a set of n points in R d , k 1 be an integer and ε ( 0 , 1 ) be a constant. An ε-coreset is a subset C P with appropriate non-negative weights (scalars), that approximates any given set Q R d of k centers. That is, the sum of squared distances over every point in P to its closest point in Q is the same, up to a factor of 1 ± ε to the weighted sum of C to the same k centers. If the coreset is small, we can solve problems such as k-means clustering or its variants (e.g., discrete k-means, where the centers are restricted to be in P, or other restricted zones) on the small coreset to get faster provable approximations. Moreover, it is known that such coreset support streaming, dynamic and distributed data using the classic merge-reduce trees. The fact that the coreset is a subset implies that it preserves the sparsity of the data. However, existing such coresets are randomized and their size has at least linear dependency on the dimension d. We suggest the first such coreset of size independent of d. This is also the first deterministic coreset construction whose resulting size is not exponential in d. Extensive experimental results and benchmarks are provided on public datasets, including the first coreset of the English Wikipedia using Amazon’s cloud.

1. Background

Given a set of n points in R d , and an error parameter ε > 0 , a coreset in this paper is a small set of weighted points in R d , such that the sum of squared distances from the original set of points to any set of k centers in R d can be approximated by the sum of weighted squared distances from the points in the coreset. The output of running an existing clustering algorithm on the coreset would then yield approximation to the output of running the same algorithm on the original data, by the definition of the coreset.
Coresets were first suggested by [1] as a way to improve the theoretical running time of existing algorithms. Moreover, a coreset is a natural tool for handling Big Data using all the computation models that are mentioned in the previous section. This is mainly due to the merge-and-reduce tree approach that was suggested by [2,3] and is formalized by [4]: coresets can be computed independently for subsets of input points, e.g., on different computers, and then be merged and re-compressed again. Such a binary compression tree can also be computed using one pass over a possibly unbounded streaming set of points, where in every moment only O ( log n ) coresets exist in memory for the n points streamed so far. Here the coreset is computed only on small chunks of points, so a possibly inefficient coreset construction still yields efficient coreset constructions for large sets; see Figure 1. Note that the coreset guarantees are preserved while using this technique, while no assumptions are made on the order of the streaming input points. These coresets can be computed independently and in parallel via M machines (e.g., on the cloud) and reduce the running time by a factor of M. The communication between the machines is also small, since each machine needs to communicate to a main server only the coreset of its data.
In practice, this technique can be implemented easily using the map-reduce approach of modern software for Big Data such as [5].
Coresets can also be used to support the dynamic model where points are deleted/inserted. Here the storage is linear in n since we need to save the tree in memory (practically, on the hard drive), however the update time is only logarithmic in n since we need to reconstruct only O ( log n ) coresets that correspond to the deleted/inserted point along the tree. First such coreset of size independent of d was introduced by [6]. See [1,2] for details.

Constrained k-Means and Determining k

Since the coreset approximates every set of k centers, it can also be used to solve the k-means problem under different constraints (e.g., allowed areas for placing the centers) or given a small set of candidate centers. In addition, the set of centers may contain duplicate points, which means that the coreset can approximate the sum of squared distances for k < k centers. Hence, coresets can be used to choose the right number k of centers up to k , by minimizing the sum of squared distances plus f ( k ) for some function of k. Full opensource is available [8].

2. Related Work

We summarize existing coresets constructions for k-means queries, as will be formally defined in Section 4.

Importance Sampling

Following a decade of research, coreset of size polynomial in d were suggested by [9]. Ref [10] suggested an improved version of size O ( d k 2 / ε 2 ) in which is a special case of the algorithms proposed by [11]. The construction is based on computing an approximation to the k-means of the input points (with no constraints on the centers) and then sample points proportionally to their distance to these centers. Each chosen point is then assigned a weight that is inverse proportional to this distance. The probability of failure in these algorithms reduces exponentially with the input size. Coresets of size O ( d k / ε 2 ) , i.e., linear in k, were suggested in work of [12], however the weight of a point may be negative or a function of the given query. For the special case k = 1 , Inaba et al. [13], provided constructions of coresets of size O ( 1 / ε 2 ) using uniform sampling.
Projection based coresets. Data summarization which are similar to coresets of size O ( k / ε ) that are based on projections on low-dimensional subspaces that diminishes the sparsity of the input data were suggested by [14] by improving the analysis of [4]. Recently [15] improves both on [4,14] by applying Johnson-Lindenstrauss Lemma [16] on construction from [4]. However, due to the projections, the resulting summarizations of all works mentioned above are not subset of the input points, unlike the coreset definition of this paper. In particular, for sparse data sets such as adjacency matrix of a graphs, documents-term matrix of Wikipedia, or images-objects matrix, the sparsity of the data diminishes and a single point in the summarization might be larger than the complete sparse input data.
Other type of weak coresets approximates only the optimal solution, and not every k centers. Such randomized coresets of size independent of d and only polynomial in k were suggested by [11] and simplified by [12].
Deterministic Constructions. The first coresets for k-means [2,17] were based on partitioning the data into cells, and take a representative point from each cell to the coreset, as is done in hashing or Hough transform [18]. However, these coresets are of size at least k / ε O ( d ) , i.e., exponential in d, while still providing result which is a sub-set of the the input in contrast to previous work [17]. While, our technique is most related to the deterministic construction that was suggested in [4] by recursively computing k-means clustering of the input points. While the output set has size independent of d, it is not a coreset as defined in this paper, since it is not a subset of the input and thus cannot handle sparse data as explained above. Techniques such as uniform sampling for each cluster yields coresets with probability of failure that is linear in the input size, or whose size depends on d.
m-means is a coreset for k-means? A natural approach for coreset construction, that is strongly related to this paper, is to compute the m-means of the input set P, for a sufficiently large m, where the weight of each center is the number of point in its cluster. If the sum of squared distances to these m-centers is about ε factor from the k-means, we obtain a ( k , ε ) -coreset by (a weaker version of) the triangle inequality. Unfortunately, it was recently proved in [19] that there exists sets such that m k Ω ( d ) centers are needed in order to obtain this small sum of squared distances.

3. Our Contribution

We suggest the following deterministic algorithms:
  • An algorithm that computes a ( 1 + ε ) approximation for the k-means of a set P that is distributed (partitioned) among M machines, where each machine needs to send only k O ( 1 ) input points to the main server at the end of its computation.
  • A streaming algorithm that, after one pass over the data and using k O ( 1 ) log n memory returns an O ( log n ) -approximation to the k-means of P. The algorithm can run “embarrassingly in parallel [20] on data that is distributed among M machines, and support insertion/deletion of points as explained in the previous section.
  • Description of how to use our algorithm to boost both the running time and quality of any existing k-means heuristic using only the heuristic itself, even in the classic off-line setting.
  • Extensive experimental results on real world data-sets. This includes the first k-means clustering with provable guarantees for the English Wikipedia, via 16 EC2 instances on Amazon’s cloud.
  • Open-code for for fully reproducing our results and for the benefit of the community. To our knowledge, this is the first coreset code that can run on the cloud without additional commercial packages.

3.1. Novel Approach: m-Means Is A Coreset for k-Means, for Smart Selection of m

One of our main technical result is that for every constant ε > 0 , there exists an integer m k O ( 1 ) such that the m-means of the input set (or its approximation) is a ( k , ε ) -coreset; see Theorem 2. However, simply computing the m-means of the input set for a sufficiently large m might yield m that is exponential in d, as explained by [19] and the related work. Instead, Algorithm 1 carefully selects the right m between k and k O ( 1 ) after checking the appropriate conditions in each iteration.

3.2. Solving k-Means Using k-Means

It might be confusing that we suggest to solve the k-means problem by computing m-means for m > k . In fact, most of the coreset constructions actually solve the optimal problem in one of their first construction steps. The main observation is that we never run the coreset on the complete input of n (or unbounded stream of) points, but only on small subsets of size 2 m . This is since our coresets are composable and can be merged (to 2 m points) and reduced back to m using the merge-and-reduce tree technique. The composable property follows from the fact that the coreset approximates the sum of squared distances to every k-centers, and not just the k-means for the subset at hand.

3.3. Running Time

Running time that is exponential in k is unavoidable for any ( 1 + ε ) -approximation algorithm that solves k-means, even in the planar case ( d = 2 ) [21] and ε < 0.1 [4]. Our main contributions is a coreset construction that uses memory that is independent of d and running time that is near-linear in n. To our knowledge this is an open problem even for the case k = 2 . Nevertheless, for large values of ε (e.g., ε > 10 ) we may use existing constant factor approximations that takes time polynomial in k to compute our coreset in time that is near-linear in n but also polynomial in k.
In practice, provable ( 1 + ε ) -approximations for k-means are rarely used due to the lower bounds on the running times above. Instead, heuristics are used. Indeed, in our experimental results we evaluate this approach instead of computing the optimal k-means during the coreset construction and on the resulting coreset itself.

4. Notation and Main Result

The input to our problem is a set P of n points in R d , where each point p P includes a multiplicative weight u ( p ) > 0 . In addition, there is an additive weight ρ > 0 for the set. Formally, a weighted set in R d is a tuple P = ( P , u , ρ ) , where P R d , u : P [ 0 , ) . In particular, an unweighted set has a unit weight u ( p ) = 1 for each point, and a zero additive weight.

4.1. k-Means Clustering

For a given set Q = q 1 , , q k of k 1 centers (points) in R d , the Euclidean distance from a point p R d to its closest center in Q is denoted by 1.7em ( p , Q ) = min q Q p q 2 . The sum of these weighted squared distances over the points of P is denoted by
cost ( P , Q ) : = p P u ( p ) · 1.7 e m 2 ( p , Q ) + ρ .
If P is an unweighted set, this cost is just the sum of squared distances over each point in P to its closest center in Q.
Let P i denote the subset of points in P whose closest center in Q is q i , for every i [ m ] = 1 , , m . Ties are broken arbitrarily. This yields a partition P 1 , , P k of P by Q. More generally, the partition of P by Q is the set P 1 , , P k where P i = ( P i , u i , ρ / k ) , and u i ( p ) = u ( p ) for every p P i and every i [ k ] .
A set Q k that minimizes this weighted sum cost ( P , Q ) over every set Q of k centers in R d is called the k-means of P. The 1-means μ ( P ) of P is called the centroid, or the center of mass, since
μ ( P ) = 1 p P u ( p ) p P u ( p ) · p .
We denote the cost of the k-means of P by opt ( P , k ) : = cost ( P , Q k ) .

4.2. Coreset

Computing the k-means of a weighted set P is the main motivation of this paper. Formally, let  ε > 0 be an error parameter. The weighted set S = ( S , w , ϕ ) is a ( k , ε ) -coreset for P, if for every set Q R d of | Q | = k centers we have
( 1 ε ) cost ( P , Q ) cost ( S , Q ) ( 1 + ε ) cost ( P , Q ) ,
That is,
( 1 ε ) p P u ( p ) 1.7 e m 2 ( p , Q ) + ρ p S w ( p ) · 1.7 e m 2 ( p , Q ) + ϕ ( 1 + ε ) p P u ( p ) · 1.7 e m 2 ( p , Q ) + ρ .
To handle streaming data we will need to compute “coresets for union of coresets”, which is the reason that we assume that both the input P and its coreset S are weighted sets.

4.3. Sparse Coresets

Unlike previous work, we add the constraints that if each point in P is sparse, i.e., has few non-zeroes coordinates, then the set S will also be sparse. Formally, the maximum sparsity s ( P ) : = max p P p 0 of P is the maximum number of non zeroes entries p 0 = | i p i 0 , i [ d ] | over every point p in P.
In particular, if each point in S is a linear combination of at most α points in P, then s ( S ) α · s ( P ) . In addition, we would like that the set S will be of size independent of both n = | P | and d.
We can now state the main result of this paper.
Theorem 1
(Small sparse coreset). For every weighted set P = ( P , u , ρ ) in R d , ε > 0 and an integer k 1 , there is a ( k , ε ) -coreset S = ( S , w , ϕ ) of size | S | = k O ( 1 / ε 2 ) where each point in S is a linear combination of O ( 1 / ε 2 ) points from P . In particular, the maximum sparsity of S is s ( P ) / ε 2 .
By plugging this result to the traditional merge-and-reduce tree in Figure 1, it is straight-forward to compute a coreset using one pass over a stream of points.
Corollary 1.
A ( k , ε log n ) coreset ( S , w , ϕ ) of size | S | = log ( n ) · k O ( 1 / ε 2 ) and maximum sparsity s ( P ) / ε 2 can be computed for the set P of the n points seen so far in an unbounded stream, using | S | · s ( P ) / ε 2 memory words. The insertion time per point in the stream is log ( n ) · 2 ( k / ε ) O ( 1 ) . If the stream is distributed uniformly to M machines, then the amortized insertion time per point is reduced by a (multiplicative) factor of M to ( 1 / M ) log ( n ) · 2 ( k / ε ) O ( 1 ) . The coreset for the union of streams can then be computed by communicating the M coresets to a main server.

5. Coreset Construction

Our main coreset construction algorithm k - M EAN - C ORESET ( P , k , ε ) gets a set P as input, and returns a ( k , ε ) -coreset ( S , w ) ; see Algorithm 1.
To obtain running time that is linear in the input size, without loss of generality, we assume that P has | P | = k O ( 1 / ε 2 ) points, and that the cardinality of the output S is | S | | P | / 2 . This is thanks to the traditional merge-and-reduce approach: given a stream of n points, we apply the coreset construction only on subsets of size 2 · | S | from P during the streaming and reduce them by half. See Figure 1 and e.g., [4,7] for details.

Algorithm Overview

In Line 1 we compute the smallest integer m = k t such that the cost opt ( P , m ) of the m-means of P is close to the cost opt ( P , m k ) of the ( m k ) -means of P. In Line 3 we compute the corresponding partition P 1 , , P m of P by its m-means Q m = q 1 , , q m . In Line 5 a ( 1 , ε ) -sparse coreset S i of size O ( 1 / ε 2 ) is computed for every P i , i [ m ] . This can be done deterministically e.g., by taking the mean of P i as explained in Lemma 1 or by using a gradient descent algorithm, namely Frank-Wolfe, as explained in [22] which also preserves the sparsity of the coreset as desired by our main theorem. The output of the algorithm is the union of the means of all these coresets, with appropriate weight, which is the size of the coreset.
The intuition behind the algorithm stems from the assumption that m-means clustering will have lower cost than k-means and this is actually supported by series of previous work [23,24]. In fact our experiments in Section 7 evidence that in practice it actually works better than anticipated by theoretical bounds.
Algorithms 13 00092 i001

6. Proof of Correctness

The first lemma states the common fact that the sum of squared distances of a set of point to a center is the sum of their squared distances to their center of mass, plus the squared distance to the center (the variance of the set).
Lemma 1.
For every x R d
cost ( P , x ) = cost ( P , μ ( P ) ) + μ ( P ) x 2 p P u ( p ) .
Proof. 
We have
cost ( P , x ) ρ = p P u ( p ) p x 2 = p P u ( p ) ( p μ ( P ) ) + ( μ ( P ) x ) 2 = p P u ( p ) p μ ( P ) 2 + p P u ( p ) μ ( P ) x 2 + 2 ( μ ( p ) x ) · p P u ( p ) ( p μ ( P ) ) .
The last term equals zero since μ ( P ) = 1 p P u ( p ) · p P u ( p ) · p , and thus
p P u ( p ) ( p μ ( P ) ) = p P u ( p ) · p p P u ( p ) μ ( P ) = p P u ( p ) · p p P u ( p ) · p = 0 .
Hence,
cost ( P , x ) = ρ + p P u ( p ) p μ ( P ) 2 + p P u ( p ) μ ( P ) x 2 = cost ( P , μ ( P ) ) + μ ( P ) x 2 p P u ( p ) .
 □
The second lemma shows that assigning all the points of P to the closest center in Q yields a small multiplicative error if the 1-mean and the k-means of P has roughly the same cost. If t = 0 , this means that we can approximate cost ( P , Q ) using only one center in the query; see Line 1 of Algorithm 1. Note that ( 1 + 2 ε ) / ( 1 2 ε ) 1 + 4 ε for ε < 1 / 4 .
Lemma 2.
For every set Q R d of | Q | = k centers we have
cost ( P , Q ) min q Q cost ( P , { q } ) cost ( P , Q ) · 1 + 2 ε 1 2 ε + opt ( P , 1 ) opt ( P , k ) ( 1 2 ε ) ε .
Proof. 
Let q * denote a center that minimizes cost ( P , { q } ) over q Q . The left inequality of (1) is then straight-forward since
cost ( P , Q ) min q Q cost ( P , { q } ) = p P min q Q u ( p ) p q 2 p P u ( p ) p q * 2 = p P u ( p ) min q Q p q 2 p q * 2 0 .
It is left to prove the right inequality of (1). Indeed, for every p P , let q p Q denote the closest point to p in Q. Ties are broken arbitrarily. Hence,
min q Q cost ( P , { q } ) cost ( P , Q ) = p P u ( p ) p q * 2 p P u ( p ) p q p 2 .
Let P 1 , , P k denote the partition of P by Q = q 1 , , q k , where P i are the closest points to q i for every i [ k ] ; see Section 4. For every p P i , let q p * = μ ( P i ) . Hence,
p P i u ( p ) p q * 2 p P i u ( p ) p q p 2
= p P i u ( p ) p μ ( P i ) 2 + μ ( P i ) q * 2 p P i u ( p )
p P i u ( p ) p μ ( P i ) 2 + μ ( P i ) q i 2 p P i u ( p )
= p P i u ( p ) q p * q * 2 q p * q p 2 ,
where in (4) and (5) we substituted x = q * and x = q p respectively in Lemma 1, and in (6) we use the fact that q p * = μ ( P i ) and q p = q i for every p P i . Summing (6) over i [ k ] yields
p P u ( p ) p q * 2 p P u ( p ) p q p 2 = p P u ( p ) q p * q * 2 q p * q p 2
= p P u ( p ) ( q p * μ ( P ) ) + ( μ ( P ) q * ) 2 ( q p * μ ( P ) ) + ( μ ( P ) q p ) 2
= p P u ( p ) μ ( P ) q * 2 μ ( P ) q p 2
2 p P u ( p ) ( q p * μ ( P ) ) ( q * q p ) .
To bound (8), we substitute x = q * and then x = q in Lemma 1, and obtain that for every q Q
μ ( P ) q * 2 μ ( P ) q 2 p P u ( p ) = cost ( P , { q * } ) cost ( P , μ ( P ) ) cost ( P , { q } ) cost ( P , μ ( P ) ) = cost ( P , { q * } ) cost ( P , { q } ) 0 .
where the last inequality is by the definition of q * . This implies that for every p P ,
μ ( P ) q * 2 μ ( P ) q p 2 0 .
Plugging the last inequality in (8) yields
p P u ( p ) p q * 2 p P u ( p ) p q p 2 2 p P u ( p ) ( q p * μ ( P ) ) ( q * q p )
2 p P u ( p ) q p * μ ( P ) q * q p = p P u ( p ) · 2 · q p * μ ( P ) ε · ε q * q p
p P u ( p ) q p * μ ( P ) 2 ε + ε q * q p 2
= 1 ε p P u ( p ) q p * μ ( P ) 2 + ε p P u ( p ) q * q p 2 ,
where (11) is by Cauchy-Schwartz inequality, and in (12) we use the fact that 2 a b a 2 + b 2 for every a , b 0 .
To bound the left term of (13) we use the fact q p * = μ ( P i ) and substitute x = μ ( P ) , P = P i in Lemma 1 for every i [ k ] as follows.
p P u ( p ) q p * μ ( P ) 2 = i = 1 k μ ( P i ) μ ( P ) 2 p P i u ( p ) = i = 1 k p P i u ( p ) p μ ( P ) 2 p P i u ( p ) p μ ( P i ) 2 = p P u ( p ) p μ ( P ) 2 i = 1 k p P i u ( p ) p μ ( P i ) 2 opt ( P , 1 ) opt ( P , k ) .
To bound the right term of (13) we use ( a b ) 2 a 2 + b 2 + 2 | a b | 2 a 2 + 2 b 2 to obtain
p P u ( p ) q * q p 2 = p P u ( p ) ( q * p ) + ( p q p ) 2 p P u ( p ) 2 q * p 2 + 2 p q p 2 = 2 · cost ( P , { q * } ) + cost ( P , Q ) ) .
Plugging (14) and the last inequality in (13) yields
cost ( P , { q * } ) cost ( P , Q ) = p P u ( p ) p q * 2 p P u ( p ) p q p 2 opt ( P , 1 ) opt ( P , k ) ε + 2 · ε cost ( P , { q * } ) + cost ( P , Q ) .
Rearranging,
cost ( P , { q * } ) cost ( P , Q ) · 1 + 2 ε 1 2 ε + opt ( P , 1 ) opt ( P , k ) ( 1 2 ε ) ε
Together with (2) this proves Lemma 2. □
Lemma 3.
Let S be a ( 1 , ε ) -coreset for a weighted set P in R d . Let Q R d be a finite set. Then
( 1 ε ) min q Q cost ( P , { q } ) min q Q cost ( S , { q } ) ( 1 + ε ) min q Q cost ( P , { q } )
Proof. 
Let q P Q be a center such that cost ( P , { q P } ) = min q Q cost ( P , { q } ) , and let q S Q be a center such that cost ( S , { q S } ) = min q Q cost ( S , { q } ) . The right side of (15) is bounded by
min q Q cost ( S , { q } ) = cost ( S , { q S } ) cost ( S , { q P } ) ( 1 + ε ) cost ( P , { q P } ) = ( 1 + ε ) min q Q cost ( P , { q } ) ,
where the first inequality is by the optimality of q S , and the second inequality is since S is a coreset for P. Similarly, the left hand side of (15) is bounded by
( 1 ε ) min q Q cost ( P , { q } ) = ( 1 ε ) cost ( P , { q P } ) ( 1 ε ) cost ( P , { q S } ) ( 1 ε ) ( 1 + ε ) cost ( S , { q S } ) = ( 1 ε 2 ) min q Q cost ( S , { q } ) min q Q cost ( S , { q } ) .
where the last inequality follows from the assumption ε < 1 . □
Lemma 4.
Let S be the output of a call to k - M EAN - C ORESET ( P , k , ε ) . Then S is a ( k , 15 ε ) -coreset for P.
Proof. 
By replacing P with P i in Lemma 1 for each i [ m ] it follows that
cost ( P i , Q ) min q Q cost ( P i , Q ) cost ( P i , Q ) · 1 + 2 ε 1 2 ε + opt ( P i , 1 ) opt ( P i , k ) ( 1 2 ε ) ε .
Summing the last inequality over each P i yields
cost ( P , Q ) i = 1 m min q Q cost ( P i , Q ) cost ( P , Q ) · 1 + 2 ε 1 2 ε + 1 ( 1 2 ε ) ε i = 1 m opt ( P i , 1 ) opt ( P i , k ) .
Since P 1 , , P m is the partition of the m-means of P we have i = 1 m opt ( P i , 1 ) = opt ( P , m ) . By letting Q i be the m-means of P i we have
i = 1 m opt ( P i , k ) = i = 1 m cost ( P i , Q i ) i = 1 m cost ( P i , j = 1 m Q j ) = cost ( P , j = 1 m Q j ) opt ( P , m k ) .
Hence,
i = 1 m opt ( P i , 1 ) opt ( P i , k ) opt ( P , m ) opt ( P , m k ) ε 2 opt ( P , k ) ε 2 cost ( P , Q ) ,
where the second inequality is by Line 1 of the algorithm. Plugging the last inequality in (16) yields
cost ( P , Q ) i = 1 m min q Q cost ( P i , Q ) cost ( P , Q ) · 1 + 3 ε 1 2 ε .
Using Lemma 3, for every i [ m ]
( 1 ε ) min q Q cost ( P i , { q } ) min q Q cost ( S i , { q } ) ( 1 + ε ) min q Q cost ( P i , { q } )
By summing over i [ m ] we obtain
( 1 ε ) i = 1 m min q Q cost ( P i , { q } ) i = 1 m min q Q cost ( S i , { q } ) ( 1 + ε ) i = 1 m min q Q cost ( P i , { q } ) .
By this and Lemma 1
( 1 ε ) i = 1 m min q Q cost ( P i , { q } ) cost ( S , Q ) ( 1 + ε ) i = 1 m min q Q cost ( P i , { q } ) .
Plugging the last inequality in (17) yields
( 1 ε ) cost ( P , Q ) ( 1 ε ) i = 1 m min q Q cost ( P i , Q ) cost ( S , Q ) ( 1 + ε ) i = 1 m min q Q cost ( P i , { q } ) ( 1 + ε ) cost ( P , Q ) · 1 + 3 ε 1 2 ε ( 1 + 15 ε ) cost ( P , Q ) .
Hence, S is a ( k , 15 ε ) coreset for P. □
Lemma 5.
There is an integer t < 1 + 1 / ε 2 such that
opt ( P , k t ) opt ( P , k t + 1 ) ε 2 · opt ( P , k ) .
Proof. 
Contradictively assume that (19) does not hold for every integer i < 1 + 1 / ε 2 . Hence,
opt ( P , k ) opt ( P , k 1 / ε 2 + 1 ) = i = 1 1 / ε 2 opt ( P , k i ) opt ( P , k i + 1 ) > 1 / ε 2 · ε 2 opt ( P , k ) opt ( P , k ) .
Contradiction, since opt ( P , k 1 / ε 2 + 1 ) 0 . □
Using the mean of P i in Line 5 of the algorithm yields a ( 1 , ε ) -coreset S i as shown in Lemma 1. The resulting coreset is not sparse, but gives the following result.
Theorem 2.
There is m k 1 / ε 2 such that the m-means of P is a ( k , 15 ε ) -coreset for P.
Proof of Theorem 1.
We compute S i a ( 1 , ε ) mean coreset for 1-mean of P i at line 5 of Algorithm 1 by using variation of Frank-Wolfe [22] algorithm. It follows that | S i | = O ( 1 / ε 2 ) for each i, therefore the overall sparsity of S is s ( P ) / ε 2 . This and Lemma 4 concludes the proof. □

7. Comparison to Existing Approaches

In this section we provide experimental results of our main algorithm of coreset constructions. We compare the clustering with existing coresets and small/medium/large datasets. Unlike most of the coreset papers, we also run the algorithm on the distributed setting via a cloud as explained below.

7.1. Datasets

For our experimental results we use three well known datasets, and the English Wikipedia as follows.
MNIST handwritten digits [25]. The MNIST dataset consists of n = 60,000 grayscale images of handwritten digits. Each image is of size 28 × 28 pixels and was converted to a row vector row of d = 784 dimensions.
Pendigits [26]. This dataset was downloaded from the UCI repository. It consists of 250 written letters by 44 humans. These humans were asked to write 250 digits in a random order inside boxes of 500 by 500 tablet pixel resolution. The tablet sends x and y tablet coordinates and pressure level values of the pen at fixed time intervals (sampling rate) of 100 milliseconds. Digits are represented as constant length feature vectors of size d = 16 the number of digits in the dataset is n = 10,992 .
NIPS dataset [27]. The OCR scanning of NIPS proceedings over 13 years. It has 15,000 pages and 1958 articles. For each author, there is a corresponding words counter vector, where the ith entry in the vector is the number of the times that the word used in one of the author’s submissions. There are overall n = 2865 authors and d = 14,036 words in this corpus.
English Wikipedia [28]. Unlike previous datasets that were uploaded to memory and then compressed via streaming coresets, the English Wikipedia practically can not be uploaded completely to memory. The size of the dataset is 15GB after converting to a term-documents matrix via gensim [29]. It has 4M vectors, each of 10 5 dimensions and an average of 200 non-zero entries, i.e., words per document.

7.2. The Experiment

We applied our coreset construction to boost the performance of Lloyd’s k-means heuristic as explained in Section 8 of previous work [6]. We compared the results with the current data summarization algorithms that can handle sparse data: uniform and importance sampling.

7.3. On the Small/Medium Datasets

We evaluate both the offline computation and the streaming data model. For offline computation we used the datasets above to produce coresets of size 100 m 1500 , then computed k-means for k = 10 , 15 , 20 , 25 till convergence. For the streaming data model, we divided each dataset into small subsets and computed coresets via the merge-and-reduce technique to construct a coreset tree as in Figure 1. Here, the coresets are smaller, of size 10 m 500 , and the values for k-means are the same.
We computed the sum of squared distances to the original (full) set of points, from each resulting set of k centers that was computed from the coreset. These sets of centers are denoted by C 1 , C 2 and C 3 for uniform, non uniform sampling and our coreset respectively. The “ground truth” or “optimal solution” C k was computed using k-means on the entire dataset until convergence. The empirical estimated error ε was then defined to be ϵ = C t / C k 1 for coreset number t = 1 , 2 , 3 . Note that, since Lloyd’s k-means is a heuristic, its performance on the reduced data might be better, i.e., ε < 0 .
These experiments were run on a single common laptop machine with 2.2GHz quad-core Intel Core i7 CPU and 16GB of 1600MHz DDR3L RAM with 256GB PCIe-based onboard SSD.

7.4. On the Wikipedia Dataset

We compared the three discussed data summarization techniques, while each one was computed in parallel and in a distributed fashion on 16 EC2 virtual servers. We repeated computation for k = 16 , 32 , 64 and 128, and coresets size in the range [ 32 , 1024 ] .
This experiment was executed via Amazon’s Web Services (“cloud”), using 16 EC2 virtual computation nodes of type c4.4xlarge, which 8 vCPU and 15GiB of RAM. We repeated distributed computation evaluating for coresets of sizes 256, 512, 1024 and 2048 points for k-means with k = 16 , 32 , 64 , 128 .

7.5. Results

The results of experiment for k = 15 , 20 , 25 on small datasets for offline computation are depicted on Figure 2, where it’s evident that error of kmeans computation fed by our coreset algorithm results outperforms error of uniform and non-uniform sampling.
For streaming computation model our algorithm is able to provide results which are better than other two as could be explored in Figure 3. In addition, existing algorithms suffer from “cold start” as common in random sampling techniques: there is a slow convergence to the small error, compared to our deterministic algorithm that introduces small error already after a small sample size.
At Figure 4 presented results of the experiment on Wikipedia dataset for different values of k = 32 , 64 , 128 , as it could be easily observed proposed coreset algorithm provides good results of big sparse dataset and provides lower energy cost compared to uniform and non-uniform approaches.
Figure 5, Figure 6 and Figure 7 show the box-plot of error distribution for all the three coresets in the offline and streaming settings. Our algorithm shows a little variance across all experiments, its mean error is very close to its median error, indicating that it produces stable results.
Figure 8 shows the memory (RAM) footprint during the coreset construction based on synthetically generated random data. The oscillations corresponds to the number of coresets in the tree that each new subset needs to update. For example, the first point in a streaming tree is updated in O ( 1 ) , however the 2 i th point for some i 1 climbs up through O ( log i ) levels in the tree, so O ( log i ) coresets are merged.

8. Conclusions

We proved that any set of points in R d has a ( k , ε ) -coreset which consists of a weighted subset of the input points whose size is independent of n and d, and polynomial in 1 / ε . Our algorithm carefully selects m such that the m-means of the input with appropriate weights (clusters’ size) yields such a coreset.
This allows us to finally compute coreset for sparse high dimensional data, in both the streaming and the distributed setting. As a practical example, we computed the first coreset for the full English Wikipedia. We hope that our open source code will allow researchers in the industry and academia to run these coresets on more databases such as images, speech or tweets.
The reduction to k-means allows us to use popular k-means heuristics (such as Lloyd-Max) and provable constant factor approximations (such as k-means++) in practice. Our experimental results on both a single machine and on the cloud shows that our coreset construction significantly improves over existing techniques, especially for small coresets, due to its deterministic approach.
We hope that this paper will also help the community to answer the following three open problems:
(i)
Can we simply compute the m-means for a specific value m k O ( 1 / ε ) and obtain a ( k , ε ) -coreset without using our algorithm?
(ii)
Can we compute such a coreset (subset of the input) whose size is m ( k / ε ) O ( 1 ) ?
(iii)
Can we compute such a smaller coreset deterministically?

Author Contributions

Conceptualization and Formal Analysis D.F. and A.B.; funding acquisition, D.F.; Code writing A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by BSF/NSF Grant Number: 2014627 and by GIF 2408- 407.6 Young Scientists’ Program Contract No.: I- 1186-407.9-2014.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Agarwal, P.K.; Har-Peled, S.; Varadarajan, K.R. Approximating extent measures of points. J. ACM 2004, 51, 606–635. [Google Scholar] [CrossRef]
  2. Har-Peled, S.; Mazumdar, S. On coresets for k-means and k-median clustering. In Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, 13–15 June 2004; ACM Press: New York, NY, USA, 2004. [Google Scholar]
  3. Bentley, J.L.; Saxe, J.B. Decomposable Searching Problems I: Static-to-Dynamic Transformation. J. Algorithms 1980, 1, 301–358. [Google Scholar] [CrossRef]
  4. Feldman, D.; Schmidt, M.; Sohler, C. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 6–8 January 2013; pp. 1434–1453. [Google Scholar]
  5. Apache Hadoop. Available online: http://hadoop.apache.org (accessed on 10 March 2020).
  6. Barger, A.; Feldman, D. k-means for Streaming and Distributed Big Sparse Data. In Proceedings of the 2016 SIAM International Conference on Data Mining, Miami, FL, USA, 5–7 May 2016; pp. 342–350. [Google Scholar]
  7. Feldman, D.; Faulkner, M.; Krause, A. Scalable training of mixture models via coresets. In Proceedings of the NIPS 2011—Advances in Neural Information Processing Systems, Granada, Spain, 12–14 December 2011; pp. 2142–2150. [Google Scholar]
  8. Barger, A.; Feldman, D. Source code for running streaming SparseKMeans coreset on the cloud 2017. (in process)
  9. Chen, K. On k-median clustering in High Dimensions. In Proceedings of the 17th Annu. ACM-SIAM Symposium on Discrete Algorithms (SODA), Barcelona, Spain, 5–7 July 2006; pp. 1177–1185. [Google Scholar]
  10. Langberg, M.; Schulman, L.J. Universal ε approximators for integrals. In Proceedings of the Twenty-First Annual ACM-SIAM symposium on Discrete Algorithms, Austin, TX, USA, 17–19 January 2010. [Google Scholar]
  11. Feldman, D.; Monemizadeh, M.; Sohler, C. A PTAS for k-means clustering based on weak coresets. In Proceedings of the Twenty-Third Annual Symposium on Computational Geometry, Gyeongju, South Korea, 6–8 June 2007. [Google Scholar]
  12. Feldman, D.; Langberg, M. A Unified Framework for Approximating and Clustering Data. arXiv 2016, arXiv:1106.1379 STOC. 2011. [Google Scholar]
  13. Inaba, M.; Katoh, N.; Imai, H. Applications of Weighted Voronoi Diagrams and Randomization to Variance-Based k-Clustering. In Proceedings of the Tenth Annual Symposium on Computational Geometry, Stony Brook, NY, USA, 6–8 June 1994; pp. 332–339. [Google Scholar]
  14. Cohen, M.; Elder, S.; Musco, C.; Musco, C.; Persu, M. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, Portland, OR, USA, 14–17 June 2015. [Google Scholar]
  15. Becchetti, L.; Bury, M.; Cohen-Addad, V.; Grandoni, F.; Schwiegelshohn, C. Oblivious dimension reduction for k-means: Beyond subspaces and the Johnson-Lindenstrauss lemma. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, Phoenix, AZ, USA, 23–26 June 2019; pp. 1039–1050. [Google Scholar]
  16. Lindenstrauss, W.J.J. Extensions of Lipschitz maps into a Hilbert space. Contemp. Math. 1984, 26, 189–206. [Google Scholar]
  17. Har-Peled, S.; Kushal, A. Smaller coresets for k-median and k-means clustering. Discret. Comput. Geom. 2007, 37, 3–19. [Google Scholar] [CrossRef] [Green Version]
  18. Ballard, D.H. Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognit. 1981, 13, 111–122. [Google Scholar] [CrossRef]
  19. Bhattacharya, A.; Jaiswal, R. On the k-means/Median Cost Function. arXiv 2017, arXiv:1704.05232. [Google Scholar]
  20. Wilkinson, B.; Allen, M. Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers; Prentice-Hall: Upper Saddle River, NJ, USA, 1999. [Google Scholar]
  21. Mahajan, M.; Nimbhorkar, P.; Varadarajan, K. The planar k-means problem is NP-hard. In WALCOM; Springer: Berlin/Heidelberg, Germany, 2009; pp. 274–285. [Google Scholar]
  22. Feldman, D.; Volkov, M.V.; Rus, D. Dimensionality Reduction of Massive Sparse Datasets Using Coresets. arXiv 2015, arXiv:abs/1503.01663. [Google Scholar]
  23. Fichtenberger, H.; Gillé, M.; Schmidt, M.; Schwiegelshohn, C.; Sohler, C. BICO: BIRCH meets coresets for k-means clustering. In European Symposium on Algorithms; Springer: Berlin/Heidelberg, Germany, 2013; pp. 481–492. [Google Scholar]
  24. Ackermann, M.R.; Märtens, M.; Raupach, C.; Swierkot, K.; Lammersen, C.; Sohler, C. StreamKM++ A clustering algorithm for data streams. J. Exp. Algorithmics (JEA) 2012, 17, 2.1–2.30. [Google Scholar]
  25. LeCun, Y.; Cortes, C. The MNIST Database of Handwritten Digits. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 10 March 2020).
  26. Alimoglu, F.; Doc, D.; Alpaydin, E.; Denizhan, Y. Combining Multiple Classifiers for Pen-Based Handwritten Digit Recognition. Available online: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.25.6299&rep=rep1&type=pdf (accessed on 10 March 2020).
  27. LeCun, Y. Nips Online Web Site. 2001. Available online: http://nips.djvuzone.org (accessed on 10 March 2020).
  28. The Free Wikipedia. Encyclopedia. 2004. Available online: https://dumps.wikimedia.org/enwiki/20170220/ (accessed on 1 February 2017).
  29. Rehurek, R.; Sojka, P. Gensim—Statistical Semantics in Python. Available online: https://www.fi.muni.cz/usr/sojka/posters/rehurek-sojka-scipy2011.pdf (accessed on 10 March 2020).
Figure 1. Coresets construction from data streams [2]. The black arrows indicate “merge-and-reduce” operations. The intermediate coresets C 1 , , C 7 are numbered in the order in which they would be generated in the streaming case. In the parallel case, C 1 , C 2 , C 4 and C 5 would be constructed in parallel, followed by C 3 and C 6 , finally resulting in C 7 . The Figure is from [7].
Figure 1. Coresets construction from data streams [2]. The black arrows indicate “merge-and-reduce” operations. The intermediate coresets C 1 , , C 7 are numbered in the order in which they would be generated in the streaming case. In the parallel case, C 1 , C 2 , C 4 and C 5 would be constructed in parallel, followed by C 3 and C 6 , finally resulting in C 7 . The Figure is from [7].
Algorithms 13 00092 g001
Figure 2. Offline coresets computation for small datasets (uniform sampling, non uniform sampling and our algorithm).
Figure 2. Offline coresets computation for small datasets (uniform sampling, non uniform sampling and our algorithm).
Algorithms 13 00092 g002
Figure 3. Streaming coresets computation for small datasets (uniform sampling, non uniform sampling and our algorithm).
Figure 3. Streaming coresets computation for small datasets (uniform sampling, non uniform sampling and our algorithm).
Algorithms 13 00092 g003
Figure 4. Comparison of uniform sampling, non uniform sampling and our algorithm based on Wikipedia in distributed setting with 16 servers.
Figure 4. Comparison of uniform sampling, non uniform sampling and our algorithm based on Wikipedia in distributed setting with 16 servers.
Algorithms 13 00092 g004
Figure 5. Error (y-axis) box-plots for real-data sets, ofline computation model.
Figure 5. Error (y-axis) box-plots for real-data sets, ofline computation model.
Algorithms 13 00092 g005
Figure 6. Error (y-axis) box-plots for real-data sets, streaming computation model.
Figure 6. Error (y-axis) box-plots for real-data sets, streaming computation model.
Algorithms 13 00092 g006
Figure 7. Error (y-axis) box-plots for wikipedia dataset, distributed computation for k = 32, 64 and 128.
Figure 7. Error (y-axis) box-plots for wikipedia dataset, distributed computation for k = 32, 64 and 128.
Algorithms 13 00092 g007
Figure 8. Allocated memory (y-axis) grows logarithmically during streaming coreset construction. The Zig-zag patterns caused by the binary merge-reduce tree in Figure 1.
Figure 8. Allocated memory (y-axis) grows logarithmically during streaming coreset construction. The Zig-zag patterns caused by the binary merge-reduce tree in Figure 1.
Algorithms 13 00092 g008

Share and Cite

MDPI and ACS Style

Barger, A.; Feldman, D. Deterministic Coresets for k-Means of Big Sparse Data. Algorithms 2020, 13, 92. https://doi.org/10.3390/a13040092

AMA Style

Barger A, Feldman D. Deterministic Coresets for k-Means of Big Sparse Data. Algorithms. 2020; 13(4):92. https://doi.org/10.3390/a13040092

Chicago/Turabian Style

Barger, Artem, and Dan Feldman. 2020. "Deterministic Coresets for k-Means of Big Sparse Data" Algorithms 13, no. 4: 92. https://doi.org/10.3390/a13040092

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop