#
Deterministic Coresets for k-Means of Big Sparse Data^{ †}

^{*}

^{†}

^{‡}

## Abstract

**:**

## 1. Background

#### Constrained k-Means and Determining k

## 2. Related Work

#### Importance Sampling

**Projection based coresets.**Data summarization which are similar to coresets of size $O(k/\epsilon )$ that are based on projections on low-dimensional subspaces that diminishes the sparsity of the input data were suggested by [14] by improving the analysis of [4]. Recently [15] improves both on [4,14] by applying Johnson-Lindenstrauss Lemma [16] on construction from [4]. However, due to the projections, the resulting summarizations of all works mentioned above are not subset of the input points, unlike the coreset definition of this paper. In particular, for sparse data sets such as adjacency matrix of a graphs, documents-term matrix of Wikipedia, or images-objects matrix, the sparsity of the data diminishes and a single point in the summarization might be larger than the complete sparse input data.

**Deterministic Constructions.**The first coresets for k-means [2,17] were based on partitioning the data into cells, and take a representative point from each cell to the coreset, as is done in hashing or Hough transform [18]. However, these coresets are of size at least $k/{\epsilon}^{O\left(d\right)}$, i.e., exponential in d, while still providing result which is a sub-set of the the input in contrast to previous work [17]. While, our technique is most related to the deterministic construction that was suggested in [4] by recursively computing k-means clustering of the input points. While the output set has size independent of d, it is not a coreset as defined in this paper, since it is not a subset of the input and thus cannot handle sparse data as explained above. Techniques such as uniform sampling for each cluster yields coresets with probability of failure that is linear in the input size, or whose size depends on d.

**m-means is a coreset for k-means?**A natural approach for coreset construction, that is strongly related to this paper, is to compute the m-means of the input set P, for a sufficiently large m, where the weight of each center is the number of point in its cluster. If the sum of squared distances to these m-centers is about $\epsilon $ factor from the k-means, we obtain a $(k,\epsilon )$-coreset by (a weaker version of) the triangle inequality. Unfortunately, it was recently proved in [19] that there exists sets such that $m\in {k}^{\mathsf{\Omega}\left(d\right)}$ centers are needed in order to obtain this small sum of squared distances.

## 3. Our Contribution

- An algorithm that computes a $(1+\epsilon )$ approximation for the k-means of a set P that is distributed (partitioned) among M machines, where each machine needs to send only ${k}^{O\left(1\right)}$ input points to the main server at the end of its computation.
- A streaming algorithm that, after one pass over the data and using ${k}^{O\left(1\right)}\mathrm{log}n$ memory returns an $O\left(\mathrm{log}n\right)$-approximation to the k-means of P. The algorithm can run “embarrassingly in parallel [20] on data that is distributed among M machines, and support insertion/deletion of points as explained in the previous section.
- Description of how to use our algorithm to boost both the running time and quality of any existing k-means heuristic using only the heuristic itself, even in the classic off-line setting.
- Extensive experimental results on real world data-sets. This includes the first k-means clustering with provable guarantees for the English Wikipedia, via 16 EC2 instances on Amazon’s cloud.
- Open-code for for fully reproducing our results and for the benefit of the community. To our knowledge, this is the first coreset code that can run on the cloud without additional commercial packages.

#### 3.1. Novel Approach: m-Means Is A Coreset for k-Means, for Smart Selection of m

#### 3.2. Solving k-Means Using k-Means

#### 3.3. Running Time

## 4. Notation and Main Result

#### 4.1. k-Means Clustering

#### 4.2. Coreset

#### 4.3. Sparse Coresets

**Theorem**

**1**

**.**For every weighted set $P=({P}^{\prime},u,\rho )$ in ${\mathbb{R}}^{d}$, $\epsilon >0$ and an integer $k\ge 1$, there is a $(k,\epsilon )$-coreset $S=({S}^{\prime},w,\varphi )$ of size $\left|S\right|={k}^{O(1/{\epsilon}^{2})}$ where each point in ${S}^{\prime}$ is a linear combination of $O(1/{\epsilon}^{2})$ points from ${P}^{\prime}$. In particular, the maximum sparsity of ${S}^{\prime}$ is $s\left(P\right)/{\epsilon}^{2}$.

**Corollary**

**1.**

## 5. Coreset Construction

#### Algorithm Overview

## 6. Proof of Correctness

**Lemma**

**1.**

**Proof.**

**Lemma**

**2.**

**Proof.**

**Lemma**

**3.**

**Proof.**

**Lemma**

**4.**

**Proof.**

**Lemma**

**5.**

**Proof.**

**Theorem**

**2.**

**Proof**

**of**

**Theorem**

**1.**

## 7. Comparison to Existing Approaches

#### 7.1. Datasets

**MNIST handwritten digits [25].**The MNIST dataset consists of $n=\mathrm{60,000}$ grayscale images of handwritten digits. Each image is of size 28 × 28 pixels and was converted to a row vector row of $d=784$ dimensions.

**Pendigits [26].**This dataset was downloaded from the UCI repository. It consists of 250 written letters by 44 humans. These humans were asked to write 250 digits in a random order inside boxes of 500 by 500 tablet pixel resolution. The tablet sends x and y tablet coordinates and pressure level values of the pen at fixed time intervals (sampling rate) of 100 milliseconds. Digits are represented as constant length feature vectors of size $d=16$ the number of digits in the dataset is $n=\mathrm{10,992}$.

**NIPS dataset**[27]. The OCR scanning of NIPS proceedings over 13 years. It has 15,000 pages and 1958 articles. For each author, there is a corresponding words counter vector, where the ith entry in the vector is the number of the times that the word used in one of the author’s submissions. There are overall $n=2865$ authors and $d=\mathrm{14,036}$ words in this corpus.

**English Wikipedia [28].**Unlike previous datasets that were uploaded to memory and then compressed via streaming coresets, the English Wikipedia practically can not be uploaded completely to memory. The size of the dataset is 15GB after converting to a term-documents matrix via gensim [29]. It has 4M vectors, each of ${10}^{5}$ dimensions and an average of 200 non-zero entries, i.e., words per document.

#### 7.2. The Experiment

#### 7.3. On the Small/Medium Datasets

#### 7.4. On the Wikipedia Dataset

#### 7.5. Results

## 8. Conclusions

- (i)
- Can we simply compute the m-means for a specific value $m\in {k}^{O(1/\epsilon )}$ and obtain a $(k,\epsilon )$-coreset without using our algorithm?
- (ii)
- Can we compute such a coreset (subset of the input) whose size is $m\in {(k/\epsilon )}^{O\left(1\right)}$?
- (iii)
- Can we compute such a smaller coreset deterministically?

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Agarwal, P.K.; Har-Peled, S.; Varadarajan, K.R. Approximating extent measures of points. J. ACM
**2004**, 51, 606–635. [Google Scholar] [CrossRef] - Har-Peled, S.; Mazumdar, S. On coresets for k-means and k-median clustering. In Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, 13–15 June 2004; ACM Press: New York, NY, USA, 2004. [Google Scholar]
- Bentley, J.L.; Saxe, J.B. Decomposable Searching Problems I: Static-to-Dynamic Transformation. J. Algorithms
**1980**, 1, 301–358. [Google Scholar] [CrossRef] - Feldman, D.; Schmidt, M.; Sohler, C. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 6–8 January 2013; pp. 1434–1453. [Google Scholar]
- Apache Hadoop. Available online: http://hadoop.apache.org (accessed on 10 March 2020).
- Barger, A.; Feldman, D. k-means for Streaming and Distributed Big Sparse Data. In Proceedings of the 2016 SIAM International Conference on Data Mining, Miami, FL, USA, 5–7 May 2016; pp. 342–350. [Google Scholar]
- Feldman, D.; Faulkner, M.; Krause, A. Scalable training of mixture models via coresets. In Proceedings of the NIPS 2011—Advances in Neural Information Processing Systems, Granada, Spain, 12–14 December 2011; pp. 2142–2150. [Google Scholar]
- Barger, A.; Feldman, D. Source code for running streaming SparseKMeans coreset on the cloud 2017. (in process)
- Chen, K. On k-median clustering in High Dimensions. In Proceedings of the 17th Annu. ACM-SIAM Symposium on Discrete Algorithms (SODA), Barcelona, Spain, 5–7 July 2006; pp. 1177–1185. [Google Scholar]
- Langberg, M.; Schulman, L.J. Universal ε approximators for integrals. In Proceedings of the Twenty-First Annual ACM-SIAM symposium on Discrete Algorithms, Austin, TX, USA, 17–19 January 2010. [Google Scholar]
- Feldman, D.; Monemizadeh, M.; Sohler, C. A PTAS for k-means clustering based on weak coresets. In Proceedings of the Twenty-Third Annual Symposium on Computational Geometry, Gyeongju, South Korea, 6–8 June 2007. [Google Scholar]
- Feldman, D.; Langberg, M. A Unified Framework for Approximating and Clustering Data. arXiv
**2016**, arXiv:1106.1379 STOC. 2011. [Google Scholar] - Inaba, M.; Katoh, N.; Imai, H. Applications of Weighted Voronoi Diagrams and Randomization to Variance-Based k-Clustering. In Proceedings of the Tenth Annual Symposium on Computational Geometry, Stony Brook, NY, USA, 6–8 June 1994; pp. 332–339. [Google Scholar]
- Cohen, M.; Elder, S.; Musco, C.; Musco, C.; Persu, M. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, Portland, OR, USA, 14–17 June 2015. [Google Scholar]
- Becchetti, L.; Bury, M.; Cohen-Addad, V.; Grandoni, F.; Schwiegelshohn, C. Oblivious dimension reduction for k-means: Beyond subspaces and the Johnson-Lindenstrauss lemma. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, Phoenix, AZ, USA, 23–26 June 2019; pp. 1039–1050. [Google Scholar]
- Lindenstrauss, W.J.J. Extensions of Lipschitz maps into a Hilbert space. Contemp. Math.
**1984**, 26, 189–206. [Google Scholar] - Har-Peled, S.; Kushal, A. Smaller coresets for k-median and k-means clustering. Discret. Comput. Geom.
**2007**, 37, 3–19. [Google Scholar] [CrossRef] [Green Version] - Ballard, D.H. Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognit.
**1981**, 13, 111–122. [Google Scholar] [CrossRef] - Bhattacharya, A.; Jaiswal, R. On the k-means/Median Cost Function. arXiv
**2017**, arXiv:1704.05232. [Google Scholar] - Wilkinson, B.; Allen, M. Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers; Prentice-Hall: Upper Saddle River, NJ, USA, 1999. [Google Scholar]
- Mahajan, M.; Nimbhorkar, P.; Varadarajan, K. The planar k-means problem is NP-hard. In WALCOM; Springer: Berlin/Heidelberg, Germany, 2009; pp. 274–285. [Google Scholar]
- Feldman, D.; Volkov, M.V.; Rus, D. Dimensionality Reduction of Massive Sparse Datasets Using Coresets. arXiv
**2015**, arXiv:abs/1503.01663. [Google Scholar] - Fichtenberger, H.; Gillé, M.; Schmidt, M.; Schwiegelshohn, C.; Sohler, C. BICO: BIRCH meets coresets for k-means clustering. In European Symposium on Algorithms; Springer: Berlin/Heidelberg, Germany, 2013; pp. 481–492. [Google Scholar]
- Ackermann, M.R.; Märtens, M.; Raupach, C.; Swierkot, K.; Lammersen, C.; Sohler, C. StreamKM++ A clustering algorithm for data streams. J. Exp. Algorithmics (JEA)
**2012**, 17, 2.1–2.30. [Google Scholar] - LeCun, Y.; Cortes, C. The MNIST Database of Handwritten Digits. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 10 March 2020).
- Alimoglu, F.; Doc, D.; Alpaydin, E.; Denizhan, Y. Combining Multiple Classifiers for Pen-Based Handwritten Digit Recognition. Available online: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.25.6299&rep=rep1&type=pdf (accessed on 10 March 2020).
- LeCun, Y. Nips Online Web Site. 2001. Available online: http://nips.djvuzone.org (accessed on 10 March 2020).
- The Free Wikipedia. Encyclopedia. 2004. Available online: https://dumps.wikimedia.org/enwiki/20170220/ (accessed on 1 February 2017).
- Rehurek, R.; Sojka, P. Gensim—Statistical Semantics in Python. Available online: https://www.fi.muni.cz/usr/sojka/posters/rehurek-sojka-scipy2011.pdf (accessed on 10 March 2020).

**Figure 1.**Coresets construction from data streams [2]. The black arrows indicate “merge-and-reduce” operations. The intermediate coresets ${C}_{1},\dots ,{C}_{7}$ are numbered in the order in which they would be generated in the streaming case. In the parallel case, ${C}_{1},{C}_{2},{C}_{4}$ and ${C}_{5}$ would be constructed in parallel, followed by ${C}_{3}$ and ${C}_{6}$, finally resulting in ${C}_{7}$. The Figure is from [7].

**Figure 2.**Offline coresets computation for small datasets (uniform sampling, non uniform sampling and our algorithm).

**Figure 3.**Streaming coresets computation for small datasets (uniform sampling, non uniform sampling and our algorithm).

**Figure 4.**Comparison of uniform sampling, non uniform sampling and our algorithm based on Wikipedia in distributed setting with 16 servers.

**Figure 7.**Error (y-axis) box-plots for wikipedia dataset, distributed computation for k = 32, 64 and 128.

**Figure 8.**Allocated memory (y-axis) grows logarithmically during streaming coreset construction. The Zig-zag patterns caused by the binary merge-reduce tree in Figure 1.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Barger, A.; Feldman, D.
Deterministic Coresets for *k*-Means of Big Sparse Data. *Algorithms* **2020**, *13*, 92.
https://doi.org/10.3390/a13040092

**AMA Style**

Barger A, Feldman D.
Deterministic Coresets for *k*-Means of Big Sparse Data. *Algorithms*. 2020; 13(4):92.
https://doi.org/10.3390/a13040092

**Chicago/Turabian Style**

Barger, Artem, and Dan Feldman.
2020. "Deterministic Coresets for *k*-Means of Big Sparse Data" *Algorithms* 13, no. 4: 92.
https://doi.org/10.3390/a13040092