k-Center Clustering with Outliers in Sliding Windows
Abstract
:1. Introduction
1.1. Related Work
1.2. Our Contribution
- A sliding window algorithm for general metric spaces that, at any time, is able to return a set of centers covering all but at most points of the current window W, within a radius that is an factor larger than the optimal radius for z ouliers. The algorithm requires a working memory of size and processes each point in time linear in the working memory size. By setting , the number of uncovered points becomes, at most, z;
- An improved algorithm with the same coverage guarantee as above, featuring a radius that is only a factor larger than the optimal radius, at the expense of an extra factor in both the working memory size and update time, for a suitable constant c;
- A sliding-window algorithm for streams of bounded doubling dimension that, starting from a (possibly crude) lower bound on the ratio between the -effective and the full diameter of the window W, returns upper and lower upper bounds to the -effective diameter of W. The algorithm features accuracy–space tradeoffs akin to those of the improved algorithm for 1-center with outliers;
- Experimental evidence that both the improved k-center and the effective diameter algorithms feature a good performance and provide accurate solutions.
1.3. Organization of the Paper
2. Preliminaries
2.1. Definition of the Problems
2.2. Doubling Dimension
3. -Center with Outliers
3.1. Weighted Coreset Construction
3.1.1. Algorithm
- The first pair is kept in the histogram;
- If a pair is kept in the histogram, all subsequent pairs with and are deleted, except for the last such pair, if any.
Algorithm 1: update(p, t) |
Algorithm 2: insertAttraction(p, γ) |
Algorithm 3: updateHistrograms(L) |
Algorithm 4: extractCoreset( ) |
3.1.2. Analysis
- 1.
- If , then .
- 2.
- .
- 1.
- For every , ;
- 2.
- For every , or ;
- 3.
- For every , ;
- 4.
- .
3.2. Computation of the Solution from the Coreset
Algorithm 5: outliersCluster(T, k, ρ, ε) |
- .
Algorithm 6: computeSolution( ) |
3.3. Obliviousness to and
3.4. Improved Approximation under Bounded Doubling Dimension
4. Effective Diameter Estimation
5. Experiments
5.1. k-Center with Outliers
5.2. Effective Diameter
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Henzinger, M.; Raghavan, P.; Rajagopalan, S. Computing on Data Streams. In Proceedings of the DIMACS Workshop on External Memory Algorithms, New Brunswick, NJ, USA, 20–22 May 1998; pp. 107–118. [Google Scholar]
- Datar, M.; Motwani, R. The Sliding-Window Computation Model and Results. In Data Stream Management; Garofalakis, M., Gehrke, J., Rastogi, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2016; pp. 149–165. [Google Scholar]
- Braverman, V. Sliding Window Algorithms. In Encyclopedia of Algorithms; Cao, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2016; pp. 2006–2011. [Google Scholar]
- Snyder, L. Introduction to facility location. In Wiley Enciclopedia of Operations Research and Management Science; Wiley: Hoboken, NJ, USA, 2011. [Google Scholar]
- Hennig, C.; Meila, M.; Murtagh, F.; Rocci, R. Handbook of Cluster Analysis; CRC Press: Boca Raton, FL, USA, 2015. [Google Scholar]
- Bateni, M.; Esfandiari, H.; Fischer, M.; Mirrokni, V. Extreme k-Center Clustering. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 2–9 February 2021; pp. 3941–3949. [Google Scholar]
- Charikar, M.; Khuller, S.; Mount, D.; Narasimhan, G. Algorithms for Facility Location Problems with Outliers. In Proceedings of the 12th ACM-SIAM Symposium on Discrete Algorithms (SODA), Washington, DC, USA, 7–9 January 2001; pp. 642–651. [Google Scholar]
- Malkomes, G.; Kusner, M.; Chen, W.; Weinberger, K.; Moseley, B. Fast Distributed k-Center Clustering with Outliers on Massive Data. In Proceedings of the 28th Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 1063–1071. [Google Scholar]
- Ceccarello, M.; Pietracaprina, A.; Pucci, G. Solving k-center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially. arXiv 2018, arXiv:1802.09205. [Google Scholar] [CrossRef] [Green Version]
- Rousseeuw, P.J. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef] [Green Version]
- Altieri, F.; Pietracaprina, A.; Pucci, G.; Vandin, F. Scalable Distributed Approximation of Internal Measures for Clustering Evaluation. In Proceedings of the SIAM International Conference on Data Mining (SDM), Online. 29 April–1 May 2021; pp. 648–656. [Google Scholar]
- Gonzalez, T.F. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci. 1985, 38, 293–306. [Google Scholar] [CrossRef] [Green Version]
- Hochbaum, D.; Shmoys, D. A Best Possible Heuristic for the k-Center Problem. Math. Oper. Res. 1985, 10, 180–184. [Google Scholar] [CrossRef] [Green Version]
- Chan, T.H.H.; Guerqin, A.; Sozio, M. Fully Dynamic k-Center Clustering. In Proceedings of the World Wide Web Conference, Lyon, France, 23–27 April 2018; Volume 2018, pp. 579–587. [Google Scholar]
- Harris, D.; Pensyl, T.; Srinivasan, A.; Trinh, K. A Lottery Model for Center-Type Problems With Outliers. ACM Trans. Algorithms 2019, 15, 36. [Google Scholar] [CrossRef]
- Chakrabarty, D.; Goyal, P.; Krishnaswamy, R. The Non-Uniform k-Center Problem. ACM Trans. Algorithms 2020, 16, 46. [Google Scholar] [CrossRef]
- Ding, H.; Yu, H.; Wang, Z. Greedy Strategy Works for k-Center Clustering with Outliers and Coreset Construction. In Proceedings of the 27th Annual European Symposium on Algorithms (ESA), Munich/Garching, Germany, 9–11 September 2019; pp. 40:1–40:16. [Google Scholar]
- McCutchen, R.; Khuller, S. Streaming Algorithms for k-Center Clustering with Outliers and with Anonymity. In Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques; Springer: Berlin/Heidelberg, Germany, 2008; pp. 165–178. [Google Scholar]
- Feldmann, A.; Marx, D. The Parameterized Hardness of the k-Center Problem in Transportation Networks. Algorithmica 2020, 82, 1989–2005. [Google Scholar] [CrossRef] [Green Version]
- Cohen-Addad, V.; Feldmann, A.; Saulpic, D. Near-Linear Time Approximation Schemes for Clustering in Doubling Metrics. J. ACM 2021, 68, 1–34. [Google Scholar] [CrossRef]
- Cohen-Addad, V.; Schwiegelshohn, C.; Sohler, C. Diameter and k-Center in Sliding Windows. In Proceedings of the 43th International Colloquium on Automata, Languages and Programming (ICALP), Rome, Italy, 11–15 July 2016; pp. 19:1–19:12. [Google Scholar]
- Pellizzoni, P.; Pietracaprina, A.; Pucci, G. Dimensionality-adaptive k-center in sliding windows. In Proceedings of the 7th IEEE International Conference on Data Science and Advanced Analytics (DSAA), Sydney, Australia, 6–9 October 2020; pp. 197–206. [Google Scholar]
- de Berg, M.; Monemizadeh, M.; Zhong, Y. k-Center Clustering with Outliers in the Sliding-Window Model. In Proceedings of the 29th Annual European Symposium on Algorithms (ESA), Lisbon, Portugal, 29–30 September 2021; pp. 13:1–13:13. [Google Scholar]
- Braverman, V.; Lang, H.; Levin, K.; Monemizadeh, M. Clustering Problems on Sliding Windows. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Arlington, VA, USA, 10–12 January 2016; pp. 1374–1390. [Google Scholar]
- Borassi, M.; Epasto, A.; Lattanzi, S.; Vassilvitskii, S.; Zadimoghaddam, M. Sliding Window Algorithms for k-Clustering Problems. In Proceedings of the 34th Neural Information Processing Systems (NeurIPS), Online, 6–12 December 2020. [Google Scholar]
- Palmer, C.; Gibbons, P.; Faloutsos, C. ANF: A fast and scalable tool for data mining in massive graphs. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Edmonton, AB, Canada, 23–26 July 2002; pp. 81–90. [Google Scholar]
- Braverman, V.; Ostrovsky, R. Smooth Histograms for Sliding Windows. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS), Providence, RI, USA, 20–23 October 2007; pp. 283–293. [Google Scholar]
- Gottlieb, L.A.; Kontorovich, A.; Krauthgamer, R. Efficient Classification for Metric Data. IEEE Trans. Inf. Theory 2014, 60, 5750–5759. [Google Scholar] [CrossRef] [Green Version]
- Hu, L.; Zhong, C. An Internal Validity Index Based on Density-Involved Distance. IEEE Access 2019, 7, 40038–40051. [Google Scholar] [CrossRef]
Dataset | Algorithm | Obj. Ratio | Memory ( Floats) | ||||
---|---|---|---|---|---|---|---|
Window Size | Window Size | ||||||
Higgs (z = 10) | our-sliding | ||||||
charikar | - | ||||||
samp-charikar | |||||||
gon | |||||||
Higgs (z = 50) | our-sliding | ||||||
charikar | - | - | |||||
samp-charikar | |||||||
gon | |||||||
Cover (z = 10) | our-sliding | ||||||
charikar | 1.02 * | ||||||
samp-charikar | |||||||
gon | |||||||
Cover (z = 50) | our-sliding | ||||||
charikar | - | ||||||
samp-charikar | |||||||
gon | |||||||
Higgs+ (z = 10) | our-sliding | ||||||
charikar | - | ||||||
samp-charikar | |||||||
gon | |||||||
Cover+ (z = 10) | our-sliding | ||||||
charikar | 1.02 * | ||||||
samp-charikar | |||||||
gon |
Dataset | Algorithm | Update Time (ms) | Query Time (s) | ||||
---|---|---|---|---|---|---|---|
Window Size | Window Size | ||||||
Higgs (z = 10) | our-sliding | ||||||
charikar | - | ||||||
samp-charikar | |||||||
Higgs (z = 50) | our-sliding | ||||||
charikar | – | – | |||||
samp-charikar | |||||||
Cover (z = 10) | our-sliding | ||||||
charikar | * | ||||||
samp-charikar | |||||||
Cover (z = 50) | our-sliding | ||||||
charikar | - | ||||||
samp-charikar | |||||||
Higgs+ (z = 10) | our-sliding | ||||||
charikar | – | ||||||
samp-charikar | |||||||
Cover+ (z = 10) | our-sliding | ||||||
charikar | * | ||||||
samp-charikar |
Dataset | Algorithm | Clustering Radius | Memory ( Floats) | ||||
---|---|---|---|---|---|---|---|
Window Size | Window Size | ||||||
Higgs (z = 10) | - | - | |||||
Cover (z = 10) | |||||||
Dataset | Algorithm | Update Time (ms) | Query Time (s) | ||||
---|---|---|---|---|---|---|---|
Window Size | Window Size | ||||||
Higgs (z = 10) | - | - | |||||
Cover (z = 10) | |||||||
Dataset | Algorithm | Diameter Ratio | Memory ( floats) | ||||
---|---|---|---|---|---|---|---|
Window Size | Window Size | ||||||
Higgs-eff | eff-sliding | ||||||
eff-sequential | - |
Dataset | Algorithm | Update Time (ms) | Query Time (s) | ||||
---|---|---|---|---|---|---|---|
Window Size | Window Size | ||||||
Higgs-eff | eff-sliding | ||||||
eff-sequential | - |
Dataset | Eff. Diameter | Memory ( floats) | Update Time (ms) | Query Time (s) |
---|---|---|---|---|
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pellizzoni, P.; Pietracaprina, A.; Pucci, G. k-Center Clustering with Outliers in Sliding Windows. Algorithms 2022, 15, 52. https://doi.org/10.3390/a15020052
Pellizzoni P, Pietracaprina A, Pucci G. k-Center Clustering with Outliers in Sliding Windows. Algorithms. 2022; 15(2):52. https://doi.org/10.3390/a15020052
Chicago/Turabian StylePellizzoni, Paolo, Andrea Pietracaprina, and Geppino Pucci. 2022. "k-Center Clustering with Outliers in Sliding Windows" Algorithms 15, no. 2: 52. https://doi.org/10.3390/a15020052
APA StylePellizzoni, P., Pietracaprina, A., & Pucci, G. (2022). k-Center Clustering with Outliers in Sliding Windows. Algorithms, 15(2), 52. https://doi.org/10.3390/a15020052