Abstract
The k-means problem is to compute a set of k centers (points) that minimizes the sum of squared distances to a given set of n points in a metric space. Arguably, the most common algorithm to solve it is k-means++ which is easy to implement and provides a provably small approximation error in time that is linear in n. We generalize k-means++ to support outliers in two sense (simultaneously): (i) nonmetric spaces, e.g., M-estimators, where the distance between a point p and a center x is replaced by for an appropriate constant c that may depend on the scale of the input. (ii) k-means clustering with outliers, i.e., where the m farthest points from any given k centers are excluded from the total sum of distances. This is by using a simple reduction to the -means clustering (with no outliers).
1. Introduction
We first introduce the notion of clustering, the solution that is suggested by k-means++, and the generalization of the problem to support outliers in the input. We then describe our contribution in this context.
1.1. Clustering
For a given similarity measure, clustering is the problem of partitioning an input set of n objects into subsets, such that objects in the same group are more similar to each other than to objects in the other sets. As mentioned in [1], clustering problems arise in many different applications, including data mining and knowledge discovery [2], data compression and vector quantization [3], and pattern recognition and classification [4]. However, for most of its variants, it is an NP-Hard problem when the number k of clusters is part of the input, as elaborated and proved in [5,6]. In fact, the exponential dependency in k is unavoidable even for approximating the (regular) Euclidean k-means for clustering n points in the plane [7,8].
Hence, constant or near-constant (logarithmic in n or k) multiplicative factor approximations to the desired cost function were suggested over the years, whose running time is polynomial in both n and k. Arguably, the most common version in both academy and industry is the k-means cost function, where the input is a set of n points in a metric space, and the goal is to compute the set of k centers (points) that minimizes the sum of squared distances over each input point to its closest center.
The k-means++ method that was suggested independently by [9,10] is an extremely common and simple solution with probable approximation bounds. It provides an multiplicative approximation in time , e.g., for the Euclidean d-dimensional space. It was also shown in [10] that the approximation factor is , if the input data is well separated in some formal sense. Recently, Ref. [11] provided an approximation for points in the d-dimensional Euclidean space via exactly k centers in time complexity of .
The k-means++ algorithm is based on the intuition that the centroids should be well spread out. Hence, it samples k centers iteratively, via a distribution that is called -sampling and is proportional to the distance of each input point to the centers that were already chosen. The first center is chosen uniformly at random from the input.
The k-means++ algorithm supports any distance to the power of as explained in [9]. Deterministic and other versions for the sum of (nonsquared) distances were suggested in [12,13,14].
A more general approximations, called bicriteria or approximations, guarantee multiplicative factor -approximation but the number of used center (for approximating the optimal k centers) is for some . The factors and might be depended on k and n, and different methods give different dependencies and running times. For example, Ref. [15] showed that sampling different randomized centers yields an -approximation and leverage it to support streaming data. Similarly, it was proved in [16] that -sampling centers yields an approximation.
The analysis of [17] explores the value of as a function of .
A coreset for the k-median/mean problem is a small weighted set (usually subset) of the input points that approximates the sum of distances or sum of squared distances from the original (big) set P to every given set of k centers, usually up to multiplicative factor. In particular, we may compute an -approximation on the coreset to obtain -approximation for the original data.
For the special case of k-means in the Euclidean space we can replace d by , including for coresets constructions, as explained in [18]. Deterministic coresets for k-means of size independent of d were suggested in [19].
1.2. Seeding
As explained in [10], Lloyd [20,21,22] suggested a simple iterative heuristic that aims to minimize the clustering cost, assuming a solution to the case is known. It is a special case of the Expected Maximization (EM) heuristic for computing a local minimum. The algorithm is initialized with k random points (seeds, centroids). At each iteration, each of the input points is classified to its closest centroid. A new set of k centroids is constructed by taking the mean (or solving the problem for , in the general case) of each of the current k clusters. This method is repeated until convergence or any given stopping condition.
Due to its simplicity, and the convergence to a local minimum [23], this method is very common; see [3,24,25,26,27,28,29] and references therein. The method has further improved in [1,30,31,32].
The drawback of this approach is that it converges to a local minimum—the one which is closest to the initial centers that had been chosen and may be arbitrarily far from the global minimum. There is also no upper bound for the convergence rate and number of iterations. Therefore, a lot of research has been done to choose good initial points, called “seeds” [33,34,35,36,37,38,39]. However, very few analytical guarantees were found to prove convergence.
A natural solution is to use provable approximation algorithms such as k-means++ above, and then apply Lloyd’s algorithm as a heuristic that hopefully improves the approximation in practice. Since Lloyd’s algorithm can only improve the initial solution, the provable upper bound on the approximation factor is still preserved.
1.3. Clustering with Outliers
In practice, data sets include some noise measurements which do not reflect a real part of the data. These are called outliers, and even a single outlier may completely change the optimal solution that is obtained without this outlier. One option to handle outliers is to change the distance function to a function that is more robust to outliers, such as M-estimators, e.g., where the distance between a pair of points is replaced by for some fixed that may depend on the scaling or spread of the data. Another option is to compute the set of k centers that minimizes the objective function, excluding the farthest m points from the candidate k centers. Here, is a given parameter for the number of outliers. Of course, given the optimal k centers for this problem, the m outliers are simply the farthest m input points, and given these m outliers the optimal solution is the k-means for the rest of the points. However, the main challenge is to approximate the global optimum, i.e., compute the optimal centers and outliers simultaneously. Guessing m is less challenging since it is an integer between 1 and n, and the cost function is monotonic in m, which enables the usage of binary search.
As explained in [40], detecting the outliers themselves is also an NP-hard problem [41]. Intensive research has been done on this problem, as explained in [42], since the problem of outliers is known in numerous applications [43,44]). In the context of data mining, Ref. [45] proposed a definition of distance-based outlier, which is free of any distributional assumptions and it can be generalized to multidimensional datasets. Following [45], further variations have been proposed [46,47]. Consequently, Ref. [48] introduced paradigm of local outlier factor (LOF). This paradigm has been extended in [43,49] in different directions.
As explained in [50], and following the discussion in [51,52] provided an algorithm based on Lagrange-relaxation technique. Several algorithms [51,53,54] were also developed. The work of [55] gives a factor of and a running time of for the problem of k-means with m outliers. Another heuristic was developed by [56]. Finally, [50] provided an -approximation in time. In the context of k-means, Ref. [57] provided several algorithms of such constant factor approximation. However, the number of the points which approximate the outliers is much greater than m, and is dependent on the data, as well as the algorithm running time.
2. Our Contribution
A natural open question is: “can we generalize the k-means++ algorithm to handle outliers”?
This paper answers this question affirmably in three senses that may be combined together:
- (i)
- A small modification of k-means++ that supports M-estimators for handling outliers, with similar provable guarantees for both the running time and approximation factor. This family of functions includes most of the M-estimators, including nonconvex functions such as for some constant c, where is the distance between an input point and its closest center c. In fact, our version in Algorithm 1 supports any pseudodistance function or -metric that approximates the triangle inequality, as formalized in Definition 1 and Theorem 1.
- (ii)
- A generalization of this solution to the case of k-mean/median problem with m outliers that takes time for constants k and m. To our knowledge, this is the first nontrivial approximation algorithm that takes time linear or even near-linear in n. The algorithm supports all the pseudodistance functions and -metric spaces, including the above M-estimators. See Corollary 4.
- (iii)
- A Weak coreset of size and a larger strong coreset for approximating the sum of distances to any k-centers ignoring their farthest m input points. For details and exact definition see next subsection and Theorem 2.
Novel reduction: from k-means to -means. While the first result is a natural generalization of the original k-means++, the second result uses a simple but powerful general reduction from k-means with m outliers to -means (without outliers), that we did not find in previous papers. More precisely, in Section 5, we prove that an approximation to the -median with appropriate positive weight for each center (the size of its cluster), can be used to approximate the sum of distances from P to any k centers, excluding their farthest m points in P; see Corollary 3. This type of reductions may be considered as “coreset” and we suggest two types of them.
In the first coreset, the approximation error is additive, although the final result is a multiplicative constant factor. Nevertheless, the size of the suggested coreset is only , i.e., independent of both n and d, for constant -metric as the Euclidean space; see Theorem 2 for details. In particular, applying exhaustive search (in time that is exponential in k) on this coreset implies an -factor approximation for the k-median with m outliers, for any -metric in time when the parameters m and k are constants; see Corollary 4.
As stated in the previous section, the exponential dependency in k is unavoidable even for the (regular) k-means in the plane [7,8]. Nevertheless, constant factor approximations that take time that is polynomial in both k and m may be doable by applying more involved approximation algorithms for the k-median with m outliers on our small “coreset” which contains only points. e.g., the polynomial time algorithm of Ke Chen [50].
Theorem 2 also suggests a second “traditional coreset” that yields -approximation for the k-median with m outliers, i.e., that obtains -approximation for the sum of distances from any set of k centers and their farthest m outliers in the original set P. The price is that we need approximation to the -median. As was proved in [58], this is doable in the Euclidean d-dimensional space by running k-means++ for iterations instead of only iterations (number of resulting centers). It was also proved in [58] that the exponential dependency on d is unavoidable in the worst case. See Section 5 for details.
For the special case of k-means in the Euclidean space we can replace d by , including for the coresets constructions above, as explained in [18]. Deterministic version of our coresets for k-median with m outliers may be obtained via [19].
3. Triangular Calculus
The algorithms in this paper support a large family of distance and nondistance functions. To exactly define this family, and their dependency on both the approximation factors and running times of the algorithms, we need to introduce the notion of -metric that generalizes the definition of a metric space. For a point and a set , we define . The minimum or sum over an empty set is defined to be zero.
Definition 1
(-metric). Let , P be a finite set and be a symmetric function such that for every . The pair is a ρ-metric if for every the following “approximated” triangle inequality holds:
For example, metric space is a -metric for . If we define , as in the k-means case, it is easy to prove (1) for .
The approximated triangle inequality also holds for sets as follows.
Lemma 1.
(See 6.1 of [59]) Let be a ρ-metric. Then, for every pair of points and a subset we have
Proof.
For every , and a center that is closest to q, i.e., , we have
□
Our generalization of the k-means++ algorithm for other distance functions needs only the above definition of -metric. However, to improve the approximation bounds for the case of k-means with m outliers in Section 5, we introduce the following variant of the triangle inequality.
Definition 2
( metric). Let be a ρ-metric. For , the pair is a -metric if for every we have
For example, for a metric space ) the inequality holds by the triangle inequality for and every . For squared distance, we have for every ; see [18].
The generalization for sets is a bit more involved as follows.
Lemma 2.
Letbe a-metric. For every setwe have
Proof.
Let such that and . Without loss of generality, we assume . We have
where the last inequality is by (3).
The last term is bounded by via Definition 2, but this bound is useless for the case . Instead, we use (1) to obtain
Plugging the last inequality in (4) proves the case as
□
Any -metric is also a -metric for some other related constants as follows.
Lemma 3.
Let be a ρ-metric. Then is a -metric, where and .
Proof.
Let . We need to prove that
Without loss of generality, , otherwise the claim is trivial. We then have
where the first inequality is by the approximated triangle inequality in (1), and the second inequality is by the assumption . □
How can we prove that a given function satisfies the condition of -metric or -metric? If f is some function of a metric distance, say, or most M-estimators functions such as , this may be easy via the following lemma.
Lemma 4
(Log-Log Lipschitz condition). Let be a monotonic nondecreasing function that satisfies the following (Log-Log Lipschitz) condition: there is such that for every and we have
Let be a metric space, and be a mapping from every to . Then is a -metric where
- (i)
- ,
- (ii)
- and, if , and
- (iii)
- and, if .
Proof.
We denote , and .
(i) We need to prove that . If then , so
and Claim (i) holds for any . Similarly, Claim (i) holds for by symmetry. So we assume . For every we have
where (7) is by the triangle inequality, and (8) is by substituting x and y respectively in (5). Substituting yields
We now compute the maximum of the factor , whose derivative is zero when
i.e., when . In this case . The other extremal values of h are and where . Hence, for and for . Plugging these bounds in (9) yields Claim (i)
(ii)–(iii) We prove that . If then .
If , we need to prove that . If then by (6), and thus for every and we have
as desired. We thus need to prove the last inequality for .
We assume
and (so and thus ), otherwise the claim trivially holds. Let . Hence,
where the first inequality is by (5) and the second is by the definition of q and the assumption . Rearranging gives , so
We now first prove that
Indeed, if then , , and (12) trivially holds with equality. Otherwise ), we let so that
where the inequality is by Young’s inequality for every and such that . We then obtain (12) by substituting so that
Next, we bound the rightmost expression of (12). We have for every , since the linear function over is tangent to the concave function at the point , which is their only intersection point. By substituting we obtain
By the triangle inequality,
4. k-Means++ for ρ-Metric
In this section we suggest a generalization of the k-means++ algorithm for every -metric and not only distance to the power of as claimed in the original version. In particular, we consider nonconvex M-estimator functions. Our goal is then to compute an -approximation to the k-median problem for .
Let be called the weight function over P. For a nonempty set we define
If Q is empty then . For brevity, we denote , and . For an integer we denote .
Definition 3
(k-median for -metric spaces). Let be a ρ-metric, and be an integer. A set is a k-median of P if
and this optimum is denoted by . For , a set is an -approximation for the k-median of P if
Note that Y might have less or more than k centers.
The following lemma implies that sampling an input point, based on the distribution of w, gives a 2-approximation to the 1-mean.
Lemma 5.
For every nonempty set ,
Proof.
Let be the weighted median of Q, i.e., . By (1), for every ,
Summing over every weighted yields
Summing again over the weighted points of Q yields
□
We also need the following approximated triangle inequality for weighted sets.
Corollary 1
(Approximated triangle inequality). Let , and be nonempty sets. Then
Proof.
Summing (2) over every weighted yields
□
The following lemma states that if we use sampling from P and hit a point in an uncovered cluster Q, then it is a 2-approximation for .
Lemma 6.
Let such that . Then
Proof.
By Corollary 1, for every ,
Multiplying this by yields
After summing over every weighted point we obtain
□
Finally, this lemma is the generalization of the original lemma of the k-means++ versions.
Lemma 7.
Let be a pair of integers. If and then the following holds.
Let be a partition of P such that . Let denote the union of the first u sets. Let be a set that covers , i.e., for every integer , and . Let Y be the output of a call to KMeans++; see Algorithm 1. Then
where , and the randomness is over the t sampled points in .
| Algorithm 1: KMeans++(); see Theorem 1. |
![]() |
Proof.
The proof is by the following induction on : (i) the base case (and any ), and (ii) the inductive step (and any ).
(i) Base Case: . We then have
where the first two equalities hold since is not random, and the inequality holds since , and . Hence, the lemma holds for and any .
(ii) Inductive step: . Let denote the first sampled point that was inserted to during the execution of Line 5 in Algorithm 1. Let . Let such that , and denote the remaining “uncovered” clusters, i.e, . The distribution of Y conditioned on the known sample y is the same as the output of a call to KMeans++ where . Hence, we need to bound
We will bound each of the last two terms by expressions that are independent of or .
Bound on . Here , (independent of j), and by the inductive assumption that the lemma holds after replacing t with , we obtain
Bound on . In this case, and . Hence,
Put and . We remove the above dependency of upon x and then m.
We have
where the first inequality holds by the inductive assumption if , or since if . The second inequality holds since , and since .
Only the term depends on x and not only on m. Summing it over every possible yields
where the inequalities follow by substituting in Lemma 6 and Lemma 5, respectively. Hence, the expected value of (27) over x is nonpositive as
By Jensen’s (or power-mean) inequality, for every convex function , and a real vector we have . Specifically, for and ,
Multiplying by u yields
Plugging this in (29) gives the desired bound on the term ,
where the last inequality holds since . Plugging the last inequality in (28) bounds by
Firstly, we have
where the last inequality holds as .
Secondly, since ,
Hence, we can bound (31) by
This proves the inductive step and bounds (31) by
□
Corollary 2.
Let and let be a point that is sampled at random from a nonempty set such that with probability . Then with probability at least ,
Proof.
By Markov’s inequality, for every non-negative random variable G and we have
Substituting yields
By Lemma 5 we have
The following theorem is a particular case of Lemma 7. It proves that the output of KMeans++; see Algorithm 1, is a -approximation of its optimum.
Theorem 1.
Let be a ρ-metric, and be an integer; See Definition 1. Let , and let Y be the output of a call to KMeans++; See Algorithm 1. Then , and with probability at least ,
Moreover, can be computed in time.
Proof.
Let and let be an optimal partition of P, i.e., . Let be a point that is sampled at random from such that with probability . Applying Lemma 7 with and yields,
where (38) holds by the definition of f* and . By plugging in Corollary 2, with probability at least over the randomness of , we have
and since , with probability at least we also have
Plugging (40) in (38) yields that with probability at least over the randomness of ,
Relating to the randomness of Y, by Markov’s inequality we have
By (41) we have,
5. k-Means with m Outliers
In this section we consider the problem of k-means with m outliers of a given set P, i.e., where the cost function for a given set X of k centers is instead of , and is the subset of the closest points to the centers. Ties are broken arbitrarily. That is, can be considered as the set of m outliers that are ignored in the summation of errors.
Definition 4
(k-median with m outliers). Let be a ρ-metric, , and be a pair of integers. For every subset of points in P and a set of centers, denote by the closest points to P. A set of k centers (points in P) is a k-median with m outliers of P if
For , a set Y is an α-approximation to the k-median of P with m outliers if
Note that we allow Y to have centers.
This is a harder and more general problem, since for we get the k-median problem from the previous sections.
We prove in this section that our generalized k-means++ can be served as a “weaker” type of coreset which admits an additive error that yields a constant factor approximation for the k-means with m outliers if is constant.
In fact, we prove that any approximated solution to the -median of P implies such a coreset, but Algorithm 1 is both simple and general for any -distance. Moreover, by taking more than centers, e.g., running Algorithm 1 additional iterations, we may reduce the approximation error of the coreset to obtain “strong” regular coreset, i.e., that introduces a -approximation for any given query set X of k centers. This is by having an approximation for the median; see Theorem 2. Upper and lower bounds for the number of centers to obtain such is the subject of [58]. The authors prove that a constant approximation to the -median of P yields such -approximation to the k-median. This implies that running Algorithm 1 iterations would yield a coreset that admits -approximation error for any given set of k-centers with their farthest m outliers, as traditional coresets. Unfortunately, the lower bounds of [58] show that the exponential dependency in d is unavoidable. However, the counter example is extremely artificial. In real world data, we may simply run Algorithm 1 until the approximation error is sufficiently small and hope this will happen after few iterations due to the structure of the data.
We state our result for the metric case and unweighted input. However, the proof essentially uses only the triangle inequality and its variants of Section 3. For simplicity of notation, we use the term multiset C instead of a weighted set , where , and denote the number of copies of each item in C. The size of a multiset denotes its number of points (including duplicated points), unless stated otherwise.
Unlike the case of k-means/median, there are few solutions to k means with m outliers. Chen [50] provided a (multiplicative) constant factor approximation for the k-median with m outliers on any metric space (that satisfies the triangle inequality, i.e., with = 1) that runs in time , i.e., polynomial in both k and m. Applying this solution on our suggested coreset as explained in Corollary 3 might reduce this dependency on n from cubic to linear, due to the fact that its construction time which is linear in n and also polynomial in . In order to obtain a more general result for any -distance, as in the previous section, we use a simple exhaustive search that takes time exponential in but still for every constant and .
Our solution also implies streaming, parallel algorithm for k-median with m outliers on distributed data. This is simply because many k-median algorithms exist for these computation models, and they can be used to construct the “coreset” in Theorem 2 after replacing k with . An existing offline nonparallel solution for the k-median with m outliers, e.g., from the previous paragraph, may then be applied on the resulting coreset to extract the final approximation as explained in Corollary 4.
For every point , let denote its closest point in Y. For a subset Q of P, define to be the union (multiset) of its projection on Y. Ties are broken arbitrarily. Recall the definition of and -approximation from Definition 4. We now ready to prove the main technical result for the k-median with m outliers.
Theorem 2
(coreset for k-median with m outliers). Let be a metric space, and let . Let such that , and let Y be an α-approximation for the -median of P (without outliers). Let . Then for every set of centers, we have
and
Proof.
Let be a set of centers. The multiset contains points that are contained in the k centers of Y. However, we do not know how to approximate the number of copies of each center in without using . One option is to guess approximation to the weight of each of these points, by observing that it is for some , and then using exhaustive search. However, the running time would be exponential in and the weights depend on the query X.
Instead, we observe that while we cannot compute via C, we have the upper bound . It follows from the fact that
and , so
We now bound the rightmost term.
For a single point we have
where (47) holds by substituting and (or vice vera) and in Lemma 2, and (48) holds by the definition of .
Summing (48) over every for some set yields
where (49) holds by the definition of f, (50) holds by the triangle inequality, and (51) is by (48). The term is bounded by
where the first inequality holds since , and the last inequality holds since Y is an -approximation for the -median of P. Plugging (53) in (51) yields
The rest of the proof is by the following case analysis: (i) , and (ii) .
Case (i): . The bound for this case is
where (55) holds by (46), and (56) holds by substituting in (54). This bounds (44) for the case that .
Case (ii):. In this case, denote the corresponding points to in P by
The motivation for (44) is to obtain a traditional coreset, in the sense of -approximation to any given set of k centers. To this end, we need to have . For the natural case where this implies approximation to the -means of P. This is doable for every input set P in the Euclidean space and many others as was proved in [58] but in the price of taking centers.
This is why we also added the bound 45, which suffices to obtain a “weaker” coreset that yields only constant factor approximation to the k-means with m outliers, but using only centers, as explained in the following corollary.
Corollary 3
(From -median to k-median with m outliers ). Let be a -metric, , and such that . Suppose that, in time, we can compute an α-approximation Y for the -median of P (without outliers). Suppose also that we can compute in time, a β-approximation of the k-median with m outliers for any given multiset of (possibly duplicated) points. Then, X is a
approximation for the k-median with m outliers of P and can be computed in time.
Proof.
We compute Y in time and then project P onto Y to obtain a set of distinct points, as defined in the proof of Theorem 2. We then compute a approximation for the k-median with m outliers of P,
in time. For a given set of centers and a subset of points, we denote by the closest points in Q to X. Let denote the k-median with m outliers of P. Hence,
where (65) holds by (45), (66) holds by (64), and (67) holds by plugging in (44). □
To get rid of the so many parameters in the last theorems, in the next corollary we assume that they are all constant and suggest a simple solution to the k-median with m outliers by running exhaustive search on our weaker coreset. The running time is exponential in k but this may be fixed by running the more involved polynomial-time approximation of Chen [50] for k-means with m outliers on our coreset.
Corollary 4.
Let such that , and let be constants. Let be a ρ-metric. Then, a set can be computed in time such that, with probability at least 0.99, X is a -approximation for the k-median with m outliers of P.
Proof.
Given a set Q of points, it is easy to compute its k-median with m outliers in time as follows. We run exhaustive search over every subset Y of size in Q. For each such subset Y, we compute its farthest m points Z in Q. The set Y that minimizes is an optimal solution, since one of these sets of k centers is a k-median with m outliers of P. Since there are such subsets, and each check requires time, the overall running time is .
Plugging in Theorem 1 implies that we can compute a set of size which is, with probability at least , an -approximation to the -median of P in time . By Lemma 3, is a -metric for , and . Plugging and in Corollary 3 then proves Corollary 4. □
6. Conclusions
We proved that the k-means++ algorithm can be generalized to handle outliers by generalizing it to -metric that supports M-estimators and also proved that k-means seeding can be used as a coreset for the k-means with m outliers. Open problems include generalizations of k-means++ for other shapes, such as k lines or k multidimensional subspaces, and using these approximations for developing coreset algorithms for these problems. Other directions include improving or generalizing the constant factor approximations for the original k-means++ and its variants in this paper, using recent improvement for the analysis of k-means++ and coresets for k-means clustering.
Author Contributions
Conceptualization, A.S. and D.F.; methodology, A.S. and D.F.; formal analysis, A.S. and D.F.; writing—original draft preparation, A.S. and D.F.; writing—review and editing, L.R. and D.F.; supervision, D.F.; All authors have read and agreed to the published version of the manuscript.
Funding
Part of the research of this paper was sponsored by the Phenomix consortium of the Israel Innovation Authority. We thank them for this support.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Kanungo, T.; Mount, D.M.; Netanyahu, N.S.; Piatko, C.D.; Silverman, R.; Wu, A.Y. A local search approximation algorithm for k-means clustering. In Proceedings of the Eighteenth Annual Symposium on Computational Geometry, Barcelona, Spain, 5–7 June 2002; pp. 10–18. [Google Scholar]
- Fayyad, U.M.; Piatetsky-Shapiro, G.; Smyth, P. Knowledge Discovery and Data Mining: Towards a Unifying Framework; KDD: Portland, OR, USA, 1996; Volume 96, pp. 82–88. [Google Scholar]
- Gersho, A.; Gray, R.M. Vector Quantization and Signal Compression; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 159. [Google Scholar]
- Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification and Scene Analysis; Wiley New York: New York, NY, USA, 1973; Volume 3. [Google Scholar]
- Aloise, D.; Deshpande, A.; Hansen, P.; Popat, P. NP-hardness of Euclidean sum-of-squares clustering. Mach. Learn. 2009, 75, 245–248. [Google Scholar] [CrossRef]
- Drineas, P.; Frieze, A.; Kannan, R.; Vempala, S.; Vinay, V. Clustering large graphs via the singular value decomposition. Mach. Learn. 2004, 56, 9–33. [Google Scholar] [CrossRef]
- Mahajan, M.; Nimbhorkar, P.; Varadarajan, K. The planar k-means problem is NP-hard. In International Workshop on Algorithms and Computation; Springer: Berlin/Heidelberg, Germany, 2009; pp. 274–285. [Google Scholar]
- Lee, E.; Schmidt, M.; Wright, J. Improved and simplified inapproximability for k-means. Inf. Process. Lett. 2017, 120, 40–43. [Google Scholar] [CrossRef]
- Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 7–9 January 2007; pp. 1027–1035. [Google Scholar]
- Ostrovsky, R.; Rabani, Y.; Schulman, L.J.; Swamy, C. The effectiveness of Lloyd-type methods for the k-means problem. J. ACM (JACM) 2013, 59, 1–22. [Google Scholar] [CrossRef]
- Lattanzi, S.; Sohler, C. A better k-means++ Algorithm via Local Search. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 3662–3671. [Google Scholar]
- Bartal, Y.; Charikar, M.; Raz, D. Approximating min-sum k-clustering in metric spaces. In Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing, Crete, Greece, 6–8 July 2001; pp. 11–20. [Google Scholar]
- Charikar, M.; Guha, S.; Tardos, É.; Shmoys, D.B. A constant-factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci. 2002, 65, 129–149. [Google Scholar] [CrossRef]
- Chuzhoy, J.; Rabani, Y. Approximating k-median with non-uniform capacities. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, Vancouver, BC, Canada, 23–25 January 2005; pp. 952–958. [Google Scholar]
- Ailon, N.; Jaiswal, R.; Monteleoni, C. Streaming k-means approximation. In Advances in Neural Information Processing Systems; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2009; pp. 10–18. [Google Scholar]
- Aggarwal, A.; Deshpande, A.; Kannan, R. Adaptive sampling for k-means clustering. In Approximation, Randomization, and Combinatorial Optimization, Algorithms and Techniques; Springer: Berlin/Heidelberg, Germany, 2009; pp. 15–28. [Google Scholar]
- Makarychev, K.; Makarychev, Y.; Sviridenko, M.; Ward, J. A bi-criteria approximation algorithm for k Means. arXiv 2015, arXiv:1507.04227. [Google Scholar]
- Feldman, D.; Schmidt, M.; Sohler, C. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SIAM, Orleans, LA, USA, 6–8 January 2013; pp. 1434–1453. [Google Scholar]
- Barger, A.; Feldman, D. k-Means for Streaming and Distributed Big Sparse Data. In Proceedings of the 2016 SIAM International Conference on Data Mining, SIAM, Miami, FL, USA, 5–7 May 2016; pp. 342–350. [Google Scholar]
- Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
- Cox, D.R. Note on grouping. J. Am. Stat. Assoc. 1957, 52, 543–547. [Google Scholar] [CrossRef]
- Bailey, D.E.; Tryon, R.C. Cluster Analysis and the BC TRY System; Tryon-Bailey Assoc. and the University of Colorado: Denver, CO, USA, 1970. [Google Scholar]
- Milligan, G.W.; Sokol, L.M. A two-stage clustering algorithm with robust recovery characteristics. Educ. Psychol. Meas. 1980, 40, 755–759. [Google Scholar] [CrossRef]
- Max, J. Quantizing for minimum distortion. IRE Trans. Inf. Theory 1960, 6, 7–12. [Google Scholar] [CrossRef]
- MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, 1 January 1967; Volume 1, pp. 281–297. [Google Scholar]
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–22. [Google Scholar]
- Linde, Y.; Buzo, A.; Gray, R. An algorithm for vector quantizer design. IEEE Trans. Commun. 1980, 28, 84–95. [Google Scholar] [CrossRef]
- Gray, R.M.; Neuhoff, D.L. Quantization. IEEE Trans. Inf. Theory 1998, 44, 2325–2383. [Google Scholar] [CrossRef]
- Jain, A.K.; Murty, M.N.; Flynn, P.J. Data clustering: A review. ACM Comput. Surv. (CSUR) 1999, 31, 264–323. [Google Scholar] [CrossRef]
- Alsabti, K.; Ranka, S.; Singh, V. An Efficient K-Means Custering Algorithm. 1997. Available online: https://surface.syr.edu/cgi/viewcontent.cgi?referer=https://scholar.google.com.hk/&httpsredir=1&article=1042&context=eecs (accessed on 25 November 2020).
- Pelleg, D.; Moore, A. Accelerating exact k-means algorithms with geometric reasoning. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA USA, 15–18 August 1999; pp. 277–281. [Google Scholar]
- Phillips, S.J. Acceleration of k-means and related clustering problems. In Proceedings of the 4th International Workshop on Algorithm Engineering and Experiments, Francicsco, CA, USA, 4–5 January 2002. [Google Scholar]
- Forgey, E. Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics 1965, 21, 768–769. [Google Scholar]
- Rousseeuw, P.J.; Kaufman, L. Finding Groups in Data; Wiley Online Library: Hoboken, NJ, USA, 1990; Volume 1. [Google Scholar]
- Fisher, D. Iterative optimization and simplification of hierarchical clusterings. J. Artif. Intell. Res. 1996, 4, 147–178. [Google Scholar] [CrossRef]
- Higgs, R.E.; Bemis, K.G.; Watson, I.A.; Wikel, J.H. Experimental designs for selecting molecules from large chemical databases. J. Chem. Inf. Comput. Sci. 1997, 37, 861–870. [Google Scholar] [CrossRef]
- Snarey, M.; Terrett, N.K.; Willett, P.; Wilton, D.J. Comparison of algorithms for dissimilarity-based compound selection. J. Mol. Graph. Model. 1997, 15, 372–385. [Google Scholar] [CrossRef]
- Bradley, P.S.; Fayyad, U.M. Refining Initial Points for K-Means Clustering; ICML, Citeseer: Princeton, NJ, USA, 1998; Volume 98, pp. 91–99. [Google Scholar]
- Meila, M.; Heckerman, D. An experimental comparison of several clustering and initialization methods. arXiv 2013, arXiv:1301.7401. [Google Scholar]
- Chawla, S.; Gionis, A. k-means–: A unified approach to clustering and outlier detection. In Proceedings of the 2013 SIAM International Conference on Data Mining, SIAM, Anaheim, CA, USA, 2–4 May 2013; pp. 189–197. [Google Scholar]
- Lin, J.H.; Vitter, J.S. e-Approximations with minimum packing constraint violation. In Proceedings of the Twenty-Fourth Annual ACM Symposium on Theory of Computing, Victoria, BC, Canada, 4–6 May 1992; pp. 771–782. [Google Scholar]
- Hautamäki, V.; Cherednichenko, S.; Kärkkäinen, I.; Kinnunen, T.; Fränti, P. Improving k-means by outlier removal. In Scandinavian Conference on Image Analysis; Springer: Berlin/Heidelberg, Germany, 2005; pp. 978–987. [Google Scholar]
- Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. (CSUR) 2009, 41, 1–58. [Google Scholar] [CrossRef]
- Hawkins, D.M. Identification of Outliers; Springer: Berlin/Heidelberg, Germany, 1980; Volume 11. [Google Scholar]
- Knox, E.M.; Ng, R.T. Algorithms for mining distancebased outliers in large datasets. In Proceedings of the International Conference on Very Large Data Bases, New York City, NY, USA, 24–27 August 1998; pp. 392–403. [Google Scholar]
- Bay, S.D.; Schwabacher, M. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 24–27 August 2003; pp. 29–38. [Google Scholar]
- Ramaswamy, S.; Rastogi, R.; Shim, K. Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 16–18 May 2000; pp. 427–438. [Google Scholar]
- Breunig, M.M.; Kriegel, H.P.; Ng, R.T.; Sander, J. LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 16–18 May 2000; pp. 93–104. [Google Scholar]
- Papadimitriou, S.; Kitagawa, H.; Gibbons, P.B.; Faloutsos, C. Loci: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering (Cat. No. 03CH37405), Bangalore, India, 8 March 2003; IEEE: Piscataway, NJ, USA, 2003; pp. 315–326. [Google Scholar]
- Chen, K. A constant factor approximation algorithm for k-median clustering with outliers. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, CA, USA, 20–22 January 2008; pp. 826–835. [Google Scholar]
- Charikar, M.; Khuller, S.; Mount, D.M.; Narasimhan, G. Algorithms for Facility Location Problems with Outliers; SODA: Philadelphia, PA, USA, 2001; Volume 1, pp. 642–651. [Google Scholar]
- Jain, K.; Vazirani, V.V. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. J. ACM (JACM) 2001, 48, 274–296. [Google Scholar] [CrossRef]
- Jain, K.; Mahdian, M.; Markakis, E.; Saberi, A.; Vazirani, V.V. Greedy facility location algorithms analyzed using dual fitting with factor-revealing LP. J. ACM (JACM) 2003, 50, 795–824. [Google Scholar] [CrossRef]
- Mahdian, M. Facility Location and the Analysis of Algorithms through Factor-Revealing Programs. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2004. [Google Scholar]
- Arya, V.; Garg, N.; Khandekar, R.; Meyerson, A.; Munagala, K.; Pandit, V. Local search heuristics for k-median and facility location problems. SIAM J. Comput. 2004, 33, 544–562. [Google Scholar] [CrossRef]
- Aboud, A.; Rabani, Y. Correlation Clustering with Penalties and Approximating the Reordering Buffer Management Problem. Ph.D. Thesis, Computer Science Department, Technion, Haifa, Israel, 2008. [Google Scholar]
- Gupta, S.; Kumar, R.; Lu, K.; Moseley, B.; Vassilvitskii, S. Local search methods for k-means with outliers. Proc. VLDB Endow. 2017, 10, 757–768. [Google Scholar] [CrossRef]
- Bhattacharya, A.; Jaiswal, R. On the k-Means/Median Cost Function. arXiv 2017, arXiv:1704.05232. [Google Scholar]
- Braverman, V.; Feldman, D.; Lang, H. New frameworks for offline and streaming coreset constructions. arXiv 2016, arXiv:1612.00889. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
