k -Means+++: Outliers-Resistant Clustering

: The k -means problem is to compute a set of k centers (points) that minimizes the sum of squared distances to a given set of n points in a metric space. Arguably, the most common algorithm to solve it is k -means++ which is easy to implement and provides a provably small approximation error in time that is linear in n . We generalize k -means++ to support outliers in two sense (simultaneously): (i) nonmetric spaces, e.g., M-estimators, where the distance dist ( p , x ) between a point p and a center x is replaced by min { dist ( p , x ) , c } for an appropriate constant c that may depend on the scale of the input. (ii) k -means clustering with m ≥ 1 outliers, i.e., where the m farthest points from any given k centers are excluded from the total sum of distances. This is by using a simple reduction to the ( k + m ) -means clustering (with no outliers).


Introduction
We first introduce the notion of clustering, the solution that is suggested by k-means++, and the generalization of the problem to support outliers in the input. We then describe our contribution in this context.

Clustering
For a given similarity measure, clustering is the problem of partitioning an input set of n objects into subsets, such that objects in the same group are more similar to each other than to objects in the other sets. As mentioned in [1], clustering problems arise in many different applications, including data mining and knowledge discovery [2], data compression and vector quantization [3], and pattern recognition and classification [4]. However, for most of its variants, it is an NP-Hard problem when the number k of clusters is part of the input, as elaborated and proved in [5,6]. In fact, the exponential dependency in k is unavoidable even for approximating the (regular) Euclidean k-means for clustering n points in the plane [7,8].
Hence, constant or near-constant (logarithmic in n or k) multiplicative factor approximations to the desired cost function were suggested over the years, whose running time is polynomial in both n and k. Arguably, the most common version in both academy and industry is the k-means cost function, where the input is a set of n points in a metric space, and the goal is to compute the set of k centers (points) that minimizes the sum of squared distances over each input point to its closest center.
The k-means++ method that was suggested independently by [9,10] is an extremely common and simple solution with probable approximation bounds. It provides an α ∈ O(log k) multiplicative approximation in time O(dnk), e.g., for the Euclidean d-dimensional space. It was also shown in [10] that the approximation factor is α = O(1), if the input data is well separated in some formal sense.
βk for some β > 1. The factors α and β might be depended on k and n, and different methods give different dependencies and running times. For example, Ref. [15] showed that sampling O(k log k) different randomized centers yields an O(1)-approximation and leverage it to support streaming data. Similarly, it was proved in [16] that D 2 -sampling O(k) centers yields an O(1) approximation.
The analysis of [17] explores the value of α as a function of β.
A coreset for the k-median/mean problem is a small weighted set (usually subset) of the input points that approximates the sum of distances or sum of squared distances from the original (big) set P to every given set of k centers, usually up to (1 + ε) multiplicative factor. In particular, we may compute an α-approximation on the coreset to obtain α(1 + ε)-approximation for the original data.
For the special case of k-means in the Euclidean space we can replace d by O(k/ε 2 ), including for coresets constructions, as explained in [18]. Deterministic coresets for k-means of size independent of d were suggested in [19].

Seeding
As explained in [10], Lloyd [20][21][22] suggested a simple iterative heuristic that aims to minimize the clustering cost, assuming a solution to the case k = 1 is known. It is a special case of the Expected Maximization (EM) heuristic for computing a local minimum. The algorithm is initialized with k random points (seeds, centroids). At each iteration, each of the input points is classified to its closest centroid. A new set of k centroids is constructed by taking the mean (or solving the problem for k = 1, in the general case) of each of the current k clusters. This method is repeated until convergence or any given stopping condition.
The drawback of this approach is that it converges to a local minimum-the one which is closest to the initial centers that had been chosen and may be arbitrarily far from the global minimum. There is also no upper bound for the convergence rate and number of iterations. Therefore, a lot of research has been done to choose good initial points, called "seeds" [33][34][35][36][37][38][39]. However, very few analytical guarantees were found to prove convergence.
A natural solution is to use provable approximation algorithms such as k-means++ above, and then apply Lloyd's algorithm as a heuristic that hopefully improves the approximation in practice. Since Lloyd's algorithm can only improve the initial solution, the provable upper bound on the approximation factor is still preserved.

Clustering with Outliers
In practice, data sets include some noise measurements which do not reflect a real part of the data. These are called outliers, and even a single outlier may completely change the optimal solution that is obtained without this outlier. One option to handle outliers is to change the distance function to a function that is more robust to outliers, such as M-estimators, e.g., where the distance dist(p, x) between a pair of points is replaced by min {dist(p, x), c} for some fixed c > 0 that may depend on the scaling or spread of the data. Another option is to compute the set of k centers that minimizes the objective function, excluding the farthest m points from the candidate k centers. Here, m ≥ 1 is a given parameter for the number of outliers. Of course, given the optimal k centers for this problem, the m outliers are simply the farthest m input points, and given these m outliers the optimal solution is the k-means for the rest of the points. However, the main challenge is to approximate the global optimum, i.e., compute the optimal centers and outliers simultaneously. Guessing m is less challenging since it is an integer between 1 and n, and the cost function is monotonic in m, which enables the usage of binary search.
As explained in [40], detecting the outliers themselves is also an NP-hard problem [41]. Intensive research has been done on this problem, as explained in [42], since the problem of outliers is known in numerous applications [43,44]). In the context of data mining, Ref. [45] proposed a definition of distance-based outlier, which is free of any distributional assumptions and it can be generalized to multidimensional datasets. Following [45], further variations have been proposed [46,47]. Consequently, Ref. [48] introduced paradigm of local outlier factor (LOF). This paradigm has been extended in [43,49] in different directions.
As explained in [50], and following the discussion in [51,52] provided an algorithm based on Lagrange-relaxation technique. Several algorithms [51,53,54] were also developed. The work of [55] gives a factor of O(1) and a running time of O(n m ) for the problem of k-means with m outliers. Another heuristic was developed by [56]. Finally, [50] provided an O(1)-approximation in O(k 2 (k + m) 2 n 3 log n) time. In the context of k-means, Ref. [57] provided several algorithms of such constant factor approximation. However,the number of the points which approximate the outliers is much greater than m, and is dependent on the data, as well as the algorithm running time.

Our Contribution
A natural open question is: "can we generalize the k-means++ algorithm to handle outliers"? This paper answers this question affirmably in three senses that may be combined together: (i) A small modification of k-means++ that supports M-estimators for handling outliers, with similar provable guarantees for both the running time and approximation factor. This family of functions includes most of the M-estimators, including nonconvex functions such as M(x) = min {x, c} for some constant c, where x = dist(p, c) is the distance between an input point and its closest center c. In fact, our version in Algorithm 1 supports any pseudodistance function or ρ-metric that approximates the triangle inequality, as formalized in Definition 1 and Theorem 1.
(ii) A generalization of this solution to the case of k-mean/median problem with m outliers that takes time O(n) for constants k and m. To our knowledge, this is the first nontrivial approximation algorithm that takes time linear or even near-linear in n. The algorithm supports all the pseudodistance functions and ρ-metric spaces, including the above M-estimators. See Corollary 4.
(iii) A Weak coreset of size O(k + m) and a larger strong coreset for approximating the sum of distances to any k-centers ignoring their farthest m input points. For details and exact definition see next subsection and Theorem 2.
Novel reduction: from k-means to (k + m)-means. While the first result is a natural generalization of the original k-means++, the second result uses a simple but powerful general reduction from k-means with m outliers to (k + m)-means (without outliers), that we did not find in previous papers. More precisely, in Section 5, we prove that an approximation to the (k + m)-median with appropriate positive weight for each center (the size of its cluster), can be used to approximate the sum of distances from P to any k centers, excluding their farthest m points in P; see Corollary 3. This type of reductions may be considered as "coreset" and we suggest two types of them.
In the first coreset, the approximation error is additive, although the final result is a multiplicative constant factor. Nevertheless, the size of the suggested coreset is only O(k + m), i.e., independent of both n and d, for constant ρ-metric as the Euclidean space; see Theorem 2 for details. In particular, applying exhaustive search (in time that is exponential in k) on this coreset implies an O(log k + m)-factor approximation for the k-median with m outliers, for any (P, ρ)-metric in O(n) time when the parameters m and k are constants; see Corollary 4. As stated in the previous section, the exponential dependency in k is unavoidable even for the (regular) k-means in the plane [7,8]. Nevertheless, constant factor approximations that take time that is polynomial in both k and m may be doable by applying more involved approximation algorithms for the k-median with m outliers on our small "coreset" which contains only O(k + m) points. e.g., the polynomial time algorithm of Ke Chen [50].
Theorem 2 also suggests a second "traditional coreset" that yields (1 + ε)-approximation for the k-median with m outliers, i.e., that obtains (1 + ε)-approximation for the sum of distances from any set of k centers and their farthest m outliers in the original set P. The price is that we need α < ε approximation to the (k + m)-median. As was proved in [58], this is doable in the Euclidean d-dimensional space by running k-means++ for (1/ε) O(d) (k + m) log n iterations instead of only O(k + m) iterations (number of resulting centers). It was also proved in [58] that the exponential dependency on d is unavoidable in the worst case. See Section 5 for details.
For the special case of k-means in the Euclidean space we can replace d by O(k/ε 2 ), including for the coresets constructions above, as explained in [18]. Deterministic version of our coresets for k-median with m outliers may be obtained via [19].

Triangular Calculus
The algorithms in this paper support a large family of distance and nondistance functions. To exactly define this family, and their dependency on both the approximation factors and running times of the algorithms, we need to introduce the notion of ρ-metric that generalizes the definition (P, dist) of a metric space. For a point p ∈ P and a set X ⊆ P, we define f (p, X) = min x∈X f (p, x). The minimum or sum over an empty set is defined to be zero. Definition 1 (ρ-metric). Let ρ ≥ 1, P be a finite set and f : P 2 → [0, ∞) be a symmetric function such that f (p, p) = 0 for every p ∈ P. The pair (P, f ) is a ρ-metric if for every x, y, z ∈ P the following "approximated" triangle inequality holds: For example, metric space (P, dist) is a ρ-metric for ρ = 1. If we define f (x, y) = (dist(x, y)) 2 , as in the k-means case, it is easy to prove (1) for ρ = 2.
The approximated triangle inequality also holds for sets as follows.
Then, for every pair of points q, x ∈ P and a subset X ⊆ P we have Proof. For every p, q ∈ P, and a center x q ∈ X that is closest to q, i.e., f (q, Our generalization of the k-means++ algorithm for other distance functions needs only the above definition of ρ-metric. However, to improve the approximation bounds for the case of k-means with m outliers in Section 5, we introduce the following variant of the triangle inequality. For example, for a metric space (P, dist) the inequality holds by the triangle inequality for φ = 1 and every ε ≥ 0. For squared distance, we have φ = O(1/ε) for every ε > 0; see [18].
The generalization for sets is a bit more involved as follows.
where the last inequality is by (3).
Plugging the last inequality in (4) proves the case f (y, Any ρ-metric is also a (ρ, φ, ε)-metric for some other related constants as follows.
Proof. Let x, y, z ∈ P. We need to prove that Without loss of generality, f (x, z) > f (y, z), otherwise the claim is trivial. We then have where the first inequality is by the approximated triangle inequality in (1), and the second inequality is by the assumption f (x, z) > f (y, z).
How can we prove that a given function f : P 2 → [0, ∞) satisfies the condition of ρ-metric or (ρ, φ, ε)-metric? If f is some function of a metric distance, say, f (x) = dist r (x) or most M-estimators functions such as f (x) = min {dist(x), 1}, this may be easy via the following lemma.
Let (P, dist) be a metric space, and f : P 2 → [0, ∞) be a mapping from every p, c ∈ P to f (p, c) = g(dist(p, c)).
and Claim (i) holds for any ρ ≥ 1. Similarly, Claim (i) holds for x = 0 by symmetry. So we assume x, y > 0. For every b ∈ (0, 1) we have where (7) is by the triangle inequality, and (8) is by substituting x and y respectively in (5).
We now compute the maximum of the factor h(x) = (x+y) r x r +y r , whose derivative is zero when i.e., when x = y. In this case h(x) = 2 r−1 . The other extremal values of h are x = 0 and x = ∞ where h(x) = 1. Hence, max x≥0 h(x) = 2 r−1 for r ≥ 1 and max x≥0 h(x) = 1 for r ∈ (0, 1). Plugging these bounds in (9) yields Claim (i) (6), and thus for every φ ≥ 1 and ε > 0 we have as desired. We thus need to prove the last inequality for y > 0.
We assume and x > y (so p = q and thus z > 0), otherwise the claim trivially holds. Let q = max {r, 1}. Hence, where the first inequality is by (5) and the second is by the definition of q and the assumption x > y ≥ 0. Rearranging gives g(y) ≥ g(x) · (y/x) q , so We now first prove that Indeed, if q = 1 then r ≤ 1, ε = 0, φ = 1 and (12) trivially holds with equality. Otherwise (q > 1), we let p = q q−1 so that where the inequality is by Young's inequality ab ≤ a p p + b q q for every a, b ≥ 0 and p, q > 0 such that 1/p + 1/q = 1. We then obtain (12) Next, we bound the rightmost expression of (12). We have 1 − w q ≤ q(1 − w) for every w ≥ 0, since the linear function q(1 − w) over w ≥ 0 is tangent to the concave function 1 − w q at the point w = 1, which is their only intersection point. By substituting w = y/x we obtain By the triangle inequality, Combining (14) and (15) yields Since g(x) ≥ φg(z) ≥ g(z) by (10) and the definition of φ, we have x ≥ z, i.e., (x/z) ≥ 1 by the monotonicity of g. Hence, By combining the last inequalities we obtain the desired result where (18) holds by (11), (19) is by (12), (20) holds by (16), and (21) is by (17).

k-Means++ for ρ-metric
In this section we suggest a generalization of the k-means++ algorithm for every ρ-metric and not only distance to the power of z ≥ 1 as claimed in the original version. In particular, we consider nonconvex M-estimator functions. Our goal is then to compute an α-approximation to the k-median problem for α = O(log k).
Note that Y might have less or more than k centers.
The following lemma implies that sampling an input point, based on the distribution of w, gives a 2-approximation to the 1-mean.

Lemma 5.
For every nonempty set Q ⊆ P, Proof. Let p * be the weighted median of Q, i.e., f w (Q, p * ) = f * w (Q, 1). By (1), for every q, x ∈ Q, Summing over every weighted q ∈ Q yields Summing again over the weighted points of Q yields We also need the following approximated triangle inequality for weighted sets.
Corollary 1 (Approximated triangle inequality). Let x ∈ P, and Q, X ⊆ P be nonempty sets. Then Proof. Summing (2) over every weighted q ∈ Q yields The following lemma states that if we use sampling from P and hit a point in an uncovered cluster Q, then it is a 2-approximation for f * (P, 1).
Proof. By Corollary 1, for every x ∈ Q, Multiplying this by After summing over every weighted point x ∈ Q we obtain Finally, this lemma is the generalization of the original lemma of the k-means++ versions.
Let {P 1 , · · · , P k } be a partition of P such that ∑ k i=1 f * (P i , 1) = f * (P, k). Let U = u i=1 P i denote the union of the first u sets. Let X ⊆ P be a set that covers P \ U, i.e., X ∩ P i = ∅ for every integer i ∈ [k] \ [u], and X ∩ U = ∅. Let Y be the output of a call to KMEANS++(P, w, f , X, t); see Algorithm 1. Then , and the randomness is over the t sampled points in Y \ X.

Input :
A ρ-metric (P, f ), a function w : P → [0, ∞), a subset X ⊆ P, and an integer t ∈ [0, |P| − |X|]. Output : Y ⊆ P. 1 Y := X 2 for i := 1 to t // If t = 0 then skip this "for" loop 3 do 4 For every p ∈ P, Pr i (p) = Pick a random point y i from P, where y i = p with probability Pr i (p) for every p ∈ P.
(i) Base Case : t = 0. We then have where the first two equalities hold since Y = X is not random, and the inequality holds since t = 0, H 0 = 1 and f * (U, u) ≥ 0. Hence, the lemma holds for t = 0 and any u ≥ 0.
(ii) Inductive step: t ≥ 1. Let y ∈ P denote the first sampled point that was inserted to Y \ X during the execution of Line 5 in Algorithm 1. Let X = X ∪ {y}. Let j ∈ [k] such that y ∈ P j , and U = U \ P j denote the remaining "uncovered" u = | {P 1 , · · · , P u } \ P j | clusters, i.e, u ∈ {u, u − 1}.
The distribution of Y conditioned on the known sample y is the same as the output of a call to KMEANS++(P, w, f , X , t ) where t = t − 1. Hence, we need to bound We will bound each of the last two terms by expressions that are independent of X or U .
Bound on E[ f (P, Y) | y ∈ P \ U]. Here u = u ∈ [k], U = U (independent of j), and by the inductive assumption that the lemma holds after replacing t with t = t − 1, we obtain Bound on E[ f (P, Y) | y ∈ U]. In this case, U = U \ P j and u = u − 1 ∈ {0, · · · , k − 1}. Hence, Put m ∈ [u] and x ∈ P m . We remove the above dependency of E[ f (P, Y) | y ∈ U] upon x and then m.
We have where the first inequality holds by the inductive assumption if u ≥ 1, or since U = U \ P j = ∅ if u = 0. The second inequality holds since X ⊆ X = X ∪ {x}, and since f * (U \ P m , u − 1) = f * (U, u) − f * (P m , 1).
Only the term f (P m , X ) = f (P m , X ∪ {x}) depends on x and not only on m. Summing it over every possible x ∈ P m yields ∑ x∈P m where the inequalities follow by substituting Q = P m in Lemma 6 and Lemma 5, respectively. Hence, the expected value of (27) over x is nonpositive as Combining this with (26) and then (27) yields a bound on E[ f (P, Y) | y = x] that is independent upon x, It is left to remove the dependency on m, which occurs in the last term f (U \ P m , X) of (28). We have By Jensen's (or power-mean) inequality, for every convex function g : R → R, and a real vector v = (v 1 , · · · , v u ) we have (1/u) ∑ u m=1 g(v m ) ≥ g((1/u) ∑ u m=1 v m ). Specifically, for g(z) := z 2 and v = f (P 1 , X), · · · , f (P u , X) , Multiplying by u yields Plugging this in (29) gives the desired bound on the term f (U , X), where the last inequality holds since f (U, X) ≤ f (P, X). Plugging the last inequality in (28) bounds Bound on E[ f (P, Y)]. Plugging (30) and (25) in (24) yields Firstly, we have where the last inequality holds as u ≥ t. Secondly, since U ⊆ P, Hence, we can bound (31) by This proves the inductive step and bounds (31) by

Corollary 2.
Let δ ∈ (0, 1] and let q 0 be a point that is sampled at random from a nonempty set Q ⊆ P such that q 0 = q with probability . Then with probability at least 1 − δ, Proof. By Markov's inequality, for every non-negative random variable G and δ > 0 we have Substituting By Lemma 5 we have Plugging (35) in (34) yields, The following theorem is a particular case of Lemma 7. It proves that the output of KMEANS++; see Algorithm 1, is a O(log k)-approximation of its optimum. Theorem 1. Let (P, f ) be a ρ-metric, and k ≥ 2 be an integer; See Definition 1. Let w : P → (0, ∞), δ ∈ (0, 1) and let Y be the output of a call to KMEANS++(P, w, f , ∅, k); See Algorithm 1. Then |Y| = k, and with probability at least 1 − δ, f (P, Y) ≤ 8ρ 2 δ 2 (1 + ln k) f * (P, k). Moreover, |Y| can be computed in O(ndk) time.
Proof. Let δ = δ/2 and let {P 1 , · · · , P k } be an optimal partition of P, i.e., ∑ k i=1 f * (P i , 1) = f * (P, k). Let p 0 be a point that is sampled at random from P k such that p 0 = p with probability w(p) ∑ p ∈P k w(p ) .
Applying Lemma 7 with u = t = k − 1 and X = {p 0 } yields, where (38) holds by the definition of f * and P k . By plugging Q = P k in Corollary 2, with probability at least 1 − δ over the randomness of p 0 , we have and since ρ ≥ 1, with probability at least 1 − δ we also have Plugging (40) in (38) yields that with probability at least 1 − δ over the randomness of p 0 , Relating to the randomness of Y, by Markov's inequality we have By (41) we have, Using the union bound on (42) and (43) we obtain and thus Pr f (P, Y) < 8ρ 2

k-Means with m Outliers
In this section we consider the problem of k-means with m outliers of a given set P, i.e., where the cost function for a given set X of k centers is f (P X , X) instead of f (P, X), and P X is the subset of the closest n − m points to the centers. Ties are broken arbitrarily. That is, P \ P X can be considered as the set of m outliers that are ignored in the summation of errors.
Definition 4 (k-median with m outliers). Let (P, f ) be a ρ-metric, n = |P|, and k, m ∈ [n] be a pair of integers. For every subset Q ⊆ P of points in P and a set X ⊆ P of centers, denote by Q X the closest n − m points to P. A set X * of k centers (points in P) is a k-median with m outliers of P if f (P X * , X * ) ≤ min X⊆P,|X|=k f (P X , X).
For α ≥ 0, a set Y is an α-approximation to the k-median of P with m outliers if Note that we allow Y to have |Y| > k centers. This is a harder and more general problem, since for m = 0 we get the k-median problem from the previous sections.
We prove in this section that our generalized k-means++ can be served as a "weaker" type of coreset which admits an additive error that yields a constant factor approximation for the k-means with m outliers if ρ is constant.
In fact, we prove that any approximated solution to the (k + m)-median of P implies such a coreset, but Algorithm 1 is both simple and general for any ρ-distance. Moreover, by taking more than k + m centers, e.g., running Algorithm 1 additional iterations, we may reduce the approximation error α of the coreset to obtain "strong" regular coreset, i.e., that introduces a (1 + ε)-approximation for any given query set X of k centers. This is by having an α < ε approximation for the (k + m) median; see Theorem 2. Upper and lower bounds for the number |X| of centers to obtain such α < ε is the subject of [58]. The authors prove that a constant approximation to the |X| = O((1/ε) d k log n)-median of P yields such α = O(ε)-approximation to the k-median. This implies that running Algorithm 1 O((1/ε) d (k + m) log n) iterations would yield a coreset that admits (1 + ε)-approximation error for any given set of k-centers with their farthest m outliers, as traditional coresets. Unfortunately, the lower bounds of [58] show that the exponential dependency in d is unavoidable. However, the counter example is extremely artificial. In real world data, we may simply run Algorithm 1 until the approximation error f (P, |X|) is sufficiently small and hope this will happen after few |X| iterations due to the structure of the data.
We state our result for the metric case and unweighted input. However, the proof essentially uses only the triangle inequality and its variants of Section 3. For simplicity of notation, we use the term multiset C instead of a weighted set (C, w), where C ⊆ P, and w : P → [0, ∞) denote the number of copies of each item in C. The size |C| of a multiset denotes its number of points (including duplicated points), unless stated otherwise.
Unlike the case of k-means/median, there are few solutions to k means with m outliers. Chen [50] provided a (multiplicative) constant factor approximation for the k-median with m outliers on any metric space (that satisfies the triangle inequality, i.e., with ρ = 1) that runs in time O(n 3 log(n)dk 2 (k + m) 2 ), i.e., polynomial in both k and m. Applying this solution on our suggested coreset as explained in Corollary 3 might reduce this dependency on n from cubic to linear, due to the fact that its construction time which is linear in n and also polynomial in k + m. In order to obtain a more general result for any ρ-distance, as in the previous section, we use a simple exhaustive search that takes time exponential in k + m but still O(n) for every constant k, m and ρ.
Our solution also implies streaming, parallel algorithm for k-median with m outliers on distributed data. This is simply because many k-median algorithms exist for these computation models, and they can be used to construct the "coreset" in Theorem 2 after replacing k with k + m. An existing offline nonparallel solution for the k-median with m outliers, e.g., from the previous paragraph, may then be applied on the resulting coreset to extract the final approximation as explained in Corollary 4.
For every point p ∈ P, let proj(p, Y) ∈ Y denote its closest point in Y. For a subset Q of P, define proj(Q, Y) = {proj(p, Y) | p ∈ Q} to be the union (multiset) of its |Q| projection on Y. Ties are broken arbitrarily. Recall the definition of P X , C X and α-approximation from Definition 4. We now ready to prove the main technical result for the k-median with m outliers. Theorem 2 (coreset for k-median with m outliers). Let (P, f ) be a (ρ, φ, ε) metric space, and let n = |P|. Let k, m ∈ [n] such that k + m < n, and let Y be an α-approximation for the (k + m)-median of P (without outliers). Let C := proj(P, Y). Then for every set X ⊆ P of |X| = k centers, we have and Proof. Let X ⊆ P be a set of |X| = k centers. The multiset proj(P X , Y) contains n − m points that are contained in the k centers of Y. However, we do not know how to approximate the number of copies of each center in proj(P X , Y) without using P X . One option is to guess (1 + ε) approximation to the weight of each of these k + m points, by observing that it is (1 + ε) i for some i ∈ O(log(n)/ε), and then using exhaustive search. However, the running time would be exponential in k + m and the weights depend on the query X.
Instead, we observe that while we cannot compute proj(P X , Y) via C, we have the upper bound f (C X , X) ≤ f (proj(P X , Y), X). It follows from the fact that proj(P X , Y) ⊆ proj(P, Y) = C, and |proj(P X , Y)| = n − m, so We now bound the rightmost term. For a single point p ∈ P we have where (47) holds by substituting x = f (p, X) and y = f (proj(p, Y), X) (or vice vera) and Z = X in Lemma 2,and (48) holds by the definition of f (p, Y). Summing (48) over every p ∈ Q for some set Q ⊆ P yields where (49) holds by the definition of f , (50) holds by the triangle inequality, and (51) is by (48). The term where the first inequality holds since Q ⊆ P, and the last inequality holds since Y is an α-approximation for the (k + m)-median of P. Plugging (53) in (51) yields The rest of the proof is by the following case analysis: (i) f (C X , X) ≥ f (P X , X), and (ii) f (C X , X) < f (P X , X).
The bound for this case is where (55) holds by (46), and (56) holds by substituting Q = P X in (54). This bounds (44) for the case that f (C X , X) ≥ f (P X , X).
The motivation for (44) is to obtain a traditional coreset, in the sense of (1 + ε)-approximation to any given set of k centers. To this end, we need to have ((φ + ερ)α) < ε. For the natural case where ρ = O(1) this implies α < ε approximation to the (k + m)-means f * (P, k + m) of P. This is doable for every input set P in the Euclidean space and many others as was proved in [58] but in the price of taking (1/ε) d k log n centers. This is why we also added the bound 45, which suffices to obtain a "weaker" coreset that yields only constant factor approximation to the k-means with m outliers, but using only k + m centers, as explained in the following corollary.
To get rid of the so many parameters in the last theorems, in the next corollary we assume that they are all constant and suggest a simple solution to the k-median with m outliers by running exhaustive search on our weaker coreset. The running time is exponential in k but this may be fixed by running the more involved polynomial-time approximation of Chen [50] for k-means with m outliers on our coreset.

Corollary 4.
Let k, m ∈ [n] such that k + m < n, and let ρ ≥ 1 be constants. Let (P, f ) be a ρ-metric. Then, a set X ⊆ P can be computed in O(n) time such that, with probability at least 0.99, X is a O(ln(k + m))-approximation for the k-median with m outliers of P.
Proof. Given a set Q of |Q| = n points, it is easy to compute its k-median with m outliers in n O(k) time as follows. We run exhaustive search over every subset Y of size |Y| = k in Q. For each such subset Y, we compute its farthest m points Z in Q. The set Y that minimizes f (Q \ Z, Y) is an optimal solution, since one of these sets of k centers is a k-median with m outliers of P. Since there are such ( n k ) = (n ) O(k) subsets, and each check requires O(nk) = O(n) time, the overall running time is t(n ) = (n ) O(k) .

Conclusions
We proved that the k-means++ algorithm can be generalized to handle outliers by generalizing it to ρ-metric that supports M-estimators and also proved that k-means seeding can be used as a coreset for the k-means with m outliers. Open problems include generalizations of k-means++ for other shapes, such as k lines or k multidimensional subspaces, and using these approximations for developing coreset algorithms for these problems. Other directions include improving or generalizing the constant factor approximations for the original k-means++ and its variants in this paper, using recent improvement for the analysis of k-means++ and coresets for k-means clustering.