Coresets for the Average Case Error for Finite Query Sets

Coreset is usually a small weighted subset of an input set of items, that provably approximates their loss function for a given set of queries (models, classifiers, hypothesis). That is, the maximum (worst-case) error over all queries is bounded. To obtain smaller coresets, we suggest a natural relaxation: coresets whose average error over the given set of queries is bounded. We provide both deterministic and randomized (generic) algorithms for computing such a coreset for any finite set of queries. Unlike most corresponding coresets for the worst-case error, the size of the coreset in this work is independent of both the input size and its Vapnik–Chervonenkis (VC) dimension. The main technique is to reduce the average-case coreset into the vector summarization problem, where the goal is to compute a weighted subset of the n input vectors which approximates their sum. We then suggest the first algorithm for computing this weighted subset in time that is linear in the input size, for n≫1/ε, where ε is the approximation error, improving, e.g., both [ICML’17] and applications for principal component analysis (PCA) [NIPS’16]. Experimental results show significant and consistent improvement also in practice. Open source code is provided.


Introduction
In this paper, we assume that the input is a set P of items, called points. Usually, P is simply a finite set of n points in R d or other metric space. In the context of PAC (probably approximately correct) learning [1], or empirical risk minimization [2] it represents the training set. In supervised learning every point in P may also include its label or class. We also assume a given function w : P → (0, ∞) called weights function that assigns a "weight" w(p) > 0 for every point p ∈ P. The weights function represents a distribution of importance over the input points, where the natural choice is uniform distribution, i.e., w(p) = 1/|P| for every p ∈ P. We are also given a (possibly infinite) set X that is the set of queries [3] which represents candidate models or hypothesis, e.g., neural networks [4], SVMs [5] or a set of vectors in R d with tuning parameters as in linear/ridge/lasso regression [6][7][8].
In machine learning and PAC-learning in particular, we often seek to compute the query that best describes our input data P for either prediction, classification, or clustering tasks. To this end, we define a loss function f : P × X → R that assigns a fitting cost f (p, x) to every point p ∈ P with respect to a query x ∈ X. For example, it may be a kernel function [9], a convex function [10], or an inner product. The tuple (P, w, X, f ) is called a query space and represents the input to our problem. In this paper, we wish to approximate the weighted sum of losses f w (P, x) = ∑ p∈P w(p) f (p, x).

Methodology.
Our main tool for approximating the sum of losses above is called a coreset, which is a (small) weighted subset of the (potentially huge) input data, from which the desired sum of losses can be recovered in a very fast time, with guarantees on the small induced error; drawn from s, is ε. Define T = ∑ q∈P w(q)| f (q, x)|. Since f w (P, x) = T · g s (P, x), approximating g s (P, x) up to ε error yields an error of εT for f w (p, x). Therefore, the size is reduced from M( f )/ε 2 to T 2 /ε 2 when T 2 ≤ M( f ). Here, we sample |S| = O(T 2 /ε 2 ) points from P according to the distribution s, and re-weight the sampled points by u (p) = T |S|| f (p,x)| . Unlike traditional PAC-learning, the sample now is non-uniform, and is proportional to s(p), rather than w, as implied by Hoeffding inequality for non-uniform distributions. For sets of queries we generalize the definition for every p ∈ P to s(p) = sup x∈X w(p)| f (p,x)| ∑ q∈P w(q)| f (p,x)| as in [19], which is especially useful for coresets below.

Coresets: A Data Summarization Technique for Approximating the Sum of Losses
Coreset for a given query space (P, w, X, g), in this and many other papers, is a pair (C, u) that is similar to ε-sample in the sense that C ⊆ P and u : C → [0, ∞) is a weights function. However, the additive error ε is now replaced with a multiplicative error 1 ± ε, i.e., for every x ∈ X we require that, |g w (P, X) − g u (C, X)| ≤ ε · g w (P, X). Dividing by g w (P, X) and assuming g w (P, X) > 0, yields ∀x ∈ X : 1 − g u (C, X) g w (P, x) ≤ ε.
Coresets are especially useful for learning big data since an off-line and possibly inefficient coreset construction for "small data" implies constructions that maintains coreset for streaming, dynamic (including deletions) and distributed data in parallel. This is via a simple and easy to implement framework that is sometimes called merge-reduce trees; see [20,21]. The fact that a coreset approximates every query (and not just the optimal one for some criterion) implies that we may solve hard optimization problems with non-trivial and non-convex constraints by running a possibly inefficient algorithm such as exhaustive search on the coreset, or running existing heuristics numerous times on the small coreset instead of once on the original data. Similarly, parameter tuning or cross validation can be applied on a coreset that is computed once for the original data as explained in [22].
An ε-coreset for a query space (P, w, X, g) is simply an ε-sample for the query space (P, w, X, f ), after defining f (p, x) := g(p,x) g w (P,x) , as explained, e.g., in [19]. By defining the error for a single x by err (x) = |1 − g u (C, x)/g w (P, x)| = | f w (P, x) − f u (C, x)| = err f (x), we obtain an error vector for the coreset err (X) = (err (x)) x∈X . We can then rewrite (3) as in (2): In the case of coresets, the sup x∈X w(p) f (p, x) = sup x∈X w(p)g(p, x) g w (P, x) of a point p ∈ P is called sensitivity [14], leverage score (in 2 approximations) [23], Lewis weights (in p approximations), or simply importance [24].

Problem Statement: Average Case Analysis for Data Summarization.
Average case analysis (e.g., [25]) was suggested about a decade ago as an alternative to the (sometimes infamous) worst-case analysis of algorithms in theoretical computer science. The idea is to replace the analysis for the worst-case input by the average input (in some sense). Inspired by this idea, a natural variant of (2) and its above implications is an ε-sample that approximates well the average query. We suggest to define an (ε, · )sample as err(X) = f w (P, X) − f u (S, X) ≤ ε, which generalizes (2) from · ∞ to any norm, such as the z norm err(X) z . For example, for the 2 , MSE or Frobenius norm, we obtain A generalization of the Hoeffding Inequality from 1963 with tight bounds was suggested relatively recently for the z norm for any z ≥ 2 and many other norms [26,27]. Here we assume a single query (|X| = 1), a distribution weights function, and a bound on sup p∈P | f (p, X)| that determines the size of the sample, as in Hoeffding inequality.
A less obvious question, which is the subject of this paper, is how to compute deterministic ε-samples that satisfies (4), for norms other than the infinity norm. While the Caratheodory theorem suggests deterministic constructions of 0-samples (for any error norm) as explained above, our goal is to obtain coreset whose size is smaller or independent of |X|.
The next question is how to generalize the idea of sup-sampling, i.e., where the function f is unbounded, for the case of norms other than · ∞ . Our main motivation for doing so is to obtain new and smaller coresets by combining the notion of ε-sample and sup-sampling or sensitivity as explained above for the · ∞ case. That is, we wish a coreset for a given query space, that would bound the non-∞ norm error To summarize, our questions are: How can we smooth the error function and approximate the "average" query via: (i) Deterministic ε-samples (for DAC-learning )? (ii) Coresets (via sensitivities/sup sampling for non-infinity norms)?

Our Contribution
We answer affirmably these questions by suggesting ε-samples and coresets for the average query. We focus on the case z = 2, i.e., the Frobenius norm, and finite query set X and hope that this would inspire the research and applications of other norms and general sets. For suggestions in this direction and future work see Section 5.2. The main results of this paper are the following constructions of an (ε, · 2 )-sample (S, u) for any given finite query space (P, w, X, f ) as defined in (5): (i) Deterministic construction that returns a coreset of size |S| ∈ O(1/ε 2 ) in time O min nd/ε 2 , nd + d log(n) 2 /ε 4 ; see Theorem 2 and Corollary 4. (ii) Randomized construction that returns such a coreset (of size |S| ∈ O(1/ε 2 )) with Algorithm. This result is of independent interest for faster and sparser convex optimization. To our knowledge, this is also the first application of sensitivity outside the coreset regime.

Overview and Organization
The rest of the paper in organized as follows. First in Section 2, we list the applications of our proposed methods, such as a faster coreset construction algorithm for least mean squares solver. We also compare our results to the state of the art to justify our practical contribution.
In Section 3, we first give our notations and relevant mathematical definitions, we explain the relation between the problem of computing an (ε, · 2 )-sample (average-case coreset) to the problem of computing a vector summarization coreset, where the goal (of the vector summarization coreset problem) is to compute a weighted subset of the n input vectors which approximates their sum. Here, we suggest a coreset for this problem of size O(1/ε) in O(nd/ε) time; see Theorem 2 and Algorithm 2. Then, in Section 3.1 we show how to improve the running time of this result and compute a coreset of the same size in O(nd + d log 2 (n)/ε 2 ) time; see Corollary 4 and Algorithm 3. In addition, we suggest a non-deterministic coreset of the same size but in time that is independent of the number of points n; see Lemma 5 and Algorithm 4.
In Section 4, we explain how our vector summarization coreset results can be used to improve all the previously mentioned applications (from Section 2). In Section 5 we conduct various experiments on real world datasets, where we apply different coreset construction algorithms presented in this paper to a variety of applications, in order to boost their running time, or reduce their memory storage. We also compare our results to many competing methods. Finally, we conclude our paper and discuss future work at Section 5.2. Due to space limitations and simplicity of the reading, the proofs of the claims are placed in the Appendixes A-I.

On the Applications of Our Method and the Current State of the Art
In what follows, we will present some of the applications of our theoretical contributions as well as discussing the current state of the art coreset/sketch methods in terms of running time for each of application. Figure 1 summarizes the main applications of our result.
(i) Vector summarization: the goal is to maintain the sum of a (possibly infinite) stream of vectors in R d , up to an additive error of ε multiplied by their variance. This is a generalization of frequent items/directions [28]. As explained in [29], the main real-world application is extractions and compactly representing groups and activity summaries of users from underlying data exchanges. For example, GPS traces in mobile networks can be exploited to identify meetings, and exchanges of information in social networks sheds light on the formation of groups of friends. Our algorithm tackles these application by providing provable solution to the heavy hitters problem in proximity matrices. The heavy hitters problem can be used to extract and represent in a compact way friend groups and activity summaries of users from underlying data exchanges. We propose a deterministic algorithm which reduces each subset of n vectors into O(1/ε) weighted vectors in O(nd + d log(n) 2 /ε 2 ) time, improving upon the nd/ε of [29] (which is the current state of the art in terms of running time), for a sufficiently large n; see Corollary 4, and Figures 2 and 3. We also provide a non-deterministic coreset construction in Lemma 5. The merge-and-reduce tree can then be used to support streaming, distributed or dynamic data. (ii) Kernel Density Estimates (KDE): by replacing ε with ε 2 for the vector summarization, we obtain fast construction of an ε-coreset for KDE of Euclidean kernels [17]; see more details in Section 4. Kernel density estimate is a technique for estimating a probability density function (continuous distribution) from a finite set of points to better analyse the studied probability distribution than when using a traditional [30,31]. (iii) 1-mean problem: a coreset for 1-mean which approximates the sum of squared distances over a set of n points to any given center (point) in R d . This problem arises in facility location problems (e.g., to compute the optimal location for placing an antenna such that all the customers are satisfied). Our deterministic construction computes such a weighted subset of size O(1/ε 2 ) in O(min nd/ε 2 , nd + d log(n) 2 /ε 4 ) time.
Previous results of [19,[32][33][34] suggested coresets for such problem. Unlike our results, these works are either non-deterministic, the coreset is not a subset of the input, or the size of the coreset is linear in d. (iv) Coreset for LMS solvers and dimensionality reduction: for example, a deterministic construction for singular value decomposition (SVD) that gets a matrix A ∈ R n×d and returns a weighted subset of k 2 /ε 2 rows, such that their weighted distance to any k-dimensional non-affine (or affine in the case of PCA) subspace approximates the distance of the original points to this subspace. The SVD and PCA are very common algorithms (see [35]), and can be used for noise reduction, data visualization, cluster analysis, or as an intermediate step to facilitate other analyses. Thus, improving them might be helpful for a wide range of real-world applications. In this paper, we propose a deterministic coreset construction that takes O(nd 2 + d 2 k 4 log(n) 2 /ε 4 ) time, improving upon the state of the art result of [35] which requires O(nd 2 k 2 /ε 2 ) time; see Table 1. Many non-deterministic coresets constructions were suggested for those problems, the construction techniques apply non-uniform sampling [36][37][38], Monte-Carlo sampling [39], and leverage score sampling [23,[40][41][42][43][44][45].

Vector Summarization Coreset
Notation 1. We denote by [n] = {1, · · · , n}. For a vector v ∈ R d , the 0-norm is denoted by v 0 and is equal to the number of non-zero entries in v. We denote by e (i) the ith standard basis vector in R n and by 0 the vector (0, · · · , 0) T ∈ R n . A vector w ∈ [0, 1] n is called a distribution vector if all its entries are non-negative and sums up to one. For a matrix A ∈ R m×n and i ∈ [m], j ∈ [n] we denote by A i,j the jth entry of the ith row of A. A weighted set is a pair (Q, m) where Q = {q 1 , · · · , q n } ⊆ R d is a set of n points, and m = (m 1 , · · · , m n ) T ∈ R n is a weights vector that assigns every q i ∈ Q a weight m i ∈ R. A matrix A ∈ R d ×d is orthogonal if Adaptations. To adapt to the notation of the following sections and the query space (P, w, X, f ) to the techniques that we use, we restate (4) as follows. Previously, we denote the queries X = {x 1 , · · · , x d }, and the input set by P = {p 1 , · · · , p n }. Now, each input point p i in the input set P corresponds to a point q i = f (p i , x 1 ), · · · , f (p i , x d ) ∈ R d , i.e., each entry of q i equals to f (p i , x) for a different query x. Throughout the rest of the paper, for technical reason and simplicity, we might alternate between the weights function notation and a weights vector notation. In such cases, the weights function w : P → [0, ∞) and weight w(q i ) of q i , i ∈ [m] are replaced by a vector of weights m = (m 1 , · · · , m n ) ∈ [0, ∞) n and m i , respectively, and vice versa. In such cases, the εsample is represented by a sparse vector From (ε, · 2 )-samples to ε-coresets. We now define an ε-coreset for vector summarization, which is a re-weighting of the input weighted set (Q, m) by a new weights vector u, such that the squared norm of the difference between the weighted means of (Q, u) and (Q, m) is small. This relates to Section 1.3, where an ( √ ε, · 2 )-sample there (in Section 1.3) is an ε-coreset for the vector summarization here.

Analysis flow.
In what follows we (first) assume that the points of our input set P lie inside the unit ball (∀ p∈P : p ≤ 1). For such an input set, we present a construction of a variant of a vector summarization coreset, where the error is ε and does not depend on the variance of the input. This construction is based on the Frank-Wolfe algorithm [48]; see Theorem 1 and Algorithm 1. This is by reducing the problem to the problem of maximizing a concave function f (x) over every vector in the unit simplex. Such problems can be solved approximately by a simple greedy algorithm known as the Frank-Wolfe algorithm.
Algorithm 1: FRANK-WOLFE( f , K); Algorithm 1.1 of [48] 1: Input: A concave function f : R n → R, and the number of iterations K. 2: Output: A vector x ∈ R n that satisfies Theorem 1. 3: x (0) := the unit n-simplex vertex with largest f value. 4: for k ∈ {0, · · · , K} do 5: We then present a proper coreset construction in Algorithm 2 and Theorem 2 for a general input set Q in R d . This algorithm is based on a reduction to the simpler case of points inside the unit ball; see Figure 1 for illustration. This reduction is inspired by the sup-sampling (see Section 1), there (in Section 1) the functions are normalized (to obtain values in [−1, 1]) and reweighted (to obtain a non-biased estimator), then the bounds were easily obtained using the Hoeffding inequality. Here, we apply different normalizations and reweightings, and instead of the non-deterministic Hoeffding inequality, we suggest a deterministic version using the Frank-Wolfe algorithm. Our new suggested normalizations (and reweightings) allow us to generalize the result to many more applications as in Section 4. For brevity purposes, all proofs of the technical results can be found at the Appendixes A-I. Theorem 1 (Coreset for points in the unit ball). Let P = {p 1 , · · · , p n } be a set of n points in R d such that p i ≤ 1 for every i ∈ [n]. Let ε ∈ (0, 1) and w = (w 1 , · · · , w n ) T ∈ [0, 1] n be a distribution vector. For every Letũ be the output of a call to FRANK-WOLFE( f , 8 ε ); see Algorithm 1. Then: We now show how to obtain a vector summarization ε-coreset Theorem 2 (Vector summarization coreset). Let (Q, m) be a weighted set of n points in R d , ε ∈ (0, 1), and let u be the output of a call to CORESET(Q, m, ε 16 ); see Algorithm 2. Then, u ∈ R n is a vector with u 0 ≤ 128 ε non-zero entries that is computed in O( nd ε ) time, and (Q, u) is a vector summarization ε-coreset for (Q, m).

Boosting the Coreset's Construction Running Time
In this section, we present Algorithm 3, which aims to boost the running time of Algorithm 1 from the previous section; see Theorem 3. The main idea behind this new boosted algorithm is as follows: instead of running the Frank-Wolfe algorithm on a (full) set of input data, it can be more efficient to partition the input into a constant number k of equal-sized chunks, pick some representative for each chunk (its mean), run the Frank-Wolfe algorithm only on the set of representatives (the set of means) to obtain back a subset of those representative, and then continue recursively only with the chunks whose representative was chosen by the algorithm. Although the Frank-Wolfe algorithm is now applied multiple times (rather than once), each of those runs is much more efficient since only the small set of representatives is considered.
This immediately implies a faster construction time of vector summarization ε-coresets for general input sets; see Corollary 4 and Figure 1 for illustration. Theorem 3 (Faster coreset for points in the unit ball). Let P be a set of n points in R d such that p ≤ 1 for every p ∈ P. Let w : P → (0, 1) be a weights function such that ∑ p∈P w(p) = 1, ε ∈ (0, 1), and let (C, u) be the output of a call to FAST-FW-CORESET(P, w, ε); see Algorithm 3. Then (i) |C| ≤ 8/ε and ∑ p∈C u(p) = 1, Corollary 4 (Faster vector summarization coreset). Let (Q, m) be a weighted set of n points in R d , and let ε ∈ (0, 1).
time, we can compute a vector u = (u 1 , · · · , u n ) T ∈ R n , such that u has u 0 ≤ 128/ε non-zero entries and (Q, u) is a vector summarization (2ε)-coreset for (Q, m). Return: A vector summarization ε-corset for (P, w) using Theorem 1. 6: end if 7: {P 1 , · · · , P k } := a partition of P into k disjoint subsets, each contains at most n/k points. 8: for every i ∈ {1, · · · , k} do 9: 11: end for 12: (μ,ũ) := a vector summarization ε log(n) -corset for the weighted set ({µ 1 , · · · , µ k }, w ) using Theorem 1. 13: C := µ i ∈μ P i {C is the union over all subsets P i whose mean µ i was chosen inμ.} 14: for every µ i ∈μ and p ∈ P i do 15: In what follows, we show how to compute a vector summarization coreset with high probability in a time that is sublinear in the input size |Q| = n. This is based on the geometric median trick, that suggests the following procedure: (i) sample k > 1 sets {S 1 , · · · , S k } of the same (small) size from the original input set Q, (ii) for each such sampled set S i (i ∈ [k]), compute its mean s i , and finally, (iii) compute and return the geometric median of those means s = {s 1 , · · · , s k }. This geometric median is guaranteed to approximate the mean of the original input set Q.
We show that there is no need to compute this geometric median, as it is a difficult computational task. We prove that there exists a set S i * from the sampled subsets such that its mean s i * is very close to this geometric median, with high probability. Thus, s i * is a good approximation to the desired mean of the original input set. Furthermore, we show that s i * is simply the point in s that minimizes its sum of (non-squared) distances to this set s, i.e., An exhaustive search over the points of s can thus recover s i * . The corresponding set S i * is the resulted vector summarization coreset; see Lemma 5 and Algorithm 4.

Applications
Coreset for 1-mean. A 1-mean ε-coreset for (Q, m) is a weighted set (Q, u) such that for every x ∈ R d , the sum of squared distances from x to either (Q, m) or (Q, u), is approximately the same. To maintain the above property, we prove that it suffices for (Q, u) to satisfy the following: the mean, the variance, and the sum of weights of (Q, u) should approximate the mean, the variance, and the sum of weights of (Q, m), respectively, up to an additive error that depends linearly on ε. Then note that when plugging ε 2 (rather than ε) as input to Algorithm 2, the output is guaranteed to satisfy the above 3 properties, by construction of u.
Theorem 6. Let (Q, m) be a weighted set of n points in R d , ε ∈ (0, 1). Then in time we can compute a vector u = (u 1 , · · · , u n ) T ∈ R n , where u 0 ≤ 128 ε 2 , such that: Coreset for KDE. Given two sets of points Q and Q , and a kernel K : between the kernel costs of Q and Q is upper bounded by µQ − µQ 2 , where µQ and µQ are the means ofQ = {φ(q) | q ∈ Q} andQ = {φ(q) | q ∈ Q }, respectively, [49]. GivenQ, we can compute a vector summarization ε 2 -coresetQ , which satisfies that µQ − µQ 2 2 ≤ ε 2 . By the above argument, this is also an ε-KDE coreset. Coreset for dimensionality reduction and LMS solvers. An ε-coreset for the k-SVD (k-PCA) problem of Q is a small weighted subset of Q that approximates the sum of squared distances from the points in Q to every non-affine (affine) k-dimensional subspace of R d , up to a multiplicative factor of 1 ± ε; see Corollary 7. Coreset for LMS solvers is the special case of k = d − 1.
In [35], it is shown how to leverage an (ε/k) 2 -coreset for the vector summarization problem in order to compute an ε-coreset for k-SVD. In [45], it is shown how to compute a coreset for k-PCA via a coreset for k-SVD, by simply adding another entry with some value r ∈ R to each vector of the input. Algorithm 5 combines both the above reductions, along with a computation of a vector summarization (ε/k) 2 -coreset to compute the desired coreset for dimensionality reduction (both k-SVD and k-PCA). To compute the vector summarization coreset we utilize our new algorithms from the previous sections, which are faster than the state of the art algorithms.
, and an error parameter ε ∈ (0, 1). 2: Output: A diagonal matrix W ∈ R n×n that satisfies Corollary 7. 6:ṽ i := the row stacking of v i v T i ∈ R d×d for every i ∈ [n] 7: ({ṽ 1 , · · · ,ṽ n }, u) := a vector summarization ( ε 5k ) 2 -coreset for ({ṽ 1 , · · · ,ṽ n }, (1, · · · , 1)). 8 9: Return W Corollary 7 (Coreset for dimensionality reduction). Let Q be a set of n points in R d , and let A ∈ R n×d be a corresponding matrix containing the points of Q in its rows. Let ε ∈ (0, 1 2 ) be an error parameter, k ∈ [d] be an integer, and W be the output of a call to DIM-CORESET(A, k, ε). Then: time, and (iii) there is a constant c, such that for every ∈ R d and an orthogonal X ∈ R d×(d−k) we have Here, A − is the subtraction of from every row of A.
Where do our methods fit in? Theoretically speaking, the 1-mean problem (also known as the arithmetic mean problem), is a widely used tool for reporting central tendencies in the field of statistics, as it is also used in machine learning. As for the practical aspect of such problem, it can be either used to obtain an estimation of the mathematical expectation of signal strength in a area [50], or as an imputation technique used to fill in missing values, e.g., in the context of filling in missing values of heart monitor sensor data [51]. Note that a variant of this problem is widely used in the context of deep learning, namely, the moving averages. Algorithms 3 and 4 can boost such methods when given large-scale datasets. In addition, our algorithms extend also to SVD, PCA, and LMS where these methods are known for their usages and efficiencies in discovering a low dimensional representation of high dimensional data. From a practical point of view, SVD showed promising results when dealing with on calibration of a star sensor on-orbit calibration [52], denoising a 4-dimensional computed tomography of the brain in stroke patients [53], removal of cardiac interference from trunk electromyogram [54], among many other applications.
We propose a summarization technique (see Algorithm 5) that aims to compute an approximation towards the SVD factorization of large-scale datasets where applying the SVD factorization on the dataset is not possible due to insufficient memory or long computational time.

Experimental Results
We now apply different coreset construction algorithms presented in this paper to a variety of applications, in order to boost their running time, or reduce their memory storage. We note that a complete open source code is provided [55].
We compare the following algorithms: (To simply distinguish between our algorithms and the competing ones in the graphs, observe that the labels of our algorithms starts with the prefix "Our-", while the competing methods do not.) (i) Uniform: Uniform random sample of the input Q, which requires sublinear time to compute. (ii) Sensitivity-sum: Random sampling based on the "sensitivity" for the vector summarization problem [58]. Sensitivity sampling is a widely known technique [19], which guarantees that a subsample of sufficient size approximates the input well. The sensitivity of a point q ∈ Q is 1 n + q 2 ∑ q ∈Q q 2 . This algorithm takes O(nd) time. (ii) ICML17: The vector summarization coreset construction algorithm from [29] (see Algorithm 2 there), which runs in O(nd/ε) time. (iv) Our-rand-sum: Our coreset construction from Lemma 5, which requires (v) Our-slow-sum: Our coreset construction from Corollary 2, which requires O(nd/ε) time. (vi) Our-fast-sum: Our coreset construction from Corollary 4, which requires O(nd + d log(n) 2 /ε 2 ) time. (vii) Sensitivity-svd: Similar to Sensitivity-sum above, however, now the sensitivity is computed by projecting the rows of the input matrix A on the optimal k-subspace (or an approximation of it) that minimizes its sum of squared distances to the rows of A, and then computing the sensitivity of each row i in the projected matrix A as u i 2 , where u i is the ith row the matrix U from the SVD of A = UDV T ; see [37].
Datasets. We used the following datasets from the UCI ML library [59]: (i) New York City Taxi Data [60]. The data covers the taxi operations at New York City. We used the data describing n = 14.7M trip fares at the year of 2013. We used the d = 6 numerical features (real numbers).
(ii) US Census Data (1990) [61]. The dataset contains n = 2.4M entries. We used the entire d = 68 real-valued attributes of the dataset. (iii) Buzz in social media Data Set [62]. It contains n = 0.5M examples of buzz events from two different social networks: Twitter, and Tom's Hardware. We used the entire d = 77 real-valued attributes. (iv) Gas Sensors for Home Activity Monitoring Data Set [63]. This dataset has n = 919, 438 recordings of a gas sensor array composed of 8 MOX gas sensors, and a temperature and humidity sensor. We used the last d = 10 real-valued attributes of the dataset.
Discussion regarding the chosen datasets. The Buzz in social media data set is widely used in the context of Principal Component Regression (or, PCR in short), that is used for estimating the unknown regression coefficients in a standard linear regression model. The goal of PCR in the context of this dataset, is to predict popularity of a certain topic on Twitter over a period. It is known that the solution of the PCR problem can be approximated using the known SVD decomposition problem. Our techniques enable us to benefit from the coreset advantages, e.g., to boost the PCR approximated solution (PCA) while using low memory, and supporting the streaming model by maintaining a coreset for the data (tweets) seen so far; each time a new point (tweet) is received, it is added to current stored coreset in memory. Once the stored coreset is large enough, our compression (coreset construction algorithm) is applied. This procedure is repeated until the stream of points is empty.
The New York City taxi data contains information about the locations of passengers as well as the locations of their destinations. Thus, the goal is to find a location which is close to the most wanted destinations. This problem can be formulated as a facility location problem, which can be reduced to an instance of the 1-mean problem. Hence, since our methods admit faster solution as well as provable approximation for the facility location problem, we can leverage our coreset to speed up the computations using this dataset.
Finally, regarding the remaining datasets, PCA has been widely used either for lowdimensional embedding or, e.g., to compute the arithmetic mean. By using our methods, we can boost the PCA while admitting an approximated solution.
The experiments.
(i) Vector summarization: The goal is to approximate the mean of a huge input set, using only a small weighted subset of the input. The empirical approximation error is defined as µ − µ s 2 , where µ is the mean of the full data and µ s is the mean of the weighted subset computed via each compared algorithm; see Figures 2 and 3.
In Figure 2, we report the empirical approximation error µ − µ s 2 as a function of the subset (coreset) size, for each of the datasets (i)-(ii), while in Figure 3 we report the overall computational time for computing the subset (coreset) and for solving the 1-mean problem on the coreset, as a function of the subset size. (ii) k-SVD: The goal is to compute the optimal k-dimensional non-affine subspaces of a given input set. We can either compute the optimal subspace using the original (full) input set, or using a weighted subset (coreset) of the input. We denote by S * and S the optimal subspace when computed either using the full data or using the subset at hand, respectively. The empirical approximation error is defined as the ratio |(c * − c )/c * |, where c * and c are the sum of squared distances between the points of original input set to S * and S , respectively; see Figures 4-6. Intuitively, this ratio represents the relative SSD error of recovering an optimal k-dimensional non-affine subspace on the compression, rather than using the full data.
In Figure 4 we report the empirical error |(c * − c )/c * | as a function of the coreset size. In Figure 5 we report the overall computational time in took to compute the coreset and to recover the optimal subspace using the coreset, as a function of the coreset size. In both figures we have three subfigures, each one for a different chosen value of k (the dimension of the subspace). Finally, in Figure 6 the x axis is the size of the dataset (which we compress to a subset of size 150), while the y-axis is the approximation error on the left hand side graph, and on the right hand side it is the overall computational time it took to compute the coreset and to recover the optimal subspace using the coreset.

Discussion
Vector summarization experiment: As predicted by the theory and as demonstrated in Figures 2 and 3, our fast and deterministic algorithm Our-fast-sum (the red line in the figures) achieves either the same or smaller approximation errors in most cases compared to the deterministic alternatives Our-slow-sum (orange line) and ICML17 (green line), while being up to ×10 times faster. Hence, when we seek a fast time deterministic solution for computing a coreset for the vector summarization problem, our algorithm Our-fast-sum is the favorable choice.
Compared to the randomized alternatives, Our-fast-sum is obviously slower, but achieves an error more than 3 orders of magnitude smaller. However, our fast and randomized algorithm Our-rand-sum (brown line) constantly achieves better results compared to the other randomized alternatives; It yields approximation error up to ×50 smaller, while maintaining the same computational time. This is demonstrated on both datasets. Hence, our compression can be used to speed up tasks, e.g., computing the PCA or PCR, as described above.
k-SVD experiment: Here, in Figures 4-6 we witness a similar phenomena, where our fast and deterministic algorithm Our-fast-svd achieves the same or smaller approximation errors compared to the deterministic alternatives Our-slow-svd and NIPS16, respectively, while being up to ×4 times faster. Compared to the randomized alternatives, Our-fast-svd is slower as predicted, but achieves an error up to 2 orders of magnitude smaller. This is demonstrated for increasing sample sizes (as in Figures 4 and 5), for increasing dataset size (as in Figure 6), and for various values of k (see Figures 4-6).

Conclusions and Future Work
This paper generalizes the definition of ε-sample and coreset from the worst case error over every query to average 2 error. We then showed a reduction from the problem of computing such coresets to the vector summarization coreset construction problem. Here, we suggest deterministic and randomized algorithms for computing such coresets, the deterministic version takes O(min nd/ε, nd + d log(n) 2 /ε 2 ), and the randomized ). Finally, we showed how to leverage an (ε 2 )-coreset for the vector summarization problem in order to compute an ε-coreset for the 1-mean problem, and similarly for k-SVD and k-PCA problem via computing an (ε/k) 2 vector summarization coreset after some reprocessing on the data.
Open problems include generalizing these results for other types of norms, or other functions such as M-estimators that are robust to outliers. We hope that the source code and the promising experimental results would encourage also practitioners to use these new types of approximations. Normalization via this new sensitivity type reduced the bounds on the number of iterations of the Frank-Wolfe algorithm by orders of magnitude. We believe that it can be used more generally for provably faster convex optimization, independently of coresets or ε-samples. We leave this for future research.

Conflicts of Interest:
The authors declare that one of the authors is Prof. Dan Feldman who is a guest editor of special issue "Sensor Data Summarization: Theory, Applications, and Systems".

Appendix A. Problem Reduction for Vector Summarization ε-Coresets
First, we define a normalized weighted set, which is simply a set which satisfies three properties: weights sum to one, zero mean, and unit variance.
Definition A1 (Normalized weighted set). A normalized weighted set is a weighted set (P, w) where P = {p 1 , · · · , p n } ⊆ R d and w = (w 1 , · · · , w n ) T ∈ R n satisfy the following properties: (a) Weights sum to one: ∑ n i=1 w i = 1, (b) The weighted sum is the origin: ∑ n i=1 w i p i = 0, and (c) Unit variance: ∑ n i=1 w i p i 2 = 1.

Appendix A.1. Reduction to Normalized Weighted Set
In this section, we argue that in order to compute a vector summarization ε-coreset for an input weighted set (Q, m), it suffices to compute a vector summarization ε-coreset for its corresponding normalized (and much simpler) weighted set (P, w) as in Definition A1; see Corollary A1. However, first, in Observation A1, we show how to compute a corresponding normalized weighted set (P, w) for any input weighted set (Q, m).
Observation A1. Let Q = {q 1 , · · · , q n } be a set of n ≥ 2 points in R d , m ∈ (0, ∞) n , w ∈ (0, 1] n be a distribution vector such that w = m m 1 , µ = ∑ n i=1 w i q i and σ = ∑ n i=1 w i q i − µ 2 . Let P = {p 1 , · · · , p n } be a set of n points in R d , such that for every j ∈ [n] we have p j = q j −µ σ . Then, (P, w) is the corresponding normalized weighted set of (Q, m), i.e., (i)-(iii) hold as follows: where the first equality holds by the definition of p i , the third holds by the definition of µ, and the last is since w is a distribution vector.
where the first and third equality hold by the definition of p i and σ, respectively.
Corollary A1. Let (Q, m) be a weighted set, and let (P, w) be its corresponding normalized weighted set as computed in Observation A1. Let (P, u) be a vector summarization ε-coreset for (P, w) and let u = m 1 · u. Then (Q, u ) is a vector summarization ε-coreset for (Q, m).
Proof. Put x ∈ R d and let y = x−µ σ . Now, for every j ∈ [n], we have that where the first equality is by the definition of y and p j . Let (P, u) be a vector summarization ε-coreset for (P, w). We prove that (Q, u ) is a vector summarization ε-coreset for (Q, m). We observe the following where the first equality holds since w = m , the second holds by (A1), and the last inequality holds since (P, u) is a vector summarization ε-coreset for (P, w).
Define A to be the matrix of d × n such that the i-th column of A is the i-th point in P, and let µ = ∑ n i=1 w i p i . We get that where the second equality holds by the definition of µ, and the fourth equality holds by since ∑ n i=1 x i p i = Ax for every x ∈ R n . At Section 2.2. in [48], it was shown that for any quadratic function f : R n → R that is defined as where M is a negative semidefinite n × n matrix, b ∈ R n is a vector, and a ∈ R, we have that C f ≤ diam(A S) 2 , where A ∈ R d×n is a matrix that satisfies M = A T A ; see equality (12) at [48]. Hence, plugging a = − µ 2 , b = 2A T µ, and M = A T A in (A5) yields that for the function f we have C f ≤ diam(AS) 2 , and Observe that x and y are distribution vectors, thus Since p i ≤ 1 for each i ∈ [n], we have that By substituting C f ≤ 2, (2) we get that, Multiplying both sides of the inequality by 8 and rearranging prove Theorem (ii) as This term is the multiplication between an a matrix in R n×d and a vector in R d , which takes O(nd) time. Hence, the running time of the Algorithm is O( nd ε ).
We also have by Theorem 1 that where the first derivative is by the definition of u in Algorithm 2 at line 11, the second holds by the definition of p , w and u at Lines 8, 9, and 12 of the algorithm, the third holds sinceũ = u m 1 , and the last inequality holds since (x | y) 2 ≥ x 2 for every x ∈ R d and y ∈ R. Combining the fact that ∑ n i=1 w i p i = 0 with (A13) yields that By (A12) and since w is a distribution vector we also have that Combining (A15) and (A14) yields that: where that second inequality holds since ε = ε/16 ≤ 1/16. By Lemma A2, Corollary A1, and (A16), Theorem 2 holds as

Appendix E. Proof of Theorem 3
Proof of Theorem 3. We use the notation and variable names as defined in Algorithm 3. First, we assume that w(p) > 0 for every p ∈ P, otherwise we remove all the points in P which have zero weight, since they do not contribute to the weighted sum. Identify the input set P = {p 1 , · · · , p n } and the set C that is computed at Line 13 of Algorithm 3 as C = c 1 , · · · , c |C| . We will first prove that the weighted set (C, u) that is computed in Lines 13-15 at an arbitrary iteration satisfies: |C| ≤ |P| 2 . Let (μ,ũ) be the vector summarization ε log(n) -coreset of the weighted set ({µ 1 , · · · , µ k }, w ) that is computed during the execution of the current iteration at Line 12. Hence, by Theorem 1 (A17) Proof of (a). Property (i) is satisfied by Line 13 as we have that C ⊆ P. Proof of (b). Property (ii) is also satisfied since where the first equality holds by the definition of C at Line 13 and w(p) for every p ∈ C at Line 15, and the third equality holds by the definition of u (µ i ) for every µ i ∈μ as in Line 10.
Proof of (c). By the definition of w and µ i , for every i ∈ {1, · · · , k} The weighted sum of (C, u) is where the first equality holds by the definitions of C and w, and the third equality holds by the definition of µ i at Line 9. Plugging (A19) and (A20) in (A17) satisfies (iii) as Proof of (d). By (A17) we have that C contains at most log(n) ε clusters from P and at most |C| ≤ log(n) ε · n k points, and by plugging k = 2 log(n) ε we obtain that |C| ≤ |P| 2 as required.
We now prove (i)-(iii) from Theorem 3. Proof of Theorem 3 (i). The first condition |C| ≤ 8/ε in (i) is satisfied since at each iteration we reduce the data size by a factor of 2, and we keep reducing until we reach the stopping condition, which is O( log(n) ε ) by Theorem 1 (since we require a ε log(n) error when we use Theorem 1, i.e., we need coreset of size O( log(n) ε )). Then, at Line 5 when the if condition is satisfied (it should be, as explained) we finally use Theorem 1 again to obtain a coreset of size 8/ε with ε-error on the small data (that was of size ). The second condition in (i) is satisfies since at each iteration we either return such a pair (C, u) at Line 18, we get by (b) that the sum of weight is always equal to 1.
Proof of Theorem 3 (ii). By (d) we also get that we have at most log(n) recursive calls. Hence, by induction on (2) we conclude that last computed set (C, u) at Line 18 satisfies (ii) At Line we return an ε coreset for the input weighted set (P, w) that have reached the size of ( log(n) ε ). Hence, the output of a the call satisfies ∑ p∈P w(p) · p − ∑ p∈C w(p) · p 2 ≤ 2ε.
Proof of Theorem 3 (iii). As explained before, there are at most log(n) recursive calls before the stopping condition at Line 4 is met. At each iteration we compute the set of meansμ, and compute a vector summarization ε log n -coreset for them. Hence, the time complexity of each iteration is n d + T(k, d, ε log(n) ) where n is the number of points in the current iteration, and T(k, d, ε log(n) ) is the running time of Algorithm 1 on k points in R d to obtain a ε log(n) -coreset. Thus, the total running of time the algorithm until the "If" condition at Line 4 is satisfied is Proof. The corollary immediately holds by using Algorithm 2 with a small change. We change Line 11 in Algorithm 2 to use Algorithm 3 and Theorem 3, instead of Algorithm 1 and Theorem 1.

Appendix G. Proof of Lemma 5
We first prove the following lemma: Lemma A4. Let P be a set of n points in R d , µ = 1 n ∑ p∈P p, and σ 2 = 1 n ∑ p∈P p − µ 2 . Let ε, δ ∈ (0, 1), and let S be a sample of m = 1 εδ points chosen i.i.d uniformly at random from P. Then, with probability at least 1 − δ we have that 1 Proof. For any random variable X, we denote by E(X) and var(X) the expectation and variance of the random variable X, respectively. Let x i denote the random variable that is the ith sample for every i ∈ [m]. Since the samples are drawn i.i.d, we have var 1 For any random variable X and error parameter ε ∈ (0, 1), the generalize Chebyshev's inequality [64] reads that Substituting X = 1 m ∑ p∈S p, E(X) = µ and ε = √ εσ in (A23) yields that Combining (A22) with (A24) proves the lemma as: Now we prove Lemma 5 Proof. Let {S 1 , · · · , S k } be a set of k i.i.d sampled subsets each of size 4 ε as defined at Line 5 of Algorithm 4, and let s i be the mean of the ith subset S i as define at Line 6. Let s := arg min x∈R d k ∑ i=1 s i − x 2 be the geometric median of the set of means {s 1 , · · · , s k }.
Using Corollary 4.1. from [65] we obtain that from the above we have that Note that where (A28) holds by substituting k = 3.5 log 1 δ + For every i ∈ [k], by substituting S = S i , which is of size 4 ε , in Lemma A4, we obtain that Pr( s i − µ 2 ≥ εσ 2 ) ≤ 1/4. Hence, with probability at least 1 − (1/4) k there is at least one set S j such that By the following inequalities: we get that with probability at least 1 − δ there is a set S j such that Combining (A31) with (A30) yields that with probability at least (1 − δ) 2 the set S j satisfies that for every x ∈ R d . Therefore, by the definitions of f andŝ, Observe that f is a convex function since it is a sum over convex functions. By the convexity of f , we get that for every pair of points p, q ∈ P it holds that: Therefore, by the definition of i * at in Algorithm 4 we get that Now by combining (A32) with (A34) we have that: Combining (A35) with (A30) and noticing the following inequality satisfies Lemma 5 as,

Appendix H. Proof of Theorem 6
We first show a reduction to a normalized weighted set as follows: Corollary A5. Let (Q, m) be a weighted set, and let (P, w) be its corresponding normalized weighted set as computed in Observation A1. Let (P, u) be a 1-mean ε-coreset for (P, w) and let u = m 1 · u. Then (Q, u ) is a 1-mean ε-coreset for (Q, m).
Proof. Let (P, u) be a 1-mean ε-coreset for (P, w). We prove that (Q, u ) is a 1-mean ε-coreset for (Q, m). Observe that where the first equality holds by (A1), and the second holds by the definition of w and u . Since (P, u) is a 1-mean ε-coreset for (P, w) where the equality holds by (A1) and since w = m m 1 . The proof concludes by combining (A36) and (A38) as

1-Mean Problem Reduction
Given a normalized weighted set (P, w) as in Definition A1, in the following lemma we prove that a weighted set (P, u) is a 1-mean ε-coreset for (P, w) if some three properties related to the mean, variance, and weights of (P, u) hold.
Lemma A6. Let (P, w) be a normalized weighted set of n points in R d , ε ∈ (0, 1), and u ∈ R n such that, Then, (P, u) is a 1-mean ε-coreset for (P, w), i.e., for every x ∈ R d we have that Proof. First we have that, where the last equality holds by the attributes (a)-(c) of the normalized weighted set (P, w). By rearranging the left hand side of (A39) we get, where (A43) holds by the triangle inequality, (A44) holds by attributes (a)-(c), and (A45) holds by combining assumptions (2), (3), and the Cauchy-Schwarz inequality, respectively. We also have for every a, b ≥ 0 that 2ab ≤ a 2 + b 2 , hence, By (A46) and assumption (1) we get that, Lemma A6 now holds by plugging (A47) in (A45) as, where the last equality holds by (A41). Observe that if assumptions (1), (2) and (3) hold, then (A48) hold. We therefore obtain an ε-coreset.
To Proof Theorem 6, we split it into 2 claims: Claim A7. Let (Q, m) be a weighted set of n points in R d , ε ∈ (0, 1), and let u be the output of a call to CORESET(Q, m, ( ε 4 ) 2 ); see Algorithm 2. Then u = (u 1 , · · · , u n ) ∈ R n is a vector with u 0 ≤ 128 ε 2 non-zero entries that is computed in O( nd ε 2 ) time, and (Q, u) is a 1-mean ε-coreset for (Q, m).
Proof. Let (P, w) be the normalized weighted set that is computed at Lines 3-5 of Algorithm 2 where P = {p 1 , · · · , p n }, and letũ = u m 1 . We show that (P,ũ) is a 1-mean ε-coreset for (P, w), then by Corollary A5 we get that (Q, u) is a 1-mean coreset for (Q, m). for every i ∈ [n]. By the definition of u at line 11 in Algorithm 2, and since the algorithm gets ε 2 as input, we have that and For every i ∈ [n] let u i = m 1 · 2u i (p T i |1) 2 be defined as at Line 12 of the algorithm. It immediately follows by the definition of u = (u 1 , · · · , u n ) and (A49) that We now prove that Properties (1)-(3) in Lemma A6 hold for (P,ũ). We have that where the first derivation follows from (A50), the second holds by the definition of w i ,u i ,u i and p i for every i ∈ [n], the third holds sinceũ = u m 1 , and the last holds since (x | y) ≥ x for every x, y such that x ∈ R d and y ∈ R. By (A54) and since w is a distribution vector we also have that By Theorem 1, we have that u is a distribution vector, which yields, By the above we get that 2 − ∑ n i=1ũ i = ∑ n i=1ũ i p i 2 . Hence, where the first equality holds since ∑ n i=1ũ i p i 2 = 2 − ∑ n i=1ũ i , the second holds since w is a distribution and the last is by (A56). Now by (A57), (A56) and (A55) we obtain that (P,ũ i ) satisfies Properties (1)-(3) in Lemma A6. Hence, by Lemma A6 and CorollaryA5 we get that The running time is the running time of Algorithm 1 with ε 2 instead of ε, i.e., O(nd/ε 2 ). Now we proof the following claim: Claim A8. Let (Q, m) be a weighted set of n points in R d , ε ∈ (0, 1). Then in O(nd + d · log(n) 2 ε 4 ) we can compute a vector u = (u 1 , · · · , u n ) T ∈ R n , such that u has u 0 ≤ 128 ε 2 non-zero entries, and (Q, u) is a 1-mean (2ε)-coreset for (Q, m).