Next Article in Journal
Automatic Prediction of Ischemia-Reperfusion Injury of Small Intestine Using Convolutional Neural Networks: A Pilot Study
Next Article in Special Issue
The Non-Tightness of a Convex Relaxation to Rotation Recovery
Previous Article in Journal
Improving R Peak Detection in ECG Signal Using Dynamic Mode Selected Energy and Adaptive Window Sizing Algorithm with Decision Tree Algorithm
Previous Article in Special Issue
No Fine-Tuning, No Cry: Robust SVD for Compressing Deep Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Coresets for the Average Case Error for Finite Query Sets

Robotics & Big Data Labs, University of Haifa, Haifa 3498838, Israel
*
Author to whom correspondence should be addressed.
Sensors 2021, 21(19), 6689; https://doi.org/10.3390/s21196689
Submission received: 7 September 2021 / Revised: 29 September 2021 / Accepted: 30 September 2021 / Published: 8 October 2021
(This article belongs to the Special Issue Sensor Data Summarization: Theory, Applications, and Systems)

Abstract

:
Coreset is usually a small weighted subset of an input set of items, that provably approximates their loss function for a given set of queries (models, classifiers, hypothesis). That is, the maximum (worst-case) error over all queries is bounded. To obtain smaller coresets, we suggest a natural relaxation: coresets whose average error over the given set of queries is bounded. We provide both deterministic and randomized (generic) algorithms for computing such a coreset for any finite set of queries. Unlike most corresponding coresets for the worst-case error, the size of the coreset in this work is independent of both the input size and its Vapnik–Chervonenkis (VC) dimension. The main technique is to reduce the average-case coreset into the vector summarization problem, where the goal is to compute a weighted subset of the n input vectors which approximates their sum. We then suggest the first algorithm for computing this weighted subset in time that is linear in the input size, for n 1 / ε , where ε is the approximation error, improving, e.g., both [ICML’17] and applications for principal component analysis (PCA) [NIPS’16]. Experimental results show significant and consistent improvement also in practice. Open source code is provided.

1. Introduction

In this paper, we assume that the input is a set P of items, called points. Usually, P is simply a finite set of n points in R d or other metric space. In the context of PAC (probably approximately correct) learning [1], or empirical risk minimization [2] it represents the training set. In supervised learning every point in P may also include its label or class. We also assume a given function w : P ( 0 , ) called weights function that assigns a “weight” w ( p ) > 0 for every point p P . The weights function represents a distribution of importance over the input points, where the natural choice is uniform distribution, i.e., w ( p ) = 1 / | P | for every p P . We are also given a (possibly infinite) set X that is the set of queries [3] which represents candidate models or hypothesis, e.g., neural networks [4], SVMs [5] or a set of vectors in R d with tuning parameters as in linear/ridge/lasso regression [6,7,8].
In machine learning and PAC-learning in particular, we often seek to compute the query that best describes our input data P for either prediction, classification, or clustering tasks. To this end, we define a loss function f : P × X R that assigns a fitting cost f ( p , x ) to every point p P with respect to a query x X . For example, it may be a kernel function [9], a convex function [10], or an inner product. The tuple ( P , w , X , f ) is called a query space and represents the input to our problem. In this paper, we wish to approximate the weighted sum of losses f w ( P , x ) = p P w ( p ) f ( p , x ) .
Methodology.
Our main tool for approximating the sum of losses above is called a coreset, which is a (small) weighted subset of the (potentially huge) input data, from which the desired sum of losses can be recovered in a very fast time, with guarantees on the small induced error; see Section 1.2. To compute such a coreset, we utilize the famous Frank–Wolfe algorithm. However, to compute a coreset in fast time, we first provide a scheme for provably boosting the running time of the Frank–Wolfe algorithm for our special case, without compromising its output accuracy; see Section 3.1. We then utilize the boosted version in order to compute a deterministic coreset in time faster than the state of the art.

1.1. Approximation Techniques for the Sum of Losses

Approximating the loss of a single query via uniform sampling. Suppose that we wish to approximate the mean f ( P , x ) = 1 n p P f ( p , x ) for a specific x in sub-linear time. Picking a random point p uniformly at random from P would give this result in expectation as E [ f ( p , x ) ] = p P f ( p , x ) / n = f ( P , x ) . By Hoeffding inequality, the mean f ( S , x ) = 1 | S | p S f ( p , x ) of a uniform sample S P would approximate f ( P , x ) with high probability. More precisely, for a given ε ( 0 , 1 ) , if the size of the sample is | S | O ( M / ε 2 ) where M = max p P | f ( p , x ) | is the maximum absolute value of f, then with constant probability our approximation error is
e r r f ( x ) = | f ( P , x ) f ( S , x ) | ε .
ε -Sample. Generally, we are interested in such data summarization S of P that approximates every query x X . An ε-sample is a pair ( S , u ) where S is a subset of P (unlike, e.g., sketches [11]), and u : S [ 0 , ) is its weights function such that the weighted loss f u ( S , x ) = p S u ( p ) f ( p , x ) of the (hopefully small) weighted subset S approximates the original weighted loss f w ( P , x )  [12], i.e.,
x X : f w ( P , x ) f u ( S , x ) ε .
We usually assume that the input is normalized in the sense that w is a distribution, and f : P × X [ 1 , 1 ] . By defining the pair of vectors f w ( P , X ) = ( p P w ( p ) f ( p , x ) ) x X and f u ( S , X ) = ( p S u ( p ) f ( p , x ) ) x X , we can define the error for a single x by e r r ( x ) = | f w ( P , x ) f u ( S , x ) | , and then the error vector for the coreset e r r ( X ) = ( e r r ( x ) ) x X . We can rewrite (1) by
e r r ( X ) = f w ( P , X ) f u ( S , X ) ε .
PAC/DAC learning for approximating the sum of losses for multiple queries. Probably approximately correct (PAC) randomized constructions generalizes the Hoeffding inequality above from a single to multiple (usually infinite) queries and returns an ε -sample for a given query space ( P , w , X , f ) and δ ( 0 , 1 ) , with probability at least 1 δ . Here, δ corresponds to the “probably” part, while “approximately correct” corresponds to ε in (2); see  [13,14]. Deterministic approximately correct (DAC) versions of PAC-learning suggest deterministic construction of ε -samples, i.e., the probability of failure of the construction is δ = 0 .
As common in machine learning and computer science in general, the main advantage of deterministic constructions is smaller bounds (in this case, on the size of the resulting ε -sample), and their disadvantage is usually the slower construction time that may be unavoidable. When the query set X if finite, the Caratheodory theorem [15,16] suggests a deterministic algorithm that returns a 0-sample ( S , u ) (i.e., f w ( P , X ) = f u ( S , X ) ) of size | S | | X | + 1 . Deterministic constructions of ε -sample are known for infinite sets of queries even when the VC-dimension is unbounded [17,18].
Sup-sampling: reducing the sample size via non-uniform sampling. As explained above, Hoeffding inequality implies an approximation of f w ( P , x ) by f u ( S , x ) where u ( p ) = 1 / | S | and S is a random sample according to w whose size depends on M ( f ) = max p P | f ( p , x ) | . To reduce the sample size we may thus define g ( p , x ) = f ( p , x ) | f ( p , x ) | 1 , 1 , and s ( p ) = w ( p ) | f ( p , x ) | q P w ( q ) | f ( q , x ) | . Now, M ( g ) = max p P | g ( p , x ) | = 1 , and by Hoeffding’s inequality, the error of approximating g s ( P , x ) via non-uniform random sample of size 1 / ε 2 drawn from s, is ε . Define T = q P w ( q ) | f ( q , x ) | . Since f w ( P , x ) = T · g s ( P , x ) , approximating g s ( P , x ) up to ε error yields an error of ε T for f w ( p , x ) . Therefore, the size is reduced from M ( f ) / ε 2 to T 2 / ε 2 when T 2 M ( f ) . Here, we sample | S | = O ( T 2 / ε 2 ) points from P according to the distribution s, and re-weight the sampled points by u ( p ) = T | S | | f ( p , x ) | .
Unlike traditional PAC-learning, the sample now is non-uniform, and is proportional to s ( p ) , rather than w, as implied by Hoeffding inequality for non-uniform distributions. For sets of queries we generalize the definition for every p P to s ( p ) = sup x X w ( p ) | f ( p , x ) | q P w ( q ) | f ( p , x ) | as in [19], which is especially useful for coresets below.

1.2. Coresets: A Data Summarization Technique for Approximating the Sum of Losses

Coreset for a given query space ( P , w , X , g ) , in this and many other papers, is a pair ( C , u ) that is similar to ε -sample in the sense that C P and u : C [ 0 , ) is a weights function. However, the additive error ε is now replaced with a multiplicative error 1 ± ε , i.e., for every x X we require that, | g w ( P , X ) g u ( C , X ) | ε · g w ( P , X ) . Dividing by g w ( P , X ) and assuming g w ( P , X ) > 0 , yields
x X : | 1 g u ( C , X ) g w ( P , x ) | ε .
Coresets are especially useful for learning big data since an off-line and possibly inefficient coreset construction for “small data" implies constructions that maintains coreset for streaming, dynamic (including deletions) and distributed data in parallel. This is via a simple and easy to implement framework that is sometimes called merge–reduce trees; see [20,21]. The fact that a coreset approximates every query (and not just the optimal one for some criterion) implies that we may solve hard optimization problems with non-trivial and non-convex constraints by running a possibly inefficient algorithm such as exhaustive search on the coreset, or running existing heuristics numerous times on the small coreset instead of once on the original data. Similarly, parameter tuning or cross validation can be applied on a coreset that is computed once for the original data as explained in [22].
An ε -coreset for a query space ( P , w , X , g ) is simply an ε -sample for the query space ( P , w , X , f ) , after defining f ( p , x ) : = g ( p , x ) g w ( P , x ) , as explained, e.g., in [19]. By defining the error for a single x by e r r ( x ) = | 1 g u ( C , x ) / g w ( P , x ) | = | f w ( P , x ) f u ( C , x ) | = e r r f ( x ) , we obtain an error vector for the coreset e r r ( X ) = ( e r r ( x ) ) x X . We can then rewrite (3) as in (2):
e r r ( X ) ε .
In the case of coresets, the sup x X w ( p ) f ( p , x ) = sup x X w ( p ) g ( p , x ) g w ( P , x ) of a point p P is called sensitivity [14], leverage score (in 2 approximations) [23], Lewis weights (in p approximations), or simply importance [24].

1.3. Problem Statement: Average Case Analysis for Data Summarization

Average case analysis (e.g., [25]) was suggested about a decade ago as an alternative to the (sometimes infamous) worst-case analysis of algorithms in theoretical computer science. The idea is to replace the analysis for the worst-case input by the average input (in some sense). Inspired by this idea, a natural variant of (2) and its above implications is an ε -sample that approximates well the average query. We suggest to define an ( ε , · ) -sample as
e r r ( X ) = f w ( P , X ) f u ( S , X ) ε ,
which generalizes (2) from · to any norm, such as the z norm e r r ( X ) z . For example, for the 2 , MSE or Frobenius norm, we obtain
x X f w ( P , x ) f u ( S , x ) 2 ε .
A generalization of the Hoeffding Inequality from 1963 with tight bounds was suggested relatively recently for the z norm for any z 2 and many other norms [26,27]. Here we assume a single query ( | X | = 1 ), a distribution weights function, and a bound on sup p P | f ( p , X ) | that determines the size of the sample, as in Hoeffding inequality.
A less obvious question, which is the subject of this paper, is how to compute deterministic ε -samples that satisfies (4), for norms other than the infinity norm. While the Caratheodory theorem suggests deterministic constructions of 0-samples (for any error norm) as explained above, our goal is to obtain coreset whose size is smaller or independent of | X | .
The next question is how to generalize the idea of sup-sampling, i.e., where the function f is unbounded, for the case of norms other than · . Our main motivation for doing so is to obtain new and smaller coresets by combining the notion of ε -sample and sup-sampling or sensitivity as explained above for the · case. That is, we wish a coreset for a given query space, that would bound the non- norm error
1 g u ( C , x ) g w ( P , x ) x X = e r r ( X ) ε .
To summarize, our questions are: How can we smooth the error function and approximate the “average” query via: (i) Deterministic ε -samples (for DAC-learning )? (ii) Coresets (via sensitivities/sup sampling for non-infinity norms)?

1.4. Our Contribution

We answer affirmably these questions by suggesting ε -samples and coresets for the average query. We focus on the case z = 2 , i.e., the Frobenius norm, and finite query set X and hope that this would inspire the research and applications of other norms and general sets. For suggestions in this direction and future work see Section 5.2. The main results of this paper are the following constructions of an ( ε , · 2 ) -sample ( S , u ) for any given finite query space ( P , w , X , f ) as defined in (5):
(i)
Deterministic construction that returns a coreset of size | S | O ( 1 / ε 2 ) in time O min n d / ε 2 , n d + d log ( n ) 2 / ε 4 ; see Theorem 2 and Corollary 4.
(ii)
Randomized construction that returns such a coreset (of size | S | O ( 1 / ε 2 ) ) with probability at least 1 δ in sub-linear time O d log ( 1 δ ) 2 + log ( 1 δ ) ε 2 ; see Lemma 5.
Algorithm. This result is of independent interest for faster and sparser convex optimization. To our knowledge, this is also the first application of sensitivity outside the coreset regime.

1.5. Overview and Organization

The rest of the paper in organized as follows. First in Section 2, we list the applications of our proposed methods, such as a faster coreset construction algorithm for least mean squares solver. We also compare our results to the state of the art to justify our practical contribution.
In Section 3, we first give our notations and relevant mathematical definitions, we explain the relation between the problem of computing an ( ε , · 2 ) -sample (average-case coreset) to the problem of computing a vector summarization coreset, where the goal (of the vector summarization coreset problem) is to compute a weighted subset of the n input vectors which approximates their sum. Here, we suggest a coreset for this problem of size O ( 1 / ε ) in O ( n d / ε ) time; see Theorem 2 and Algorithm 2. Then, in Section 3.1 we show how to improve the running time of this result and compute a coreset of the same size in O ( n d + d log 2 ( n ) / ε 2 ) time; see Corollary 4 and Algorithm 3. In addition, we suggest a non-deterministic coreset of the same size but in time that is independent of the number of points n; see Lemma 5 and Algorithm 4.
In Section 4, we explain how our vector summarization coreset results can be used to improve all the previously mentioned applications (from Section 2). In Section 5 we conduct various experiments on real world datasets, where we apply different coreset construction algorithms presented in this paper to a variety of applications, in order to boost their running time, or reduce their memory storage. We also compare our results to many competing methods. Finally, we conclude our paper and discuss future work at Section 5.2. Due to space limitations and simplicity of the reading, the proofs of the claims are placed in the Appendix A, Appendix B, Appendix C, Appendix D, Appendix E, Appendix F, Appendix G, Appendix H, Appendix I.

2. On the Applications of Our Method and the Current State of the Art

In what follows, we will present some of the applications of our theoretical contributions as well as discussing the current state of the art coreset/sketch methods in terms of running time for each of application. Figure 1 summarizes the main applications of our result.
(i)
Vector summarization: the goal is to maintain the sum of a (possibly infinite) stream of vectors in R d , up to an additive error of ε multiplied by their variance. This is a generalization of frequent items/directions [28].
As explained in [29], the main real-world application is extractions and compactly representing groups and activity summaries of users from underlying data exchanges. For example, GPS traces in mobile networks can be exploited to identify meetings, and exchanges of information in social networks sheds light on the formation of groups of friends. Our algorithm tackles these application by providing provable solution to the heavy hitters problem in proximity matrices. The heavy hitters problem can be used to extract and represent in a compact way friend groups and activity summaries of users from underlying data exchanges.
We propose a deterministic algorithm which reduces each subset of n vectors into O ( 1 / ε ) weighted vectors in O ( n d + d log ( n ) 2 / ε 2 ) time, improving upon the n d / ε of [29] (which is the current state of the art in terms of running time), for a sufficiently large n; see Corollary 4, and Figures 2 and 3. We also provide a non-deterministic coreset construction in Lemma 5. The merge-and-reduce tree can then be used to support streaming, distributed or dynamic data.
(ii)
Kernel Density Estimates (KDE): by replacing ε with ε 2 for the vector summarization, we obtain fast construction of an ε -coreset for KDE of Euclidean kernels [17]; see more details in Section 4. Kernel density estimate is a technique for estimating a probability density function (continuous distribution) from a finite set of points to better analyse the studied probability distribution than when using a traditional [30,31].
(iii)
1-mean problem: a coreset for 1-mean which approximates the sum of squared distances over a set of n points to any given center (point) in R d . This problem arises in facility location problems (e.g., to compute the optimal location for placing an antenna such that all the customers are satisfied). Our deterministic construction computes such a weighted subset of size O ( 1 / ε 2 ) in O ( min n d / ε 2 , n d + d log ( n ) 2 / ε 4 ) time. Previous results of [19,32,33,34] suggested coresets for such problem. Unlike our results, these works are either non-deterministic, the coreset is not a subset of the input, or the size of the coreset is linear in d.
(iv)
Coreset for LMS solvers and dimensionality reduction: for example, a deterministic construction for singular value decomposition (SVD) that gets a matrix A R n × d and returns a weighted subset of k 2 / ε 2 rows, such that their weighted distance to any k-dimensional non-affine (or affine in the case of PCA) subspace approximates the distance of the original points to this subspace. The SVD and PCA are very common algorithms (see [35]), and can be used for noise reduction, data visualization, cluster analysis, or as an intermediate step to facilitate other analyses. Thus, improving them might be helpful for a wide range of real-world applications. In this paper, we propose a deterministic coreset construction that takes O ( n d 2 + d 2 k 4 log ( n ) 2 / ε 4 ) time, improving upon the state of the art result of [35] which requires O ( n d 2 k 2 / ε 2 ) time; see Table 1. Many non-deterministic coresets constructions were suggested for those problems, the construction techniques apply non-uniform sampling [36,37,38], Monte-Carlo sampling [39], and leverage score sampling [23,40,41,42,43,44,45].

3. Vector Summarization Coreset

Notation 1.
We denote by [ n ] = 1 , , n . For a vector v R d , the 0-norm is denoted by v 0 and is equal to the number of non-zero entries in v. We denote by e ( i ) the ith standard basis vector in R n and by 0 the vector ( 0 , , 0 ) T R n . A vector w [ 0 , 1 ] n is called a distribution vector if all its entries are non-negative and sums up to one. For a matrix A R m × n and i [ m ] , j [ n ] we denote by A i , j the jth entry of the ith row of A. A weighted set is a pair ( Q , m ) where Q = q 1 , , q n R d is a set of npoints, and m = ( m 1 , , m n ) T R n is a weights vector that assigns every q i Q a weight m i R . A matrix A R d × d is orthogonal if A T A = I R d × d .
Adaptations. To adapt to the notation of the following sections and the query space ( P , w , X , f ) to the techniques that we use, we restate (4) as follows. Previously, we denote the queries X = x 1 , , x d , and the input set by P = p 1 , , p n . Now, each input point p i in the input set P corresponds to a point q i = f ( p i , x 1 ) , , f ( p i , x d ) R d , i.e., each entry of q i equals to f ( p i , x ) for a different query x. Throughout the rest of the paper, for technical reason and simplicity, we might alternate between the weights function notation and a weights vector notation. In such cases, the weights function w : P [ 0 , ) and weight w ( q i ) of q i , i [ m ] are replaced by a vector of weights m = ( m 1 , , m n ) [ 0 , ) n and m i , respectively, and vice versa. In such cases, the ε -sample is represented by a sparse vector u [ 0 , ) where S = p i P u i > 0 , i [ n ] is the chosen subset of P.
Hence, f w ( P , X ) = p P w ( p ) f ( p , x 1 ) , , f ( p , x d ) = i = 1 n m i q i , and f u ( S , X ) = i = 1 n u i q i .
From ( ε , · 2 ) -samples to ε -coresets. We now define an ε -coreset for vector summarization, which is a re-weighting of the input weighted set ( Q , m ) by a new weights vector u, such that the squared norm of the difference between the weighted means of ( Q , u ) and ( Q , m ) is small. This relates to Section 1.3, where an ( ε , · 2 ) -sample there (in Section 1.3) is an ε -coreset for the vector summarization here.
Definition 1
(vector summarization ε -coreset). Let ( Q , m ) and ( Q , u ) be two weighted sets of n points in R d , and let ε [ 0 , 1 ) . Let μ = i = 1 n m i m 1 q i , σ 2 = i = 1 n m i m 1 q i μ 2 , and μ ˜ = i = 1 n u i u 1 q i . Then ( Q , u ) is avector summarization ε -coresetfor ( Q , m ) if μ ˜ μ 2 2 ε σ 2 .
Analysis flow. In what follows we (first) assume that the points of our input set P lie inside the unit ball ( p P : p 1 ). For such an input set, we present a construction of a variant of a vector summarization coreset, where the error is ε and does not depend on the variance of the input. This construction is based on the Frank–Wolfe algorithm [48]; see Theorem 1 and Algorithm 1. This is by reducing the problem to the problem of maximizing a concave function f ( x ) over every vector in the unit simplex. Such problems can be solved approximately by a simple greedy algorithm known as the Frank–Wolfe algorithm.
Algorithm 1:FRANK–WOLFE ( f , K ) ; Algorithm 1.1 of [48]
1:
Input: A concave function f : R n R , and the number of iterations K.
2:
Output: A vector x R n that satisfies Theorem 1.
3:
x ( 0 ) : = the unit n-simplex vertex with largest f value.
4:
for k 0 , , K do
5:
i : = arg max i f ( x ( k ) ) i
6:
α : = arg max α [ 0 , 1 ] f x ( k ) + α ( e ( i ) x ( k ) )
7:
x ( k + 1 ) : = x ( k ) + α ( e ( i ) x ( k ) )
8:
end for
9:
Return x ( k + 1 )
We then present a proper coreset construction in Algorithm 2 and Theorem 2 for a general input set Q in R d . This algorithm is based on a reduction to the simpler case of points inside the unit ball; see Figure 1 for illustration. This reduction is inspired by the sup-sampling (see Section 1), there (in Section 1) the functions are normalized (to obtain values in [ 1 , 1 ] ) and reweighted (to obtain a non-biased estimator), then the bounds were easily obtained using the Hoeffding inequality. Here, we apply different normalizations and reweightings, and instead of the non-deterministic Hoeffding inequality, we suggest a deterministic version using the Frank–Wolfe algorithm. Our new suggested normalizations (and reweightings) allow us to generalize the result to many more applications as in Section 4.
For brevity purposes, all proofs of the technical results can be found at the Appendix A, Appendix B, Appendix C, Appendix D, Appendix E, Appendix F, Appendix G, Appendix H, Appendix I.
Theorem 1
(Coreset for points in the unit ball). Let P = { p 1 , , p n } be a set of n points in R d such that p i 1 for every i [ n ] . Let ε ( 0 , 1 ) and w = ( w 1 , , w n ) T [ 0 , 1 ] n be a distribution vector. For every x = ( x 1 , , x n ) T R n , define f ( x ) = i = 1 n ( w i x i ) p i 2 . Let u ˜ be the output of a call to F r a n k W o l f e ( f , 8 ε ) ; see Algorithm 1. Then:
(i) 
u ˜ is a distribution vector with u ˜ 0 8 ε ,
(ii) 
i = 1 n ( w i u i ˜ ) p i 2 ε , and
(iii) 
u i ˜ is computed in O n d ε time.
We now show how to obtain a vector summarization ε -coreset of size O ( 1 / ε ) in O ( n d ε ) time for any set P R d .
Theorem 2
(Vector summarization coreset). Let ( Q , m ) be a weighted set of n points in R d , ε ( 0 , 1 ) , and let u be the output of a call to C o r e S e t ( Q , m , ε 16 ) ; see Algorithm 2. Then, u R n is a vector with u 0 128 ε non-zero entries that is computed in O ( n d ε ) time, and ( Q , u ) is a vector summarization ε-coreset for ( Q , m ) .
Algorithm 2:CORESET ( Q , m , ε )
1:
Input: A weigthed set ( Q , m ) of n 2 points in R d and an error parameter ε ( 0 , 1 ) .
2:
Output: A weight vector u [ 0 , ) n with O ( 1 / ε ) non-zero entries that satisfies Theorem 2.
3:
w : = m m 1
4:
μ w : = i = 1 n w i q i
5:
σ w : = i = 1 n w i q i μ 2
6:
for every i { 1 , , n }  do
7:
p i : = q i μ σ
8:
p i : = ( p i T 1 ) T ( p i T 1 ) 2       {Notice: p i 1 .}
9:
w i : = w i ( p i T 1 ) 2 2
10:
end for
11:
Compute a sparse vector u with O ( 1 / ε ) non-zero entries, such that i = 1 n ( w i u i ) p i 2 ε {E.g., using Algorithm 1 (see Theorem 1).}
12:
u i : = m 1 · 2 u i ( p i T 1 ) 2    for every i 1 , , n
13:
Returnu

3.1. Boosting the Coreset’s Construction Running Time

In this section, we present Algorithm 3, which aims to boost the running time of Algorithm 1 from the previous section; see Theorem 3. The main idea behind this new boosted algorithm is as follows: instead of running the Frank–Wolfe algorithm on a (full) set of input data, it can be more efficient to partition the input into a constant number k of equal-sized chunks, pick some representative for each chunk (its mean), run the Frank–Wolfe algorithm only on the set of representatives (the set of means) to obtain back a subset of those representative, and then continue recursively only with the chunks whose representative was chosen by the algorithm. Although the Frank–Wolfe algorithm is now applied multiple times (rather than once), each of those runs is much more efficient since only the small set of representatives is considered.
This immediately implies a faster construction time of vector summarization ε -coresets for general input sets; see Corollary 4 and Figure 1 for illustration.
Theorem 3
(Faster coreset for points in the unit ball). Let P be a set of n points in R d such that p 1 for every p P . Let w : P ( 0 , 1 ) be a weights function such that p P w ( p ) = 1 , ε ( 0 , 1 ) , and let ( C , u ) be the output of a call to F a s t F W C o r e S e t ( P , w , ε ) ; see Algorithm 3. Then
(i) 
| C | 8 / ε and p C u ( p ) = 1 ,
(ii) 
p P w ( p ) p p C u ( p ) p 2 2 ε , and
(iii) 
( C , u ) is computed in O n d + d · log ( n ) 2 ε 2 time.
Corollary 4
(Faster vector summarization coreset). Let ( Q , m ) be a weighted set of n points in R d , and let ε ( 0 , 1 ) . Then in O n d + d · log ( n ) 2 ε 2 time, we can compute a vector u = ( u 1 , , u n ) T R n , such that u has u 0 128 / ε non-zero entries and ( Q , u ) is a vector summarization ( 2 ε ) -coreset for ( Q , m ) .
Algorithm 3:FAST-FW-CORESET ( P , w , ε )
1:
Input: A weighed set ( P , w ) of n 2 points in R d and an error parameter ε ( 0 , 1 ) .
2:
Output: A pair ( C , u ) that satisfies Theorem 3
3:
k : = 2 log ( n ) ε
4:
if | P | k then
5:
Return: A vector summarization ε -corset for ( P , w ) using Theorem 1.
6:
end if
7:
P 1 , , P k : = a partition of P into k disjoint subsets, each contains at most n / k points.
8:
for every i 1 , , k  do
9:
μ i : = 1 q P i w ( q ) · p P i w ( p ) · p {The weighted mean of P i }
10:
w ( μ i ) : = p P i w ( p )
11:
end for
12:
( μ ˜ , u ˜ ) : = a vector summarization ε log ( n ) -corset for the weighted set ( μ 1 , , μ k , w ) using Theorem 1.
13:
C : = μ i μ ˜ P i {C is the union over all subsets P i whose mean μ i was chosen in μ ˜ .}
14:
for every μ i μ ˜ and p P i  do
15:
u ( p ) : = u ˜ ( μ i ) w ( p ) q P i w ( q )
16:
end for
17:
( C , u ) : = F a s t F W C o r e S e t ( C , u , ε )
18:
Return: ( C , u )
In what follows, we show how to compute a vector summarization coreset with high probability in a time that is sublinear in the input size | Q | = n . This is based on the geometric median trick, that suggests the following procedure: (i) sample k > 1 sets S 1 , , S k of the same (small) size from the original input set Q, (ii) for each such sampled set S i ( i [ k ] ), compute its mean s ¯ i , and finally, (iii) compute and return the geometric median of those means s ¯ = s ¯ 1 , , s ¯ k . This geometric median is guaranteed to approximate the mean of the original input set Q.
We show that there is no need to compute this geometric median, as it is a difficult computational task. We prove that there exists a set S i from the sampled subsets such that its mean s ¯ i is very close to this geometric median, with high probability. Thus, s ¯ i is a good approximation to the desired mean of the original input set. Furthermore, we show that s ¯ i is simply the point in s ¯ that minimizes its sum of (non-squared) distances to this set s ¯ , i.e., i arg min j [ k ] i = 1 k s ¯ i s ¯ j 2 . An exhaustive search over the points of s ¯ can thus recover s ¯ i . The corresponding set S i is the resulted vector summarization coreset; see Lemma 5 and Algorithm 4.
Lemma 5
(Fast probabilistic vector summarization coreset). Let Q be a set of n points in R d , μ = 1 n p P q , and σ 2 = 1 n p P q μ 2 . Let ε ( 0 , 1 ) , δ ( 0 , 0.9 ] , and let S R d be the output of a call to P r o b W e a k C o r e s e t ( Q , ε , δ ) ; see Algorithm 4. Then:
(i) 
S Q and | S | = 4 ε ,
(ii) 
with probability at least 1 3 δ we have 1 | S | p S p μ 2 33 · ε σ 2 , and
(iii) 
S is computed in O d log ( 1 δ ) 2 + d log ( 1 δ ) ε time.
Algorithm 4:PROB-WEAK-CORESET ( Q , ε , δ )
1:
Input: A set Q of n 2 points in R d , ε ( 0 , 1 ) , and δ ( 0 , 1 ) .
2:
Output: A subset S P that satisfies Lemma 5.
3:
k : = 3.5 log 1 δ + 1 .
4:
S : = an i.i.d sample of size 4 k ε .
5:
S 1 , , S k : = a partition of S into k disjoint subsets, each contains 4 ε points.
6:
s ¯ i : = the mean of the i’th subset S i for i [ k ] .
7:
i : = arg min j [ k ] i = 1 k s ¯ i s ¯ j 2 .
8:
Return S i

4. Applications

Coreset for 1-mean. A 1-mean ε -coreset for ( Q , m ) is a weighted set ( Q , u ) such that for every x R d , the sum of squared distances from x to either ( Q , m ) or ( Q , u ) , is approximately the same. To maintain the above property, we prove that it suffices for ( Q , u ) to satisfy the following: the mean, the variance, and the sum of weights of ( Q , u ) should approximate the mean, the variance, and the sum of weights of ( Q , m ) , respectively, up to an additive error that depends linearly on ε . Then note that when plugging ε 2 (rather than ε ) as input to Algorithm 2, the output is guaranteed to satisfy the above 3 properties, by construction of u.
The following theorem computes a 1-mean ε -coreset.
Theorem 6.
Let ( Q , m ) be a weighted set of n points in R d , ε ( 0 , 1 ) . Then in
O min n d + d · log ( n ) 2 ε 4 , n d ε 2
time we can compute a vector u = ( u 1 , , u n ) T R n , where u 0 128 ε 2 , such that:
x R d : i = 1 n ( m i u i ) q i x 2 ε i = 1 n m i q i x 2 .
Coreset for KDE. Given two sets of points Q and Q , and a kernel K : R d × R d R that is defined by the kernel map ϕ , the maximal difference
sup x R d | q Q K ( x , q ) | Q | q Q K ( x , q ) | Q | |
between the kernel costs of Q and Q is upper bounded by μ Q ^ μ Q ^ 2 , where μ Q ^ and μ Q ^ are the means of Q ^ = ϕ ( q ) q Q and Q ^ = ϕ ( q ) q Q , respectively, [49]. Given Q ^ , we can compute a vector summarization ε 2 -coreset Q ^ , which satisfies that μ Q ^ μ Q ^ 2 2 ε 2 . By the above argument, this is also an ε -KDE coreset.
Coreset for dimensionality reduction and LMS solvers. An ε -coreset for the k-SVD (k-PCA) problem of Q is a small weighted subset of Q that approximates the sum of squared distances from the points in Q to every non-affine (affine) k-dimensional subspace of R d , up to a multiplicative factor of 1 ± ε ; see Corollary 7. Coreset for LMS solvers is the special case of k = d 1 .
In [35], it is shown how to leverage an ( ε / k ) 2 -coreset for the vector summarization problem in order to compute an ε -coreset for k-SVD. In [45], it is shown how to compute a coreset for k-PCA via a coreset for k-SVD, by simply adding another entry with some value r R to each vector of the input. Algorithm 5 combines both the above reductions, along with a computation of a vector summarization ( ε / k ) 2 -coreset to compute the desired coreset for dimensionality reduction (both k-SVD and k-PCA). To compute the vector summarization coreset we utilize our new algorithms from the previous sections, which are faster than the state of the art algorithms.
Algorithm 5:DIM-CORESET ( A , k , ε )

1:
Input: A matrix A R n × d , an integer k [ d ] , and an error parameter ε ( 0 , 1 ) .
2:
Output: A diagonal matrix W R n × n that satisfies Corollary 7.
3:
r : = 1 + max i [ n ] 4 a i 2 ε 4 {where a i is the ith row of A}
4:
U , Σ , V : = the full SVD of [ A ( r , , r ) T ] R n × ( d + 1 )
5:
v i : = U i , 1 , , U i , k , U i , k + 1 : d Σ k + 1 : d , k + 1 : d Σ k + 1 : d , k + 1 : d F for every i [ n ]
6:
v i ˜ := the row stacking of v i v i T R d × d for every i [ n ]
7:
( v 1 ˜ , , v n ˜ , u ) : = a vector summarization ( ε 5 k ) 2 -coreset for ( v 1 ˜ , , v n ˜ , ( 1 , , 1 ) ) .
8:
W : = a diagonal matrix in R n × n , where W i , i = u i , i [ n ] .
9:
ReturnW
Corollary 7
(Coreset for dimensionality reduction). Let Q be a set of n points in R d , and let A R n × d be a corresponding matrix containing the points of Q in its rows. Let ε ( 0 , 1 2 ) be an error parameter, k [ d ] be an integer, and W be the output of a call to D I M C o r e S e t ( A , k , ε ) . Then:
(i) 
W is a diagonal matrix with O k 2 ε 2 non-zero entries,
(ii) 
W is computed in O min n d 2 + d 2 log ( n ) 2 k 4 ε 4 , n d 2 k 2 ε 2 time, and
(iii) 
there is a constant c, such that for every R d and an orthogonal X R d × ( d k ) we have
| 1 W ( A ) X F 2 ( A ) X F 2 | c ε .
Here, A is the subtraction of ℓ from every row of A.
Where do our methods fit in? Theoretically speaking, the 1-mean problem (also known as the arithmetic mean problem), is a widely used tool for reporting central tendencies in the field of statistics, as it is also used in machine learning. As for the practical aspect of such problem, it can be either used to obtain an estimation of the mathematical expectation of signal strength in a area [50], or as an imputation technique used to fill in missing values, e.g., in the context of filling in missing values of heart monitor sensor data [51]. Note that a variant of this problem is widely used in the context of deep learning, namely, the moving averages. Algorithms 3 and 4 can boost such methods when given large-scale datasets. In addition, our algorithms extend also to SVD, PCA, and LMS where these methods are known for their usages and efficiencies in discovering a low dimensional representation of high dimensional data. From a practical point of view, SVD showed promising results when dealing with on calibration of a star sensor on-orbit calibration [52], denoising a 4-dimensional computed tomography of the brain in stroke patients [53], removal of cardiac interference from trunk electromyogram [54], among many other applications.
We propose a summarization technique (see Algorithm 5) that aims to compute an approximation towards the SVD factorization of large-scale datasets where applying the SVD factorization on the dataset is not possible due to insufficient memory or long computational time.

5. Experimental Results

We now apply different coreset construction algorithms presented in this paper to a variety of applications, in order to boost their running time, or reduce their memory storage. We note that a complete open source code is provided [55].
Software/Hardware. The algorithms were implemented in Python 3.6  [56] using “Numpy” [57]. Tests were conducted on a PC with Intel i9-7960X CPU @2.80 GHz x 32 and 128 Gb RAM.
We compare the following algorithms: (To simply distinguish between our algorithms and the competing ones in the graphs, observe that the labels of our algorithms starts with the prefix “Our-”, while the competing methods do not.)
(i)
Uniform: Uniform random sample of the input Q, which requires sublinear time to compute.
(ii)
Sensitivity-sum: Random sampling based on the “sensitivity” for the vector summarization problem [58]. Sensitivity sampling is a widely known technique [19], which guarantees that a subsample of sufficient size approximates the input well. The sensitivity of a point q Q is 1 n + q 2 q Q q 2 . This algorithm takes O ( n d ) time.
(ii)
ICML17: The vector summarization coreset construction algorithm from [29] (see Algorithm 2 there), which runs in O ( n d / ε ) time.
(iv)
Our-rand-sum: Our coreset construction from Lemma 5, which requires O d log ( 1 δ ) 2 + d log ( 1 δ ) ε time.
(v)
Our-slow-sum: Our coreset construction from Corollary 2, which requires O ( n d / ε ) time.
(vi)
Our-fast-sum: Our coreset construction from Corollary 4, which requires O ( n d + d log ( n ) 2 / ε 2 ) time.
(vii)
Sensitivity-svd: Similar to Sensitivity-sum above, however, now the sensitivity is computed by projecting the rows of the input matrix A on the optimal k-subspace (or an approximation of it) that minimizes its sum of squared distances to the rows of A, and then computing the sensitivity of each row i in the projected matrix A as u i 2 , where u i is the ith row the matrix U from the SVD of A = U D V T ; see [37]. This takes O ( n d k ) time.
(viii)
NIPS16: The coreset construction algorithm from [35] (see Algorithm 2 there) which requires O ( n d 2 k 2 / ε 2 ) time.
(ix)
Our-slow-svd: Corollary 7 offers a coreset construction for SVD using Algorithm 5, which utilizes Algorithm 2. However, Algorithm 2 either utilizes Algorithm 1 (see Theorem 2) or Algorithm 3 (see Theorem 4). Our-slow-svd applies the former option, which requires O ( n d 2 k 2 / ε 2 ) ) time.
(x)
Our-fast-svd: Corollary 7 offers a coreset construction for SVD using Algorithm 5, which utilizes Algorithm 2. However, Algorithm 2 either utilizes Algorithm 1 (see Theorem 2) or Algorithm 3 (see Theorem 4). Our-fast-svd uses the latter option, which requires O ( n d 2 + d 2 log ( n ) 2 k 4 / ε 4 ) time.
Datasets. We used the following datasets from the UCI ML library [59]:
(i)
New York City Taxi Data [60]. The data covers the taxi operations at New York City. We used the data describing n = 14.7 M trip fares at the year of 2013. We used the d = 6 numerical features (real numbers).
(ii)
US Census Data (1990) [61]. The dataset contains n = 2.4 M entries. We used the entire d = 68 real-valued attributes of the dataset.
(iii)
Buzz in social media Data Set [62]. It contains n = 0.5 M examples of buzz events from two different social networks: Twitter, and Tom’s Hardware. We used the entire d = 77 real-valued attributes.
(iv)
Gas Sensors for Home Activity Monitoring Data Set [63]. This dataset has n = 919 , 438 recordings of a gas sensor array composed of 8 MOX gas sensors, and a temperature and humidity sensor. We used the last d = 10 real-valued attributes of the dataset.
Discussion regarding the chosen datasets. The Buzz in social media data set is widely used in the context of Principal Component Regression (or, PCR in short), that is used for estimating the unknown regression coefficients in a standard linear regression model. The goal of PCR in the context of this dataset, is to predict popularity of a certain topic on Twitter over a period. It is known that the solution of the PCR problem can be approximated using the known SVD decomposition problem. Our techniques enable us to benefit from the coreset advantages, e.g., to boost the PCR approximated solution (PCA) while using low memory, and supporting the streaming model by maintaining a coreset for the data (tweets) seen so far; each time a new point (tweet) is received, it is added to current stored coreset in memory. Once the stored coreset is large enough, our compression (coreset construction algorithm) is applied. This procedure is repeated until the stream of points is empty.
The New York City taxi data contains information about the locations of passengers as well as the locations of their destinations. Thus, the goal is to find a location which is close to the most wanted destinations. This problem can be formulated as a facility location problem, which can be reduced to an instance of the 1-mean problem. Hence, since our methods admit faster solution as well as provable approximation for the facility location problem, we can leverage our coreset to speed up the computations using this dataset.
Finally, regarding the remaining datasets, PCA has been widely used either for low-dimensional embedding or, e.g., to compute the arithmetic mean. By using our methods, we can boost the PCA while admitting an approximated solution.
The experiments.
(i)
Vector summarization:The goal is to approximate the mean of a huge input set, using only a small weighted subset of the input. The empirical approximation error is defined as μ μ s 2 , where μ is the mean of the full data and μ s is the mean of the weighted subset computed via each compared algorithm; see Figure 2 and Figure 3.
In Figure 2, we report the empirical approximation error μ μ s 2 as a function of the subset (coreset) size, for each of the datasets (i)–(ii), while in Figure 3 we report the overall computational time for computing the subset (coreset) and for solving the 1-mean problem on the coreset, as a function of the subset size.
(ii)
k-SVD:The goal is to compute the optimal k-dimensional non-affine subspaces of a given input set. We can either compute the optimal subspace using the original (full) input set, or using a weighted subset (coreset) of the input. We denote by S and S the optimal subspace when computed either using the full data or using the subset at hand, respectively. The empirical approximation error is defined as the ratio | ( c c ) / c | , where c and c are the sum of squared distances between the points of original input set to S and S , respectively; see Figure 4, Figure 5 and Figure 6. Intuitively, this ratio represents the relative SSD error of recovering an optimal k-dimensional non-affine subspace on the compression, rather than using the full data.
In Figure 4 we report the empirical error | ( c c ) / c | as a function of the coreset size. In Figure 5 we report the overall computational time in took to compute the coreset and to recover the optimal subspace using the coreset, as a function of the coreset size. In both figures we have three subfigures, each one for a different chosen value of k (the dimension of the subspace). Finally, in Figure 6 the x axis is the size of the dataset (which we compress to a subset of size 150), while the y-axis is the approximation error on the left hand side graph, and on the right hand side it is the overall computational time it took to compute the coreset and to recover the optimal subspace using the coreset.
Figure 2. Experimental results for vector summarization. The x axis is the size of the subset (coreset), while the y axis is the approximation error μ μ s 2 . The difference between the two graphs is the chosen dataset.
Figure 2. Experimental results for vector summarization. The x axis is the size of the subset (coreset), while the y axis is the approximation error μ μ s 2 . The difference between the two graphs is the chosen dataset.
Sensors 21 06689 g002
Figure 3. Experimental results for vector summarization. The x axis is the size of the subset (coreset), while the y axis is the overall time took to compute the coreset and to solve the problem on it. The difference between the two graphs is the chosen dataset.
Figure 3. Experimental results for vector summarization. The x axis is the size of the subset (coreset), while the y axis is the overall time took to compute the coreset and to solve the problem on it. The difference between the two graphs is the chosen dataset.
Sensors 21 06689 g003
Figure 4. Experimental results for k-SVD, we used Dataset (iii). The x axis is the size of the subset (coreset), while the y axis is the approximation error ε . The difference between the 3 graphs is the chosen low dimension k.
Figure 4. Experimental results for k-SVD, we used Dataset (iii). The x axis is the size of the subset (coreset), while the y axis is the approximation error ε . The difference between the 3 graphs is the chosen low dimension k.
Sensors 21 06689 g004
Figure 5. Experimental results for k-SVD, we used Dataset (iii). The x axis is the size of the subset (coreset), while the y axis is the overall time took to compute the coreset and to solve the problem on it. The difference between the 3 graphs is the chosen low dimension k.
Figure 5. Experimental results for k-SVD, we used Dataset (iii). The x axis is the size of the subset (coreset), while the y axis is the overall time took to compute the coreset and to solve the problem on it. The difference between the 3 graphs is the chosen low dimension k.
Sensors 21 06689 g005
Figure 6. Experimental results for k-SVD, we used Dataset (iii). The x axis is the size of the dataset which we compress to subsample of size 150, while the y-axis is the approximation error in the left hand side graph, and in the right hand side it is the overall time took to compute the coreset and to solve the problem on it.
Figure 6. Experimental results for k-SVD, we used Dataset (iii). The x axis is the size of the dataset which we compress to subsample of size 150, while the y-axis is the approximation error in the left hand side graph, and in the right hand side it is the overall time took to compute the coreset and to solve the problem on it.
Sensors 21 06689 g006

5.1. Discussion

Vector summarization experiment:As predicted by the theory and as demonstrated in Figure 2 and Figure 3, our fast and deterministic algorithm Our-fast-sum (the red line in the figures) achieves either the same or smaller approximation errors in most cases compared to the deterministic alternatives Our-slow-sum (orange line) and ICML17 (green line), while being up to × 10 times faster. Hence, when we seek a fast time deterministic solution for computing a coreset for the vector summarization problem, our algorithm Our-fast-sum is the favorable choice.
Compared to the randomized alternatives, Our-fast-sum is obviously slower, but achieves an error more than 3 orders of magnitude smaller. However, our fast and randomized algorithm Our-rand-sum (brown line) constantly achieves better results compared to the other randomized alternatives; It yields approximation error up to × 50 smaller, while maintaining the same computational time. This is demonstrated on both datasets. Hence, our compression can be used to speed up tasks, e.g., computing the PCA or PCR, as described above.
k-SVD experiment: Here, in Figure 4, Figure 5 and Figure 6 we witness a similar phenomena, where our fast and deterministic algorithm Our-fast-svd achieves the same or smaller approximation errors compared to the deterministic alternatives Our-slow-svd and NIPS16, respectively, while being up to × 4 times faster. Compared to the randomized alternatives, Our-fast-svd is slower as predicted, but achieves an error up to 2 orders of magnitude smaller. This is demonstrated for increasing sample sizes (as in Figure 4 and Figure 5), for increasing dataset size (as in Figure 6), and for various values of k (see Figure 4, Figure 5 and Figure 6).

5.2. Conclusions and Future Work

This paper generalizes the definition of ε -sample and coreset from the worst case error over every query to average 2 error. We then showed a reduction from the problem of computing such coresets to the vector summarization coreset construction problem. Here, we suggest deterministic and randomized algorithms for computing such coresets, the deterministic version takes O ( min n d / ε , n d + d log ( n ) 2 / ε 2 ) , and the randomized O ( d log ( 1 δ ) 2 + d log ( 1 δ ) ε ) . Finally, we showed how to leverage an ( ε 2 ) -coreset for the vector summarization problem in order to compute an ε -coreset for the 1-mean problem, and similarly for k-SVD and k-PCA problem via computing an ( ε / k ) 2 vector summarization coreset after some reprocessing on the data.
Open problems include generalizing these results for other types of norms, or other functions such as M-estimators that are robust to outliers. We hope that the source code and the promising experimental results would encourage also practitioners to use these new types of approximations. Normalization via this new sensitivity type reduced the bounds on the number of iterations of the Frank–Wolfe algorithm by orders of magnitude. We believe that it can be used more generally for provably faster convex optimization, independently of coresets or ε -samples. We leave this for future research.

Author Contributions

Conceptualization, A.M., I.J., M.T. and D.F.; methodology, A.M. and I.J.; software, A.M. and M.T.; validation, A.M. and M.T.; formal analysis, A.M., I.J., M.T. and D.F.; investigation, A.M., I.J., M.T. and D.F.; resources, D.F.; data curation, A.M., I.J. and M.T.; writing—original draft preparation, A.M., I.J. and M.T.; writing—review and editing, A.M., I.J. and M.T.; visualization, I.J.; supervision, D.F.; project administration, A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare that one of the authors is Prof. Dan Feldman who is a guest editor of special issue “Sensor Data Summarization: Theory, Applications, and Systems”.

Appendix A. Problem Reduction for Vector Summarization ε-Coresets

First, we define a normalized weighted set, which is simply a set which satisfies three properties: weights sum to one, zero mean, and unit variance.
Definition A1
(Normalized weighted set). A normalized weighted set is a weighted set ( P , w ) where P = p 1 , , p n R d and w = ( w 1 , , w n ) T R n satisfy the following properties:
(a) 
Weights sum to one: i = 1 n w i = 1 ,
(b) 
The weighted sum is the origin: i = 1 n w i p i = 0 , and
(c) 
Unit variance: i = 1 n w i p i 2 = 1 .

Appendix A.1. Reduction to Normalized Weighted Set

In this section, we argue that in order to compute a vector summarization ε -coreset for an input weighted set ( Q , m ) , it suffices to compute a vector summarization ε -coreset for its corresponding normalized (and much simpler) weighted set ( P , w ) as in Definition A1; see Corollary A1. However, first, in Observation A1, we show how to compute a corresponding normalized weighted set ( P , w ) for any input weighted set ( Q , m ) .
Observation A1.
Let Q = { q 1 , , q n } be a set of n 2 points in R d , m ( 0 , ) n , w ( 0 , 1 ] n be a distribution vector such that w = m m 1 , μ = i = 1 n w i q i and σ = i = 1 n w i q i μ 2 . Let P = p 1 , , p n be a set of n points in R d , such that for every j [ n ] we have p j = q j μ σ . Then, ( P , w ) is the corresponding normalized weighted set of ( Q , m ) , i.e., (i)–(iii) hold as follows:
(i)
i = 1 n w i = 1 ,
(ii)
i = 1 n w i p i = 0 , and
(iii)
i = 1 n w i p i 2 = 1 .
Proof. 
i = 1 n w i = 1 , immediately holds by the definition of w .
i = 1 n w i p i = i = 1 n w i · q i μ σ = 1 σ i = 1 n w i q i i = 1 n w i μ = 1 σ μ i = 1 n w i μ = 1 σ μ 1 i = 1 n w i = 0 ,
where the first equality holds by the definition of p i , the third holds by the definition of μ , and the last is since w is a distribution vector.
i = 1 n w i p i 2 = i = 1 n w i q i μ σ 2 = 1 σ 2 i = 1 n w i q i μ 2 = i = 1 n w i q i μ 2 i = 1 n w i q i μ 2 = 1 ,
where the first and third equality hold by the definition of p i and σ , respectively. □
Corollary A1.
Let ( Q , m ) be a weighted set, and let ( P , w ) be its corresponding normalized weighted set as computed in Observation A1. Let ( P , u ) be a vector summarization ε-coreset for ( P , w ) and let u = m 1 · u . Then ( Q , u ) is a vector summarization ε-coreset for ( Q , m ) .
Proof. 
Put x R d and let y = x μ σ . Now, for every j [ n ] , we have that
q j x 2 = σ p j + μ ( σ y + μ ) 2 = σ p j σ y 2 = σ 2 | | p j y | | 2 ,
where the first equality is by the definition of y and p j .
Let ( P , u ) be a vector summarization ε -coreset for ( P , w ) . We prove that ( Q , u ) is a vector summarization ε -coreset for ( Q , m ) . We observe the following
i = 1 n m i m 1 q i i = 1 n u i u 1 q i 2 = i = 1 n w i q i i = 1 n u i u 1 q i 2 = i = 1 n ( w i u i u 1 ) ( p i σ + μ ) 2 = i = 1 n ( w i u i u 1 ) p i σ + μ i = 1 n ( w i u i u 1 ) 2 = i = 1 n ( w i u i u 1 ) ( p i σ ) 2 ε σ 2
where the first equality holds since w = m m 1 and u u 1 = m u m u 1 = u i u 1 , the second holds by (A1), and the last inequality holds since ( P , u ) is a vector summarization ε -coreset for ( P , w ) . □

Appendix A.2. Vector Summarization Problem Reduction

Given a normalized weighted set ( P , w ) as in Definition A1, in the following lemma we prove that a weighted set ( P , u ) is a vector summarization ε -coreset for the normalized weighted set ( P , w ) if and only if the squared 2 norm of the weighted mean of ( P , u ) is smaller than ε .
Lemma A2.
Let ( P , w ) be a normalized weighted set of n points in R d , ε ( 0 , 1 ) , and u R n be a weight vector. Let p ¯ = i = 1 n w i w 1 p i , s ¯ = i = 1 n u i u 1 p i , and σ 2 = p i p ¯ 2 . Then, ( P , u ) is a vector summarization ε-coreset for ( P , w ) , i.e., p ¯ s ¯ 2 ε i = 1 n w i σ 2 if and only if s ¯ 2 ε .
Proof. 
The proof holds since ( P , w ) is a normalized wighted set, i.e., p ¯ = 0 , and σ 2 = 1 . □

Appendix B. Frank–Wolfe Theorem

Here, for completeness we state the Frank–Wolfe Theorem [48]. This theorem will be used in the proof of Theorem 1 that shows how to compute a variant of vector summarization coreset for points inside the unit ball.
To do so, we consider the measure C f defined in [48]; see equality ( 9 ) in Section 2.2. For a simplex S and concave function f, the quantity C f is defined as
C f : = sup 1 α 2 ( f ( x ) + ( y x ) T f ( x ) f ( y ) ) ,
where the supremum is over every x and z in S, and over every α so that y = x + α ( z x ) is also in S. The set of such α includes [ 0 , 1 ] , but α can also be negative.
Theorem A3 
(Theorem 2.2 from [48]).For simplex S and concave function f, Algorithm 1 (Algorithm 1.1 from [48]) finds a point x ( k ) on a k-dimensional face of S such that
f ( x ) f ( x ( k ) ) 4 C f 1 k + 3 ,
for k > 0, where f ( x ) is the optimal value of f.

Appendix C. Proof of Theorem 1

Proof of Theorem 1. 
Let C f be defined for f and S as in (A3), and let f ( x ) be the maximum value of f in S. Based on Theorem A3 we have:
  • u ˜ is a point on a 8 ε -dimensional face of S, i.e., u ˜ 0 8 ε , u S [ 0 , 1 ] n and i = 1 n u i ˜ = 1 . Hence, claim (i) of this theorem is satisfied.
  • f ( x ) f ( x ( k ) ) 4 C f 1 k + 3 , for every k 0 , , 8 ε .
Since f ( x ) 0 for every x S , we have that,
f ( x ) = f ( w ) = i = 1 n ( w i w i ) p i 2 = 0 .
Define A to be the matrix of d × n such that the i-th column of A is the i-th point in P, and let μ = i = 1 n w i p i . We get that
f ( x ) = i = 1 n ( w i x i ) p i 2 = μ i = 1 n x i p i 2 = μ 2 + 2 μ T ( i = 1 n x i p i ) i = 1 n x i p i 2 = μ 2 + 2 μ T A x A x 2 = μ 2 + 2 x T A T μ x T A T A x ,
where the second equality holds by the definition of μ , and the fourth equality holds by since i = 1 n x i p i = A x for every x R n .
At Section 2.2 in [48], it was shown that for any quadratic function f : R n R that is defined as
f ( x ) = a + x T b + x T M x ,
where M is a negative semidefinite n × n matrix, b R n is a vector, and a R , we have that C f d i a m ( A S ) 2 , where A R d × n is a matrix that satisfies M = A T A ; see equality ( 12 ) at [48].
Hence, plugging a = μ 2 , b = 2 A T μ , and M = A T A in (A5) yields that for the function f we have C f d i a m ( A S ) 2 , and
d i a m ( A S ) 2 = sup a , b A S a b 2 2 = sup x , y S A x A y 2 2 .
Observe that x and y are distribution vectors, thus
sup x , y S A x A y 2 2 = sup i , j p i p j 2 2 .
Since p i 1 for each i [ n ] , we have that
sup i , j p i p j 2 2 2 .
By substituting C f 2 , k = 8 / ε , f ( x ( k ) ) = f ( u ˜ ) = i = 1 n ( w i u i ˜ ) p i 2 , and f ( x ) = 0 in (2) we get that,
i = 1 n ( w i u i ˜ ) p i 2 8 1 8 / ε + 3 .
Multiplying both sides of the inequality by 8 and rearranging prove Theorem (ii) as
i = 1 n ( w i u i ˜ ) p i 2 8 8 / ε + 3 8 8 / ε = ε .
Running time: We have K = 8 ε iterations in Algorithm 1, where each iteration takes O ( n d ) time, since the gradient of f based on the vector x = ( x 1 , , x n ) T S is 2 A T i = 1 n ( w i x i ) p i . This term is the multiplication between an a matrix in R n × d and a vector in R d , which takes O ( n d ) time. Hence, the running time of the Algorithm is O ( n d ε ) . □

Appendix D. Proof of Theorem 2

Proof of Theorem 2. 
Let ( P , w ) be the normalized weighted set that is computed at Lines 3–5 of Algorithm 2 where P = p 1 , , p n , and let u ˜ = u m 1 . We show that ( P , u ˜ ) is a vector summarization ε -coreset for ( P , w ) , then by Corrolary A1 we get that ( Q , u ) is a vector summarization ε -coreset for ( Q , m ) . For every i [ n ] let w i , u i , u i and p i be defined as in Algorithm 2, and let ε = ε 16 . First, by the definition of u we have that
u 0 8 ε = 128 ε ,
and since u i = m 1 · 2 u i ( p i T 1 ) 2 for every i [ n ] , we get that
u 0 128 ε .
We also have by Theorem 1 that
(A10) 4 ε 4 i = 1 n ( w i u i ) p 2 (A11) = 4 i = 1 n w i ( p i T 1 ) 2 u i m 1 ( p i T 1 ) 2 2 · ( p i T 1 ) T ( p i T 1 ) 2 2 = i = 1 n ( w i u i ˜ ) · ( p i T 1 ) T 2 (A12) = i = 1 n ( w i u i ˜ ) · p i T i = 1 n ( w i u i ˜ ) T 2 (A13) i = 1 n ( w i u i ˜ ) · p i 2 ,
where the first derivative is by the definition of u in Algorithm 2 at line 11, the second holds by the definition of p , w and u at Lines 8, 9, and 12 of the algorithm, the third holds since u ˜ = u m 1 , and the last inequality holds since ( x y ) 2 x 2 for every x R d and y R . Combining the fact that i = 1 n w i p i = 0 with (A13) yields that
4 ε i = 1 n u i ˜ p i 2 .
By (A12) and since w is a distribution vector we also have that
4 ε | i = 1 n ( w i u i ˜ ) | 2 = | 1 i = 1 n u i ˜ | 2 ,
which implies
2 ε | 1 i = 1 n u i ˜ | .
Combining (A15) and (A14) yields that:
i = 1 n u i ˜ p i i = 1 n u i ˜ 2 4 ε ( 1 2 ε ) 2 16 ε = ε ,
where that second inequality holds since ε = ε / 16 1 / 16 .
By Lemma A2, Corollary A1, and (A16), Theorem 2 holds as
i = 1 n u i u 1 q i i = 1 n m i m 1 q i 2 2 = μ u μ m 2 2 ε σ m 2 .

Appendix E. Proof of Theorem 3

Proof of Theorem 3. 
We use the notation and variable names as defined in Algorithm 3.
First, we assume that w ( p ) > 0 for every p P , otherwise we remove all the points in P which have zero weight, since they do not contribute to the weighted sum. Identify the input set P = p 1 , , p n and the set C that is computed at Line 13 of Algorithm 3 as C = c 1 , , c | C | . We will first prove that the weighted set ( C , u ) that is computed in Lines 13–15 at an arbitrary iteration satisfies:
(a)
C P ,
(b)
p C u(p) =1,
(c)
p P w ( p ) · p p C u ( p ) · p 2 ε log ( n ) , and
(d)
| C | | P | 2 .
Let ( μ ˜ , u ˜ ) be the vector summarization ε log ( n ) -coreset of the weighted set ( μ 1 , , μ k , w ) that is computed during the execution of the current iteration at Line 12. Hence, by Theorem 1
μ i μ ˜ u ˜ ( μ i ) μ i i = 1 k w ( μ i ) · μ i 2 ε log ( n ) , μ ˜ μ 1 , , μ k , and | μ ˜ | 8 · log ( n ) ε .
Proof of (a). Property (i) is satisfied by Line 13 as we have that C P .
Proof of (b). Property (ii) is also satisfied since
p C u ( p ) = μ i μ ˜ p P i u ˜ ( μ i ) w ( p ) w ( μ i ) = μ i μ ˜ u ˜ ( μ i ) w ( μ i ) p P i w ( p ) = μ i μ ˜ u ˜ ( μ i ) p P i w ( p ) p P i w ( p ) = μ i μ ˜ u ˜ ( μ i ) = 1 ,
where the first equality holds by the definition of C at Line 13 and w ( p ) for every p C at Line 15, and the third equality holds by the definition of u ( μ i ) for every μ i μ ˜ as in Line 10.
Proof of (c). By the definition of w and μ i , for every i 1 , , k
i = 1 k w ( μ i ) · μ i = i = 1 k w ( μ i ) · 1 w ( μ i ) · p P i w ( p ) · p = i = 1 k p P i w ( p ) p = p P w ( p ) p .
The weighted sum of ( C , u ) is
p C u ( p ) p = μ i μ ˜ p P i u ˜ ( μ i ) w ( p ) w ( μ i ) · p = μ i μ ˜ u ˜ ( μ i ) p P i w ( p ) w ( μ i ) p = μ i μ ˜ u ˜ ( μ i ) μ i ,
where the first equality holds by the definitions of C and w, and the third equality holds by the definition of μ i at Line 9. Plugging (A19) and (A20) in (A17) satisfies (iii) as
p P w ( p ) · p p C u ( p ) · p 2 ε log ( n ) .
Proof of (d). By (A17) we have that C contains at most log ( n ) ε clusters from P and at most | C | log ( n ) ε · n k points, and by plugging k = 2 log ( n ) ε we obtain that | C | | P | 2 as required.
We now prove (i)–(iii) from Theorem 3.
Proof of Theorem 3 (i).The first condition | C | 8 / ε in (i) is satisfied since at each iteration we reduce the data size by a factor of 2, and we keep reducing until we reach the stopping condition, which is O ( log ( n ) ε ) by Theorem 1 (since we require a ε log ( n ) error when we use Theorem 1, i.e., we need coreset of size O ( log ( n ) ε ) ). Then, at Line 5 when the if condition is satisfied (it should be, as explained) we finally use Theorem 1 again to obtain a coreset of size 8 / ε with ε -error on the small data (that was of size O ( log ( n ) ε ).
The second condition in (i) is satisfies since at each iteration we either return such a pair ( C , u ) at Line 18, we get by (b) that the sum of weight is always equal to 1.
Proof of Theorem 3 (ii). By (d) we also get that we have at most log ( n ) recursive calls. Hence, by induction on (2) we conclude that last computed set ( C , u ) at Line 18 satisfies (ii)
p P w ( p ) · p p C w ( p ) · p 2 log ( n ) · ε log ( n ) = ε .
At Line we return an ε coreset for the input weighted set ( P , w ) that have reached the size of ( log ( n ) ε ) . Hence, the output of a the call satisfies p P w ( p ) · p p C w ( p ) · p 2 2 ε .
Proof of Theorem 3 (iii). As explained before, there are at most l o g ( n ) recursive calls before the stopping condition at Line 4 is met. At each iteration we compute the set of means μ ˜ , and compute a vector summarization ε log n -coreset for them. Hence, the time complexity of each iteration is n d + T ( k , d , ε log ( n ) ) where n is the number of points in the current iteration, and T ( k , d , ε log ( n ) ) is the running time of Algorithm 1 on k points in R d to obtain a ε log ( n ) -coreset. Thus, the total running of time the algorithm until the "If" condition at Line 4 is satisfied is
i = 1 log ( n ) n d 2 i 1 + T ( k , d , ε log ( n ) ) 2 n d + log ( n ) · T ( k , d , ε log ( n ) ) O n d + k d ε log ( n ) .
Plugging k = 2 log ( n ) ε and observing the the last compression at Line 5 is done on a data of size O ( log ( n ) ε ) proves (iii) as the running time of Algorithm 3 is O n d + log ( n ) 2 d ε 2 . □

Appendix F. Proof of Corollary 4

Proof. 
The corollary immediately holds by using Algorithm 2 with a small change. We change Line 11 in Algorithm 2 to use Algorithm 3 and Theorem 3, instead of Algorithm 1 and Theorem 1. □

Appendix G. Proof of Lemma 5

We first prove the following lemma:
Lemma A4.
Let P be a set of n points in R d , μ = 1 n p P p , and σ 2 = 1 n p P p μ 2 . Let ε , δ ( 0 , 1 ) , and let S be a sample of m = 1 ε δ points chosen i.i.d uniformly at random from P. Then, with probability at least 1 δ we have that 1 m p S p μ 2 ε σ 2 .
Proof. 
For any random variable X, we denote by E ( X ) and var ( X ) the expectation and variance of the random variable X, respectively. Let x i denote the random variable that is the ith sample for every i [ m ] . Since the samples are drawn i.i.d, we have
var 1 m p S p = i = 1 m var x i m = m · var x 1 m = m σ 2 m 2 = σ 2 m = ε δ σ 2 .
For any random variable X and error parameter ε ( 0 , 1 ) , the generalize Chebyshev’s inequality [64] reads that
Pr ( X E ( X ) ε ) var ( X ) ( ε ) 2 .
Substituting X = 1 m p S p , E ( X ) = μ and ε = ε σ in (A23) yields that
Pr 1 m p S p μ ε σ var ( 1 m p S p ) σ 2 ε .
Combining (A22) with (A24) proves the lemma as:
Pr 1 m p S p μ 2 ε σ 2 ε δ σ 2 σ 2 ε = δ .
Now we prove Lemma 5
Proof. 
Let S 1 , , S k be a set of k i.i.d sampled subsets each of size 4 ε as defined at Line 5 of Algorithm 4, and let s ¯ i be the mean of the ith subset S i as define at Line 6. Let s ^ : = arg min x R d i = 1 k s ¯ i x 2 be the geometric median of the set of means s ¯ 1 , , s ¯ k .
Using Corollary 4.1 from [65] we obtain that
Pr s ^ μ 11 σ 2 log ( 1.4 / δ ) 4 k ε δ ,
from the above we have that
Pr s ^ μ 2 121 ε σ 2 log ( 1.4 / δ ) 4 k δ .
Note that
(A27) Pr s ^ μ 2 121 ε σ 2 log ( 1.4 / δ ) 4 k (A28) = Pr s ^ μ 2 30.25 · ε σ 2 log ( 1.4 / δ ) 3.5 log 1 δ + 1 (A29) Pr s ^ μ 2 31 · ε σ 2 ,
where (A28) holds by substituting k = 3.5 log 1 δ + 1 as in Line 3 of Algorithm 4, and (A29) holds since log ( 1.4 / δ ) 3.5 log 1 δ + 1 < 1 for every δ 0.9 as we assumed. Combining (A29) with (A26) yields,
Pr s ^ μ 2 31 · ε σ 2 δ .
For every i [ k ] , by substituting S = S i , which is of size 4 ε , in Lemma A4, we obtain that
Pr ( s ¯ i μ 2 ε σ 2 ) 1 / 4 .
Hence, with probability at least 1 ( 1 / 4 ) k there is at least one set S j such that
s ¯ j μ 2 ε σ 2 .
By the following inequalities:
( 1 / 4 ) k = ( 1 / 4 ) 3.5 log 1 δ + 1 ( 1 / 4 ) log ( 1 / δ ) = 4 log ( δ ) 2 log ( δ ) = δ
we get that with probability at least 1 δ there is a set S j such that
s ¯ j μ 2 ε σ 2 .
Combining (A31) with (A30) yields that with probability at least ( 1 δ ) 2 the set S j satisfies that
s ¯ j s ^ 2 32 ε σ 2 .
Let f : R d [ 0 , ) be a function such that f ( x ) = i = 1 k s ¯ i x 2 for every x R d . Therefore, by the definitions of f and s ^ ,
s ^ : = arg min x R d i = 1 k s ¯ i x 2 = arg min x R d f ( x ) .
Observe that f is a convex function since it is a sum over convex functions. By the convexity of f, we get that for every pair of points p , q P it holds that:
if f ( q ) f ( p ) then q s ^ p s ^ .
Therefore, by the definition of i at in Algorithm 4 we get that
i arg min i [ k ] s ¯ i s ^ .
Now by combining (A32) with (A34) we have that:
Pr s ¯ i s ^ 2 32 ε σ 2 ( 1 δ ) 2 .
Combining (A35) with (A30) and noticing the following inequality
( 1 δ ) 3 = ( 1 2 δ + δ 2 ) ( 1 δ ) ( 1 2 δ ) ( 1 δ ) = 1 δ 2 δ + 2 δ 2 1 3 δ ,
satisfies Lemma 5 as,
Pr s ¯ i μ 2 33 ε σ 2 1 3 δ .
Running time. It takes O d log ( 1 δ ) ε to compute the set of means at Line 6, and O d log ( 1 δ ) 2 time to compute Line 7 by simple exhaustive search over all the means. Hence, the total running time is O d log ( 1 δ ) 2 + log ( 1 δ ) ε . □

Appendix H. Proof of Theorem 6

We first show a reduction to a normalized weighted set as follows:
Corollary A5.
Let ( Q , m ) be a weighted set, and let ( P , w ) be its corresponding normalized weighted set as computed in Observation A1. Let ( P , u ) be a 1-mean ε-coreset for ( P , w ) and let u = m 1 · u . Then ( Q , u ) is a 1-mean ε-coreset for ( Q , m ) .
Proof. 
Let ( P , u ) be a 1-mean ε -coreset for ( P , w ) . We prove that ( Q , u ) is a 1-mean ε -coreset for ( Q , m ) . Observe that
| i = 1 n ( m i u i ) q i x 2 | = | i = 1 n ( m i u i ) σ 2 p i y 2 |
= | i = 1 n m 1 σ 2 ( w i u i ) p i y 2 | ,
where the first equality holds by (A1), and the second holds by the definition of w and u . Since ( P , u ) is a 1-mean ε -coreset for ( P , w )
| i = 1 n m 1 σ 2 ( w i u i ) p i y 2 | ε i = 1 n m 1 σ 2 w i p i y 2 = ε i = 1 n m i q i x 2 ,
where the equality holds by (A1) and since w = m m 1 . The proof concludes by combining (A36) and (A38) as | i = 1 n ( m i u i ) q i x 2 | ε i = 1 n m i q i x 2 .

1-Mean Problem Reduction

Given a normalized weighted set ( P , w ) as in Definition A1, in the following lemma we prove that a weighted set ( P , u ) is a 1-mean ε -coreset for ( P , w ) if some three properties related to the mean, variance, and weights of ( P , u ) hold.
Lemma A6.
Let ( P , w ) be a normalized weighted set of n points in R d , ε ( 0 , 1 ) , and u R n such that,
  • i = 1 n u i p i ε ,
  • | 1 i = 1 n u i | ε , and
  • | 1 i = 1 n u i · p i 2 | ε .
Then, ( P , u ) is a 1-mean ε-coreset for ( P , w ) , i.e., for every x R d we have that
| i = 1 n ( w i u i ) p i x 2 | 2 ε i = 1 n w i p i x 2 .
Proof. 
First we have that,
i = 1 n w i p i x 2 = i = 1 n w i p i 2 2 x T i = 1 n w i p i + x 2 i = 1 n w i
= 1 + x 2 ,
where the last equality holds by the attributes (a)–(c) of the normalized weighted set ( P , w ) . By rearranging the left hand side of (A39) we get,
(A42) i = 1 n ( w i u i ) p i x 2 = i = 1 n ( w i u i ) ( p i 2 2 p i T x + x 2 ) (A43) i = 1 n ( w i u i ) p i 2 + x 2 i = 1 n ( w i u i ) + 2 x T i = 1 n ( w i u i ) p i (A44) = 1 i = 1 n u i p i 2 + x 2 1 i = 1 n u i + 2 x T i = 1 n u i p i (A45) ε + ε x 2 + 2 x i = 1 n u i p i ,
where (A43) holds by the triangle inequality, (A44) holds by attributes (a)–(c), and (A45) holds by combining assumptions (2), (3), and the Cauchy-Schwarz inequality, respectively. We also have for every a , b 0 that 2 a b a 2 + b 2 , hence,
2 a b = 2 ε a b ε ε a 2 + b 2 ε .
By (A46) and assumption (1) we get that,
2 x i = 1 n u i p i ε x 2 + i = 1 n u i p i 2 ε ε x 2 + ε 2 ε = ε x 2 + ε .
Lemma A6 now holds by plugging (A47) in (A45) as,
| i = 1 n ( w i u i ) p i x 2 | ε + ε x 2 + ε x 2 + ε = 2 ε + 2 ε x 2 = 2 ε ( 1 + x 2 ) = 2 ε i = 1 n w i p i x 2 ,
where the last equality holds by (A41).
Observe that if assumptions (1), (2) and (3) hold, then (A48) hold. We therefore obtain an ε -coreset. □
To Proof Theorem 6, we split it into 2 claims:
Claim A7.
Let ( Q , m ) be a weighted set of n points in R d , ε ( 0 , 1 ) , and let u be the output of a call to C o r e S e t ( Q , m , ( ε 4 ) 2 ) ; see Algorithm 2. Then u = ( u 1 , , u n ) R n is a vector with u 0 128 ε 2 non-zero entries that is computed in O ( n d ε 2 ) time, and ( Q , u ) is a 1-mean ε-coreset for ( Q , m ) .
Proof. 
Let ( P , w ) be the normalized weighted set that is computed at Lines 3–5 of Algorithm 2 where P = p 1 , , p n , and let u ˜ = u m 1 . We show that ( P , u ˜ ) is a 1-mean ε -coreset for ( P , w ) , then by Corollary A5 we get that ( Q , u ) is a 1-mean coreset for ( Q , m ) .
Let ε = ε 4 , let p i : = ( p i T 1 ) T ( p i T 1 ) 2 and w i : = w i ( p i T 1 ) 2 2 for every i [ n ] . By the definition of u at line 11 in Algorithm 2, and since the algorithm gets ε 2 as input, we have that
u 0 8 / ε 2 = 128 ε 2 ,
and
i = 1 n ( w i u i ) p i 2 ε 2 .
For every i [ n ] let u i = m 1 · 2 u i ( p i T 1 ) 2 be defined as at Line 12 of the algorithm. It immediately follows by the definition of u = ( u 1 , , u n ) and (A49) that
u 0 128 / ε 2 .
We now prove that Properties (1)–(3) in Lemma A6 hold for ( P , u ˜ ) . We have that
(A52) 2 ε 2 i = 1 n ( w i u i ) p i = 2 i = 1 n w i ( p i T 1 ) 2 u i m 1 ( p i T 1 ) 2 2 · ( p i T 1 ) T ( p i T 1 ) 2 (A53) = i = 1 n ( w i u i ˜ ) · ( p i T 1 ) T (A54) = i = 1 n ( w i u i ˜ ) · p i T i = 1 n ( w i u i ˜ ) T (A55) i = 1 n ( w i u i ˜ ) · p i ,
where the first derivation follows from (A50), the second holds by the definition of w i , u i , u i and p i for every i [ n ] , the third holds since u ˜ = u m 1 , and the last holds since ( x y ) x for every x , y such that x R d and y R .
By (A54) and since w is a distribution vector we also have that
2 ε | i = 1 n ( w i u i ˜ ) | = | 1 i = 1 n u i ˜ | .
By Theorem 1, we have that u is a distribution vector, which yields,
2 = 2 i = 1 n u i = i = 1 n u i ˜ ( p i T 1 ) T 2 = i = 1 n u i ˜ p i 2 + i = 1 n u i ˜ ,
By the above we get that 2 i = 1 n u i ˜ = i = 1 n u i ˜ p i 2 . Hence,
| i = 1 n ( w i u i ˜ ) p i 2 | = | i = 1 n w i p i 2 ( 2 i = 1 n u i ˜ ) | = | 1 ( 2 i = 1 n u i ˜ ) | = | i = 1 n u i ˜ 1 | 2 ε
where the first equality holds since i = 1 n u i ˜ p i 2 = 2 i = 1 n u i ˜ , the second holds since w is a distribution and the last is by (A56). Now by (A57), (A56) and (A55) we obtain that ( P , u i ˜ ) satisfies Properties (1)–(3) in Lemma A6. Hence, by Lemma A6 and Corollary A5 we get that
| i = 1 n ( w i u i ) p i x 2 | 4 ε i = 1 n w i p i x 2 = ε i = 1 n w i p i x 2 .
The running time is the running time of Algorithm 1 with ε 2 instead of ε , i.e., O ( n d / ε 2 ) .
Now we proof the following claim:
Claim A8.
Let ( Q , m ) be a weighted set of n points in R d , ε ( 0 , 1 ) . Then in O ( n d + d · log ( n ) 2 ε 4 ) we can compute a vector u = ( u 1 , , u n ) T R n , such that u has u 0 128 ε 2 non-zero entries, and ( Q , u ) is a 1-mean ( 2 ε ) -coreset for ( Q , m ) .
Proof. 
The Claim immediately holds by using Algorithm 2 with a small change. We change Line 11 in Algorithm 2 to use Algorithm 3 and Theorem 3, instead of Algorithm 1 and Theorem 1. □
Combining both Claim A7 with Claim A8 proves Theorem 6.

Appendix I. Proof of Corollary 7

Proof. 
We consider the variables defined in Algorithm 5. Let X R d × ( d k ) such that X T X = I , and let A = [ A | ( r , , r ) T ] . Plugging A = A into Theorem 3 at [35]
| 1 W A X 2 A X 2 | 5 i = 1 n v i ˜ W i , i 2 v i ˜ .
We also have by the definition of W and Theorem 2
i = 1 n v i ˜ W i , i 2 v i ˜ ( ε / k ) i = 1 n v i ˜ 2 ( ε / k ) i = 1 n v i ˜ ,
where the first inequality holds since W i , i = u i 2 for every i [ n ] , and the vector u R n is a vector summarization ( ε / 5 k ) 2 -coreset for ( v 1 ˜ , , v n ˜ , ( 1 , , 1 ) ) .
Finally, at [35] they show that ( ε / 5 k ) i = 1 n v i ˜ ε . Hence, combining this fact with (A58), and (A59) yields
| 1 W A X 2 A X 2 | ε .
Finally, the corollary holds by combing Lemma 4.1 at [45] with (A60). □

References

  1. Valiant, L.G. A theory of the learnable. Commun. ACM 1984, 27, 1134–1142. [Google Scholar] [CrossRef] [Green Version]
  2. Vapnik, V. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems; Morgan-Kaufmann: Denver, CO, USA, 1992; pp. 831–838. [Google Scholar]
  3. Feldman, D.; Langberg, M. A unified framework for approximating and clustering data. In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, San Jose, CA, USA, 6–8 June 2011; pp. 569–578. [Google Scholar]
  4. Nielsen, M.A. Neural Networks and Deep Learning; Determination Press: San Francisco, CA, USA, 2015; Volume 2018. [Google Scholar]
  5. Steinwart, I.; Christmann, A. Support Vector Machines; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  6. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
  7. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
  8. Hoerl, A.E.; Kennard, R.W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
  9. Bergman, S. The Kernel Function and Conformal Mapping; American Mathematical Soc.: Providence, RI, USA, 1970; Volume 5. [Google Scholar]
  10. Eggleston, H.G. Convexity. J. Lond. Math. Soc. 1966, 1, 183–186. [Google Scholar] [CrossRef]
  11. Phillips, J.M. Coresets and sketches. arXiv 2016, arXiv:1601.00617. [Google Scholar]
  12. Har-Peled, S. Geometric Approximation Algorithms; Number 173; American Mathematical Soc.: Providence, RI, USA, 2011. [Google Scholar]
  13. Vapnik, V. The Nature of Statistical Learning Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
  14. Langberg, M.; Schulman, L.J. Universal ε-approximators for integrals. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, Austin, TX, USA, 17 January 2010; pp. 598–607. [Google Scholar]
  15. Carathéodory, C. Über den Variabilitätsbereich der Koeffizienten von Potenzreihen, die gegebene Werte nicht annehmen. Math. Ann. 1907, 64, 95–115. [Google Scholar] [CrossRef] [Green Version]
  16. Cook, W.; Webster, R. Caratheodory’s theorem. Can. Math. Bull. 1972, 15, 293. [Google Scholar] [CrossRef]
  17. Phillips, J.M.; Tai, W.M. Near-optimal coresets of kernel density estimates. Discret. Comput. Geom. 2020, 63, 867–887. [Google Scholar] [CrossRef] [Green Version]
  18. Matousek, J. Approximations and optimal geometric divide-and-conquer. J. Comput. Syst. Sci. 1995, 50, 203–208. [Google Scholar] [CrossRef] [Green Version]
  19. Braverman, V.; Feldman, D.; Lang, H. New frameworks for offline and streaming coreset constructions. arXiv 2016, arXiv:1612.00889. [Google Scholar]
  20. Bentley, J.L.; Saxe, J.B. Decomposable searching problems I: Static-to-dynamic transformation. J. Algorithms 1980, 1, 301–358. [Google Scholar] [CrossRef]
  21. Har-Peled, S.; Mazumdar, S. On coresets for k-means and k-median clustering. In Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, 13 June 2004; pp. 291–300. [Google Scholar]
  22. Maalouf, A.; Jubran, I.; Feldman, D. Fast and accurate least-mean-squares solvers. arXiv 2019, arXiv:1906.04705. [Google Scholar]
  23. Drineas, P.; Magdon-Ismail, M.; Mahoney, M.W.; Woodruff, D.P. Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res. 2012, 13, 3475–3506. [Google Scholar]
  24. Cohen, M.B.; Peng, R. Lp row sampling by lewis weights. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, Portland, OR, USA, 4 June 2015; pp. 183–192. [Google Scholar]
  25. Ritter, K. Average-Case Analysis of Numerical Problems; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
  26. Juditsky, A.; Nemirovski, A.S. Large deviations of vector-valued martingales in 2-smooth normed spaces. arXiv 2008, arXiv:0809.0813. [Google Scholar]
  27. Tropp, J.A. An introduction to matrix concentration inequalities. arXiv 2015, arXiv:1501.01571. [Google Scholar]
  28. Charikar, M.; Chen, K.; Farach-Colton, M. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming; Springer: Berlin/Heidelberg, Germany, 2002; pp. 693–703. [Google Scholar]
  29. Feldman, D.; Ozer, S.; Rus, D. Coresets for vector summarization with applications to network graphs. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 17 July 2017; Volume 70, pp. 1117–1125. [Google Scholar]
  30. Węglarczyk, S. Kernel density estimation and its application. In ITM Web of Conferences; EDP Sciences: Les Ulis, France, 2018; Volume 23. [Google Scholar]
  31. Zheng, Y.; Jestes, J.; Phillips, J.M.; Li, F. Quality and efficiency for kernel density estimates in large data. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 22 June 2013; pp. 433–444. [Google Scholar]
  32. Bachem, O.; Lucic, M.; Krause, A. Scalable k-means clustering via lightweight coresets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19 July 2018; pp. 1119–1127. [Google Scholar]
  33. Barger, A.; Feldman, D. k-Means for Streaming and Distributed Big Sparse Data. In Proceedings of the 2016 SIAM International Conference on Data Mining, Miami, FL, USA, 30 June 2016; pp. 342–350. [Google Scholar]
  34. Feldman, D.; Schmidt, M.; Sohler, C. Turning Big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. arXiv 2018, arXiv:1807.04518. [Google Scholar]
  35. Feldman, D.; Volkov, M.; Rus, D. Dimensionality reduction of massive sparse datasets using coresets. Adv. Neural Inf. Process. Syst. 2016, 29, 2766–2774. [Google Scholar]
  36. Cohen, M.B.; Elder, S.; Musco, C.; Musco, C.; Persu, M. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, Portland, OR, USA, 14 June 2015; pp. 163–172. [Google Scholar]
  37. Varadarajan, K.; Xiao, X. On the sensitivity of shape fitting problems. arXiv 2012, arXiv:1209.4893. [Google Scholar]
  38. Feldman, D.; Tassa, T. More constraints, smaller coresets: Constrained matrix approximation of sparse big data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10 August 2015; pp. 249–258. [Google Scholar]
  39. Frieze, A.; Kannan, R.; Vempala, S. Fast Monte-Carlo algorithms for finding low-rank approximations. J. ACM (JACM) 2004, 51, 1025–1041. [Google Scholar] [CrossRef]
  40. Yang, J.; Chow, Y.L.; Ré, C.; Mahoney, M.W. Weighted SGD for p regression with randomized preconditioning. J. Mach. Learn. Res. 2017, 18, 7811–7853. [Google Scholar]
  41. Cohen, M.B.; Lee, Y.T.; Musco, C.; Musco, C.; Peng, R.; Sidford, A. Uniform sampling for matrix approximation. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, Rehovot, Israel, 11 January 2015; pp. 181–190. [Google Scholar]
  42. Papailiopoulos, D.; Kyrillidis, A.; Boutsidis, C. Provable deterministic leverage score sampling. In Proceedings of the 20th ACM SIGKDD iInternational Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24 August 2014; pp. 997–1006. [Google Scholar]
  43. Drineas, P.; Mahoney, M.W.; Muthukrishnan, S. Relative-error CUR matrix decompositions. SIAM J. Matrix Anal. Appl. 2008, 30, 844–881. [Google Scholar] [CrossRef] [Green Version]
  44. Cohen, M.B.; Musco, C.; Musco, C. Input sparsity time low-rank approximation via ridge leverage score sampling. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, Barcelona, Spain, 16 January 2017; pp. 1758–1777. [Google Scholar]
  45. Maalouf, A.; Statman, A.; Feldman, D. Tight sensitivity bounds for smaller coresets. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, CA, USA, 23 August 2020; pp. 2051–2061. [Google Scholar]
  46. Batson, J.; Spielman, D.A.; Srivastava, N. Twice-ramanujan sparsifiers. SIAM J. Comput. 2012, 41, 1704–1721. [Google Scholar] [CrossRef] [Green Version]
  47. Cohen, M.B.; Nelson, J.; Woodruff, D.P. Optimal approximate matrix product in terms of stable rank. arXiv 2015, arXiv:1507.02268. [Google Scholar]
  48. Clarkson, K.L. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM Trans. Algorithms (TALG) 2010, 6, 63. [Google Scholar] [CrossRef]
  49. Desai, A.; Ghashami, M.; Phillips, J.M. Improved practical matrix sketching with guarantees. IEEE Trans. Knowl. Data Eng. 2016, 28, 1678–1690. [Google Scholar] [CrossRef] [Green Version]
  50. Madariaga, D.; Madariaga, J.; Bustos-Jiménez, J.; Bustos, B. Improving Signal-Strength Aggregation for Mobile Crowdsourcing Scenarios. Sensors 2021, 21, 1084. [Google Scholar] [CrossRef]
  51. Mahendran, N.; Vincent, D.R.; Srinivasan, K.; Chang, C.Y.; Garg, A.; Gao, L.; Reina, D.G. Sensor-assisted weighted average ensemble model for detecting major depressive disorder. Sensors 2019, 19, 4822. [Google Scholar] [CrossRef] [Green Version]
  52. Wu, L.; Xu, Q.; Heikkilä, J.; Zhao, Z.; Liu, L.; Niu, Y. A star sensor on-orbit calibration method based on singular value decomposition. Sensors 2019, 19, 3301. [Google Scholar] [CrossRef] [Green Version]
  53. Yang, W.; Hong, J.Y.; Kim, J.Y.; Paik, S.h.; Lee, S.H.; Park, J.S.; Lee, G.; Kim, B.M.; Jung, Y.J. A novel singular value decomposition-based denoising method in 4-dimensional computed tomography of the brain in stroke patients with statistical evaluation. Sensors 2020, 20, 3063. [Google Scholar] [CrossRef]
  54. Peri, E.; Xu, L.; Ciccarelli, C.; Vandenbussche, N.L.; Xu, H.; Long, X.; Overeem, S.; van Dijk, J.P.; Mischi, M. Singular value decomposition for removal of cardiac interference from trunk electromyogram. Sensors 2021, 21, 573. [Google Scholar] [CrossRef]
  55. Code. Open Source Code for All the Algorithms Presented in This Paper. 2021. Available online: https://github.com/alaamaalouf/vector-summarization-coreset (accessed on 29 September 2021).
  56. Van Rossum, G.; Drake, F.L. Python 3 Reference Manual; CreateSpace: Scotts Valley, CA, USA, 2009. [Google Scholar]
  57. Oliphant, T.E. A Guide to NumPy; Trelgol Publishing USA, 2006; Volume 1, Available online: https://ecs.wgtn.ac.nz/foswiki/pub/Support/ManualPagesAndDocumentation/numpybook.pdf (accessed on 29 September 2021).
  58. Tremblay, N.; Barthelmé, S.; Amblard, P.O. Determinantal Point Processes for Coresets. J. Mach. Learn. Res. 2019, 20, 1–70. [Google Scholar]
  59. Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: http://archive.ics.uci.edu/ml (accessed on 29 September 2021).
  60. Donovan, B.; Work, D. Using Coarse GPS Data to Quantify City-Scale Transportation System Resilience to Extreme Events. 2015. Available online: http://vis.cs.kent.edu/DL/Data/ (accessed on 29 September 2021).
  61. US Census Data (1990) Data Set. Available online: https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990) (accessed on 10 June 2021).
  62. Kawala, F.; Douzal-Chouakria, A.; Gaussier, E.; Dimert, E. Prédictions D’activité dans les Réseaux Sociaux en Ligne. 2013. Available online: https://archive.ics.uci.edu/ml/datasets/Buzz+in+social+media+ (accessed on 29 September 2021).
  63. Huerta, R.; Mosqueiro, T.; Fonollosa, J.; Rulkov, N.F.; Rodriguez-Lujan, I. Online decorrelation of humidity and temperature in chemical sensors for continuous monitoring. Chemom. Intell. Lab. Syst. 2016, 157, 169–176. [Google Scholar] [CrossRef] [Green Version]
  64. Chen, X. A new generalization of Chebyshev inequality for random vectors. arXiv 2007, arXiv:0707.0805. [Google Scholar]
  65. Minsker, S. Geometric median and robust estimation in Banach spaces. Bernoulli 2015, 21, 2308–2335. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Illustration of Algorithm 2, its normalization of the input, its main applications (red boxes) and their plugged parameters. Algorithm 2 utilizes and boosts the run-time of the Frank–Wolfe algorithm for those applications; see Section 1.4.
Figure 1. Illustration of Algorithm 2, its normalization of the input, its main applications (red boxes) and their plugged parameters. Algorithm 2 utilizes and boosts the run-time of the Frank–Wolfe algorithm for those applications; see Section 1.4.
Sensors 21 06689 g001
Table 1. Known deterministic subset coresets for LMS solvers. Our result has the fastest running time for sufficiently large n and d.
Table 1. Known deterministic subset coresets for LMS solvers. Our result has the fastest running time for sufficiently large n and d.
ErrorSizeTimeCitationNotes
ε O ( k 2 / ε 2 ) O ( n d 2 k 2 / ε 2 ) [35]N/A
ε O ( d / ε 2 ) p o l y ( n , d , ε ) [46]inefficient for large n
0 O ( d 2 ) O ( n d 2 + log ( n ) p o l y ( d ) ) [22]inefficient for large d
ε O ( k / ε 2 ) p o l y ( n , d , k , ε ) [47]inefficient for large n
ε O ( k 2 / ε 2 ) O ( n d 2 + log ( n ) 2 d 2 k 4 / ε 4 ) 🟉N/A
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Maalouf, A.; Jubran, I.; Tukan, M.; Feldman, D. Coresets for the Average Case Error for Finite Query Sets. Sensors 2021, 21, 6689. https://doi.org/10.3390/s21196689

AMA Style

Maalouf A, Jubran I, Tukan M, Feldman D. Coresets for the Average Case Error for Finite Query Sets. Sensors. 2021; 21(19):6689. https://doi.org/10.3390/s21196689

Chicago/Turabian Style

Maalouf, Alaa, Ibrahim Jubran, Murad Tukan, and Dan Feldman. 2021. "Coresets for the Average Case Error for Finite Query Sets" Sensors 21, no. 19: 6689. https://doi.org/10.3390/s21196689

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop