Abstract
Coreset is usually a small weighted subset of an input set of items, that provably approximates their loss function for a given set of queries (models, classifiers, hypothesis). That is, the maximum (worst-case) error over all queries is bounded. To obtain smaller coresets, we suggest a natural relaxation: coresets whose average error over the given set of queries is bounded. We provide both deterministic and randomized (generic) algorithms for computing such a coreset for any finite set of queries. Unlike most corresponding coresets for the worst-case error, the size of the coreset in this work is independent of both the input size and its Vapnik–Chervonenkis (VC) dimension. The main technique is to reduce the average-case coreset into the vector summarization problem, where the goal is to compute a weighted subset of the n input vectors which approximates their sum. We then suggest the first algorithm for computing this weighted subset in time that is linear in the input size, for , where is the approximation error, improving, e.g., both [ICML’17] and applications for principal component analysis (PCA) [NIPS’16]. Experimental results show significant and consistent improvement also in practice. Open source code is provided.
1. Introduction
In this paper, we assume that the input is a set P of items, called points. Usually, P is simply a finite set of n points in or other metric space. In the context of PAC (probably approximately correct) learning [1], or empirical risk minimization [2] it represents the training set. In supervised learning every point in P may also include its label or class. We also assume a given function called weights function that assigns a “weight” for every point . The weights function represents a distribution of importance over the input points, where the natural choice is uniform distribution, i.e., for every . We are also given a (possibly infinite) set X that is the set of queries [3] which represents candidate models or hypothesis, e.g., neural networks [4], SVMs [5] or a set of vectors in with tuning parameters as in linear/ridge/lasso regression [6,7,8].
In machine learning and PAC-learning in particular, we often seek to compute the query that best describes our input data P for either prediction, classification, or clustering tasks. To this end, we define a loss function that assigns a fitting cost to every point with respect to a query . For example, it may be a kernel function [9], a convex function [10], or an inner product. The tuple is called a query space and represents the input to our problem. In this paper, we wish to approximate the weighted sum of losses .
Methodology.
Our main tool for approximating the sum of losses above is called a coreset, which is a (small) weighted subset of the (potentially huge) input data, from which the desired sum of losses can be recovered in a very fast time, with guarantees on the small induced error; see Section 1.2. To compute such a coreset, we utilize the famous Frank–Wolfe algorithm. However, to compute a coreset in fast time, we first provide a scheme for provably boosting the running time of the Frank–Wolfe algorithm for our special case, without compromising its output accuracy; see Section 3.1. We then utilize the boosted version in order to compute a deterministic coreset in time faster than the state of the art.
1.1. Approximation Techniques for the Sum of Losses
Approximating the loss of a single query via uniform sampling. Suppose that we wish to approximate the mean for a specific x in sub-linear time. Picking a random point p uniformly at random from P would give this result in expectation as . By Hoeffding inequality, the mean of a uniform sample would approximate with high probability. More precisely, for a given , if the size of the sample is where is the maximum absolute value of f, then with constant probability our approximation error is
-Sample. Generally, we are interested in such data summarization S of P that approximates every query . An ε-sample is a pair where S is a subset of P (unlike, e.g., sketches [11]), and is its weights function such that the weighted loss of the (hopefully small) weighted subset S approximates the original weighted loss [12], i.e.,
We usually assume that the input is normalized in the sense that w is a distribution, and . By defining the pair of vectors and , we can define the error for a single x by , and then the error vector for the coreset . We can rewrite (1) by
PAC/DAC learning for approximating the sum of losses for multiple queries. Probably approximately correct (PAC) randomized constructions generalizes the Hoeffding inequality above from a single to multiple (usually infinite) queries and returns an -sample for a given query space and , with probability at least . Here, corresponds to the “probably” part, while “approximately correct” corresponds to in (2); see [13,14]. Deterministic approximately correct (DAC) versions of PAC-learning suggest deterministic construction of -samples, i.e., the probability of failure of the construction is .
As common in machine learning and computer science in general, the main advantage of deterministic constructions is smaller bounds (in this case, on the size of the resulting -sample), and their disadvantage is usually the slower construction time that may be unavoidable. When the query set X if finite, the Caratheodory theorem [15,16] suggests a deterministic algorithm that returns a 0-sample (i.e., ) of size . Deterministic constructions of -sample are known for infinite sets of queries even when the VC-dimension is unbounded [17,18].
Sup-sampling: reducing the sample size via non-uniform sampling. As explained above, Hoeffding inequality implies an approximation of by where and S is a random sample according to w whose size depends on . To reduce the sample size we may thus define , and . Now, , and by Hoeffding’s inequality, the error of approximating via non-uniform random sample of size drawn from s, is . Define . Since , approximating up to error yields an error of for . Therefore, the size is reduced from to when . Here, we sample points from P according to the distribution s, and re-weight the sampled points by .
Unlike traditional PAC-learning, the sample now is non-uniform, and is proportional to , rather than w, as implied by Hoeffding inequality for non-uniform distributions. For sets of queries we generalize the definition for every to as in [19], which is especially useful for coresets below.
1.2. Coresets: A Data Summarization Technique for Approximating the Sum of Losses
Coreset for a given query space , in this and many other papers, is a pair that is similar to -sample in the sense that and is a weights function. However, the additive error is now replaced with a multiplicative error , i.e., for every we require that, Dividing by and assuming , yields
Coresets are especially useful for learning big data since an off-line and possibly inefficient coreset construction for “small data" implies constructions that maintains coreset for streaming, dynamic (including deletions) and distributed data in parallel. This is via a simple and easy to implement framework that is sometimes called merge–reduce trees; see [20,21]. The fact that a coreset approximates every query (and not just the optimal one for some criterion) implies that we may solve hard optimization problems with non-trivial and non-convex constraints by running a possibly inefficient algorithm such as exhaustive search on the coreset, or running existing heuristics numerous times on the small coreset instead of once on the original data. Similarly, parameter tuning or cross validation can be applied on a coreset that is computed once for the original data as explained in [22].
An -coreset for a query space is simply an -sample for the query space , after defining , as explained, e.g., in [19]. By defining the error for a single x by , we obtain an error vector for the coreset . We can then rewrite (3) as in (2):
In the case of coresets, the of a point is called sensitivity [14], leverage score (in approximations) [23], Lewis weights (in approximations), or simply importance [24].
1.3. Problem Statement: Average Case Analysis for Data Summarization
Average case analysis (e.g., [25]) was suggested about a decade ago as an alternative to the (sometimes infamous) worst-case analysis of algorithms in theoretical computer science. The idea is to replace the analysis for the worst-case input by the average input (in some sense). Inspired by this idea, a natural variant of (2) and its above implications is an -sample that approximates well the average query. We suggest to define an -sample as
which generalizes (2) from to any norm, such as the norm . For example, for the , MSE or Frobenius norm, we obtain
A generalization of the Hoeffding Inequality from 1963 with tight bounds was suggested relatively recently for the norm for any and many other norms [26,27]. Here we assume a single query (), a distribution weights function, and a bound on that determines the size of the sample, as in Hoeffding inequality.
A less obvious question, which is the subject of this paper, is how to compute deterministic -samples that satisfies (4), for norms other than the infinity norm. While the Caratheodory theorem suggests deterministic constructions of 0-samples (for any error norm) as explained above, our goal is to obtain coreset whose size is smaller or independent of .
The next question is how to generalize the idea of sup-sampling, i.e., where the function f is unbounded, for the case of norms other than . Our main motivation for doing so is to obtain new and smaller coresets by combining the notion of -sample and sup-sampling or sensitivity as explained above for the case. That is, we wish a coreset for a given query space, that would bound the non- norm error
To summarize, our questions are: How can we smooth the error function and approximate the “average” query via: (i) Deterministic -samples (for DAC-learning )? (ii) Coresets (via sensitivities/sup sampling for non-infinity norms)?
1.4. Our Contribution
We answer affirmably these questions by suggesting -samples and coresets for the average query. We focus on the case , i.e., the Frobenius norm, and finite query set X and hope that this would inspire the research and applications of other norms and general sets. For suggestions in this direction and future work see Section 5.2. The main results of this paper are the following constructions of an -sample for any given finite query space as defined in (5):
- (i)
- Deterministic construction that returns a coreset of size in time ; see Theorem 2 and Corollary 4.
- (ii)
- Randomized construction that returns such a coreset (of size ) with probability at least in sub-linear time ; see Lemma 5.
Algorithm. This result is of independent interest for faster and sparser convex optimization. To our knowledge, this is also the first application of sensitivity outside the coreset regime.
1.5. Overview and Organization
The rest of the paper in organized as follows. First in Section 2, we list the applications of our proposed methods, such as a faster coreset construction algorithm for least mean squares solver. We also compare our results to the state of the art to justify our practical contribution.
In Section 3, we first give our notations and relevant mathematical definitions, we explain the relation between the problem of computing an -sample (average-case coreset) to the problem of computing a vector summarization coreset, where the goal (of the vector summarization coreset problem) is to compute a weighted subset of the n input vectors which approximates their sum. Here, we suggest a coreset for this problem of size in time; see Theorem 2 and Algorithm 2. Then, in Section 3.1 we show how to improve the running time of this result and compute a coreset of the same size in time; see Corollary 4 and Algorithm 3. In addition, we suggest a non-deterministic coreset of the same size but in time that is independent of the number of points n; see Lemma 5 and Algorithm 4.
In Section 4, we explain how our vector summarization coreset results can be used to improve all the previously mentioned applications (from Section 2). In Section 5 we conduct various experiments on real world datasets, where we apply different coreset construction algorithms presented in this paper to a variety of applications, in order to boost their running time, or reduce their memory storage. We also compare our results to many competing methods. Finally, we conclude our paper and discuss future work at Section 5.2. Due to space limitations and simplicity of the reading, the proofs of the claims are placed in the Appendix A, Appendix B, Appendix C, Appendix D, Appendix E, Appendix F, Appendix G, Appendix H, Appendix I.
2. On the Applications of Our Method and the Current State of the Art
In what follows, we will present some of the applications of our theoretical contributions as well as discussing the current state of the art coreset/sketch methods in terms of running time for each of application. Figure 1 summarizes the main applications of our result.
- (i)
- Vector summarization: the goal is to maintain the sum of a (possibly infinite) stream of vectors in , up to an additive error of multiplied by their variance. This is a generalization of frequent items/directions [28].As explained in [29], the main real-world application is extractions and compactly representing groups and activity summaries of users from underlying data exchanges. For example, GPS traces in mobile networks can be exploited to identify meetings, and exchanges of information in social networks sheds light on the formation of groups of friends. Our algorithm tackles these application by providing provable solution to the heavy hitters problem in proximity matrices. The heavy hitters problem can be used to extract and represent in a compact way friend groups and activity summaries of users from underlying data exchanges.We propose a deterministic algorithm which reduces each subset of n vectors into weighted vectors in time, improving upon the of [29] (which is the current state of the art in terms of running time), for a sufficiently large n; see Corollary 4, and Figures 2 and 3. We also provide a non-deterministic coreset construction in Lemma 5. The merge-and-reduce tree can then be used to support streaming, distributed or dynamic data.
- (ii)
- Kernel Density Estimates (KDE): by replacing with for the vector summarization, we obtain fast construction of an -coreset for KDE of Euclidean kernels [17]; see more details in Section 4. Kernel density estimate is a technique for estimating a probability density function (continuous distribution) from a finite set of points to better analyse the studied probability distribution than when using a traditional [30,31].
- (iii)
- 1-mean problem: a coreset for 1-mean which approximates the sum of squared distances over a set of n points to any given center (point) in . This problem arises in facility location problems (e.g., to compute the optimal location for placing an antenna such that all the customers are satisfied). Our deterministic construction computes such a weighted subset of size in time. Previous results of [19,32,33,34] suggested coresets for such problem. Unlike our results, these works are either non-deterministic, the coreset is not a subset of the input, or the size of the coreset is linear in d.
- (iv)
- Coreset for LMS solvers and dimensionality reduction: for example, a deterministic construction for singular value decomposition (SVD) that gets a matrix and returns a weighted subset of rows, such that their weighted distance to any k-dimensional non-affine (or affine in the case of PCA) subspace approximates the distance of the original points to this subspace. The SVD and PCA are very common algorithms (see [35]), and can be used for noise reduction, data visualization, cluster analysis, or as an intermediate step to facilitate other analyses. Thus, improving them might be helpful for a wide range of real-world applications. In this paper, we propose a deterministic coreset construction that takes time, improving upon the state of the art result of [35] which requires time; see Table 1. Many non-deterministic coresets constructions were suggested for those problems, the construction techniques apply non-uniform sampling [36,37,38], Monte-Carlo sampling [39], and leverage score sampling [23,40,41,42,43,44,45].
Table 1. Known deterministic subset coresets for LMS solvers. Our result has the fastest running time for sufficiently large n and d.
3. Vector Summarization Coreset
Notation 1.
We denote by . For a vector , the 0-norm is denoted by and is equal to the number of non-zero entries in v. We denote by the ith standard basis vector in and by the vector . A vector is called a distribution vector if all its entries are non-negative and sums up to one. For a matrix and we denote by the jth entry of the ith row of A. A weighted set is a pair where is a set of npoints, and is a weights vector that assigns every a weight . A matrix is orthogonal if .
Adaptations. To adapt to the notation of the following sections and the query space to the techniques that we use, we restate (4) as follows. Previously, we denote the queries , and the input set by . Now, each input point in the input set P corresponds to a point , i.e., each entry of equals to for a different query x. Throughout the rest of the paper, for technical reason and simplicity, we might alternate between the weights function notation and a weights vector notation. In such cases, the weights function and weight of , are replaced by a vector of weights and , respectively, and vice versa. In such cases, the -sample is represented by a sparse vector where is the chosen subset of P.
Hence, , and .
From -samples to -coresets. We now define an -coreset for vector summarization, which is a re-weighting of the input weighted set by a new weights vector u, such that the squared norm of the difference between the weighted means of and is small. This relates to Section 1.3, where an -sample there (in Section 1.3) is an -coreset for the vector summarization here.
Definition 1
(vector summarization -coreset). Let and be two weighted sets of n points in , and let . Let , , and . Then is avector summarization -coresetfor if .
Analysis flow. In what follows we (first) assume that the points of our input set P lie inside the unit ball (). For such an input set, we present a construction of a variant of a vector summarization coreset, where the error is and does not depend on the variance of the input. This construction is based on the Frank–Wolfe algorithm [48]; see Theorem 1 and Algorithm 1. This is by reducing the problem to the problem of maximizing a concave function over every vector in the unit simplex. Such problems can be solved approximately by a simple greedy algorithm known as the Frank–Wolfe algorithm.
| Algorithm 1:FRANK–WOLFE; Algorithm 1.1 of [48] |
|
We then present a proper coreset construction in Algorithm 2 and Theorem 2 for a general input set Q in . This algorithm is based on a reduction to the simpler case of points inside the unit ball; see Figure 1 for illustration. This reduction is inspired by the sup-sampling (see Section 1), there (in Section 1) the functions are normalized (to obtain values in ) and reweighted (to obtain a non-biased estimator), then the bounds were easily obtained using the Hoeffding inequality. Here, we apply different normalizations and reweightings, and instead of the non-deterministic Hoeffding inequality, we suggest a deterministic version using the Frank–Wolfe algorithm. Our new suggested normalizations (and reweightings) allow us to generalize the result to many more applications as in Section 4.
Figure 1.
Illustration of Algorithm 2, its normalization of the input, its main applications (red boxes) and their plugged parameters. Algorithm 2 utilizes and boosts the run-time of the Frank–Wolfe algorithm for those applications; see Section 1.4.
For brevity purposes, all proofs of the technical results can be found at the Appendix A, Appendix B, Appendix C, Appendix D, Appendix E, Appendix F, Appendix G, Appendix H, Appendix I.
Theorem 1
(Coreset for points in the unit ball). Let be a set of n points in such that for every . Let and be a distribution vector. For every , define . Let be the output of a call to ; see Algorithm 1. Then:
- (i)
- is a distribution vector with ,
- (ii)
- , and
- (iii)
- is computed in time.
We now show how to obtain a vector summarization -coreset of size in time for any set .
Theorem 2
(Vector summarization coreset). Let be a weighted set of n points in , , and let u be the output of a call to ; see Algorithm 2. Then, is a vector with non-zero entries that is computed in time, and is a vector summarization ε-coreset for .
| Algorithm 2:CORESET |
|
3.1. Boosting the Coreset’s Construction Running Time
In this section, we present Algorithm 3, which aims to boost the running time of Algorithm 1 from the previous section; see Theorem 3. The main idea behind this new boosted algorithm is as follows: instead of running the Frank–Wolfe algorithm on a (full) set of input data, it can be more efficient to partition the input into a constant number k of equal-sized chunks, pick some representative for each chunk (its mean), run the Frank–Wolfe algorithm only on the set of representatives (the set of means) to obtain back a subset of those representative, and then continue recursively only with the chunks whose representative was chosen by the algorithm. Although the Frank–Wolfe algorithm is now applied multiple times (rather than once), each of those runs is much more efficient since only the small set of representatives is considered.
This immediately implies a faster construction time of vector summarization -coresets for general input sets; see Corollary 4 and Figure 1 for illustration.
Theorem 3
(Faster coreset for points in the unit ball). Let P be a set of n points in such that for every . Let be a weights function such that , , and let be the output of a call to ; see Algorithm 3. Then
- (i)
- and ,
- (ii)
- , and
- (iii)
- is computed in time.
Corollary 4
(Faster vector summarization coreset). Let be a weighted set of n points in , and let . Then in time, we can compute a vector , such that u has non-zero entries and is a vector summarization -coreset for .
| Algorithm 3:FAST-FW-CORESET |
|
In what follows, we show how to compute a vector summarization coreset with high probability in a time that is sublinear in the input size . This is based on the geometric median trick, that suggests the following procedure: (i) sample sets of the same (small) size from the original input set Q, (ii) for each such sampled set (), compute its mean , and finally, (iii) compute and return the geometric median of those means . This geometric median is guaranteed to approximate the mean of the original input set Q.
We show that there is no need to compute this geometric median, as it is a difficult computational task. We prove that there exists a set from the sampled subsets such that its mean is very close to this geometric median, with high probability. Thus, is a good approximation to the desired mean of the original input set. Furthermore, we show that is simply the point in that minimizes its sum of (non-squared) distances to this set , i.e., . An exhaustive search over the points of can thus recover . The corresponding set is the resulted vector summarization coreset; see Lemma 5 and Algorithm 4.
Lemma 5
(Fast probabilistic vector summarization coreset). Let Q be a set of n points in , , and . Let , , and let be the output of a call to ; see Algorithm 4. Then:
- (i)
- and ,
- (ii)
- with probability at least we have , and
- (iii)
- S is computed in time.
| Algorithm 4:PROB-WEAK-CORESET |
|
4. Applications
Coreset for 1-mean. A 1-mean -coreset for is a weighted set such that for every , the sum of squared distances from x to either or , is approximately the same. To maintain the above property, we prove that it suffices for to satisfy the following: the mean, the variance, and the sum of weights of should approximate the mean, the variance, and the sum of weights of , respectively, up to an additive error that depends linearly on . Then note that when plugging (rather than ) as input to Algorithm 2, the output is guaranteed to satisfy the above 3 properties, by construction of u.
The following theorem computes a 1-mean -coreset.
Theorem 6.
Let be a weighted set of n points in , . Then in
time we can compute a vector , where , such that:
Coreset for KDE. Given two sets of points Q and , and a kernel that is defined by the kernel map , the maximal difference
between the kernel costs of Q and is upper bounded by , where and are the means of and , respectively, [49]. Given , we can compute a vector summarization -coreset , which satisfies that . By the above argument, this is also an -KDE coreset.
Coreset for dimensionality reduction and LMS solvers. An -coreset for the k-SVD (k-PCA) problem of Q is a small weighted subset of Q that approximates the sum of squared distances from the points in Q to every non-affine (affine) k-dimensional subspace of , up to a multiplicative factor of ; see Corollary 7. Coreset for LMS solvers is the special case of .
In [35], it is shown how to leverage an -coreset for the vector summarization problem in order to compute an -coreset for k-SVD. In [45], it is shown how to compute a coreset for k-PCA via a coreset for k-SVD, by simply adding another entry with some value to each vector of the input. Algorithm 5 combines both the above reductions, along with a computation of a vector summarization -coreset to compute the desired coreset for dimensionality reduction (both k-SVD and k-PCA). To compute the vector summarization coreset we utilize our new algorithms from the previous sections, which are faster than the state of the art algorithms.
| Algorithm 5:DIM-CORESET |
|
Corollary 7
(Coreset for dimensionality reduction). Let Q be a set of n points in , and let be a corresponding matrix containing the points of Q in its rows. Let be an error parameter, be an integer, and W be the output of a call to . Then:
- (i)
- W is a diagonal matrix with non-zero entries,
- (ii)
- W is computed in time, and
- (iii)
- there is a constant c, such that for every and an orthogonal we haveHere, is the subtraction of ℓ from every row of A.
Where do our methods fit in? Theoretically speaking, the 1-mean problem (also known as the arithmetic mean problem), is a widely used tool for reporting central tendencies in the field of statistics, as it is also used in machine learning. As for the practical aspect of such problem, it can be either used to obtain an estimation of the mathematical expectation of signal strength in a area [50], or as an imputation technique used to fill in missing values, e.g., in the context of filling in missing values of heart monitor sensor data [51]. Note that a variant of this problem is widely used in the context of deep learning, namely, the moving averages. Algorithms 3 and 4 can boost such methods when given large-scale datasets. In addition, our algorithms extend also to SVD, PCA, and LMS where these methods are known for their usages and efficiencies in discovering a low dimensional representation of high dimensional data. From a practical point of view, SVD showed promising results when dealing with on calibration of a star sensor on-orbit calibration [52], denoising a 4-dimensional computed tomography of the brain in stroke patients [53], removal of cardiac interference from trunk electromyogram [54], among many other applications.
We propose a summarization technique (see Algorithm 5) that aims to compute an approximation towards the SVD factorization of large-scale datasets where applying the SVD factorization on the dataset is not possible due to insufficient memory or long computational time.
5. Experimental Results
We now apply different coreset construction algorithms presented in this paper to a variety of applications, in order to boost their running time, or reduce their memory storage. We note that a complete open source code is provided [55].
Software/Hardware. The algorithms were implemented in Python [56] using “Numpy” [57]. Tests were conducted on a PC with Intel i9-7960X CPU @2.80 GHz x 32 and 128 Gb RAM.
We compare the following algorithms: (To simply distinguish between our algorithms and the competing ones in the graphs, observe that the labels of our algorithms starts with the prefix “Our-”, while the competing methods do not.)
- (i)
- Uniform: Uniform random sample of the input Q, which requires sublinear time to compute.
- (ii)
- Sensitivity-sum: Random sampling based on the “sensitivity” for the vector summarization problem [58]. Sensitivity sampling is a widely known technique [19], which guarantees that a subsample of sufficient size approximates the input well. The sensitivity of a point is . This algorithm takes time.
- (ii)
- ICML17: The vector summarization coreset construction algorithm from [29] (see Algorithm 2 there), which runs in time.
- (iv)
- Our-rand-sum: Our coreset construction from Lemma 5, which requires time.
- (v)
- Our-slow-sum: Our coreset construction from Corollary 2, which requires time.
- (vi)
- Our-fast-sum: Our coreset construction from Corollary 4, which requires time.
- (vii)
- Sensitivity-svd: Similar to Sensitivity-sum above, however, now the sensitivity is computed by projecting the rows of the input matrix A on the optimal k-subspace (or an approximation of it) that minimizes its sum of squared distances to the rows of A, and then computing the sensitivity of each row i in the projected matrix as , where is the ith row the matrix U from the SVD of ; see [37]. This takes time.
- (viii)
- NIPS16: The coreset construction algorithm from [35] (see Algorithm 2 there) which requires time.
- (ix)
- Our-slow-svd: Corollary 7 offers a coreset construction for SVD using Algorithm 5, which utilizes Algorithm 2. However, Algorithm 2 either utilizes Algorithm 1 (see Theorem 2) or Algorithm 3 (see Theorem 4). Our-slow-svd applies the former option, which requires time.
- (x)
- Our-fast-svd: Corollary 7 offers a coreset construction for SVD using Algorithm 5, which utilizes Algorithm 2. However, Algorithm 2 either utilizes Algorithm 1 (see Theorem 2) or Algorithm 3 (see Theorem 4). Our-fast-svd uses the latter option, which requires time.
Datasets. We used the following datasets from the UCI ML library [59]:
- (i)
- New York City Taxi Data [60]. The data covers the taxi operations at New York City. We used the data describing trip fares at the year of 2013. We used the numerical features (real numbers).
- (ii)
- US Census Data (1990) [61]. The dataset contains entries. We used the entire real-valued attributes of the dataset.
- (iii)
- Buzz in social media Data Set [62]. It contains examples of buzz events from two different social networks: Twitter, and Tom’s Hardware. We used the entire real-valued attributes.
- (iv)
- Gas Sensors for Home Activity Monitoring Data Set [63]. This dataset has recordings of a gas sensor array composed of 8 MOX gas sensors, and a temperature and humidity sensor. We used the last real-valued attributes of the dataset.
Discussion regarding the chosen datasets. The Buzz in social media data set is widely used in the context of Principal Component Regression (or, PCR in short), that is used for estimating the unknown regression coefficients in a standard linear regression model. The goal of PCR in the context of this dataset, is to predict popularity of a certain topic on Twitter over a period. It is known that the solution of the PCR problem can be approximated using the known SVD decomposition problem. Our techniques enable us to benefit from the coreset advantages, e.g., to boost the PCR approximated solution (PCA) while using low memory, and supporting the streaming model by maintaining a coreset for the data (tweets) seen so far; each time a new point (tweet) is received, it is added to current stored coreset in memory. Once the stored coreset is large enough, our compression (coreset construction algorithm) is applied. This procedure is repeated until the stream of points is empty.
The New York City taxi data contains information about the locations of passengers as well as the locations of their destinations. Thus, the goal is to find a location which is close to the most wanted destinations. This problem can be formulated as a facility location problem, which can be reduced to an instance of the 1-mean problem. Hence, since our methods admit faster solution as well as provable approximation for the facility location problem, we can leverage our coreset to speed up the computations using this dataset.
Finally, regarding the remaining datasets, PCA has been widely used either for low-dimensional embedding or, e.g., to compute the arithmetic mean. By using our methods, we can boost the PCA while admitting an approximated solution.
The experiments.
- (i)
- Vector summarization:The goal is to approximate the mean of a huge input set, using only a small weighted subset of the input. The empirical approximation error is defined as , where is the mean of the full data and is the mean of the weighted subset computed via each compared algorithm; see Figure 2 and Figure 3.In Figure 2, we report the empirical approximation error as a function of the subset (coreset) size, for each of the datasets (i)–(ii), while in Figure 3 we report the overall computational time for computing the subset (coreset) and for solving the 1-mean problem on the coreset, as a function of the subset size.
- (ii)
- k-SVD:The goal is to compute the optimal k-dimensional non-affine subspaces of a given input set. We can either compute the optimal subspace using the original (full) input set, or using a weighted subset (coreset) of the input. We denote by and the optimal subspace when computed either using the full data or using the subset at hand, respectively. The empirical approximation error is defined as the ratio , where and are the sum of squared distances between the points of original input set to and , respectively; see Figure 4, Figure 5 and Figure 6. Intuitively, this ratio represents the relative SSD error of recovering an optimal k-dimensional non-affine subspace on the compression, rather than using the full data.In Figure 4 we report the empirical error as a function of the coreset size. In Figure 5 we report the overall computational time in took to compute the coreset and to recover the optimal subspace using the coreset, as a function of the coreset size. In both figures we have three subfigures, each one for a different chosen value of k (the dimension of the subspace). Finally, in Figure 6 the x axis is the size of the dataset (which we compress to a subset of size 150), while the y-axis is the approximation error on the left hand side graph, and on the right hand side it is the overall computational time it took to compute the coreset and to recover the optimal subspace using the coreset.
Figure 2.
Experimental results for vector summarization. The x axis is the size of the subset (coreset), while the y axis is the approximation error . The difference between the two graphs is the chosen dataset.
Figure 3.
Experimental results for vector summarization. The x axis is the size of the subset (coreset), while the y axis is the overall time took to compute the coreset and to solve the problem on it. The difference between the two graphs is the chosen dataset.
Figure 4.
Experimental results for k-SVD, we used Dataset (iii). The x axis is the size of the subset (coreset), while the y axis is the approximation error . The difference between the 3 graphs is the chosen low dimension k.
Figure 5.
Experimental results for k-SVD, we used Dataset (iii). The x axis is the size of the subset (coreset), while the y axis is the overall time took to compute the coreset and to solve the problem on it. The difference between the 3 graphs is the chosen low dimension k.
Figure 6.
Experimental results for k-SVD, we used Dataset (iii). The x axis is the size of the dataset which we compress to subsample of size 150, while the y-axis is the approximation error in the left hand side graph, and in the right hand side it is the overall time took to compute the coreset and to solve the problem on it.
5.1. Discussion
Vector summarization experiment:As predicted by the theory and as demonstrated in Figure 2 and Figure 3, our fast and deterministic algorithm Our-fast-sum (the red line in the figures) achieves either the same or smaller approximation errors in most cases compared to the deterministic alternatives Our-slow-sum (orange line) and ICML17 (green line), while being up to times faster. Hence, when we seek a fast time deterministic solution for computing a coreset for the vector summarization problem, our algorithm Our-fast-sum is the favorable choice.
Compared to the randomized alternatives, Our-fast-sum is obviously slower, but achieves an error more than 3 orders of magnitude smaller. However, our fast and randomized algorithm Our-rand-sum (brown line) constantly achieves better results compared to the other randomized alternatives; It yields approximation error up to smaller, while maintaining the same computational time. This is demonstrated on both datasets. Hence, our compression can be used to speed up tasks, e.g., computing the PCA or PCR, as described above.
k-SVD experiment: Here, in Figure 4, Figure 5 and Figure 6 we witness a similar phenomena, where our fast and deterministic algorithm Our-fast-svd achieves the same or smaller approximation errors compared to the deterministic alternatives Our-slow-svd and NIPS16, respectively, while being up to times faster. Compared to the randomized alternatives, Our-fast-svd is slower as predicted, but achieves an error up to 2 orders of magnitude smaller. This is demonstrated for increasing sample sizes (as in Figure 4 and Figure 5), for increasing dataset size (as in Figure 6), and for various values of k (see Figure 4, Figure 5 and Figure 6).
5.2. Conclusions and Future Work
This paper generalizes the definition of -sample and coreset from the worst case error over every query to average error. We then showed a reduction from the problem of computing such coresets to the vector summarization coreset construction problem. Here, we suggest deterministic and randomized algorithms for computing such coresets, the deterministic version takes , and the randomized . Finally, we showed how to leverage an -coreset for the vector summarization problem in order to compute an -coreset for the 1-mean problem, and similarly for k-SVD and k-PCA problem via computing an vector summarization coreset after some reprocessing on the data.
Open problems include generalizing these results for other types of norms, or other functions such as M-estimators that are robust to outliers. We hope that the source code and the promising experimental results would encourage also practitioners to use these new types of approximations. Normalization via this new sensitivity type reduced the bounds on the number of iterations of the Frank–Wolfe algorithm by orders of magnitude. We believe that it can be used more generally for provably faster convex optimization, independently of coresets or -samples. We leave this for future research.
Author Contributions
Conceptualization, A.M., I.J., M.T. and D.F.; methodology, A.M. and I.J.; software, A.M. and M.T.; validation, A.M. and M.T.; formal analysis, A.M., I.J., M.T. and D.F.; investigation, A.M., I.J., M.T. and D.F.; resources, D.F.; data curation, A.M., I.J. and M.T.; writing—original draft preparation, A.M., I.J. and M.T.; writing—review and editing, A.M., I.J. and M.T.; visualization, I.J.; supervision, D.F.; project administration, A.M. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Conflicts of Interest
The authors declare that one of the authors is Prof. Dan Feldman who is a guest editor of special issue “Sensor Data Summarization: Theory, Applications, and Systems”.
Appendix A. Problem Reduction for Vector Summarization ε-Coresets
First, we define a normalized weighted set, which is simply a set which satisfies three properties: weights sum to one, zero mean, and unit variance.
Definition A1
(Normalized weighted set). A normalized weighted set is a weighted set where and satisfy the following properties:
- (a)
- Weights sum to one: ,
- (b)
- The weighted sum is the origin: , and
- (c)
- Unit variance: .
Appendix A.1. Reduction to Normalized Weighted Set
In this section, we argue that in order to compute a vector summarization -coreset for an input weighted set , it suffices to compute a vector summarization -coreset for its corresponding normalized (and much simpler) weighted set as in Definition A1; see Corollary A1. However, first, in Observation A1, we show how to compute a corresponding normalized weighted set for any input weighted set .
Observation A1.
Let be a set of points in , , be a distribution vector such that , and . Let be a set of n points in , such that for every we have . Then, is the corresponding normalized weighted set of , i.e., (i)–(iii) hold as follows:
- (i)
- ,
- (ii)
- , and
- (iii)
- .
Proof.
where the first equality holds by the definition of , the third holds by the definition of , and the last is since w is a distribution vector.
where the first and third equality hold by the definition of and , respectively. □
Corollary A1.
Let be a weighted set, and let be its corresponding normalized weighted set as computed in Observation A1. Let be a vector summarization ε-coreset for and let . Then is a vector summarization ε-coreset for .
Proof.
Put and let . Now, for every , we have that
where the first equality is by the definition of y and .
Let be a vector summarization -coreset for . We prove that is a vector summarization -coreset for . We observe the following
where the first equality holds since and , the second holds by (A1), and the last inequality holds since is a vector summarization -coreset for . □
Appendix A.2. Vector Summarization Problem Reduction
Given a normalized weighted set as in Definition A1, in the following lemma we prove that a weighted set is a vector summarization -coreset for the normalized weighted set if and only if the squared norm of the weighted mean of is smaller than .
Lemma A2.
Let be a normalized weighted set of n points in , , and be a weight vector. Let , , and . Then, is a vector summarization ε-coreset for , i.e., if and only if .
Proof.
The proof holds since is a normalized wighted set, i.e., , and . □
Appendix B. Frank–Wolfe Theorem
Here, for completeness we state the Frank–Wolfe Theorem [48]. This theorem will be used in the proof of Theorem 1 that shows how to compute a variant of vector summarization coreset for points inside the unit ball.
To do so, we consider the measure defined in [48]; see equality in Section 2.2. For a simplex S and concave function f, the quantity is defined as
where the supremum is over every x and z in S, and over every so that is also in S. The set of such includes , but can also be negative.
Theorem A3
(Theorem 2.2 from [48]).For simplex S and concave function f, Algorithm 1 (Algorithm 1.1 from [48]) finds a point on a k-dimensional face of S such that
for k > 0, where is the optimal value of f.
Appendix C. Proof of Theorem 1
Proof of Theorem 1.
Let be defined for f and S as in (A3), and let be the maximum value of f in S. Based on Theorem A3 we have:
- is a point on a -dimensional face of S, i.e., , and . Hence, claim (i) of this theorem is satisfied.
- for every
Since for every , we have that,
Define A to be the matrix of such that the i-th column of A is the i-th point in P, and let . We get that
where the second equality holds by the definition of , and the fourth equality holds by since for every .
At Section 2.2 in [48], it was shown that for any quadratic function that is defined as
where M is a negative semidefinite matrix, is a vector, and , we have that , where is a matrix that satisfies ; see equality at [48].
Observe that x and y are distribution vectors, thus
Since for each , we have that
By substituting , , , and in (2) we get that,
Multiplying both sides of the inequality by 8 and rearranging prove Theorem (ii) as
Running time: We have iterations in Algorithm 1, where each iteration takes time, since the gradient of f based on the vector is . This term is the multiplication between an a matrix in and a vector in , which takes time. Hence, the running time of the Algorithm is . □
Appendix D. Proof of Theorem 2
Proof of Theorem 2.
Let be the normalized weighted set that is computed at Lines 3–5 of Algorithm 2 where , and let . We show that is a vector summarization -coreset for , then by Corrolary A1 we get that is a vector summarization -coreset for . For every let ,, and be defined as in Algorithm 2, and let . First, by the definition of we have that
and since for every , we get that
We also have by Theorem 1 that
where the first derivative is by the definition of in Algorithm 2 at line 11, the second holds by the definition of and u at Lines 8, 9, and 12 of the algorithm, the third holds since , and the last inequality holds since for every and . Combining the fact that with (A13) yields that
By (A12) and since w is a distribution vector we also have that
which implies
By Lemma A2, Corollary A1, and (A16), Theorem 2 holds as
□
Appendix E. Proof of Theorem 3
Proof of Theorem 3.
We use the notation and variable names as defined in Algorithm 3.
First, we assume that for every , otherwise we remove all the points in P which have zero weight, since they do not contribute to the weighted sum. Identify the input set and the set C that is computed at Line 13 of Algorithm 3 as . We will first prove that the weighted set that is computed in Lines 13–15 at an arbitrary iteration satisfies:
- (a)
- ,
- (b)
- u(p) =1,
- (c)
- , and
- (d)
Let be the vector summarization -coreset of the weighted set that is computed during the execution of the current iteration at Line 12. Hence, by Theorem 1
Proof of (a). Property (i) is satisfied by Line 13 as we have that .
Proof of (b). Property (ii) is also satisfied since
where the first equality holds by the definition of C at Line 13 and for every at Line 15, and the third equality holds by the definition of for every as in Line 10.
Proof of (c). By the definition of and , for every
The weighted sum of is
where the first equality holds by the definitions of C and w, and the third equality holds by the definition of at Line 9. Plugging (A19) and (A20) in (A17) satisfies (iii) as
Proof of (d). By (A17) we have that C contains at most clusters from P and at most points, and by plugging we obtain that as required.
We now prove (i)–(iii) from Theorem 3.
Proof of Theorem 3 (i).The first condition in (i) is satisfied since at each iteration we reduce the data size by a factor of 2, and we keep reducing until we reach the stopping condition, which is by Theorem 1 (since we require a error when we use Theorem 1, i.e., we need coreset of size ). Then, at Line 5 when the if condition is satisfied (it should be, as explained) we finally use Theorem 1 again to obtain a coreset of size with -error on the small data (that was of size ).
The second condition in (i) is satisfies since at each iteration we either return such a pair at Line 18, we get by (b) that the sum of weight is always equal to 1.
Proof of Theorem 3 (ii). By (d) we also get that we have at most recursive calls. Hence, by induction on (2) we conclude that last computed set at Line 18 satisfies (ii)
At Line we return an coreset for the input weighted set that have reached the size of . Hence, the output of a the call satisfies
Proof of Theorem 3 (iii). As explained before, there are at most recursive calls before the stopping condition at Line 4 is met. At each iteration we compute the set of means , and compute a vector summarization -coreset for them. Hence, the time complexity of each iteration is where is the number of points in the current iteration, and is the running time of Algorithm 1 on k points in to obtain a -coreset. Thus, the total running of time the algorithm until the "If" condition at Line 4 is satisfied is
Plugging and observing the the last compression at Line 5 is done on a data of size proves (iii) as the running time of Algorithm 3 is . □
Appendix F. Proof of Corollary 4
Proof.
The corollary immediately holds by using Algorithm 2 with a small change. We change Line 11 in Algorithm 2 to use Algorithm 3 and Theorem 3, instead of Algorithm 1 and Theorem 1. □
Appendix G. Proof of Lemma 5
We first prove the following lemma:
Lemma A4.
Let P be a set of n points in , , and . Let , and let S be a sample of points chosen i.i.d uniformly at random from P. Then, with probability at least we have that
Proof.
For any random variable X, we denote by and the expectation and variance of the random variable X, respectively. Let denote the random variable that is the ith sample for every . Since the samples are drawn i.i.d, we have
For any random variable X and error parameter , the generalize Chebyshev’s inequality [64] reads that
Substituting , and in (A23) yields that
□
Now we prove Lemma 5
Proof.
Let be a set of k i.i.d sampled subsets each of size as defined at Line 5 of Algorithm 4, and let be the mean of the ith subset as define at Line 6. Let be the geometric median of the set of means .
Using Corollary 4.1 from [65] we obtain that
from the above we have that
Note that
where (A28) holds by substituting as in Line 3 of Algorithm 4, and (A29) holds since for every as we assumed. Combining (A29) with (A26) yields,
For every , by substituting , which is of size , in Lemma A4, we obtain that
Hence, with probability at least there is at least one set such that
By the following inequalities:
we get that with probability at least there is a set such that
Let be a function such that for every . Therefore, by the definitions of f and ,
Observe that f is a convex function since it is a sum over convex functions. By the convexity of f, we get that for every pair of points it holds that:
Therefore, by the definition of at in Algorithm 4 we get that
Running time. It takes to compute the set of means at Line 6, and time to compute Line 7 by simple exhaustive search over all the means. Hence, the total running time is . □
Appendix H. Proof of Theorem 6
We first show a reduction to a normalized weighted set as follows:
Corollary A5.
Let be a weighted set, and let be its corresponding normalized weighted set as computed in Observation A1. Let be a 1-mean ε-coreset for and let . Then is a 1-mean ε-coreset for .
Proof.
Let be a 1-mean -coreset for . We prove that is a 1-mean -coreset for . Observe that
where the first equality holds by (A1), and the second holds by the definition of w and . Since is a 1-mean -coreset for
where the equality holds by (A1) and since . The proof concludes by combining (A36) and (A38) as □
1-Mean Problem Reduction
Given a normalized weighted set as in Definition A1, in the following lemma we prove that a weighted set is a 1-mean -coreset for if some three properties related to the mean, variance, and weights of hold.
Lemma A6.
Let be a normalized weighted set of n points in , , and such that,
- ,
- , and
- .
Then, is a 1-mean ε-coreset for , i.e., for every we have that
Proof.
First we have that,
where the last equality holds by the attributes (a)–(c) of the normalized weighted set . By rearranging the left hand side of (A39) we get,
where (A43) holds by the triangle inequality, (A44) holds by attributes (a)–(c), and (A45) holds by combining assumptions (2), (3), and the Cauchy-Schwarz inequality, respectively. We also have for every that , hence,
By (A46) and assumption (1) we get that,
Observe that if assumptions (1), (2) and (3) hold, then (A48) hold. We therefore obtain an -coreset. □
To Proof Theorem 6, we split it into 2 claims:
Claim A7.
Let be a weighted set of n points in , , and let u be the output of a call to ; see Algorithm 2. Then is a vector with non-zero entries that is computed in time, and is a 1-mean ε-coreset for .
Proof.
Let be the normalized weighted set that is computed at Lines 3–5 of Algorithm 2 where , and let . We show that is a 1-mean -coreset for , then by Corollary A5 we get that is a 1-mean coreset for .
Let , let and for every . By the definition of at line 11 in Algorithm 2, and since the algorithm gets as input, we have that
and
For every let be defined as at Line 12 of the algorithm. It immediately follows by the definition of and (A49) that
We now prove that Properties (1)–(3) in Lemma A6 hold for . We have that
where the first derivation follows from (A50), the second holds by the definition of ,, and for every , the third holds since , and the last holds since for every such that and .
By (A54) and since w is a distribution vector we also have that
By Theorem 1, we have that is a distribution vector, which yields,
By the above we get that . Hence,
where the first equality holds since , the second holds since w is a distribution and the last is by (A56). Now by (A57), (A56) and (A55) we obtain that satisfies Properties (1)–(3) in Lemma A6. Hence, by Lemma A6 and Corollary A5 we get that
The running time is the running time of Algorithm 1 with instead of , i.e., .
□
Now we proof the following claim:
Claim A8.
Let be a weighted set of n points in , . Then in we can compute a vector , such that u has non-zero entries, and is a 1-mean -coreset for .
Proof.
The Claim immediately holds by using Algorithm 2 with a small change. We change Line 11 in Algorithm 2 to use Algorithm 3 and Theorem 3, instead of Algorithm 1 and Theorem 1. □
Combining both Claim A7 with Claim A8 proves Theorem 6.
Appendix I. Proof of Corollary 7
Proof.
We consider the variables defined in Algorithm 5. Let such that , and let . Plugging into Theorem 3 at [35]
We also have by the definition of W and Theorem 2
where the first inequality holds since for every , and the vector is a vector summarization -coreset for .
Finally, the corollary holds by combing Lemma 4.1 at [45] with (A60). □
References
- Valiant, L.G. A theory of the learnable. Commun. ACM 1984, 27, 1134–1142. [Google Scholar] [CrossRef]
- Vapnik, V. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems; Morgan-Kaufmann: Denver, CO, USA, 1992; pp. 831–838. [Google Scholar]
- Feldman, D.; Langberg, M. A unified framework for approximating and clustering data. In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, San Jose, CA, USA, 6–8 June 2011; pp. 569–578. [Google Scholar]
- Nielsen, M.A. Neural Networks and Deep Learning; Determination Press: San Francisco, CA, USA, 2015; Volume 2018. [Google Scholar]
- Steinwart, I.; Christmann, A. Support Vector Machines; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
- Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Hoerl, A.E.; Kennard, R.W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
- Bergman, S. The Kernel Function and Conformal Mapping; American Mathematical Soc.: Providence, RI, USA, 1970; Volume 5. [Google Scholar]
- Eggleston, H.G. Convexity. J. Lond. Math. Soc. 1966, 1, 183–186. [Google Scholar] [CrossRef]
- Phillips, J.M. Coresets and sketches. arXiv 2016, arXiv:1601.00617. [Google Scholar]
- Har-Peled, S. Geometric Approximation Algorithms; Number 173; American Mathematical Soc.: Providence, RI, USA, 2011. [Google Scholar]
- Vapnik, V. The Nature of Statistical Learning Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
- Langberg, M.; Schulman, L.J. Universal ε-approximators for integrals. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, Austin, TX, USA, 17 January 2010; pp. 598–607. [Google Scholar]
- Carathéodory, C. Über den Variabilitätsbereich der Koeffizienten von Potenzreihen, die gegebene Werte nicht annehmen. Math. Ann. 1907, 64, 95–115. [Google Scholar] [CrossRef]
- Cook, W.; Webster, R. Caratheodory’s theorem. Can. Math. Bull. 1972, 15, 293. [Google Scholar] [CrossRef]
- Phillips, J.M.; Tai, W.M. Near-optimal coresets of kernel density estimates. Discret. Comput. Geom. 2020, 63, 867–887. [Google Scholar] [CrossRef]
- Matousek, J. Approximations and optimal geometric divide-and-conquer. J. Comput. Syst. Sci. 1995, 50, 203–208. [Google Scholar] [CrossRef][Green Version]
- Braverman, V.; Feldman, D.; Lang, H. New frameworks for offline and streaming coreset constructions. arXiv 2016, arXiv:1612.00889. [Google Scholar]
- Bentley, J.L.; Saxe, J.B. Decomposable searching problems I: Static-to-dynamic transformation. J. Algorithms 1980, 1, 301–358. [Google Scholar] [CrossRef]
- Har-Peled, S.; Mazumdar, S. On coresets for k-means and k-median clustering. In Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, 13 June 2004; pp. 291–300. [Google Scholar]
- Maalouf, A.; Jubran, I.; Feldman, D. Fast and accurate least-mean-squares solvers. arXiv 2019, arXiv:1906.04705. [Google Scholar]
- Drineas, P.; Magdon-Ismail, M.; Mahoney, M.W.; Woodruff, D.P. Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res. 2012, 13, 3475–3506. [Google Scholar]
- Cohen, M.B.; Peng, R. Lp row sampling by lewis weights. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, Portland, OR, USA, 4 June 2015; pp. 183–192. [Google Scholar]
- Ritter, K. Average-Case Analysis of Numerical Problems; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
- Juditsky, A.; Nemirovski, A.S. Large deviations of vector-valued martingales in 2-smooth normed spaces. arXiv 2008, arXiv:0809.0813. [Google Scholar]
- Tropp, J.A. An introduction to matrix concentration inequalities. arXiv 2015, arXiv:1501.01571. [Google Scholar]
- Charikar, M.; Chen, K.; Farach-Colton, M. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming; Springer: Berlin/Heidelberg, Germany, 2002; pp. 693–703. [Google Scholar]
- Feldman, D.; Ozer, S.; Rus, D. Coresets for vector summarization with applications to network graphs. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 17 July 2017; Volume 70, pp. 1117–1125. [Google Scholar]
- Węglarczyk, S. Kernel density estimation and its application. In ITM Web of Conferences; EDP Sciences: Les Ulis, France, 2018; Volume 23. [Google Scholar]
- Zheng, Y.; Jestes, J.; Phillips, J.M.; Li, F. Quality and efficiency for kernel density estimates in large data. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 22 June 2013; pp. 433–444. [Google Scholar]
- Bachem, O.; Lucic, M.; Krause, A. Scalable k-means clustering via lightweight coresets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19 July 2018; pp. 1119–1127. [Google Scholar]
- Barger, A.; Feldman, D. k-Means for Streaming and Distributed Big Sparse Data. In Proceedings of the 2016 SIAM International Conference on Data Mining, Miami, FL, USA, 30 June 2016; pp. 342–350. [Google Scholar]
- Feldman, D.; Schmidt, M.; Sohler, C. Turning Big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. arXiv 2018, arXiv:1807.04518. [Google Scholar]
- Feldman, D.; Volkov, M.; Rus, D. Dimensionality reduction of massive sparse datasets using coresets. Adv. Neural Inf. Process. Syst. 2016, 29, 2766–2774. [Google Scholar]
- Cohen, M.B.; Elder, S.; Musco, C.; Musco, C.; Persu, M. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, Portland, OR, USA, 14 June 2015; pp. 163–172. [Google Scholar]
- Varadarajan, K.; Xiao, X. On the sensitivity of shape fitting problems. arXiv 2012, arXiv:1209.4893. [Google Scholar]
- Feldman, D.; Tassa, T. More constraints, smaller coresets: Constrained matrix approximation of sparse big data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10 August 2015; pp. 249–258. [Google Scholar]
- Frieze, A.; Kannan, R.; Vempala, S. Fast Monte-Carlo algorithms for finding low-rank approximations. J. ACM (JACM) 2004, 51, 1025–1041. [Google Scholar] [CrossRef]
- Yang, J.; Chow, Y.L.; Ré, C.; Mahoney, M.W. Weighted SGD for ℓp regression with randomized preconditioning. J. Mach. Learn. Res. 2017, 18, 7811–7853. [Google Scholar]
- Cohen, M.B.; Lee, Y.T.; Musco, C.; Musco, C.; Peng, R.; Sidford, A. Uniform sampling for matrix approximation. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, Rehovot, Israel, 11 January 2015; pp. 181–190. [Google Scholar]
- Papailiopoulos, D.; Kyrillidis, A.; Boutsidis, C. Provable deterministic leverage score sampling. In Proceedings of the 20th ACM SIGKDD iInternational Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24 August 2014; pp. 997–1006. [Google Scholar]
- Drineas, P.; Mahoney, M.W.; Muthukrishnan, S. Relative-error CUR matrix decompositions. SIAM J. Matrix Anal. Appl. 2008, 30, 844–881. [Google Scholar] [CrossRef]
- Cohen, M.B.; Musco, C.; Musco, C. Input sparsity time low-rank approximation via ridge leverage score sampling. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, Barcelona, Spain, 16 January 2017; pp. 1758–1777. [Google Scholar]
- Maalouf, A.; Statman, A.; Feldman, D. Tight sensitivity bounds for smaller coresets. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, CA, USA, 23 August 2020; pp. 2051–2061. [Google Scholar]
- Batson, J.; Spielman, D.A.; Srivastava, N. Twice-ramanujan sparsifiers. SIAM J. Comput. 2012, 41, 1704–1721. [Google Scholar] [CrossRef]
- Cohen, M.B.; Nelson, J.; Woodruff, D.P. Optimal approximate matrix product in terms of stable rank. arXiv 2015, arXiv:1507.02268. [Google Scholar]
- Clarkson, K.L. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM Trans. Algorithms (TALG) 2010, 6, 63. [Google Scholar] [CrossRef]
- Desai, A.; Ghashami, M.; Phillips, J.M. Improved practical matrix sketching with guarantees. IEEE Trans. Knowl. Data Eng. 2016, 28, 1678–1690. [Google Scholar] [CrossRef]
- Madariaga, D.; Madariaga, J.; Bustos-Jiménez, J.; Bustos, B. Improving Signal-Strength Aggregation for Mobile Crowdsourcing Scenarios. Sensors 2021, 21, 1084. [Google Scholar] [CrossRef]
- Mahendran, N.; Vincent, D.R.; Srinivasan, K.; Chang, C.Y.; Garg, A.; Gao, L.; Reina, D.G. Sensor-assisted weighted average ensemble model for detecting major depressive disorder. Sensors 2019, 19, 4822. [Google Scholar] [CrossRef]
- Wu, L.; Xu, Q.; Heikkilä, J.; Zhao, Z.; Liu, L.; Niu, Y. A star sensor on-orbit calibration method based on singular value decomposition. Sensors 2019, 19, 3301. [Google Scholar] [CrossRef]
- Yang, W.; Hong, J.Y.; Kim, J.Y.; Paik, S.h.; Lee, S.H.; Park, J.S.; Lee, G.; Kim, B.M.; Jung, Y.J. A novel singular value decomposition-based denoising method in 4-dimensional computed tomography of the brain in stroke patients with statistical evaluation. Sensors 2020, 20, 3063. [Google Scholar] [CrossRef]
- Peri, E.; Xu, L.; Ciccarelli, C.; Vandenbussche, N.L.; Xu, H.; Long, X.; Overeem, S.; van Dijk, J.P.; Mischi, M. Singular value decomposition for removal of cardiac interference from trunk electromyogram. Sensors 2021, 21, 573. [Google Scholar] [CrossRef]
- Code. Open Source Code for All the Algorithms Presented in This Paper. 2021. Available online: https://github.com/alaamaalouf/vector-summarization-coreset (accessed on 29 September 2021).
- Van Rossum, G.; Drake, F.L. Python 3 Reference Manual; CreateSpace: Scotts Valley, CA, USA, 2009. [Google Scholar]
- Oliphant, T.E. A Guide to NumPy; Trelgol Publishing USA, 2006; Volume 1, Available online: https://ecs.wgtn.ac.nz/foswiki/pub/Support/ManualPagesAndDocumentation/numpybook.pdf (accessed on 29 September 2021).
- Tremblay, N.; Barthelmé, S.; Amblard, P.O. Determinantal Point Processes for Coresets. J. Mach. Learn. Res. 2019, 20, 1–70. [Google Scholar]
- Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: http://archive.ics.uci.edu/ml (accessed on 29 September 2021).
- Donovan, B.; Work, D. Using Coarse GPS Data to Quantify City-Scale Transportation System Resilience to Extreme Events. 2015. Available online: http://vis.cs.kent.edu/DL/Data/ (accessed on 29 September 2021).
- US Census Data (1990) Data Set. Available online: https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990) (accessed on 10 June 2021).
- Kawala, F.; Douzal-Chouakria, A.; Gaussier, E.; Dimert, E. Prédictions D’activité dans les Réseaux Sociaux en Ligne. 2013. Available online: https://archive.ics.uci.edu/ml/datasets/Buzz+in+social+media+ (accessed on 29 September 2021).
- Huerta, R.; Mosqueiro, T.; Fonollosa, J.; Rulkov, N.F.; Rodriguez-Lujan, I. Online decorrelation of humidity and temperature in chemical sensors for continuous monitoring. Chemom. Intell. Lab. Syst. 2016, 157, 169–176. [Google Scholar] [CrossRef]
- Chen, X. A new generalization of Chebyshev inequality for random vectors. arXiv 2007, arXiv:0707.0805. [Google Scholar]
- Minsker, S. Geometric median and robust estimation in Banach spaces. Bernoulli 2015, 21, 2308–2335. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).