Next Article in Journal
Application of Pattern Language for Game Design in Pedagogy and Design Practice
Next Article in Special Issue
Beyond Importance Scores: Interpreting Tabular ML by Visualizing Feature Semantics
Previous Article in Journal
Leveraging Aviation Risk Models to Combat Cybersecurity Threats in Vehicular Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Understanding Collections of Related Datasets Using Dependent MMD Coresets

by
Sinead A. Williamson
1,* and
Jette Henderson
2
1
Department of Statistics and Data Science, University of Texas at Austin, Austin, TX 78712, USA
2
CognitiveScale, Austin, TX 78759, USA
*
Author to whom correspondence should be addressed.
Information 2021, 12(10), 392; https://doi.org/10.3390/info12100392
Submission received: 2 August 2021 / Revised: 2 September 2021 / Accepted: 3 September 2021 / Published: 23 September 2021
(This article belongs to the Special Issue Foundations and Challenges of Interpretable ML)

Abstract

:
Understanding how two datasets differ can help us determine whether one dataset under-represents certain sub-populations, and provides insights into how well models will generalize across datasets. Representative points selected by a maximum mean discrepancy (MMD) coreset can provide interpretable summaries of a single dataset, but are not easily compared across datasets. In this paper, we introduce dependent MMD coresets, a data summarization method for collections of datasets that facilitates comparison of distributions. We show that dependent MMD coresets are useful for understanding multiple related datasets and understanding model generalization between such datasets.

1. Introduction

When working with large datasets, it is important to understand your data. If a dataset is not representative of your population of interest, and no appropriate correction is made, then models trained on this data may perform poorly in the wild. Sub-populations that are under-represented in the training data are likely to be poorly served by the resulting algorithm, leading to unanticipated or unfair outcomes—something that has been observed in numerous scenarios including medical diagnoses [1,2] and image classification [3,4].
In low-dimensional settings, it is common to summarize data using summary statistics such as marginal moments or label frequencies, or to visualize univariate or bivariate marginal distributions using histograms or scatter plots. As the dimensionality of our data increases, such summaries and visualizations become unwieldy, and ignore higher-order correlation structure. In structured data such as images, such summary statistics can be hard to interpret, and can exclude important information about the distribution [5,6]—the per-pixel mean and standard deviation of a collection of images tells us little about the overall distribution. Further, if our data are not labeled, or are only partially labeled, we cannot make use of label frequencies to assess class balance.
In such settings, we can instead choose to present a set of exemplars that capture the diversity of the data. This is particularly helpful for structured, high-dimensional data such as images or text, that can easily be qualitatively assessed by a person. A number of algorithms have been proposed to find such a set of exemplars [7,8,9,10,11,12,13,14,15,16,17]. Many of these algorithms can be seen as constructing a coreset for the dataset—a (potentially weighted) set of exemplars that behave similarly to the full dataset under a certain class of functions. In particular, coresets that minimize the maximum mean discrepancy [18] (MMD) between coreset and data have recently been used for understanding data distributions [11,13]. Further, evaluating models on such MMD-coresets have been shown to aid in understanding model performance [11].
In addition to summarizing a single dataset, we may also wish to compare and contrast multiple related datasets. For example, a company may be interested in characterizing differences and similarities between different markets. A machine learning practitioner may wish to know whether their dataset is similar to that used to train a given model. A researcher may be interested in understanding trends in posts or images on social media. Here, summary statistics offer interpretable comparisons: we can plot the mean and standard deviation of a given marginal quantity over time, and easily see how it changes [19,20]. By contrast, coresets are harder to compare, since the exemplars selected for two datasets X 1 and X 2 will not in general overlap.
In this paper, we introduce dependent MMD coresets, a new tool for characterizing related datasets and understanding model behavior across such datasets. These dependent MMD coresets provide a low-dimensional summary of a collection of datasets, that allows easy comparison across datasets. A dependent MMD coreset for a collection of datasets constructs a collection of exemplars, that is shared across all datasets. Each dataset assigns a different weight vector to these exemplars, so that the weighted exemplars approximate the dataset. These weights allow us to easily see which exemplars are relevant to which datasets, and comparing two sets of weights provides a simple way of showing how the corresponding datasets differ.
The use of shared exemplars makes it easy to compare two or more datasets, by providing a common language. Consider comparing two datasets of faces. If we independently constructed representations of each dataset—for example, using two independent MMD coresets—we would obtain two disjoint sets of weighted exemplars. Visually assessing the similarity between two sets would involve considering both the similarities of the images and the similarities in the weights. Conversely, with a dependent MMD coreset, the exemplars would be shared between the two datasets. Similarity can be assessed by considering the relative weights assigned in the two marginal coresets. This in turn leads to easy summarization of the difference between the two datasets, by identifying exemplars that are highly representative of one dataset, but less representative of the other.
In addition to understanding the difference between multiple datasets, dependent MMD coresets allow us to qualitatively explore the behavior of algorithms on these datasets. The shared set of exemplars provides representative points at which to evaluate the algorithm. Looking at the relative weights of these exemplars in the different datasets paints a picture of the relative performances we would expect between those datasets. This is particularly useful when a model has been trained on one dataset, but we wish to apply it to a second dataset: looking at exemplars that are highly representative of the second dataset, but not the first, allows us to identify potential failure modes.
We begin by considering existing coreset methods for data and model understanding in Section 2, before discussing their limitations and proposing our dependent MMD coreset in Section 3. A greedy algorithm to select dependent MMD coresets is provided in Section 3.4. In Section 4, we evaluate the efficacy of this algorithm, and show how the resulting dependent coresets can be used for understanding collections of image datasets and probing the generalization behavior of machine learning algorithms. We summarize notation used in this paper in Table 1.

2. Background and Related Work

2.1. Coresets and Measure-Coresets

A coreset is a “small” summary of a dataset X, which can act as a proxy for the dataset under a certain class of functions F . Concretely, a weighted set of points { ( w i , u i ) } i S are an ϵ strong coreset for a size-n dataset X with respect to F if
1 n i = 1 n f ( x i ) 1 | S | j S w j f ( u j )     ϵ
for all f F [21].
A measure coreset [22] generalizes this idea to assume that X are independently and identically distributed samples from some distribution P . A measure Q is an ϵ -measure coreset for P with respect to some class F of functions if
sup f F E X P f ( X )     E Y Q f ( Y ) ] | .
The left hand side of Equation (1) describes an integral probability metric [23], a class of distances between probability measures parametrized by some class F of functions. Different choices of F yield different distributions (Table 2).

2.2. MMD-Coresets

In this paper, we consider the case where F is the class of all functions that can be represented in the unit ball of some reproducing kernel Hilbert space (RKHS) H —a very rich class of continuous functions on X . This corresponds to a metric known as the maximum mean discrepancy [18] (MMD),
MMD ( P , Q ) = sup f H | E X P [ f ( X ) ]     E Y Q [ f ( Y ) ] | .
An RKHS can be defined in terms of a mapping Φ : X H , which in turn specifies a kernel function k ( x , x ) = Φ ( x ) , Φ ( x ) H . A distribution P can be represented in this space in terms of its mean embedding, μ P = E P Φ ( x ) . The MMD between two distributions equivalently can be expressed in terms of their mean embeddings, MMD ( P , Q ) 2 = | | μ P μ Q | | H 2 ,
An ϵ -MMD coreset for a distribution P is a finite, atomic distribution Q = i S w i δ u i such that MMD ( P , Q ) 2 ϵ 2 . We will refer to the set { u i } i S as the support of Q , and refer to individual locations in the support of Q as exemplars.
In practice, we are unlikely to have access to P directly, but instead have samples X : = ( x 1 , , x n ) P . If Q = i S w i δ u i , we can estimate MMD ( P , Q ) 2 as
MMD 2 ^ X , Q   = 1 n 2 i = 1 n j = 1 n k ( x i , x j ) + i S j S w i w j k ( u i , u j ) 2 n i = 1 n j S w j k ( x i , u j ) .
We, therefore, define an ϵ -MMD coreset for a dataset X as a finite, atomic distribution Q such that MMD 2 ^ ( X , Q ) ϵ 2 —or equivalently, whose mean embedding μ Q in H is close to the empirical mean embedding μ ^ X so that | | μ Q μ ^ X | | H 2 ϵ 2 .
A number of algorithms have been proposed that correspond to finding ϵ -MMD coresets, under certain restrictions on Q (While most of these algorithms do not explicitly use coreset terminology, the resulting set of samples, exemplars or prototypes meet the definition of an MMD coreset for some value of ϵ ). Many of these algorithms greedily construct an MMD coreset, adding exemplars one-by-one based on some criteria. For example, kernel herding [14,24,25] can be seen as finding an MMD coreset Q for a known distribution P , with no restriction on the support of Q . The greedy prototype-finding algorithm used by [11] can be seen as a version of kernel herding, where P is only observed via a set of samples X, and where the support of Q is restricted to be some subset of a collection of candidates U (often chosen to be the data set X). Versions of this algorithm that assign weights to the atoms in Q are proposed in [13].
Other methods start from the full dataset, and repeatedly discard points to construct a coreset [15,16,17,26]. Loosely, these methods repeatedly partition the dataset based on a discrepancy criterion, and then discard one half of the partition. Compared with the greedy methods, these approaches typically obtain smaller coresets for a given ϵ [15,17]. As shown by [27], random sampling also provides a way to construct an MMD-coreset.
As in [11,13], in this paper we require the support of our coreset to be a subset of some finite set of candidates U, indexed by 1 , , n U . In other words, our measure coresets will take the form Q = i S w i δ u i , where S [ n U ] .

2.3. Coresets for Understanding Datasets and Models

The primary application of coresets is to create a compact representation of a large dataset, to allow for fast inference on downstream tasks (see [28] for a recent survey). However, such compact representations have also proved beneficial in interpretation of both models and datasets.
While humans are good at interpreting visual data [29], visualizing large quantities of data can become overwhelming due to the sheer quantity of information. Coresets can be used to filter such large datasets, while still retaining much of the relevant information.
The MMD-critic algorithm [11] uses a fixed-dimension, uniformly weighted MMD coreset, which they refer to as “prototypes”, to summarize collections of images. Gurumoorthy et al. [13] extends this to use a weighted MMD coreset, showing that weighted prototypes allow us to better model the data distribution, leading to more interpretable summaries. Zheng et al. [30] show how unweighted MMD coresets can be used to represent spatial point processes such as spatial location of crimes.
Techniques such as coresets that produce representative points for a dataset can also be used to provide interpretations and explanations of the behavior of models on that dataset. Case-based reasoning approaches use representative points to describe “typical” behavior of a model [11,31,32,33]. Considering the model’s output on such representative points can allow us to understand the model’s behavior.
Viewing the model’s behavior on a collection of “typical” points in our dataset also allows us to get an idea of the overall model performance on our data. Evaluating a model on a coreset can give an idea of how we expect it to perform on the entire dataset, and can help identify failure modes or subsets of the data where the model performs poorly.

2.4. Criticising MMD Coresets

While MMD coresets are good at summarizing a distribution, since the coreset is much smaller than the original dataset, there are likely to be outliers in the data distribution that are not well explained by the coreset. The MMD-critic algorithm supplements the “prototypes” associated with the MMD coreset with a set of “criticisms”—points that are poorly modeled by the coreset [11].
Recall from Equation (5) that the MMD between two distributions P and Q corresponds to the maximum difference in the expected value on the two spaces, of a function that can be represented in the unit ball of a Hilbert space H . The function f that achieves this maximum is known as the witness function, and is given by
f ( x ) = E X P [ k ( x , X ) ] E Y Q [ k ( x , Y ) ] .
When we only have access to P via a size-n sample X, and where Q = i S w i δ u i , we can approximate this as
f ^ ( x ) = 1 n i = 1 n k ( x , X i ) j S w j k ( x , u j ) .
Criticisms of an MMD coreset for a data set X are selected as the points in X with the largest values of the witness function. Kim et al. [11] show that the combination of prototypes and criticisms allow us to visually understand large collections of images: the prototypes summarize the main structure of the dataset, while the criticisms allow us to represent the extrema of a distribution. Criticisms can also augment an MMD coreset in a case-based reasoning approach to model understanding, by allowing us to consider model behavior on both “typical” and “atypical” exemplars.

2.5. Dependent and Correlated Random Measures

Dependent random measures [34,35] are distributions over collections of countable measures P t = i = 1 w t , i δ u t , i , indexed by some set T , such that the marginal distribution at each t T is described by a specific distribution. In most cases, this marginal distribution is a Dirichlet process, meaning that the P t are probability distributions. Most dependent random measures either keep the weights w t , i or the atom locations u t , i constant accross t, to assist identifiability and interpretability.
In a Bayesian framework, dependent Dirichlet processes are often used as a prior for time-dependent mixture models. In settings where the atom locations (i.e., mixture components) are fixed but the weights vary, the posterior mixture components can be used to visualize and understand data drift [36,37]. The dependent coresets presented in this paper can be seen as deterministic, finite-dimensional analogues of these posterior dependent random measures.

3. Understanding Multiple Datasets Using Coresets

As we have seen, coresets provide a way of summarizing a single distribution. In this section, we discuss interpretational limitations that arise when we attempt to use coresets to summarize multiple related datasets (Section 3.1), before proposing dependent MMD coresets in Section 3.2 and discussing their uses in Section 3.3.

3.1. Understanding Multiple Datasets Using MMD-Coresets

If we have a collection { X t } t T of datasets, we might wish to find ϵ -MMD coresets for each of the X t , in the hope of not just summarizing the individual datasets, but also of easily comparing them. However, if we want to understand the relationships between the datasets, in addition to their marginal distributions, comparing such coresets in an interpretable manner is challenging.
An MMD coreset selects a set { u i } i S of points from some set of candidates U. Even if two datasets X and Y are sampled from the same underlying distribution (i.e., X , Y iid P ), and the set U of available candidates is shared, the optimal MMD-coreset for the two datasets will differ in general. Sampling error between the two distributions means that MMD 2 ^ ( X , Q ) MMD 2 ^ ( Y , Q ) for any candidate coreset Q unless X Y , and so the optimal coreset will typically differ between the two datasets.
Figure 1 shows that, even if two distributions X and Y are sampled from the same underlying distribution, and their coreset locations are selected from the same collection U, the two coresets will not be identical. Here, we see two datasets (Figure 1b,c) generated from the same mixture of three equally weighted Gaussians (Figure 1a). Below (Figure 1d,e), we have selected a coreset for each dataset (using the algorithm that will be introduced in Section 3.4), with locations selected from a shared set U. While the associated coresets are visually similar, they are not the same.
This is magnified if we look at a high-dimensional dataset. Here, the relative sparsity of data points (and candidate points) in the space means that individual locations in Q X might not have close neighbors in Q Y , even if X and Y are sampled from the same distribution. Further, in high dimensional spaces, it is harder to visually assess the distance between two exemplars. These observations make it hard to compare two coresets, and gain insights about similarities and differences between the associated datasets.
To demonstrate this, we constructed two datasets, each containing 250 randomly selected, female-identified US highschool yearbook photos from the 1990s. Figure 2 shows MMD-coresets obtained for the two datasets (See Section 4.2.1 for full details of dataset and coreset generation.) While both datasets were selected from the same distribution, there is no overlap in the support of the two coresets. Visually, it is hard to tell that these two coresets are representing samples from the same distribution.

3.2. Dependent MMD Coresets

The coresets in Figure 2 are hard to compare due to their disjoint supports (i.e., the fact that there are no shared exemplars). Comparing the two coresets involves comparing multiple individual photos and assessing their similarities, in addition to incorporating the information encoded in the associated weights. To avoid the lack of interpretability resulting from dissimilar supports, we introduce the notion of a dependent MMD coreset.
Given a collection of datasets { X t } t T , the collection of finite, atomic measures { Q t } t T , is an ϵ -dependent MMD coreset if
MMD 2 ^ ( X t , Q t ) ϵ 2
for all t T , and if the Q t have common support, i.e.,
Q t = i S w t , i δ u i ,
where { u i } i S is a subset of some candidate set U.
In Equation (4), the exemplars u i are shared between all t T , but the weights w t , i associated with these exemplars can vary with t. Taking the view from Hilbert space, we are restricting the mean embeddings μ Q t of the marginal coresets to all lie within a convex hull defined by the exemplars { u i } i S .
By restricting the support of our coresets in this manner, we obtain data summaries that are easily comparable. Within a single dataset, we can look at the weighted exemplars that make up the coreset and use these to understand the spread of the data, as is the case with an independent MMD coreset. Indeed, since Q t still meets the definition of an MMD coreset for X t (see Equation (3)), we can use it analogously to an independently generated coreset. However, since the exemplars are shared across datasets, we can directly compare the exemplars for two datasets. We no longer need to intuit similarities between disjoint sets of exemplars and their corresponding weights; instead we can directly compare the weights for each exemplar to determine their relative relevance to each dataset. We will show in Section 4.2.1 that this facilitates qualitative comparison between the marginal coresets, when compared to independently generated coresets.
We note that the dependent MMD coresets introduced in this paper are directly extensible to other integral probability measures; we could, for example, construct a dependent version of the Wasserstein coresets introduced by [22].

3.3. Model Understanding and Extrapolation

As we discussed in Section 2.3, MMD coresets can be used as tools to understand the performance of an algorithm on “typical” data points. Considering how an algorithm performs on such exemplars allows the practitioner to understand failure modes of the algorithm, when applied to the data. In classification tasks where labeling is expensive, or on qualitative tasks such as image modification, looking at an appropriate coreset can provide an estimate of how the algorithm will perform across the dataset.
In a similar manner, dependent coresets can be used to understand generalization behavior of an algorithm. Assume a machine learning algorithm has been trained on a given dataset X a , but we wish to apply it (without modification) to a dataset X b . This is frequently done in practice, since many machine learning algorithms require large training sets and high computational cost; however if the training distribution differs from the deployment distribution, the algorithm may not perform as intended. In general, we would expect the algorithm to perform well on data points in X b that have many close neighbors in X a , but perform poorly on data points in X b that are not well represented in X a .
Creating a dependent MMD coreset Q a = i w a , i δ u i , Q b = i w b , i δ u i for the pair ( X a , X b ) allows us to identify exemplars that are highly representative of X a or X b (i.e., have high weight in the corresponding weighted coreset). Further, by comparing the weights in the two coreset measures—e.g., by calculating f i = w b , i / w a , i —we can identify exemplars that are much more representative of one dataset than another. Rather than look at all points in the coreset, if we are satisfied with the performance of our model on X a , we can choose to only look at points with high values of f i —points that are representative of the new dataset X b , but not the original dataset X a . Further, if we wish to consider generalization to multiple new datasets, a shared set of exemplars reduces the amount of labeling or evaluation required.
An MMD coreset, dependent or otherwise, will only contain exemplars that are representative of the dataset(s). There are likely to be outliers that are less well represented by the coreset. Such outliers are likely to be underserved by a given algorithm—for example, yielding low accuracy or poor reconstructions.
As we saw in Section 2.4, MMD coresets can be augmented by criticisms—points in X that are poorly approximated by Q . We can equivalently construct criticisms for each dataset represented by a dependent MMD coreset. In the example above, we would select criticisms for the dataset X b by selecting points in X b that maximize
f ^ t ( x ) = 1 n t i = 1 n t k ( x , X t , i ) j S w t , j k ( x , u j ) .
In addition to evaluating our algorithm on the marginal dependent coreset for dataset X b , or the subset of the coreset with high values of f i , we can evaluate on the criticisms C b . In conjunction, the dependent MMD coreset and its criticisms allow us to better understand how the algorithm is likely to perform on both typical, and atypical, exemplars of X b .

3.4. A Greedy Algorithm for Finding Dependent Coresets

Given a collection { X t } t T of datasets, where we assume X t : = { x t , 1 , , x t , n t } P t , and a set of n U candidates U, our goal is to find a collection { Q t } t T with shared support { u i : i S [ n U ] } such that MMD 2 ^ ( X t , Q t ) ϵ 2 for all t T . We begin by constructing an algorithm for a related task: to minimize t T M M D ( Q t , P t ) , where
MMD 2 ^ ( X t , Q t ) = 1 n t 2 i = 1 n t j = 1 n t k ( x t , i , x t , j ) + i S j S w t , i w t , j k ( u i , u j ) 2 n t i = 1 n t j S w t , j k ( x i , u j )
where Q t = i S w t , i δ u i . If we ignore terms in Equation (5) that do not depend on the Q t , we obtain the following loss:
L ( { Q t } t T ) = t T t ( Q t ) t ( Q t ) = 1 2 i S j S w t , i w t , j k ( u i , u j ) 1 n t i = 1 n t j S k ( x i , u j ) .
We can use a greedy algorithm to minimize this loss. Let Q t ( m ) = i S ( m ) w t , i ( m ) δ u i ( m ) , where S ( m ) indexes the first m exemplars to be added. We wish to select the exemplar u , and set of weights w t , ( m + 1 ) , { w t , i ( m + 1 ) } i S ( m ) for each dataset X t , that minimize the loss. However, searching over all possible combinations of exemplars and weights is prohibitively expensive, as it involves a non-linear optimization to learn the weights associated with each candidate. Instead, we assume that, for each t T , there is some α t > 0 such that w t , ( m + 1 ) = α t α t + 1 and w t , i ( m + 1 ) = w t , i ( m ) α t + 1 for all i S ( m ) . In other words, we assume that the relative weights in each Q t of the previously added exemplars do not change as we add more exemplars.
Fortunately, the value of α t that minimizes t 1 1 + α t Q t ( m ) + α t α t + 1 δ u can be found analytically for each candidate u by differentiating the loss in step 1, yielding
α t = i , j S ( m ) w t , i w t , j k ( u i , u j ) i S ( m ) w t , i k ( u i , u ) 1 n t i = 1 n t k ( x i , u ) j S ( m ) w j k ( x t , i , u j ) k ( u , u ) i S ( m ) w t , i k ( u i , u ) 1 n t i = 1 n t k ( x i , u ) j S ( m ) w j k ( x t , i , u j ) .
We can, therefore, set
i , { α t } arg min α t R + i [ n U ] \ S ( m ) , t T t 1 1 + α t Q t ( m ) + α t α t + 1 δ u i
and let S ( m + 1 ) = S ( m ) i , w t , i ( m + 1 ) = α t α t + 1 for all t T , and w t , i ( m + 1 ) = w t , i ( m ) α t + 1 for all t T and i S ( m ) .
As written, the procedure will greedily minimize the sum of the per-dataset losses. However, the definition of an MMD coreset involves satisfying, not minimizing: we want MMD 2 ^ ( X t , Q t ) ϵ 2 for all t T . To achieve this, we modify the sum in Equation (8) so that it only includes terms for which MMD 2 ^ X t , Q t ( m ) > ϵ 2 . The resulting procedure is summarized in Algorithm 1.
Algorithm 1.dmmd: Selecting dependent MMD coresets
Require: Datasets { X t } t T ; candidate set U; kernel k ( · , · ) ; threshold ϵ 2 > 0
S ( 0 ) ; w t ( 0 ) [ ] for all t T ; D T ; m 0
while D do
  for all i [ n U ] \ S ( m ) do
   for all t T do
    Calculate α t , i using Equation (7)
   end for
    L i 0
   for all t D do
     L i L i + t 1 1 + α t Q t ( m ) + α t α t + 1 δ u
   end for
  end for
   i =   arg min i [ n U ] \ S ( m ) L i
   S ( m + 1 ) S ( m )     { i }
   w i ( m + 1 ) α i α i + 1
  for all t T do
   for all i S ( m ) do
     w i ( m + 1 ) w i ( m ) α i + 1
   end for
    Q t ( m + 1 ) i S ( m + 1 ) w t , i ( m + 1 ) δ u i
  end for
   D = { X t : MMD 2 ^ ( X t , Q t ( m + 1 ) ) > ϵ 2 }
   m m + 1
end while

3.5. Limitations

As discussed in Section 2, if we can bound the MMD between two distributions by ϵ , then for any function f in the unit ball H of the Hilbert space associated with our kernel, the expectations of f with respect to the two distributions will differ by at most ϵ . However, we have no guarantee for functions that cannot be represented in that Hilbert space. If we use an MMD coreset (dependent or otherwise) to understand the output of a model, and that output cannot be well approximated by the expectation with respect to a function in H , we cannot use performance on the coreset to bound performance on the full dataset. For this reason, we focus on the use of coresets as a qualitative, diagnostic tool for exploring model performance.
Beyond the question of whether functions of interest lie in a Hilbert space, we must also question which Hilbert space. Our choice of kernel will impact the nature of the resulting coresets. If we assume the popular squared exponential kernel, then different lengthscales will cause the algorithm to prioritize capturing variation at different scales. In this work, we have used median heuristics to set the lengthscale [38]; however if we were interested in capturing differences on a specific task, a better approach might be to learn the kernel. An alternative approach would be to use a different integral probability metric in place of the MMD, such as the Wasserstein distance, which has been used to construct (non-dependent) measure coresets [22].
Conversely, a limitation of MMD is that calculating MMD 2 ^ ( X , Q ) scales cubically with the size of the data. Similarly, calculating the Wasserstein distance is typically computationally expensive, as in general it requires solving a linear programming problem. This limits the scalability of our algorithm; however, since subsampling a dataset yields a valid MMD coreset with high probability [27]), our algorithm could be used on samples from larger datasets. Similarly, we could replace our initial datasets with (non-dependent) MMD coresets obtained using an existing algorithm [14,15,16,17], although this would be more expensive than random sampling. In either setting, we would need to incorporate the approximation error of the random sample or coreset, into our overall approximation error ϵ .
When working with complex datasets such as images, we often work with lower-dimensional representations or embeddings [39,40,41]—for example, in Section 4, we will use ResNet [42] to generate embeddings for yearbook photos. However, this can make notions of “similarity” opaque, since the representations can capture properties of the image that are not immediately obvious to the viewer, or do not register as important [43]. Concerningly, recent research has suggested that image representations can encode harmful human-like biases [44].
Our algorithm greedily constructs dependent coresets. Recent work on MMD coresets has found that discrepancy-based algorithms, where the full dataset is successively divided based on some discrepancy measure, can obtain smaller ϵ -coresets than greedy methods or random sampling [15,16,17]. Unfortunately, it is not clear how to extend such a partitioning algorithm to the dependent setting; however, these results suggest that it is worth exploring alternative constructions for dependent coresets.

4. Experimental Evaluation

In Section 3.2, we introduced dependent MMD coresets, a summarization technique designed to allow easy comparison between related datasets, and proposed a greedy algorithm to construct dependent MMD coresets in Section 3.4. We also described, in Section 3.3, how dependent MMD coresets can be used to understand performance of models and algorithms, particularly in the context of generalization to new datasets.
In this section, we will empirically evaluate the performance of our algorithm in Section 4.1. Previous greedy algorithms for weighted MMD coresets (without dependence) proceed by first selecting a new exemplar, and then updating weights once the exemplar has been added to the coreset. While such an approach could be adapted to the dependent setting, we show that our algorithm (Algorithm 1), which pre-selects weights based on a single calculation, achieves comparable coresets with lower computational cost.
After evaluating the algorithm used to select the coresets, we will go on to explore the coresets themselves, in Section 4.2. We begin by showing how, when comparing two datasets, the shared support offered by dependent MMD coresets allows for easier comparison than two standard MMD coresets. We then go on to show, in an example comparing 12 related datasets, that dependent MMD coresets can allow us to capture trends and similarities in an interpretable manner.
In Section 4.3, we turn our attention to coresets for model understanding. Here, we simulate a scenario where we wish to deploy algorithms trained on one dataset, to a slightly different datasets. By looking at performance on exemplars that are highly weighted in the second dataset, but not the first, we can obtain qualitative insights on the generalization properties of the algorithms. Adding evaluation on criticisms of the dependent MMD coreset leads to a deeper understanding of the model behavior.

4.1. Evaluation of Dependent MMD Coreset Algorithm

In Section 3.4, we proposed a greedy algorithm for selecting dependent MMD coresets (Algorithm 1, which we will denote dmmd). This algorithm selected weights (one for each dataset) for each candidate data point, and then greedily selected a data point and its associated weights. Since dependent coresets are introduced in this work, there is no direct comparison algorithm; however, a natural alternative would have been to adapt protodash, an existing greedy algorithm for weighted MMD coresets, to the dependent setting. Such an approach differs from Algorithm 1 in that weights are optimized after a candidate has been selected.
Below, we review the protodash algorithm, and introduce two alternative greedy algorithms for dependent MMD coresets: a dependent version of protodash, that selects unweighted candidates then optimizes weights; and a hybrid algorithm that pre-selects weights for candidate points, but further optimizes them after an exemplar has been added to the coreset. We quantitatively compare these variants with Algorithm 1, showing that pre-selecting weights provides comparable coresets to methods that optimize weights, at a much lower computational cost.
The protodash algorithm [13] for weighted MMD coresets greedily selects exemplars that minimize the gradient of the loss in Equation (6) (for a single dataset). Having selected an exemplar to add to the coreset, protodash then uses an optimization procedure to find the weights that minimize MMD 2 ^ ( X , Q ( m + 1 ) ) . We modify this algorithm for the dependent MMD setting by summing the gradients across all datasets for which the ϵ 2 threshold is not yet satisficed, leading to the dependent protodash algorithm shown in Algorithm 2.
Unlike the dependent version of protodash in Algorithm 2, our algorithm assigns weights before selection, which should encourage adding points that would help some of the marginal coresets, but not others. However, there is no post-exemplar-addition optimization of the weights. Inspired by the post-addition optimization in protodash, we also compare our algorithm with a variant of Algorithm 1 that optimizes the weights after each step—allowing the relative weights of the exemplars to change between each iteration. We will refer to this variant of dmmd with post-exemplar-addition optimization as dmmd-opt.
Algorithm 2. A dependent protodash algorithm
Require: Datasets { X t } t T , candidate set U, kernel k ( · , · ) , threshold ϵ 2 > 0
S ( 0 ) , w t ( 0 ) [ ] for all t T , D T , m 0
for all i [ n U ] do
   g i = t T 1 n t j = 1 n t k ( x j , u i )
end for
while D do
   i =   arg min i [ n U ] \ S ( m ) g i
   S ( m + 1 ) S ( m ) { i }
  for all t T do
{ w t , i ( m + 1 ) } i S ( m + 1 )   arg max { w t , i } i S ( m + 1 ) t ( i S ( m + 1 ) w t , i δ u i )
  end for
   Q t ( m + 1 ) i S ( m + 1 ) w t , i ( m + 1 ) δ u i
  for all i [ n U ] \ S ( m + 1 ) do
    g i = t D t ( Q ( m + 1 )
  end for
   D = { X t : MMD 2 ^ ( X t , Q t ( m + 1 ) ) > ϵ 2 }
   m m + 1
end while
We evaluate all three methods using a dataset of photographs of 15,367 female-identified students, taken from yearbooks between 1905 and 2013 [45]. We show a random subset of these images in Figure 3. We generated 512-dimensional embeddings of the photos using the torchvision pre-trained implementation of ResNet [42,46]. We then partitioned the collection into 12 datasets, each containing photos from a single decade.
In order to capture lengthscales appropriate for the variation in each decade, we use an additive kernel, setting
K = K a l l + t T K t | T | + 1 ,
where K a l l is a squared exponential kernel with bandwidth given by the overall median pairwise distances; T is the set of decades that index the datasets; K t is a squared exponential kernel with bandwidth given by the median pairwise distance between images in dataset t.
We begin by considering how good a dependent MMD coreset each algorithm is able to construct, for a given number of exemplars m = | S | . To do so, we ran all algorithms without specifying a threshold ϵ 2 , recording MMD 2 ^ ( X t , Q t ( m ) ) for each value of m. All algorithms were run for one hour on a 2019 Macbook Pro (2.6 GHz 6-Core Intel Core i7, 32 GB 2667 MHz DDR4), excluding time taken to generate and store the kernel entries, which only occurs one time. As much code as possible was re-used between the three algorithms. Where required, optimization of weights was carried out using a BFGS optimizer. Code is available at https://github.com/sinead/dmmd (accesson 22 September 2021).
Figure 4a shows the per-dataset estimates MMD 2 ^ ( X t , Q t ( m ) ) , and Figure 4b shows the average performance across all 12 datasets. We see that the three algorithms perform comparably in terms of coreset quality. dmmd-opt seems to perform slightly better than dmmd, as might be expected due to the additional optimization step. protodash, by comparison, seems to perform slightly worse, which we hypothesise is because it has no mechanism by which weights can be incorporated at selection time. However, in both cases, the difference is slight.
dmmd is however much faster at generating coresets, since it does not optimize the full set of weights at each iteration. This can be seen in Figure 5, which shows the time taken to generate coresets of a given size. The cost of the optimization-based algorithms grows rapidly with coreset size ( m ) ; the rate of growth of the dmmd coresets is much smaller.
In practice, rather than endlessly minimizing t T MMD 2 ^ ( X t , Q t ) , we will aim to find Q t such that MMD 2 ^ ( X t , Q t ) < ϵ 2 for all t T . In Figure 6, we show the coreset sizes required to obtain an ϵ -MMD dependent coreset on the twelve decade-specific yearbook datasets, for each algorithm. Again, a maximum runtime of one hour was specified. When all three algorithms were able to finish, the coreset sizes are comparable (with dmmd-opt finding slightly smaller coresets than dmmd, and protodash finding slightly larger coresets). However the optimization-based methods are hampered by their slow runtime.
Based on these analyses, it appears there is some advantage to additional optimization of the weights. However, in most cases, we do not feel the additional computational cost merits the improved performance.

4.2. Interpretable Data Summarizations

Summarizations of datasets can allow us to quickly understand properties of their distributions, and allow us to convey such properties to others, for example in a document explaining the data and its providence [47,48]. In high-dimensional, highly structured datasets such as collections of images, traditional summary statistics such as the mean of a dataset are particularly uninterpretable, as they convey little of the shape of the underlying distribution. A better approach is to show the viewer a collection of images that are representative of the dataset. MMD coresets allow us to obtain such a representative set, making them a better choice than displaying a random subset.
As we discussed in Section 3.1, if we wish to summarize a collection of related datasets, independently generated MMD coresets can help us understand each dataset individually, but it may prove challenging to compare datasets. This challenge becomes greater in high dimensional settings such as image data, where we cannot easily intuit a distance between exemplars. To showcase this phenomenon, and demonstrate how dependent MMD coresets can help, we return to the yearbook photos introduced in Section 4.1. For all experiments in this section, we use the additive kernel described in Section 4.1.

4.2.1. A Shared Support Allows for Easier Comparison of Datasets

In Section 3.2, we argued that the shared support provided by dependent MMD coresets facilitates comparison of datasets, since we only need to consider differences in weights. To demonstrate this, we constructed four datasets, each a subset of the entire yearbook dataset containing 250 photos. The first two datasets contained only faces from the 1990s; the second two, only faces from the 2000s. The datasets were generated by sampling without replacement from the associated decades, to ensure no photo appeared more than once across the four datasets. Our goal is to provide a visual way to compare these four datasets.
We begin by independently generating (non-dependent) MMD coresets for the four datasets, using Algorithm 1 independently on each dataset, with a threshold of ϵ 2 = 0.01 . The set of candidate images, U, was the entire dataset of 15,367 images. The resulting coresets are shown in Figure 7; the areas of the bubbles correspond to the weights associated with each exemplar (The top row of Figure 2 duplicates Figure 7).
We can see that, considered individually, each coreset appears to be doing a good job of capturing the variation in students for each dataset. However, if we compare the four coresets, it is not easy to tell that Figure 7a,b represent the same underlying distribution, and Figure 7c,d represent a second underlying distribution—or to interpret the difference between the two distributions. We see that the highest weighted exemplar for the two 2000s datasets is the same(top left of Figure 7c,d), but only one other image is shared between the two coresets. Meanwhile, the first coreset for the 1990s shares the same highest-weighted image with the two 2000s datasets—but this coreset does not appear in the first 1990s coreset, and the two 1990s coresets have no overlap. Overall, it is hard to compare between the marginal coresets.
By contrast, the shared support offered by dependent coresets means we can directly compare the distributions using their coresets. In Figure 8, we show a dependent MMD coreset ( ϵ 2 = 0.01 ) for the same collection of datasets. The shared support allows us to see that, while the two decades are fairly similar, there is clearly a stronger similarity between the pairs of datasets from the same year (i.e., similarly sized photos), than between pairs from different years. We can also identify images that exemplify the difference between the two decades, by looking at the difference in weights. We see that many of the faces towards the top of the bubble plot have high weights in the 2000s, but low weights in the 1990s. Examining these exemplars suggests that straight hair became more prevalent in the 2000s. Conversely, many of the faces towards the bottom of the bubble plot have high weights in the 1990s, but low weights in the 2000s. These photos tend to have wavy/fluffy hair and bangs. In conjunction, these plots suggest a tendency in the 2000s away from bangs and towards straight hair, something the authors remember from their formative years. However, there is still a significant overlap between the two decades: many of the exemplars have similar weights in the 1990s and the 2000s.
We can also see this in Figure 9, a bar chart shows the average weights associated with each exemplar in each decades (i.e., the blue bar above a given image is the average weight for that exemplar across the two 1990s datasets, and the red bar is the average weight across the two 2000s datasets). We see that most of the exemplars have similar weights in both scenarios, but that we have a number of straight-haired exemplars disproportionally representing the 2000s, and a number of exemplars with bangs and/or wavy hair disproportionately representing the 1990s. These insights would have been hard to intuit from the standard MMD coresets, where it is hard to identify what variation is due to true underlying differences in the dataset, and what is due to sampling error.

4.2.2. Dependent Coresets Allow Us to Visualize Data Drift in Collections of Images

Next, we show how dependent MMD coresets can be used to understand and visualize variation between collections of multiple datasets. As in Section 4.1, we partition the 15,367 yearbook images into twelve datasets based on their decade, with the goal of understanding how the distribution over yearbook photos changes over time. Table 3 shows the number of photos in each resulting dataset.
Figure 10 shows the exemplars in the resulting dependent MMD coreset, with a threshold of ϵ 2 = 0.01 . The corresponding plots show how the weights vary with time. The exemplars are ordered based on their average weight across the 12 datasets. In each case, a red, vertical line indicates the year of the yearbook from which the exemplar was taken. We are able to see how styles change over time, moving away from the formal styles of the early 20th century, through waved hairstyles popular in the midcentury, towards longer, straighter hairstyles in later decades. In general, the relevance of an exemplar peaks around the time it was taken (although, this information is not used to select exemplars). However, some styles remain relevant over longer time periods (see many exemplars in the first column). Most of the early exemplars are highly peaked on the 00 s or 10 s; this is not surprising since these pre-WW1 photos tend to have very distinctive photography characteristics and hair styles. Note that we do not include a comparison to standard, independent MMD coresets, as it would not be possible to produce an analogous set of plots—the exemplars in each decade’s coreset would, in general, not overlap.
Figure 10 appears to show that the marginal coresets have high weights on exemplars from the corresponding decade. To look at this in more detail, we consider the distributions over the dates of the exemplars associated with each decade. Figure 11 shows the weighted mean and standard deviation of the years associated with the exemplars, with weights given by the coreset weights. We see that the mean weighted year of the exemplars increases with the decade. However, we notice that it is pulled towards the 1940s and 1950s in each case: this is because we must represent all datasets using a weighted combination of points taken from the convex hull of all datapoints.

4.3. Dependent Coresets Allow Us to Understand Model Generalization

To see how dependent coresets can be used to understand how a model trained on one dataset will generalize to others, we simulate a scenario where we wish to deploy a machine learning model on a given dataset, but where the model was trained on a different dataset. In this scenario, we are interested in learning whether the model generalizes well to the new dataset.
We generate two datasets—one to represent the training data, and one to represent the data used in deployment—by partitioning a collection of image digits. We started with the USPS handwritten digits dataset [49], which comprises a train set of 2791 handwritten digits, and a test set of 2001 handwritten digits. We split the train set into two datasets, X a and X b , where X a is skewed towards the earlier digits and X b towards the later digits. Figure 12 shows the resulting label counts for each dataset: we see there is a clear distributional imbalance. Note that, in general, we will not have such a concise summary of the difference between two datasets; however using image digits as our example allows us to get an idea of the “ground truth” difference between the two datasets.
We selected three classification algorithms to assess generalization performance. We chose classification algorithms because it is easy for us to obtain “ground truth” generalization performance by applying these algorithms to our second dataset X b , allowing us to compare our insights with the true generalization performance. In general, we may not be able to easily estimate generalization performance in this manner: we may have unlabeled data, or our task may not be easily qualitatively evaluated (e.g., evaluating quality of auto-generated captions); we expect our approach to have greatest utility in such scenarios.
We trained three classifiers—a decision tree with maximum depth of 8, a random forest with 100 trees, and a multilayer perceptron (MLP) with a single hidden layer with 100 units—on X a and the corresponding labels. In each case, we used the implementation in scikit-learn [50], with parameters chosen to have similar train set accuracy on X a . These three models were chosen to have varying generalization accuracy. Table 4 shows the associated classification accuracies on the datasets X a and X b , which we will use as a quantitative representation of the algorithms’ generalization performance on X b . We also show confusion matrices in Figure 13 and Figure 14. We see that all three algorithms perform comparably on X a , the dataset on which they were trained. However, when applied to X b , we see in Figure 14 that the decision tree struggles in classifying 8 s and 9 s, and that the random forest struggles with 9 s.
We begin our analysis by generating a dependent MMD coreset for the two datasets, with ϵ 2 = 0.005 . To ensure the exemplars in our coreset have not been seen in training, we let our set of candidate points U be the union of X b and the USPS test set. As with the yearbook data, we use an additive squared exponential kernel, with bandwidths of the composite kernels being the median within-class pairwise distances, and the overall median pairwise distance. Distances were calculated using the raw pixel values. Figure 15 shows the resulting dependent MMD coreset, with the bars showing the weights w a , i , w b , i associated with the two datasets, and the images below the x axis showing the corresponding images u i . In Figure 16, the u i and w i have been grouped by number, so that if y ( u ) is the label of image u, the jth bar for Q a has weight i S : y ( u i )   =   j w a , j .
We can see that the coreset has selected points that cover the spread of the overall dataset. However, looking at Figure 16, we see that the weights assigned to these exemplars in Q a and Q b mirror the relative frequencies of each digit in the corresponding datasets X a and X b (Figure 12).
We then considered all points u i in our dependent coreset ( Q a , Q b ) where f i = w b , i / w a , i > 2 —i.e., points that are much more representative of X b than X a . We then looked at the class probabilities of the three algorithms, on each of these points, as shown in Figure 17. We see that the decision tree mis-classifies nine of the 21 exemplars, and is frequently highly confident in its misclassification. The random forest misclassifies three examples, and the MLP misclassifies two. We see this agrees with the ordering provided by empirically evaluating generalization in Table 4—the MLP generalizes best, and the decision tree worst. As suggested by our confusion matrices in Figure 14, we see that all algorithms generalize worst to the numbers 8 and 9—this is to be expected, since these digits are most under-represented in X a . The decision tree in particular appears to fail on these digits, mirroring the quanitative results in the confusion matrix.
For comparison, in Figure 18 we show the points where f i < 0.5 —i.e., points that are much more representative of X a than X b . Note that, since our candidate set did not include any members of X a , none of these points were in our training set. Despite this, the accuracy is high, and fairly consistent between the three classifiers (the decision tree misclassifies two exemplars; the other two algorithms make no errors).
The dependent coreset only provides information about performance on “representative” members of X b . Since classifiers will tend to underperform on outliers, looking only at the dependent MMD coreset does not give us a full picture of the expected performance. We can augment our dependent MMD coreset with criticisms—points that are poorly described by the dependent coreset. Figure 19 shows the performance of the three algorithms on a size-20 set of criticisms for X b . Note that, overall, accuracy is lower than for the coreset—unsurprising, since these are outliers. However, as before, we see that the decision tree performs worst on these criticisms (nine mis-classifications), with the other two algorithms performing slightly better (six mis-classifications for the random forest, and seven for the MLP).
Note that, since accuracy does not correspond to a function in a RKHS, we cannot expect to use the coreset to bound the expected accuracy of an algorithm on the full dataset. Indeed, while the coresets and critics correctly suggest that the decision tree generalizes poorly, they do not give conclusive evidence on the relative generalization abilities of the other two algorithms. However, they do highlight what sort of data points are likely to be poorly modeled under each algorithm. By providing a qualitative assessment of performance modalities and failure modes on either typical points for a dataset, or points that are disproportionately representative of a dataset (vs the original training set) dependent MMD coresets allows users to identify potential generalization concerns for further exploration.

4.4. Discussion

MMD coresets have already proven to be a useful tool for summarizing datasets and understanding models. However, as we have shown in Section 4.2, their interpretability wanes when used to compare related datasets. Dependent MMD coresets provide a tool to jointly model multiple datasets using a shared set of exemplars. This shared set of exemplars makes it easy to compare two datasets, providing an interpretable summary not just of each dataset in isolation, but also of the difference between datasets. As such, we believe they will prove useful in understanding related datasets, and summarizing such collections of datasets.
In addition to facilitating understanding of data, we have also shown that dependent MMD coresets can be used to better understand model performance. By considering the weights associated with two different datasets, we can identify areas of domain mis-match. By exploring performance of algorithms on such points, we can glean insights about the ability of a model to generalize to new datasets.
In principle, dependent MMD coresets can be applied to any number of datasets. However, as we discuss in Section 3.5, the computational cost of our algorithm will scale cubically in the number of datapoints in the union of the datasets. This cost can be reduced by representing each dataset with an independent coreset, either obtained by subsampling the data or by applying a coreset selection algorithm such as [14,15,16,17]; however, the approximation error of this coreset would need to be incorporated into the overall approximation error ϵ .
An alternative approach might be to develop streaming algorithms for constructing dependent MMD coresets. In the non-dependent setting, streaming algorithms such as [16] allow us to construct a coreset in an online manner, at a lower computational cost than batch algorithms. Such an approach would be particularly appealing in the case of time-stamped data, since it would allow us to update our dependent MMD coreset to include a new dataset.
Dependent MMD coresets are just one example of a dependent coreset that could be constructed using this framework. Future directions include exploring dependent analogues of other measure coresets [22].

Author Contributions

Conceptualization, S.A.W. and J.H.; methodology, software, and experiments, S.A.W.; data curation: S.A.W. and J.H.; writing and visualization: S.A.W. and J.H. Both authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. Part of the work was completed while S.A.W. was employed by CognitiveScale.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Datasets and code available at https://github.com/sinead/dmmd (accessed on 22 September 2021).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Larrazabal, A.J.; Nieto, N.; Peterson, V.; Milone, D.H.; Ferrante, E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc. Natl. Acad. Sci. USA 2020, 117, 12592–12594. [Google Scholar] [CrossRef] [PubMed]
  2. Chen, I.Y.; Johansson, F.D.; Sontag, D. Why is my classifier discriminatory? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 3543–3554. [Google Scholar]
  3. Buolamwini, J.; Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, New York, NY, USA, 23–24 February 2018; pp. 77–91. [Google Scholar]
  4. Shankar, S.; Halpern, Y.; Breck, E.; Atwood, J.; Wilson, J.; Sculley, D. No classification without representation: Assessing geodiversity issues in open data sets for the developing world. arXiv 2017, arXiv:1711.08536. [Google Scholar]
  5. Alexander, R.G.; Schmidt, J.; Zelinsky, G.J. Are summary statistics enough? Evidence for the importance of shape in guiding visual search. Vis. Cogn. 2014, 22, 595–609. [Google Scholar] [CrossRef]
  6. Lauer, T.; Cornelissen, T.H.; Draschkow, D.; Willenbockel, V.; Võ, M.L.H. The role of scene summary statistics in object recognition. Sci. Rep. 2018, 8, 14666. [Google Scholar] [CrossRef] [Green Version]
  7. Kaufmann, L.; Rousseeuw, P. Clustering by means of medoids. In Statistical Data Analysis Based on the L1-Norm and Related Methods; Springer: Berlin/Heidelberg, Germany, 1987; pp. 405–416. [Google Scholar]
  8. Bien, J.; Tibshirani, R. Prototype selection for interpretable classification. Ann. Appl. Stat. 2011, 5, 2403–2424. [Google Scholar] [CrossRef] [Green Version]
  9. Mak, S.; Joseph, V.R. Projected support points: A new method for high-dimensional data reduction. arXiv 2017, arXiv:1708.06897. [Google Scholar]
  10. Mak, S.; Joseph, V.R. Support points. Ann. Stat. 2018, 46, 2562–2592. [Google Scholar] [CrossRef]
  11. Kim, B.; Khanna, R.; Koyejo, O.O. Examples are not enough, learn to criticize! Criticism for interpretability. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 2280–2288. [Google Scholar]
  12. Wilson, D.R.; Martinez, T.R. Reduction techniques for instance-based learning algorithms. Mach. Learn. 2000, 38, 257–286. [Google Scholar] [CrossRef]
  13. Gurumoorthy, K.S.; Dhurandhar, A.; Cecchi, G.; Aggarwal, C. Efficient data representation by selecting prototypes with importance weights. In Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, 8–11 November 2019; pp. 260–269. [Google Scholar]
  14. Chen, Y.; Welling, M.; Smola, A. Super-samples from kernel herding. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, 8–11 July 2010; pp. 109–116. [Google Scholar]
  15. Phillips, J.M.; Tai, W.M. Near-optimal coresets of kernel density estimates. Discret. Comput. Geom. 2020, 63, 867–887. [Google Scholar] [CrossRef] [Green Version]
  16. Karnin, Z.; Liberty, E. Discrepancy, coresets, and sketches in machine learning. In Proceedings of the 32nd Conference on Learning Theory Conference on Learning Theory, Phoenix, AZ, USA, 25–28 June 2019; pp. 1975–1993. [Google Scholar]
  17. Tai, W.M. Optimal Coreset for Gaussian Kernel Density Estimation. arXiv 2021, arXiv:2007.08031. [Google Scholar]
  18. Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A kernel two-sample test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
  19. Pratt, K.B.; Tschapek, G. Visualizing concept drift. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 24–27 August 2003; pp. 735–740. [Google Scholar]
  20. Hohman, F.; Wongsuphasawat, K.; Kery, M.B.; Patel, K. Understanding and visualizing data iteration in machine learning. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020; pp. 1–13. [Google Scholar]
  21. Agarwal, P.K.; Har-Peled, S.; Varadarajan, K.R. Approximating extent measures of points. J. ACM 2004, 51, 606–635. [Google Scholar] [CrossRef]
  22. Claici, S.; Solomon, J. Wasserstein coresets for Lipschitz costs. Stat 2018, 1050, 18. [Google Scholar]
  23. Müller, A. Integral probability metrics and their generating classes of functions. Adv. Appl. Probab. 1997, 29, 429–443. [Google Scholar] [CrossRef]
  24. Bach, F.; Lacoste-Julien, S.; Obozinski, G. On the equivalence between herding and conditional gradient algorithms. In Proceedings of the 29th International Conference on Machine Learning, Edinburgh, UK, 26 June–1 July 2012. [Google Scholar]
  25. Lacoste-Julien, S.; Lindsten, F.; Bach, F. Sequential kernel herding: Frank-Wolfe optimization for particle filtering. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 9–12 May 2015; pp. 544–552. [Google Scholar]
  26. Phillips, J.M. ε-samples for kernels. In Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 6–8 January 2013; pp. 1622–1632. [Google Scholar]
  27. Lopez-Paz, D.; Muandet, K.; Schölkopf, B.; Tolstikhin, I. Towards a learning theory of cause-effect inference. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1452–1461. [Google Scholar]
  28. Feldman, D. Introduction to core-sets: An updated survey. arXiv 2020, arXiv:2011.09384. [Google Scholar]
  29. Potter, M.C.; Wyble, B.; Hagmann, C.E.; McCourt, E.S. Detecting meaning in RSVP at 13 ms per picture. Atten. Percept. Psychophys. 2014, 76, 270–279. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  30. Zheng, Y.; Ou, Y.; Lex, A.; Phillips, J.M. Visualization of big spatial data using coresets for kernel density estimates. In Proceedings of the IEEE Visualization in Data Science (VDS), Phoenix, AZ, USA, 1 October 2017; pp. 23–30. [Google Scholar]
  31. Kim, B.; Rudin, C.; Shah, J.A. The Bayesian case model: A generative approach for case-based reasoning and prototype classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1952–1960. [Google Scholar]
  32. Aamodt, A.; Plaza, E. Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Commun. 1994, 7, 39–59. [Google Scholar] [CrossRef]
  33. Murdock, J.W.; Aha, D.W.; Breslow, L.A. Assessing elaborated hypotheses: An interpretive case-based reasoning approach. In Case-Based Reasoning Research and Development, Proceedings of the 5th International Conference on Case-Based Reasoning, Trondheim, Norway, 23–26 June 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 332–346. [Google Scholar]
  34. MacEachern, S.N. Dependent nonparametric processes. In ASA Proceedings of the Section on Bayesian Statistical Science; American Statistical Association: Alexandria, VA, USA, 1999; Volume 1, pp. 50–55. [Google Scholar]
  35. Quintana, F.A.; Mueller, P.; Jara, A.; MacEachern, S.N. The dependent Dirichlet process and related models. arXiv 2020, arXiv:2007.06129. [Google Scholar]
  36. De Iorio, M.; Müller, P.; Rosner, G.L.; MacEachern, S.N. An ANOVA model for dependent random measures. J. Am. Stat. Assoc. 2004, 99, 205–215. [Google Scholar] [CrossRef]
  37. Dubey, A.; Hefny, A.; Williamson, S.; Xing, E.P. A nonparametric mixture model for topic modeling over time. In Proceedings of the 13th SIAM International Conference on Data Mining, Austin, TX, USA, 2–4 May 2013; pp. 530–538. [Google Scholar]
  38. Garreau, D.; Jitkrittum, W.; Kanagawa, M. Large sample analysis of the median heuristic. arXiv 2017, arXiv:1707.07269. [Google Scholar]
  39. Kiela, D.; Bottou, L. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 36–45. [Google Scholar]
  40. Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative pretraining from pixels. In Proceedings of the International Conference on Machine Learning, Online, 13–18 July 2020; pp. 1691–1703. [Google Scholar]
  41. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, Online, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
  42. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  43. Athalye, A.; Engstrom, L.; Ilyas, A.; Kwok, K. Synthesizing robust adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 284–293. [Google Scholar]
  44. Steed, R.; Caliskan, A. Image representations learned with unsupervised pre-training contain human-like biases. In Proceedings of the 4th Conference on Fairness, Accountability, and Transparency, Online, 3–10 March 2021; pp. 701–713. [Google Scholar]
  45. Ginosar, S.; Rakelly, K.; Sachs, S.; Yin, B.; Efros, A.A. A century of portraits: A visual historical record of American high school yearbooks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 7–13 December 2015; pp. 1–7. [Google Scholar]
  46. Marcel, S.; Rodriguez, Y. Torchvision the machine-vision package of torch. In Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Italy, 25–29 October 2010; pp. 1485–1488. [Google Scholar]
  47. Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J.W.; Wallach, H.; Daumé, H., III; Crawford, K. Datasheets for datasets. In Proceedings of the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning, Stockholm, Sweden, 13–15 July 2018. [Google Scholar]
  48. Chmielinski, K.S.; Newman, S.; Taylor, M.; Joseph, J.; Thomas, K.; Yurkofsky, J.; Qiu, Y.C. The Dataset Nutrition Label (2nd Gen): Leveraging Context to Mitigate Harms in Artificial Intelligence. In Proceedings of the NeurIPS 2020 Workshop on Dataset Curation and Security, Online, 11 December 2020. [Google Scholar]
  49. Hull, J.J. A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 1994, 16, 550–554. [Google Scholar] [CrossRef]
  50. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Figure 1. (a) Three equally weighted Gaussians (lines show 1, 2, 3 standard deviations of each component). (b,c) Independently sampled datasets from the mixture of three Gaussians. (d,e) MMD coresets for the three-Gaussian datasets.
Figure 1. (a) Three equally weighted Gaussians (lines show 1, 2, 3 standard deviations of each component). (b,c) Independently sampled datasets from the mixture of three Gaussians. (d,e) MMD coresets for the three-Gaussian datasets.
Information 12 00392 g001
Figure 2. Independently learned MMD coresets for two randomly selected dataset of 250 female-identified photographs from US yearbooks in the 1990s. Area of each bubble is proportional to weight of the corresponding exemplar.
Figure 2. Independently learned MMD coresets for two randomly selected dataset of 250 female-identified photographs from US yearbooks in the 1990s. Area of each bubble is proportional to weight of the corresponding exemplar.
Information 12 00392 g002
Figure 3. A random subset of 100 images taken from the yearbook dataset.
Figure 3. A random subset of 100 images taken from the yearbook dataset.
Information 12 00392 g003
Figure 4. Evaluating how coreset quality varies with number of exemplars, for dependent MMD coresets generated using three algorithms, on 12 yearbook datasets.
Figure 4. Evaluating how coreset quality varies with number of exemplars, for dependent MMD coresets generated using three algorithms, on 12 yearbook datasets.
Information 12 00392 g004
Figure 5. Time (in seconds) taken to construct MMD dependent coresets of a given size, for three algorithms, on the 12 yearbook datasets. Algorithms ran for a maximum of one hour.
Figure 5. Time (in seconds) taken to construct MMD dependent coresets of a given size, for three algorithms, on the 12 yearbook datasets. Algorithms ran for a maximum of one hour.
Information 12 00392 g005
Figure 6. Coreset size required to obtain an ϵ -MMD dependent coreset on the 12 yearbook datasets, for three algorithms. Algorithms ran for a maximum of one hour; protodash failed to complete coresets for ϵ 2 = 0.01 and ϵ 2 = 0.005 . dmmd-opt failed to complete a coreset for ϵ 2 = 0.005 .
Figure 6. Coreset size required to obtain an ϵ -MMD dependent coreset on the 12 yearbook datasets, for three algorithms. Algorithms ran for a maximum of one hour; protodash failed to complete coresets for ϵ 2 = 0.01 and ϵ 2 = 0.005 . dmmd-opt failed to complete a coreset for ϵ 2 = 0.005 .
Information 12 00392 g006
Figure 7. Independently learned, weighted 0.1-MMD coresets based on 250 random samples from a given decade. Area of each bubble is proportional to the weight of the corresponding exemplar in the coreset.
Figure 7. Independently learned, weighted 0.1-MMD coresets based on 250 random samples from a given decade. Area of each bubble is proportional to the weight of the corresponding exemplar in the coreset.
Information 12 00392 g007
Figure 8. Dependent 0.1-MMD coreset for a collection of 4 datasets, each including 250 random samples from a given decade. Area of each bubble is proportional to the weight of the corresponding exemplar in the marginal coreset. Positioning is constant across all four examples.
Figure 8. Dependent 0.1-MMD coreset for a collection of 4 datasets, each including 250 random samples from a given decade. Area of each bubble is proportional to the weight of the corresponding exemplar in the marginal coreset. Positioning is constant across all four examples.
Information 12 00392 g008
Figure 9. Summary of a dependent 0.1-MMD coreset for four datasets of yearbook faces from the 1990s and 2000s. Exemplars are shown along the x axis. The average weight for each exemplar in the coresets associated with the 1990s is shown in blue Information 12 00392 i001; the average weight for the 2000s is shown in red Information 12 00392 i002.
Figure 9. Summary of a dependent 0.1-MMD coreset for four datasets of yearbook faces from the 1990s and 2000s. Exemplars are shown along the x axis. The average weight for each exemplar in the coresets associated with the 1990s is shown in blue Information 12 00392 i001; the average weight for the 2000s is shown in red Information 12 00392 i002.
Information 12 00392 g009
Figure 10. Visualization of an 0.1-MMD dependent coreset for 12 datasets, each containing yearbook photos from a given decades. Photos show the exemplars { u i : i S } , ordered by their average weight across the 12 marginal coresets. To the left of each photo is a plot of the corresponding weight over time; a red vertical line marks the year the photo was taken.
Figure 10. Visualization of an 0.1-MMD dependent coreset for 12 datasets, each containing yearbook photos from a given decades. Photos show the exemplars { u i : i S } , ordered by their average weight across the 12 marginal coresets. To the left of each photo is a plot of the corresponding weight over time; a red vertical line marks the year the photo was taken.
Information 12 00392 g010
Figure 11. Distribution over the year associated with the marginal coresets for each decade. Plot shows weighted mean ± weighted standard deviation.
Figure 11. Distribution over the year associated with the marginal coresets for each decade. Plot shows weighted mean ± weighted standard deviation.
Information 12 00392 g011
Figure 12. Frequency with which each digit occurs in two datasets of handwritten digits.
Figure 12. Frequency with which each digit occurs in two datasets of handwritten digits.
Information 12 00392 g012
Figure 13. Confusion matrices on X a , for three classification algorithms trained on X a .
Figure 13. Confusion matrices on X a , for three classification algorithms trained on X a .
Information 12 00392 g013
Figure 14. Confusion matrices on X b , for three classification algorithms trained on X a .
Figure 14. Confusion matrices on X b , for three classification algorithms trained on X a .
Information 12 00392 g014
Figure 15. Dependent MMD coreset for two datasets of handwritten digits. Bars show the weight in each marginal coreset, with X a shown in blue Information 12 00392 i003 and X b shown in red Information 12 00392 i004; images along axis show corresponding exemplars.
Figure 15. Dependent MMD coreset for two datasets of handwritten digits. Bars show the weight in each marginal coreset, with X a shown in blue Information 12 00392 i003 and X b shown in red Information 12 00392 i004; images along axis show corresponding exemplars.
Information 12 00392 g015
Figure 16. Summary of a dependent MMD coreset for two datasets of handwritten digits. The weights and exemplars from Figure 15 have been combined based on their label. Weights for X a are shown in blue Information 12 00392 i005 and weights for X b are shown in red Information 12 00392 i006.
Figure 16. Summary of a dependent MMD coreset for two datasets of handwritten digits. The weights and exemplars from Figure 15 have been combined based on their label. Weights for X a are shown in blue Information 12 00392 i005 and weights for X b are shown in red Information 12 00392 i006.
Information 12 00392 g016
Figure 17. Exemplars over-represented in Q b , with class probabilities under three algorithms trained on X a . The true class is shown in blue Information 12 00392 i007; where the highest probability class differs from the true class, the highest probability class is shown in red Information 12 00392 i008.
Figure 17. Exemplars over-represented in Q b , with class probabilities under three algorithms trained on X a . The true class is shown in blue Information 12 00392 i007; where the highest probability class differs from the true class, the highest probability class is shown in red Information 12 00392 i008.
Information 12 00392 g017
Figure 18. Exemplars over-represented in Q a , with class probabilities under three algorithms trained on X a . The true class is shown in blue Information 12 00392 i009; where the highest probability class differs from the true class, the highest probability class is shown in red Information 12 00392 i010.
Figure 18. Exemplars over-represented in Q a , with class probabilities under three algorithms trained on X a . The true class is shown in blue Information 12 00392 i009; where the highest probability class differs from the true class, the highest probability class is shown in red Information 12 00392 i010.
Information 12 00392 g018
Figure 19. Criticisms of Q b from the dataset X b , with class probabilities under three algorithms trained on X a . The true class is shown in blue Information 12 00392 i011; where the highest probability class differs from the true class, the highest probability class is shown in red Information 12 00392 i012.
Figure 19. Criticisms of Q b from the dataset X b , with class probabilities under three algorithms trained on X a . The true class is shown in blue Information 12 00392 i011; where the highest probability class differs from the true class, the highest probability class is shown in red Information 12 00392 i012.
Information 12 00392 g019
Table 1. Notation used in this paper.
Table 1. Notation used in this paper.
T set that indexes datasets and associated measures
X t = ( x t , 1 , , x t , n t ) X n t a dataset indexed by t T
P t true distribution at t T , X t P t
U = ( u 1 , , u n u ) set of candidate locations
δ u Dirac measure (i.e., point mass) at u.
Q t a probability measure used to approximate P t , that takes the form i S w t , i δ u i , where S [ n u ]
Table 2. Some examples of integral probability metrics.
Table 2. Some examples of integral probability metrics.
Distance F
1-Wasserstein distance { f : | | f | | 1 1 }
Maximum mean discrepancy { f : | | f | | H 1 } for some RKHS H
Total variation { f : | | f | | 1 }
Table 3. Number of yearbook photos for each decade.
Table 3. Number of yearbook photos for each decade.
1900s1910s1920s1930s1940s1950s1960s1970s1980s1990s2000s2010s
359830816822650209323192806282626212208602
Table 4. Accuracies of three classification algorithms, on datasets X a and X b . All algorithms were trained on X a .
Table 4. Accuracies of three classification algorithms, on datasets X a and X b . All algorithms were trained on X a .
ModelAccuracy on X a Accuracy on X b
MLP0.99980.8531
Random Forest1.00.7129
Decision Tree0.95850.5880
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Williamson, S.A.; Henderson, J. Understanding Collections of Related Datasets Using Dependent MMD Coresets. Information 2021, 12, 392. https://doi.org/10.3390/info12100392

AMA Style

Williamson SA, Henderson J. Understanding Collections of Related Datasets Using Dependent MMD Coresets. Information. 2021; 12(10):392. https://doi.org/10.3390/info12100392

Chicago/Turabian Style

Williamson, Sinead A., and Jette Henderson. 2021. "Understanding Collections of Related Datasets Using Dependent MMD Coresets" Information 12, no. 10: 392. https://doi.org/10.3390/info12100392

APA Style

Williamson, S. A., & Henderson, J. (2021). Understanding Collections of Related Datasets Using Dependent MMD Coresets. Information, 12(10), 392. https://doi.org/10.3390/info12100392

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop