1. Introduction
Dataset distillation, also known as
dataset condensation, refers to the process of generating a small, highly informative subset from a large dataset such that a model trained on this subset achieves predictive performance comparable to that of a model trained on the original, full dataset [
1,
2,
3]. Based on the traditional taxonomy [
2], the distilled examples are synthetically generated to better capture the characteristics of the full dataset. A similar setup, but where the resulting instances were selected from the original dataset as key representatives, has been named
instance selection [
4,
5] or instance reduction [
6,
7]. Recent studies [
3,
8] use distillation as an umbrella for all these methods, which reduces the original set size while preserving as much as possible of the performance, and we follow the same framework.
In terms of objectives and applications, dataset distillation helps reduce data storage requirements and can alleviate privacy and copyright concerns associated with maintaining and repeatedly using large volumes of raw data. Additionally, the smaller dataset size reduces the computational cost of model training, both in terms of runtime and memory usage, which is particularly beneficial when multiple models must be trained on the same dataset. These advantages enable a range of practical applications.
For example, in continual learning—where new tasks are learned sequentially while preserving knowledge of previous tasks—a “replay buffer” containing data from older tasks is often used to prevent catastrophic forgetting [
9]. Dataset distillation can significantly reduce the memory requirements of such replay buffers, allowing models to learn a larger number of tasks without forgetting [
10,
11].
While dataset distillation has been extensively studied for image and prompt datasets [
3,
12], its application to other data modalities remains limited. In particular, tabular data distillation has received less attention, despite the fact that many real-world machine learning problems and applications rely on tabular data. This paper aims to fill this gap.
This paper makes the following contributions: (1) Presents an extensive comparison, with respect to downstream prediction, among unsupervised distillation methods for tabular data. (2) Identifies, surprisingly, the coreset methods as the strongest competitor, no matter the problem type (classification or regression) or learner or database used. (3) Identifies the Coreset Leverage Score method as offering the best compromise between performance, duration, and numerical robustness and G-Coreset (i.e., based on the Gonzales algorithm) as a tight competitor. (4) We identify a moderate correlation between distillation expected performance and a database tailness measure. (5) We emphasize the fact that distillation changes the database consistency, narrowing the range of hyperparameter values that yield good performance.
The remainder of the paper is organized as follows:
Section 2 identifies previous studies on data distillation in general and on tabular data distillation in particular. In
Section 3, we present the experimental chain; there, we present the learners, but most of it is dedicated to the distillation methods used. The datasets used are detailed in
Section 5. Since the experimentation produced large amounts of raw results, we present those extensively in the
Supplementary Materials and only integrative summaries in
Section 6. We discuss the limitations of the proposed method in
Section 8. The paper ends with a conclusion
Section 9.
2. Related Work
The term ”dataset distillation” was coined by Wang et al. [
2] and defines algorithms that “take as input a large real dataset to be distilled (training set), and outputs a small synthetic distilled dataset”. At the same time, a concurrent term, “dataset condensation”, was introduced by Zhao et al. [
13] and it defines a method for ”training set synthesis for data-efficient learning, that learns to condense large dataset into a small set of informative synthetic samples”. Thus, the two techniques overlap in purpose and procedure.
The concept of making the training set smaller while keeping the same or most of the performance is older. As mentioned in the Introduction Section, instance selection [
4,
5] or instance reduction [
6,
7] were introduced as early as 2015. Yet in this paper, we refer to an even older algorithm, by Gonzales [
14], which was introduced for clustering in 1985.
However, significant interest was garnered for compressing the training set by the very large data collection associated in modern times with deep learning. Methods in that direction work with an image database and a survey on the contributions may be followed in the work of Yu et al. [
3] or for more general datasets (i.e., that include text data) in [
15]. However, these studies approach trends and concepts in general without any focus on tabular data distillation.
Tabular data distillation has been examined in specific studies. For instance, Xu et al. [
16] proposed a technique based on generative adversarial networks and one based on Variational Autoencoders. Zhao et al. [
8] developed a method based on distribution matching where the distribution is estimated via representation in neural networks. Kang et al. [
17] worked on the same idea, namely column embedding-based representation learning, but made the explicit assumption of an encoder’s existence. While these studies promote the proposed solution and include comparison against baseline, these comparisons are restricted to a few datasets and a few distillation methods.
In this work, we identify an existing gap in broad comparisons of distillation methods for tabular datasets and aim to fill it. To this end, we compare seven distillation techniques across 17 classification datasets and nine regression datasets, using three off-the-shelf learners and four different reduction ratios for the distilled datasets.
3. Methodology and Learners
3.1. Overview of Approach
The general methodology (also shown in
Figure 1) comprises the following key steps:
- 1.
Select a set of relevant tabular sets.
- 2.
Learner training and hyperparameter search.
- 3.
Select the distillation methods. Apply them on the tabular sets.
- 4.
Run the learners over the distilled/condensed datasets. Do hyperparameter search.
- 5.
Compare the performance on the distilled set with performance on the original set (baseline).
For the learning models, in all cases (distilled and original sets), training represents a single step within a broader exhaustive hyperparameter search. Each investigated hyperparameter configuration defines one complete training–testing cycle. In this study, we focus on Support Vector Machines (SVMs), Random Forests (RFs), and Gradient Boosting Machines, implemented using XGBoost. Details are presented below.
3.2. Formulation
Let the original dataset be represented by , containing n instances in a d-dimensional space: , where . An instance has a label that can be either categorial for classification or continuous for regression problems. The learning problem is to find a model that, when applied to the instance , will produce the prediction , which should be as close as possible to .
The goal of a distillation method is to find a distilled set consisting of k instances, where : with . In this context, each acts as a “prototype” or distilled representative of a subset of the original data.
3.3. Learners
This study evaluates three widely used supervised learning algorithms: Random Forest (RF), Support Vector Machine (SVM), and Gradient Boosting Machine (GBM). These models are consistently reported as strong baselines for tabular learning. Earlier large-scale comparisons across many datasets identified RF and SVM among the most competitive methods [
18]. More recent studies further show that gradient boosting approaches, particularly implementations such as XGBoost, achieve state-of-the-art performance on tabular data [
19].
Recent broad empirical analyses confirm that tree-based methods remain highly competitive for tabular learning. For example, Grinsztajn et al. [
20] demonstrate that classical tree ensembles often outperform deep learning approaches on typical tabular datasets, while more recent work [
21] provides both empirical and theoretical insights supporting the robustness of non-deep models in this domain. Based on these findings, we focus on RF, SVM, and XGBoost as representative and competitive learners for tabular tasks.
Although ensemble methods such as RF and GBM are naturally robust due to averaging and randomized feature selection [
22], careful hyperparameter tuning remains necessary to obtain strong performance. Because optimal hyperparameters are dataset-dependent and cannot be determined a priori [
23,
24], we employ a grid search to explore combinations of candidate values and select configurations that maximize predictive performance.
3.3.1. Random Forest
Random Forest(RF) [
25] is a learner that operates on the principle of Bagging and Feature Randomness. As an ensemble method, Random Forest aggregates predictions from multiple decision trees trained on different bootstrap samples.
Each tree in the forest is a tree grown using the CART (Classification and Regression Trees) algorithm. A tree partitions the feature space into disjoint regions. For a given node, the algorithm seeks the best split by minimizing an impurity measure (for classification) or variance (for regression).
For a region
, the prediction is typically the average or majority value of the training observations
:
Random Forest introduces two layers of randomness to create a diverse “forest”. The first layer is Bootstrap Sampling: for each tree, a random sample of size ”in-bag percentage” is drawn with replacement from the original training data. The second layer is Feature Subspacing: at each node of every tree, the algorithm selects a random subset of features and chooses the best split only from this subset.
The forest model F is a collection of B trees. The final prediction in the case of classification is determined by a majority vote. For regression problems, the final prediction is the average of the numerical outputs of all trees.
The key hyperparameters we tuned were the in-bag percentage and the number of features, p, considered at each split; performance is not convex with respect to these two parameters. Although the number of trees can also influence performance, accuracy generally increases monotonically (though not strictly) with the number of trees until saturation.
3.3.2. Gradient Boosting Machine
The Gradient Boosting Machine is implemented using
eXtreme Gradient Boosting (XGBoost) [
26], an optimized variant of the original Gradient Boosting framework [
27]. XGBoost is an ensemble learning method that combines multiple weak learners—typically trees—into a strong predictive model. Each tree is trained to correct the residual errors of the preceding ensemble, a process known as boosting.
Mathematically, XGBoost is an iterative procedure that begins with an initial prediction (commonly zero) and incrementally adds trees to minimize prediction error. This process can be formalized as:
where
is the final predicted value for the
i-th instance,
,
M is the total number of trees in the ensemble, and
is the prediction of the
m-th tree.
The objective function in XGBoost consists of two components: a loss function, which measures how well the model fits the training data, and a regularization term, which penalizes model complexity to prevent overfitting. Its general form is:
where
quantifies the error between the true value
and the prediction
, and
penalizes overly complex trees.
In our case, cross-entropy as is used for classification and Mean Squared Error for the regression task. The hyperparameters tuned in our implementation were the learning rate, the minimum number of samples required to create a child node, and the subsampling ratio.
3.3.3. Support Vector Machine
Support Vector Machine (SVM) [
28] performs strongly on tabular data, particularly on weak or noisy feature sets. They are among the most robust and mathematically grounded algorithms in machine learning.
Support Vector Machines (SVMs) aim to find the maximum-margin hyperplane separating two classes. Given the dataset
with labels
, the decision function is:
The optimal hyperplane maximizes the margin between classes, which is equivalent to minimizing the norm of the weight vector:
For non-separable data, slack variables are introduced and controlled by a regularization parameter C (named cost).
To handle non-linear boundaries, SVM employs the kernel trick, mapping data into a higher-dimensional space through a kernel function:
Following the recommendations of LibSVM [
29], we use the radial basis function (RBF) kernel:
. For multi-class classification, we adopt the One-vs-Rest strategy [
29].
For the regression problem, Support Vector Regression (SVR) is used. Instead of keeping points out of a margin, SVR tries to fit as many points as possible within a margin (the
-insensitive tube). The objective is to find a function
that deviates from the actual targets
by no more than
. The optimization minimizes:
Errors smaller than are ignored, making the model robust to noise and outliers that fall within the tube.
In our experiments, we used an aggregated one-vs-all formulation with a Gaussian (RBF) kernel. Following the recommendations in the LibSVM documentation [
29], the hyperparameters we tuned were the cost parameter (which regulates the trade-off between margin width and classification errors) and the variance of the Gaussian kernel,
, which controls the curvature of the decision boundary.
4. Distillation Methods
Fundamentally, distillation methods can be divided according to the following criteria:
Realism
- –
Instance selection [
4]: Methods that select existing instances.
- –
Generator: Methods that generate new, synthetic instances. They follow the standard distillation taxonomy [
2].
Objective
- –
Distribution matching: The purpose is to produce examples that match the original population statistics. Statistics can be evaluated by:
- *
Low order moments, such as mean, variance;
- *
Low order moments and outliers.
Actual methods, here, include:
- *
Moments matching: Mean, covariance, skewness, kurtosis.
- *
Gaussian Mixture Models (GMMs).
- *
Copula-based synthetic data generators.
- *
Probability distribution fitting.
- –
Gradient matching: This create a small synthetic dataset whose gradients (with respect to a model) match those produced by the real dataset. However, they are dependent on the learning model and, thus, not used in this study.
- –
Label-Preserving Condensation: It focuses on label distribution, , preservation instead of data distribution, .
We aim to evaluate learner independent, label independent methods. Thus, we restricted our evaluation on a limited set that will be detailed in the next subsections.
4.1. K-Means
As a distillation method, K-means [
30,
31] may be seen as a distribution matching method where the distilled set fits the mean of clusters from the original set. Formally, the K-means algorithm seeks to partition the original
n observations into
k sets
so as to minimize the
within-cluster sum of squares.
The distilled instances (centroids),
are defined by the following optimization:
The distilled instance
is calculated as the mean of the points assigned to that cluster:
Naturally, the mean is unlikely to be a point existing initially in the set,
. Thus, as an alternative to K-means, one might use K-Medoids [
32]. While both aim to divide a dataset into
k groups, the fundamental difference lies in how they define the “center” of a cluster. In K-Medoids, the center of a cluster, called a medoid, is an actual data point from the original dataset. Specifically, it is the point within a cluster that has the minimum total dissimilarity to all other points in that same cluster.
4.2. Coreset Methods
A
coreset is defined [
33,
34] as small weighted or unweighted subset of the original dataset such that training (or evaluating) a model on the coreset approximates training on the full dataset. In general, coreset methods select real data points (as opposed to creating synthetic points), aim to preserve geometry/coverage/influence of the data set, and thus are especially natural for tabular data [
34]. In contrast to K-means, distilled points are from the original dataset
.
In this evaluation, we consider two alternatives: one based on the Gonzales algorithm [
14], named G-Coresets, and one based on Leverage Score.
4.2.1. G-Coreset
The Gonzalez algorithm [
14] is a greedy strategy designed to solve the K-Center problem. It distills data by iteratively picking the instance that is furthest from the currently selected set, ensuring the “maximum coverage” of the data space.
The objective is then formulated as the minimization of the maximum distance between any point in
and its nearest neighbor in the distilled set
:
where
The distillation process, further named G-Coreset (Gonzales Coreset), is built with a greedy iterative algorithm:
Following this algorithm,
. Equation (
11) ensures that the set maximizes the coverage.
4.2.2. Coreset Leverage Score
An alternative to Gonzales’ method is Leverage Score sampling for the coreset. The method is built in the context of linear regression, where not all data points contribute equally to the final model; some points, known as high-leverage points, have a disproportionate influence on the position of the regression line. By identifying and prioritizing these points, we can distill a dataset that maintains the predictive power of the original, while significantly reducing the computational overhead.
The statistical core of this method lies in the projection (or “hat”) matrix. For a dataset
with
N samples and
d features, the Leverage Score
for each observation
is defined as:
These scores correspond to the diagonal elements of the hat matrix . Geometrically, measures how far an individual observation’s features are from the average of the features in the dataset. A high Leverage Score indicates that the point is an outlier in the feature space and, consequently, plays a critical role in defining the model’s parameters.
To perform the distillation, points are sampled with a probability proportional to their Leverage Scores. This process ensures that “important” points are preserved in the distilled coreset. This approach provides strong theoretical guaranties. Specifically, if we sample m points where , the resulting model trained on the distilled data will be a approximation of the model trained on the full dataset.
However, we emphasize that this theoretical justification is valid only in the context of linear least-squares problems and low-rank matrix approximation. The guarantees concern subspace preservation and approximation of linear objectives. Since the downstream learners are non-linear, the toretical guarantees do not transfer directly to RF, XGB, or RBF-SVM.
In practice, directly computing the hat matrix
H is very computationally intensive since it requires multiplication over the entire dataset. Based on the work of Drineas et al. [
35], we first compute a rank-
k approximation of
using Principal Component Analysis (PCA). Based on the Singular Value Decomposition (SVD), fundamental result,
, where
,
,
.
Now let
to denote the matrix formed by the top-
m left singular vectors corresponding to the largest singular values. The Leverage Score may be computed as:
where
is the
i-th row of
. We recall [
35,
36] that
and
, while high Leverage Scores indicate influential or extreme samples, which are relevant in forming coresets.
We define a probability distribution over samples:
Given a target reduced size
, we sample
k rows without replacement according to:
From an intuitive point of view, we apply PCA-based Leverage Score sampling to select informative samples. Leverage Scores are computed from the top principal components and used to define a sampling distribution over data points.
4.3. Copula Based Distillation
We recall that a copula is a multivariate cumulative distribution function for which the marginal probability distribution of each variable is uniform on the interval
. The development of copulas in practical application is based on Sklar theorem [
37], which states that every multivariate cumulative distribution function can be expressed in terms of its marginals and a copula
.
To build an actual copula-based distillation method, we consider that each instance
is a vector of
d random variables
. According to Sklar’s theorem, the joint cumulative distribution function (CDF) can be decomposed:
where
is the marginal CDF on the
j-th dimension and
is the Copula function describing the dependency structure.
While different models may be used, previous distillation and synthesis studies [
38,
39] used a Gaussian Copula model. It assumes that dependencies between features follow a multivariate normal structure:
where
is the inverse CDF (quantile function) of the standard normal distribution, and
is the joint CDF of a multivariate normal distribution with mean 0 and correlation matrix
.
The distillation (synthesis) process [
40] works as follows:
While K-means distills data into “average” points, the coreset does it in border points and the Gaussian Copula distills the “rules” of the data (distribution and correlation). By sampling k times from this learned model, one obtains a distilled set S that mimics the original density and feature relationships.
The solution used in this study is based on the SDV library [
40].
4.4. Gaussian Mixture Model
Data distillation via distribution matching aims to create a synthetic set such that the statistical distribution of is as close as possible to the distribution of the original dataset . By using Gaussian Mixture Models (GMMs), we can represent as a combination of m probability densities and use the parameters of these densities to derive our k distilled instances.
In this framework, we assume
is generated by a probability density function
. We model this density using a GMM with
m components:
where
are the parameters of the mixture model,
is the mixing coefficient for the
j-th component (
), and
is the multivariate Gaussian distribution with mean
and covariance
.
The distribution is iteratively found using an Expectation-Maximization algorithm. First, one needs to find the parameters
,
that maximize the likelihood of the original data
:
Once the modes are found, they are sampled
k times to retrieve the distilled set
:
4.5. Conditional Tabular GAN
More recent and elaborated models have used deep architectures to synthesize data that should match a given set. In this work, we focus on the Conditional Tabular GAN (CTGAN) solution by Xu et al. [
16]. The solution is a modified generative adversarial network specifically designed to produce high-quality synthetic tabular data.
In more detail, the CTGAN is designed to adapt to modeling tabular data, which often contain mixed-type columns, non-Gaussian distributions, and imbalanced categorical values. Off-the-shelf deep learning models often fall short because tabular data does not share the local structures found in images or text. To address these challenges, CTGAN implements three primary changes:
- 1.
Mode-Specific Normalization for Continuous Columns. It starts from the observation that continuous columns in tabular data are often non-Gaussian and multimodal, meaning they have multiple “peaks” or modes in their distribution, which CTGAN addresses through mode-specific normalization. It uses a Variational Gaussian Mixture (VGM) model to estimate the number of modes and fit a Gaussian mixture to each continuous column. Each value is transformed into a representation consisting of a one-hot vector (indicating which mode the value belongs to) and a scalar (representing the value’s normalized position within that specific mode). This allows the model to focus on learning the distribution within each mode independently.
- 2.
Conditional Generator and Training-by-Sampling. It starts from the observation that categorical columns are frequently highly imbalanced, where a single category might appear in over 90% of the rows. In the standard GAN training, the generator may never learn to produce rare “minority” classes because they do not appear enough to influence the discriminator. CTGAN solves this using a conditional generator and a “training-by-Sampling” approach: The generator is given a conditional vector that specifies a particular category from a discrete column that it must produce; then, during training, CTGAN samples categories based on the logarithm of their frequency rather than their raw frequency. This forces the model to “evenly explore” and learn from minority categories that would otherwise be ignored. Next, a cross-entropy loss is added to the generator to penalize it, if the generated row does not match the requested condition.
- 3.
Specialized Network Architecture for Mixed Types. Because tabular data lacks local structure, CTGAN uses fully connected networks for both the generator and the discriminator. To handle the mixed-type nature of the output, the model employs different activation functions in the final layer, such as
Tanh, which is used for the scalar values of continuous columns, and
Gumbel softmax. The latter is used for both the discrete column values and the mode indicators for continuous columns, allowing the model to differentiate between continuous and categorical outputs, while remaining end-to-end differentiable. Also CTGAN incorporates the PacGAN framework [
41] (using 10 samples per “pac”) to further prevent mode collapse, a common issue where the generator produces limited varieties of data.
4.6. Tabular Variational Autoencoder
The solution used in this evaluation is a secondary proposal from the same work that introduced CTGAN [
16].
The TVAE solutions consider a table T (which, here, represents the input data space ) with continuous columns and discrete columns , each treated as a random variable. Together, they follow an unknown joint distribution .
The implementation adapts Variational Autoencoders (VAEs) for tabular data, calling the model Tabular VAE (TVAE), using similar preprocessing, but modifying the loss. TVAE uses two networks for and , trained with the ELBO loss. The design of differs to model probabilities accurately: it outputs a joint distribution over variables .
The encoder uses two fully connected layers with ReLU and Gaussian generation, then two fully connected layers with softmax. The decoder also uses two ReLU fully connected layers, followed by a fully connected layer and exponentiation. Parameters are trained via gradient descent.
In our case, we used the implementation of both CTGAN and TVAE as offered by authors and included in the SDV library [
40].
5. Databases
In this paper, a collection of tabular databases was used. They have been collected while aiming to have more than 2000 instances, but lower than 100,000. Also some popular databases, such as those about “Diabetes” and “Heart” problems, were added. The databases define both classification and regression problems. Unless provided with training and testing, data was divided into 80% training set and 20% testing set. The division was done once and kept for the entire experiment. Distillation methods analyze and reduce only the training data. If the labels were multidimensional, only the first value was retained.
Databases were retrieved from three main sources: UC Irvine Machine Learning Repository, available online at
https://archive.ics.uci.edu/ (accessed on 31 December 2025), Kaggle, found at
https://www.kaggle.com/ (accessed on 31 December 2025), and OpenML, available online at
https://www.openml.org/ (accessed on 31 December 2025). The databases have been made public for academic research and often can be found in other locations. A summary of them, the introductory paper, and some details are provided in
Table 1 for classification and, respectively, in
Table 2 for regression.
6. Experiments
Overall, this work contains a very large volume of results. Consequently, the following strategy is adopted: we present the bulk of the raw results in the
Supplementary Materials, while in the main paper, we report only the conclusive, integrative results. However, in each case, we refer to the specific results in the
Supplementary Materials that were used to generate the reported findings.
6.1. Implementation
The hyperparameters were optimized independently for the baseline and for each distilled set. The grid search was also performed independently for each learner.
To reduce the duration of the experiments, we relied on CPU parallelism. Since SVM, which is based on LibSVM, does not support parallel processing, whereas XGBoost does, we concluded that it is more efficient to run different experiments in parallel threads. The simplest approach is to run experiments on two different datasets simultaneously, each on a separate thread. Another case where parallelism is applied is during the learner grid search: training and testing for different hyperparameter values are independent and can therefore be executed in parallel.
No GPU acceleration was used. The CTGAN and TVAE methods are computationally more intensive and, being based on neural network architectures, could in principle benefit from GPU computation. However, the creators of SDV, after extensive testing, concluded that GPU usage does not provide significant benefits and advised against it.
6.2. Geometry Preservation
Distillation transforms the original set of points into a new set. To determine how the geometry changes, we computed the Wasserstein distance (WD) [
67]. The WD represents the minimum “cost” required to transport one distribution into another. Mathematically, for two distributions assumed to be Gaussian,
and
, the distance
is approximated as:
The intuition is to view the dataset as a “cloud of mass”; the Wasserstein distance then measures the geometric work required to transform the original distribution into the new one. Unlike standard statistical tests, it accounts for the underlying metric structure of the space.
The data are first normalized so that each feature has zero mean and unit variance.
In general, WD is not normalized, and interpreting numerical results requires a baseline. In our case, the baseline was computed by measuring the distance between the original dataset and a subset obtained by randomly sampling 50% of the data. The final baseline value is obtained by averaging five such trials. The interpretation of the values is as follows:
If , the distilled set captures the geometry about as well as randomly sampled subsets of the raw data.
If , the distillation process has likely suffered from mode collapse or outlier bias. This means that the distilled points have migrated to a different region of the feature space, or that the distillation failed to capture the spread (variance) of the original data.
If , this suggests that the k distilled points are strategically placed and represent the global mass better than a random subset would.
6.3. Prediction Metrics
To measure the efficiency of the learning and, respectively, of the distillation, we use the following metrics:
- 1.
For classification problems:
Accuracy is the most intuitive metric for classification. It represents the proportion of total predictions that the model got exactly right. If
is the true label and
is the predicted label for
n samples, accuracy is defined as:
where
is the indicator function. While simple, accuracy can be misleading in highly imbalanced datasets.
Average (balanced) Accuracy is useful for evaluating imbalanced datasets. While standard accuracy rewards a model for simply predicting the majority class, Average Accuracy treats every class as equally important regardless of its size. If a classifier ignores a minority class as a lazy way to achieve high standard accuracy, the Balanced Accuracy score will drop sharply, providing a more truthful reflection of model utility. For a dataset with
C classes, it is defined as:
- 2.
For regression problems:
Mean Squared Error (MSE) measures how far predictions deviate from the truth. It is defined as:
A low MSE indicates that the model’s predictions are, on average, very close to the actual values. However, MSE is scale-dependent; a “good” MSE in one problem might be a “bad” one in another, depending on the units of measurement. Thus, we complement with another measure:
Pearson Correlation Coefficient,
: While MSE measures the magnitude of error, the Pearson Correlation Coefficient measures the strength and direction of the linear relationship between the actual values
and predicted values
. It is defined as:
where
is the mean of the labels. The result ranges from
to
. A value of
indicates a perfect linear trend, meaning the model has captured the general “shape” or movement of the data perfectly, even if the absolute values are scaled or shifted. A value of 0 means that prediction is not related to actual labels.
6.4. Database Analysis and Baseline Performance
Different problems (encapsulated in different databases) may bring different levels of difficulty. While there is no universal metric that quantifies difficulty or the perspective of a problem, there are some indications that can be used. In this work, we refer to two categories: dataset tailness and stability in hyperparameter grid search.
6.4.1. Tailness
“Tailness” for a tabular dataset is not a single number with a universally accepted definition, but in practice, it refers to how heavy-tailed, skewed, or rare event-dominated the feature and label distributions are. In a recent impactful work, McElfresh et al. [
68] compared the performance of the tree-based classifiers and neural networks on tabular data and concluded that determining the better performer is related to the “tailness” of the dataset. This is relevant for our experiments as the distillation methods also span the neural processing–thresholding range.
For tabular datasets, tailness usually captures one or more of (i) heavy-tailed feature distributions (e.g., log-normal, Pareto-like, extreme outliers); (ii) skewness/asymmetry—refers to being long right or left tails; (iii) rare events/extreme quantiles—a small fraction of samples carrying disproportionate mass; (iv) conditional tails—rare combinations of feature values (multivariate tails). No single metric captures all of this—so usually a tailness profile is computed.
In this work, we measure tailness by the following metrics:
- 1.
Skewness is a third order statistical moment that measures asymmetry of the distribution. It is defined as:
The practical interpretation for skewness is as follows: (i) ; (ii) ; (iii) ; .
- 2.
Kurtosis measures tail heaviness relative to Gaussian. It is computed as:
W.r.t interpretation, if the value is around 0 than it is Gaussian-like, while values larger than 3 indicate very-heavy-tailed.
- 3.
The quantile tail ratio is a non-parametric measure. To control the range, we consider the logarithmic version. First we compute the feature-wise empirical quantiles:
meaning “the smallest value of
t for which the further condition becomes true”. This is used to determine
,
,
. These are developed to determine the feature-wise tail ratio and respectively the dataset-level mean tail ratio:
For this metric (in the original, non-logarithmic form), the values are theoretically in the range. When a database has extreme outliers, the numerator explodes and becomes very large (values larger than ). Thus, in the logarithmic form, values larger than 5 indicate heavy tailness (many outliers).
- 4.
Multivariate tailness (joint rarity) is computed by the
Mahalanobis tail score, which counts how many samples lie in the extreme multivariate tail. To compute, first the means and the covariance matrix are found:
Next, the Mahalanobis distance per sample is determined:
which is further used to find the empirical tail threshold, which is further accumulated for the entire dataset
For this measure, if the data were truly Gaussian and the covariance known, the expected value is 0.01. Thus, values larger than 0.05 indicate heavy multivariate tails, while values larger than 0.1 indicate a strong rare event structure. Since we represent them as percentage (i.e., ) and the relevant thresholds become 1 and 5.
For the databases used in this study, the measures are provided in
Table 3 and
Table 4. The main observation is that our study includes datasets that neither measure identifies as having strong tails such as Crop Recommendation and Diabetes, nor datasets that have strong tails according to one metric, such as Connect-4, where the Kurtosis points to a value in excess of 150, meaning that variance is dominated by very few extreme observations, which suggests that most samples are tightly clustered, but a tiny fraction are orders of magnitude larger.
In general, no database is “heavy” according to the Mahalanobis tail score, but there is a clear distinction based on other metrics. According to the quantile tail ratio, most databases for classification have heavy tails, while most used in regression do not.
6.4.2. Impact of Learners’ Hyperparameters
In machine learning, the robustness of a dataset is often revealed not by the peak accuracy achieved, but by how much that accuracy fluctuates during hyperparameter tuning. When performing a grid search on a tabular dataset using Random Forest (RF) and Radial Basis Function SVM (RBF SVM), the “spread” between the maximum and minimum accuracies may serve as a diagnostic tool for dataset consistency. As short rule of thumb: high performance variation is the sign of an “unstable” dataset, while low performance variation is the sign of a “resilient” dataset.
High variation—where slight changes in RBF Gamma or RF Feature Sampling lead to massive swings in accuracy—typically indicates a dataset with low consistency or a high degree of noise. In an RBF SVM, controls the “reach” of a single training example; if performance drops sharply when increases, it suggests that the decision boundary is being forced to “wiggle” around outliers rather than capturing a global pattern. Similarly, if a Random Forest’s accuracy collapses when you reduce the “In-bag” percentage or feature count, the dataset likely contains redundant or weak features that only provide signal when specific, lucky subsets are sampled. In this scenario, the model is not learning the data; it is “memorizing” specific coincidences in the feature space.
Conversely, little variation between the minimum and maximum accuracies suggests a highly consistent dataset. When an RBF SVM maintains stable performance across a wide range of Cost (C) and Gamma () values, it implies that the classes are well-separated by a “thick” margin that is not easily disrupted by small boundary shifts. For a Random Forest, low sensitivity to the number of features or sampling rates indicates feature synergy—the underlying patterns are so prevalent that almost any random subset of features or data points is sufficient to reconstruct the logic of the target variable. This “flat” optimization landscape is a hallmark of a dataset that will generalize well to unseen production data. XGBoost does optimization in the data space and changing the topology of the space has a dramatic effect over the performance.
Thus, a secondary goal of grid search, beyond maximizing the performance, is to analyze the “stability window.” A consistent dataset produces a wide “plateau” of high performance in the hyperparameter grid, whereas an inconsistent one produces “sharp peaks.”
Furthermore, looking at where the maximum performance has been achieved between the original set and the distilled set, we get an intuition of how well the distillation preserves the structure of the original data.
To combine these intuitive ideas in a single metric, we have computed the
standard deviation of the learner performance w.r.t. hyperparameters change. This metric is consistent over the same learner since the number of hyperparameter changes is the same. The larger the deviation value is, the more unstable the dataset is. The values for the classification datasets are shown in
Table 5. Such a measure is an alternative to the tailness description. Since the correlation between any tailness measure and the deviation is below 0.35, such a measure is a different view.
Analysis of the variation of the performance reveals that the three learners, more often than not, identify the same databases as unstable. There are certain differences such as Crop Recommendation where SVM is more variable.
6.4.3. Visualization
To gain better insight, we visualize the data both in its original form and after the distillation process. Because tabular datasets contain many features, it is impossible to interpret their global structure through direct inspection. For practical visualization, the data must be mapped into a two-dimensional embedding. The goal is to compress the multidimensional structure while preserving key relationships, such as clusters or gradients, so they can be displayed on a standard coordinate system. We use three methods:
Principal Component Analysis (PCA) is a linear technique that projects the data onto directions of maximum variance. It is mathematically transparent and preserves global structure, meaning that points far apart in the plot are also far apart in the original space. However, due to its linear nature, it often fails to capture non-linear relationships, and local structures such as small sub-clusters may be blurred or lost. When applied, the projection matrix is computed on the original dataset and reused for the distilled sets.
T-Distributed Stochastic Neighbor Embedding (t-SNE) [
69] is a non-linear method that emphasizes local structure and excels at revealing distinct clusters that linear techniques may miss. Its main drawback is that it distorts global structure: distances between clusters in the plot do not necessarily reflect their true relationships. The results are also stochastic and sensitive to the perplexity hyperparameter. It is run separately for each dataset.
Uniform Manifold Approximation and Projection (UMAP) [
70] is a more recent non-linear method that aims to balance local and global structure. It is significantly faster than t-SNE and often preserves inter-cluster distances more faithfully. However, if not carefully tuned, it can introduce artificial structures, such as elongated “strings” or spurious clusters. It is run separately for each dataset.
6.5. Baseline and Comparison with Other Works
The best performance for the classification benchmarks are provided in
Table 6, while those for regression are in
Table 7. In both cases, the full results are in the
Supplementary Material; one may notice that among the three learners, performance is similar between XGBoost and Random Forest and slightly worse for SVM, in most cases. These findings are consistent with previous experiments [
68,
71].
On regression datasets, it is harder to find comparable results. The problem is not the lack of results on specific databases, but rather the broad variety or metrics used to report performance. In general, each work used a slightly different metric.
A recent work [
71] tested and improved many classifiers across multiple datasets, but there is no perfect overlap with our study. We present the corresponding results in
Table 6. In addition, we recall baseline results reported in the original studies that introduced these datasets or in the public code accompanying it. While in some cases, our baseline performance is slightly better and in others slightly worse, it is never substantially worse. We therefore argue that our learners are highly competitive, which makes our findings meaningful.
6.6. Distillation—Geometry Preservation
The aggregated results capture general trends but hide many informative details. Overall, K-means and GaussCop appear to be the methods that best preserve the geometry of the datasets.
Certain datasets, such as Mushroom or Diabetes, cause almost all methods to fall into mode collapse. The Diabetes dataset is relatively small, suggesting that strong reduction damages the fidelity of geometry preservation. However Mushroom is not, thus indicating that a specific geometry dataset is disturbing for all distillation methods.
In general, the number of mode collapse cases is similar when comparing coreset-based methods with neural network-based methods.
Overall, there is no major difference between the methods that would allow us to conclude that a specific approach is clearly superior for preserving dataset geometry.
6.7. Distillation—Downstream Prediction
Each distillation method was asked to reduce the training set to a percentage of the original training set. The percentages used are: 50%, 25%, 10%, and 5%. The testing set was kept intact in all cases. For each situation, the hyperparameter grid was searched.
6.7.1. Classification
For aggregating the classification results, we used the following measures:
Sum of differences (SD). This is computed as:
where
is number of databases (17 for classification, nine for regression),
is the baseline metric on database
i, and
is the equivalent metric, but on the distilled dataset. This measure accumulates the loss of metric compared to the baseline, for all databases. Smaller values are better; negative values means that there is an improvement.
Sum of normalized differences (SND), which is computed as:
Compared to previous measures, here, each accumulated factor is normalized to the baseline metric value. This is more relevant for MSE, in regression, where different databases have different ranges for the metric.
For classification, the aggregated SD and SND values, with respect to accuracy as a metric, are visually shown in
Figure 2 and
Figure 3 and followed numerically in
Table 10 and
Table 11. To better show the trends while doing distillation, we plot the SND variation with respect to the percentage in
Figure 4.
When analyzing the results, several conclusions stand out:
All distillation methods lose performance. The minimal loss starts at about 4% in accuracy when the dataset is reduced by half and increases to roughly 12.5% when the distilled dataset is only 5% of the full dataset. In some specific cases (i.e., particular datasets), there is a performance increase after distillation, but these cases are rare and likely incidental.
The behavior of the distillation methods is consistent across variations, whether with respect to the dataset, the learner used, or the evaluation metric. This consistency refers both to the magnitude of performance loss and to the relative ranking with respect to classifiers; these aspects can be examined in more detail in the raw results provided in the
Supplementary Materials.
The best-performing methods are clearly the coreset-based approaches. While the method based on the Gonzalez algorithm performs better when the reduction is moderate (e.g., to 50%), the overall best performer is the Leverage Score-based coreset, which preserves much of the performance even under more aggressive reductions. When considering Average Accuracy, the G-coreset emerges as a clearer winner.
On the opposite end, K-means is by far the worst performer. This is easily explained by the fact that the method is designed to capture cluster means rather than decision boundaries, which are critical for classification. Based on this finding, we do not further evaluate the K-means algorithm.
The Gaussian-based generative methods (Gaussian Copula Synthesizer and Gaussian Mixture Models) generally perform poorly when compared to the coreset methods.
Among the two modern neural network-based methods, TVAE surprisingly performs better. This is somewhat unexpected, since in the original work [
16], the main proposal was CTGAN.
When comparing results based on accuracy and Average Accuracy, the overall trends are consistent. The main difference is that the numerical values for Average Accuracy are lower and the performance loss due to distillation is larger. One may hope that supervised distillation, where class labels are used, may lead to more balanced results on Average Accuracy.
When distilling moderately, all methods lead to decent results. Coreset methods produced good results no matter the problem.
When distilling aggressively (to 5%), on specific databases, even the best method fails (i.e., performance is random change, or learner not converging). This suggests that success with aggressive distillation ratios is problem-dependent.
While some distillation methods lose performance more rapidly (e.g., G-Coreset), others, such as CTGAN, maintain nearly constant performance. However, this observation is of limited relevance because the predicted crossing point occurs at a value below zero, which is not feasible.
To further look into the behavior of the methods, we use visualization. While more plots are in provided in
Figures S1–S4 from the Supplementary Materials, here we present some that are more informative. In
Figure 5, we represent the original set and the distilled versions with G-Coreset ones CTGAN and TVAE. We recall that PCA ensures geometric comparability, but may miss non-linear complex structures. Note in the figure how coreset methods, especially CoreLeverageScore, preserve the space of the original dataset, while CTGAN and TVAE dramatically compress it.
For a complementary view, we present in
Figure 6 UMAP visualization of Crop Recommendation methods. We recall that UMAP is built to inspect non-linear manifold preservation. One may observe that the structures exiting in the original dataset, despite dramatic reduction, are clearly preserved by the coreset methods, while the neural methods lead to less clear separation.
Correlation with Tailness. Here we try to answer to the question: Can we use a taillness metric to predict the behavior of distillation methods?
To answer this question, we compute the correlation coefficient between the considered tailness measures and difference between the accuracy on the baseline and the accuracy on the distillation set, while varying the databases. An absolute value close to 1 would signal that a specific tailness measure is a very good indicator of the distillation method behavior, while values closer to zero mean that there is no indication.
The results are shown in
Table 14. As one may notice, the only relevant connection is with respect to MeanTailRatio99-01, where absolute values near 0.5 indicate a medium correlation. The negative values mean anti-correlation: the MeanTailRatio value is larger and the decrease (which is aimed) in accuracy of the distillation metrics is smaller. Thus, databases with larger tails are better summarized by distillation methods. Another point is a rather notable difference between the two coreset methods; here our finding is negative—the best performing method, Coreset Leverage Score, is poorly anticipated by the tailness metric.
We further investigate if the MeanTailRatio99-01 is better related to a specific distillation case (i.e., when the reduction is 50%, 25%, 10%, or 5%). The results are in
Table 15. The results are rather uniform with the exception of the G-Coreset method, where the correlation significantly increases.
6.7.2. Regression
For aggregating the regression results, we used the same measures as in the case of classification, namely sum of differences (SD) and sum of normalized differences (SND). Compared to classification, this time, the measures are aggregating Mean Square Errors and, respectively, Pearson Correlation Coefficient. An additional change is that, due to poor results on classification, we have stopped evaluating K-means as a distillation method.
For regression, the aggregated SD and SND values, with respect to MSE as metric, are in
Table 16 and
Table 17, while w.r.t Pearson Correlation are in
Table 18 and
Table 19. A graphical representation for the latter is shown in
Figure 7 and
Figure 8. The trends for each distillation method are provided in
Figure 9.
Analyzing the results, similar conclusions to the experiments for classification can be drawn. Yet there are some particular notes:
Distillation methods, in general, lose performance in the regression tasks too. This, in terms of correlation, ranges from when considering half the database to approximately when considering a set of only 5% of the original set size.
The loss is more abrupt for G-Coreset than for CTGAN, but this does not compensate for possible (i.e., given by a positive number) reduction.
Again the best performing methods are, clearly, the coreset methods, with a similar edge at large sets for G-Coreset; the overall best is the coreset based on Leverage Score.
On the opposite side, with the K-means discounted, the worse are, again, Gaussian-based generative methods.
Again, TVAE is better than CTGAN. Yet once more, it is not a tight competitor for coreset methods.
A particularly interesting aspect in the regression setting is the difference between the findings when performance is measured using MSE versus correlation, as well as between non-normalized and normalized metrics. In terms of raw MSE, distillation significantly increases the loss in performance. However, after normalization, the loss appears relatively small. This suggests that distillation methods introduce a significant bias in the predictions; however, they still capture the overall trend well, as indicated by the correlation-based measures.
6.8. Do Distillation Methods Change the Nature of the Dataset?
To evaluate this aspect, we look at the performance variation while doing a hyperparameter search on the dataset, with the standard deviation as the metric. The raw results are in the
Supplementary Materials in Tables S54–S57. The aggregated results for the XGBoost (best learner) classification are in
Table 20.
As one might see, with the exception of K-means, all other distillation methods dramatically increased the range of accuracies, signaling that the smaller dataset becomes less and less consistent. While this is expected, it also means that the same set of hyperparameters that lead to the best value is no longer automatically viable when learning a distilled set.
Another observation, this time more intriguing, is the deviation, while larger than the baseline, it is almost constant or even decreasing while reducing the training set. This means that the distillation method changes the structures of the points (in a way not captured by Wasserstein distance), but does it in a consistent way: the nature of the distillation matters more and not the percentage of reducing.
7. Duration
A key aspect of the distillation is the reduction of the training time. For this experiment, we used a single CPU (Intel(R) Core(TM) i5-13400F—10 cores, 16 threads, 4.60 Ghz) running Python 3.10.12. Full and detailed results are in the
Supplementary Materials in Tables S58–S61. Averaged data (over all classification datasets) is shown in
Table 21.
We recall that while theoretically CTGAN and TVAE might use GPU, their creators advocate against it, so in this evaluation, they have not been used. All processing here is CPU-only, single thread (no parallelization).
The results presented here should be interpreted as relative indications rather than absolute evaluations. Parallelization and further optimization could reduce the reported runtimes. From the data, it can be observed that the most efficient method by far is the Coreset Leverage Score approach. K-means is faster in terms of runtime, but its poor performance outweighs this advantage. Neural network-based methods require significantly longer execution times.
The G-Coreset method in general require a short runtime. The average is pushed high by the duration on Credit Card Fraud, where to compute the 50% distillation set, an astounding 3061 s was needed, which questions the usability of the method, in this implementation, on large datasets. If the results on this database would be ignored, then G-Coreset would rank second behind Coreset Leverage Score.
In general, most methods exhibit relatively constant runtimes regardless of the distillation percentage. This is because they first construct a representation of the original dataset, which is the most computationally intensive step, and then perform sampling, which accounts for only a small fraction of the total runtime. A notable exception is G-Coreset, which computes each point in the distilled set independently and therefore scales linearly with the size of the distilled dataset; G-Coreset also scales upward with dimensionality; in contrast, Coreset Leverage Score uses PCA to set dimensionality at a fixed size, and thus is near-independent to the database dimensionality.
Regarding specific results, the runtime of all distillation methods on the Credit Card Fraud dataset is significantly higher, as this dataset is larger both in terms of the number of instances and feature dimensionality. Consequently, for all methods, the runtime on this dataset appears as an upper-end outlier.
8. Discussion and Limitations
8.1. Relation with Prior Findings
Two main paradigms have emerged for dataset compression: synthetic data generation and instance selection (coresets). Coreset selection predates modern dataset distillation and has long been studied as a principled method for approximating large datasets with representative subsets. Indeed, dataset distillation itself is often presented as a successor to earlier coreset approaches [
74].
In contrast, much of the recent tabular synthesis literature focuses on generative approaches (e.g., GAN-based or diffusion-based models) [
3,
75], typically evaluated on a small number of datasets and often without including classical instance selection baselines [
16,
17]. At the same time, surveys of dataset compression still identify coreset selection as a fundamental technique for reducing training data while preserving predictive performance.
Our work aims precisely to bridge these two lines of research. By evaluating both generative methods and instance selection approaches across a large set of classification and regression tasks, we provide a broad empirical comparison that has been largely missing from prior work. Our results indicate that, for tabular data and the downstream predictive tasks considered, classical coreset methods remain highly competitive and outperform synthetic generators.
One possible explanation is that tabular datasets frequently involve heterogeneous feature types and non-differentiable learners such as tree ensembles [
20], which complicates the optimization of synthetic datasets and may favor methods that directly preserve the empirical data geometry.
8.2. Theoretical Support
In short, the findings of this paper argue that coreset (instance-based selection) are more suitable for distillation if the ranking criteria is downstream prediction accuracy under supervised learning.
First let us note that instance selection approximates the original empirical distribution:
Generators approximate the true distribution . This approximation assumes some underlying model by which the actual samples are drawn from the true data . The generators aim to invert this model, which is an assumption. Since the test is based on real data, one might argue that instance selection is more suitable.
A similar rationale may be constructed for the downstream supervised evaluation. For supervised learning, coreset methods approximate the empirical sum (in the loss function) directly:
where
.
Generators approximate the expectation of the loss function . However, if , the risk landscape shifts. In finite-sample supervised settings, empirical-risk approximation may be superior, which matches the findings in this paper.
8.3. Limits of the Experiments
In this paper, we investigated seven distillation methods. All methods are unsupervised, in the sense that they do not use labels while condensing the dataset. The conclusions may differ if distillation explicitly takes labels into account. In other words, during distribution matching, we only match and subsequently obtain from the data. A joint optimization over may, therefore, lead to different conclusions.
The self-imposed restriction of investigating distillation methods independently of the learner lead to the exclusion of popular methods such as those based on gradient matching [
13] or Trajectory Matching [
76]. They are powerful classes of methods, but inherently connect the distillation set to a learner trained by back-propagation.
Our conclusions primarily apply to tabular datasets of medium-to-large dimensionality and to classical machine learning learners (Random Forest, Support Vector Machine, XGBoost). Although the observed superiority of coreset methods holds consistently across all experiments and compression levels tested, we do not implicitly extend this conclusion to neural tabular models. Performance trends may differ for very-high-dimensional data or for end-to-end deep learning pipelines. We investigated both classification and regression tasks, but not clustering (explicitly), retrieval, or representation learning.
While we performed hyperparameter searches for each learner, the search ranges were limited and the learners were used largely in an off-the-shelf manner. We did not pursue extensive fine-tuning for each classifier. Nevertheless, we showed that our baseline performance is comparable to prior studies. More careful and intensive tuning may lead to slightly different results.
All methods presented in the main paper are feasible even for larger datasets. Most scale linearly with respect to the number of instances
N. In the
Supplementary Materials, we also discuss another method, namely Coreset Facility Location Submodular Optimization, which—despite being a strong performer—has quadratic memory requirements. This makes it impractical for large-scale datasets and conflicts with the core motivation of data distillation.
The relative ranking of distillation methods (for a specific dataset and reduction ratio) is influenced by the choice of the learner, suggesting that no single method is universally optimal. Nevertheless, coreset-based methods performed best in the vast majority of the cases studied.
Another limitation concerns the evaluation metrics. For classification, we used accuracy and Average (Balanced) Accuracy, while for regression, we used Mean Squared Error and the Pearson Correlation Coefficient. We did not consider task-specific metrics such as the F1-score, Area Under the Curve, or ranking-based metrics. Again, our evaluation focuses primarily on downstream predictive performance.
We also investigated some desirable properties, such as interpretability, by means of visualization and correlations with tailness metrics. Our findings are restricted to the techniques used, namely PCA, t-SNE, and UMAP for visualization, and the four tailness metrics considered. Other methods or metrics may lead to slightly different conclusions.
8.4. Stability
The reported results, being averaged over many datasets, are stable. Core experiments (best learner for each case) were repeated five times, and the aggregated results show virtually no change.
Variability appears when considering a specific learner (defined by the model and precise hyperparameter values) applied across consecutive runs of a distillation method. However, similar performance can typically be recovered through slight adjustments (i.e., tuning) of hyperparameters. Variability is somewhat larger for very aggressive distillation ratios, but even in these cases, the hyperparameter search is generally able to compensate. Overall, the aggregated results are very stable.
8.5. Efficiency Trade-Off
Distillation inevitably trades accuracy for efficiency. When averaged over multiple datasets, all distillation methods lead to some loss in performance.
In general, moderate reductions of the training dataset (e.g., to 50%) incur limited performance loss (approximately 3–5% in accuracy or about 0.03 in correlation). This can be partially attributed to the relatively large size of the original datasets, which makes moderate distillation less harmful.
In contrast, aggressive distillation leads, in an increasing number of cases, to a steeper degradation in performance, indicating a practical lower bound on dataset size for reliable learning. This bound is highly dataset-dependent. For example, when distilling to 5% of the data, accuracy losses for XGBoost with Leverage Score-based coresets ranged from minimal (e.g., a 3% accuracy loss) to nearly complete failure to learn (e.g., on the Crop Recommendation or Red Wine datasets).
Overall, halving the dataset size is reliable regardless of the method used, whereas stronger distillation ratios are both method- and dataset-dependent. Among the evaluated approaches, G-Coreset proved to be the most stable.
9. Conclusions
In this paper, we evaluated seven unsupervised distillation methods across 17 classification and nine regression tasks. Distillation was assessed in terms of downstream predictive performance using off-the-shelf learners such as Random Forest, Support Vector Machine, and XGBoost.
This work does not propose new distillation algorithms; instead, it provides a systematic empirical evaluation of existing methods within a unified experimental framework. The investigation was conducted in a structured and disciplined manner: dataset reductions were performed at fixed and meaningful percentages, and while the learners were moderately tuned, the tuning process was applied consistently across all experiments.
We found coreset-based methods to be the most reliable from multiple perspectives, including both predictive performance and computational efficiency. More recent neural network-based methods were, somewhat surprisingly, found to be inferior in this setting. Among the coreset approaches, the Leverage Score-based coreset emerged as a robust all-around solution, while G-Coreset was able to achieve better performance under low distillation ratios, albeit with occasionally high runtimes.
We also investigated the correlation between distillation performance and dataset tailness and found that the proportion of outliers is moderately anti-correlated with performance across all methods, with a stronger effect observed for coreset-based approaches. This suggests that precomputing the outlier ratio may help form expectations regarding distillation success.
Finally, we examined dataset-level learning consistency by measuring the variation in learner predictive performance with respect to hyperparameter changes. Distillation to smaller dataset sizes does introduce some variation, as the variance increases by, roughly, a factor of 10.
Lastly, we contribute to the community by making our code fully public, enabling others to reproduce and extend our experiments.