Dataset Condensation via Expert Subspace Projection

The rapid growth in dataset sizes in modern deep learning has significantly increased data storage costs. Furthermore, the training and time costs for deep neural networks are generally proportional to the dataset size. Therefore, reducing the dataset size while maintaining model performance is an urgent research problem that needs to be addressed. Dataset condensation is a technique that aims to distill the original dataset into a much smaller synthetic dataset while maintaining downstream training performance on any agnostic neural network. Previous work has demonstrated that matching the training trajectory between the synthetic dataset and the original dataset is more effective than matching the instantaneous gradient, as it incorporates long-range information. Despite the effectiveness of trajectory matching, it suffers from complex gradient unrolling across iterations, which leads to significant memory and computation overhead. To address this issue, this paper proposes a novel approach called Expert Subspace Projection (ESP), which leverages long-range information while avoiding gradient unrolling. Instead of strictly enforcing the synthetic dataset’s training trajectory to mimic that of the real dataset, ESP only constrains it to lie within the subspace spanned by the training trajectory of the real dataset. The memory-saving advantage offered by our method facilitates unbiased training on the complete set of synthetic images and seamless integration with other dataset condensation techniques. Through extensive experiments, we have demonstrated the effectiveness of our approach. Our method outperforms the trajectory matching method on CIFAR10 by 16.7% in the setting of 1 Image/Class, surpassing the previous state-of-the-art method by 3.2%.


Introduction
With the rapid development of the Internet, a growing number of large-scale datasets are being collected for obtaining state-of-art machine learning models in multiple fields, including computer vision, natural language processing, and speech recognition [1]. Such rapid growth of dataset scale results in increasingly expensive model training, and at some scales, even storing and preprocessing the data are burdensome. For instance, the training of a recent language model, GPT-3 [2], consumes an astonishing 190 MWh of electricity, generating approximately 85,000 kg of CO 2 , according to US carbon emission standards, which is equivalent to the emissions of driving a car 700,000 km. An intuitive solution is to select a representative subset from the original dataset, commonly referred to as coreset selection. However, previous work shows that, when facing strict compression ratios, coreset selection methods suffer from severe information loss [1] and cannot compete with dataset condensation techniques [3,4], which distill the original dataset into a much smaller synthetic dataset. Typically, this type of method adopts bi-level optimization, which involves an inner optimization for model updates and an outer optimization for synthetic image updates. To distill the essential information from the original dataset into the synthetic dataset, a suitable matching objective must be defined.
The matching objective includes distribution matching [5], gradient matching [6,7], and meta-model matching [8,9]. Notably, a trajectory matching objective [10] has recently been introduced, demonstrating significant performance improvements over other matching objectives. This objective aims to align the trajectory of the network trained with the synthetic dataset images with the parameter trajectory of the network trained with the original dataset. Unlike gradient matching methods, which only consider the instantaneous training dynamics (i.e., gradients) between the synthetic and real datasets, trajectory matching recognizes that the long-range training dynamics (i.e., training trajectory) provide more informative constraints for dataset condensation.
However, the limitation of trajectory matching is the substantial cost of computation and memory involved in executing multiple unrolled gradient computations for the recursive computational graph when mimicking the training trajectory. This hampers the feasibility of training on the complete set of synthetic images without resorting to slicing and may introduce bias when conducting mini-batch training on the sliced data, ultimately affecting the final performance.
This paper presents a novel matching objective called Expert Subspace Projection (ESP), which effectively guides the dataset condensation process with long-range training dynamics while significantly reducing computational and memory costs compared to trajectory matching. Instead of strictly enforcing the synthetic dataset's training trajectory to mimic the real dataset, ESP only constrains it to lie within the subspace spanned by the training trajectory of the real dataset. Our core technical concept is illustrated in Figure 1. We extract an arbitrary segment from the parameter trajectory obtained through training on the original dataset D real , whose starting point is θ * t and ending point is θ * t+T . For training with the synthetic data, the neural network is initialized with θ * t , and the gradient of the parameters at that point is G. In gradient matching [6], the objective is to align G with θ * t+1 − θ * t . Conversely, in trajectory matching [10], the objective is to align θ t+T with θ * t+T after T iterations. Both methods have their advantages and disadvantages. Gradient matching, although computationally simpler, cannot effectively utilize long-range information. On the other hand, trajectory matching has the ability to incorporate long-range information but requires the gradient to be unrolled through T iterations during computation.
Our method, however, circumvents the drawbacks of both approaches while inheriting their respective advantages. First, we construct a subspace S τ * span by means of the training trajectory { θ * t } t+T t . This subspace effectively captures a substantial amount of information relevant to the training trajectory using a real dataset. Consequently, by confining the optimization gradient G of each step within this subspace when training with the synthetic dataset, we are able to distill a significant portion of the information inherent in the real dataset into the synthetic one. Specifically, we define a new objective function L Proj to penalize the norm of the residual vector between G and its projection G S τ * within the subspace S τ * . This approach not only circumvents the requirement for gradient unrolling across T iterations but also effectively utilizes the long-range information embedded within the optimization process. The memory consumption of our method is not affected by the number of steps in the neural network optimization (inner optimization) since the synthetic images are updated at every step. This significantly reduces the spatial complexity of training compared to previous trajectory methods. Consequently, our method facilitates training on the entire set of synthetic images without the need for slicing and can be seamlessly integrated with other techniques such as distribution matching [5] and KFS [1]. Our method has been extensively tested on four widely used data condensation benchmarks, and the results demonstrate its remarkable effectiveness. In particular, our ESP method outperforms trajectory matching by a significant margin, leading to the establishment of a new state-of-the-art performance.
Projection Loss ℒ P Figure 1. Diagram illustrating the proposed Expert Subspace Projection (ESP) method. The expert trajectory consists of weight snapshots obtained during training with the original dataset D real . Each node θ * t represents a saved weight snapshot at the end of step t. The subspace S τ * is spanned by the set of weight snapshots { θ * t } t+T t . G represents the parameter gradient vector generated during training with the synthetic dataset D syn . The subspace projection loss L Proj penalizes the norm of the residual vector between G and its projection G S τ * within the subspace S τ * .

Dataset Condensation
In order to reduce the resources required for deep neural networks, researchers usually use knowledge distillation [11][12][13][14] to distill complex and large models into smaller ones, while still ensuring comparable results to those before compression. As technology has evolved, the concept of knowledge distillation began to be transferred to the datasets. Wang et al. [4] first proposed dataset condensation, which uses meta-learning [8,9] methods to compress the knowledge of the entire training dataset into a small amount of synthetic data and achieves high accuracy through several steps of gradient descent on the synthetic data. Subsequently, many works have utilized gradient matching [6,7,10] and distribution matching [1,5] for optimization. Dataset Condensation (DC) [6] assumes that the optimization process of the synthetic dataset is very close to the real dataset, so it optimizes the synthetic dataset by matching the optimization trajectory of the model trained on the synthetic dataset with the optimization trajectory of the real dataset. Differentiable Siamese Augmentation (DSA) [7] is further work based on DC [6] and uses a set of data enhancement strategies while learning the synthetic image, thereby enhancing the information in the real training image and transferring this enhanced knowledge to the synthetic image. Distribution matching (DM) [5] matches the features of the real samples and the synthetic samples that are output at the last layer of the neural network. These neural networks are randomly initialized to ensure computational efficiency and, at the same time, very high accuracy. Knowledge Factorization and Sharing (KFS) [1] also uses distribution matching [5], introducing a new latent code decoder architecture, which greatly increases the number of modalities of synthetic images with the same number of parameters, thus achieving a new state of the art.
Trajectory matching [10] encourages the synthetic dataset to mimic the long-range training dynamic of the real dataset by mimicking the expert trajectory generated by the real dataset. Although long-range information helps it achieves satisfactory results, the method requires accumulating multiple computational graphs, which greatly increases memory consumption and even introduces a subsampling bias as a result of having to reduce the batch size for saving memory. Our ESP method alleviates the memory consumption issue by projecting the model gradient into expert subspace instead of matching model parameters, allowing the model to be trained on the complete synthetic dataset while also utilizing the long-range information of expert trajectories.

Subspace Training
Generally, deep neural networks come with a large number of parameters, which tend to have strong correlations, thus resulting in great redundancy. Guy et al. [15] first proposed the hypothesis that, in various large-scale deep learning scenarios, gradients dynamically converge to a very small subspace after short-term training, so gradient descent in the subspace will yield a similar loss reduction. Li et al. [16] try to optimize network parameters in a small, random subspace instead of the original parameter space, then slowly increase the dimension of this subspace. Eventually, the authors find that the intrinsic dimension required for model training is smaller than one might think. While this training holds promise for more efficient and scalable optimization schemes, its practical application is limited by poor optimization performance. Gressmann et al. [17] made some optimizations to the stochastic subspace approach, achieving further improvements by applying independent projections to different parts of the network, making the approximation more efficient as the network dimensionality grows. After that, Li et al. [18] extracted the landscape by analyzing the optimization trajectories, while also verifying that many standard neural network structures can be trained well with only 40 independent variables and that the performance is almost the same as conventional training with all parameters. Inspired by [18], we span the parameters from the expert trajectory into an expert subspace and encourage the synthetic data to learn information about real data within the expert subspace, thereby optimizing synthetic data in an efficient way.

Coreset Selection
Coreset selection [19][20][21][22][23] is an approximate replacement of the original large dataset with a small dataset such that the small dataset still provides rich information, making the accuracy on the test dataset very close to the original dataset. However, such methods often come with a trade-off between performance and dataset size, as they produce a rough approximation of the full dataset. Dataset condensation is very similar to coreset selection, but dataset condensation [1,5,6,10] is more robust. It mainly uses the original dataset to synthesize some learnable pictures and then captures the rich information encoded in the original dataset to realize the compression of the original dataset. These learned images do not appear in the original dataset.

Method
The primary objective of dataset condensation is to efficiently compress a real dataset D real into a significantly smaller synthetic dataset D syn while minimizing the loss in performance during downstream training. In line with previous studies [6,7,10], our methods also embrace a bi-level optimization framework. This framework consists of an inner optimization for the neural network's parameters and an outer optimization for synthetic images.
To begin with, we present the comprehensive framework of our Expert Subspace Projection (ESP) method, as illustrated in Figure 2. In the inner optimization, we utilize the binary cross-entropy (BCE) loss as the objective function. This loss penalizes the misclassification of synthetic images based on their pre-defined ground-truth labels. Notably, rather than using the original gradient descent, we update the neural network's parameters using projected gradient descent within the expert subspace. This constraint ensures that the training trajectory (student trajectory) remains confined within the expert subspace. In the outer optimization, we employ a combination of the projection loss and the distribution matching loss [5] as the objective function. The first loss term penalizes any deviation of gradients from the desired subspace, while the second term aligns the feature distribution between real and synthetic images. During this process, the synthetic images are updated using normal gradient descent.

Informative segment
Parameters snapshot  Figure 2. Overview of our ESP method for dataset condensation. We train the model with the complete real dataset to obtain the expert trajectory. An informative segment of length T is extracted and flattened into one-dimensional vectors, forming a subspace. The condensation process utilizes a bi-level optimization framework. The inner stage refines the model's parameters within the expert subspace using projected gradient descent, while the outer stage updates the synthetic dataset. Distribution matching loss penalizes feature distribution discrepancy, and projection loss penalizes gradients outside the expert subspace.
In subsequent sections, we delve into the technical details of our method, providing a comprehensive explanation of its underlying mechanisms. Furthermore, we conduct a thorough analysis to demonstrate the memory consumption advantage offered by our approach.

Preliminaries
The long-range training trajectory of the original dataset, referred to as the expert trajectory, has been empirically demonstrated to be effective in guiding the condensation of the synthetic dataset in trajectory matching [10]. As illustrated in Figure 2, an expert consists of a series of parameter snapshots during the training process with D real . Each snapshot, denoted as θ * t , represents the parameters saved after the t-th epoch and is flattened into a one-dimensional vector. In the context of trajectory matching, the inner optimization process involves T steps of updates on the neural network's parameters, i.e., from θ t to θ t+T , supervised by the classification loss of the synthetic images D syn . To ensure that the student trajectory begins from the same starting point as the expert trajectory, θ t is initialized with θ * t . The outer optimization process involves a single update step for the synthetic images, supervised by the trajectory matching loss: where N is the length of the student trajectory, and T is the length of the expert trajectory. It is worth noting that the student trajectory's length is not necessarily the same as that of the expert trajectory due to the much smaller size of the synthetic dataset compared to the original dataset. In practice, N is intentionally set to be smaller than T.
In most cases, the student trajectory is indeed much shorter. However, for the sake of simplicity and without loss of generality, we assume that both trajectories have the same length in the equation provided above. L TM promotes D syn to mimic the longrange training dynamic of D real . However, due to the inclusion of T steps of updates in the inner optimization process, the calculation of gradients for the outer optimization requires unrolling the gradients through T iterations. This unrolling process and the need to save the computation graph from multiple updates result in significantly higher memory consumption compared to other methods [5][6][7]. This high memory usage necessitates the use of a slice of the synthetic dataset, which may introduce a biased optimization. In Section 3.5, we discuss the impact of this issue in detail and analyze the memory usage.

Expert Subspace Projection
Our proposed ESP method leverages the benefits of long-range training dynamics obtained from expert trajectories. Importantly, it addresses the issue of linear memory growth that arises when unrolling the gradient through iterations, as the inner optimization solely entails a single-step update. The crux of the challenge lies in emulating the long-range training dynamics despite the constraint of having only one update.
Our solution is simple yet effective. In our approach, we confine each optimization step to remain within the subspace spanned by the expert trajectory during the inner optimization. Simultaneously, during the outer optimization, we penalize the residual gradient that deviates from this subspace. The expert subspace S τ * is spanned by the vector By using Schmidt's orthogonal standardization, we can obtain a set of standardized bases for the subspace S τ * : where the notation OS(·) represents the orthogonal standardization operator. It is worth emphasizing that the length of the expert trajectory is considerably smaller than the dimension of the parameter vector. Our experimental findings provide evidence that this expert subspace encapsulates the majority of long-range training dynamics exhibited by the expert trajectories. Therefore, it can serve as a reliable proxy for capturing these dynamics. Consistent with trajectory matching [10], the expert trajectory can be generated and stored offline prior to commencing the data condensation process. This approach helps to conserve memory and reduce computation costs.

Inner Optimization
At the onset of the dataset condensation process, we initialize the synthetic images with random Gaussian distribution, along with pre-assigned category labels. The neural network is initialized with the starting point of the expert trajectory, i.e., θ t := θ * t . Following the bi-level optimization framework, we commence by updating the neural network in the inner optimization phase, followed by updating the synthetic images in the outer optimization phase. The gradient for the neural network in the inner optimization is computed as follows: where M represents the neural network model, and L BCE represents binary cross-entropy loss. Rather than directly utilizing this gradient to update the neural network, we project it onto the expert subspace to align with the long-range training dynamics of the original data: where E T represent the transposition of E, and the neural network is updated by the projected gradient in expert subspace: where α represents the learning rate. In the equation, we have introduced the naive gradient descent method. However, it is essential to emphasize that there are other gradient-based optimization algorithms that can also be employed in this context.

Outer Optimization
In the outer optimization, we perceive the gradient deviation outside the expert subspace as the disparity between the synthetic dataset and the original dataset. To alleviate this dissimilarity, we introduce a penalization function that aims to minimize the norm of the residual gradient: where · indicates the 2 norm. In subsequent experimental analyses, we observed that the integration of this loss function resulted in improved accuracy in downstream training and enhanced the level of detail in synthesized images. These findings provide compelling evidence for the complementary nature of these two loss functions. The final objective function is defined as follows, where β is the hyper-parameter to balance these two losses: Finally, we can update the synthetic dataset with gradient descent: Our comprehensive data condensation process is summarized in Algorithm 1.

Memory Consumption
Trajectory matching [10] involves N update steps in the inner optimization and a single update step in the outer optimization. Therefore, the calculation of the gradient for the outer optimization needs to unroll the gradient through all N steps. As a result, all intermediate computational graphs and variables must be stored in GPU memory. Accurately estimating GPU memory consumption, also known as the memory footprint, is a complex task that depends on various factors, such as the specific operators utilized in the neural network, the connectivity of the computational graph, and the choice of deep learning framework. However, in theory, the memory footprint scales approximately linearly with the number of inner optimization steps N. Our method involves a single step of inner optimization followed by a single step of outer optimization. Therefore, our memory consumption is significantly smaller compared to trajectory matching, while still allowing us to benefit from the valuable long-range expert trajectory.

Algorithm 1: Expert Subspace Projection
Input: the set of expert trajectories {τ * k } K k=1 ; the length of the selected expert trajectory T; the length of the student trajectory N; the learning rate α; Output: synthetic dataset D syn ; Initialize D syn with Gaussian distribution; while not converged do Randomly select a trajectory from the set {τ * k } K k=1 ; Randomly extract a segment of length T: τ * = { θ * t } t+T t ; Construct the expert subspace S τ * ; Initialize θ t with θ * t ; for n = 1 to N do Calculate G with Equation (4); Project G onto S τ * to obtain G S τ * with Equation (5); Calculate L Out with Equation (9); // Inner Optimization

Experiments
We validate the effectiveness of our proposed method on various classification benchmark datasets, assessing its ability to generalize across different architectures. In addition, we conduct ablation experiments and provide visualization results to further substantiate the efficacy of our approach. [24]. The CIFAR10 dataset is a collection of 60,000 color images, each measuring 32 × 32 pixels. These images are classified into 10 distinct categories representing common objects and animals such as airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks. Each category contains 6000 images, and the dataset is split into a training set with 50,000 images and a test set with 10,000 images. CIFAR10 is widely recognized as a benchmark for image classification tasks, playing a crucial role in the development and evaluation of various machine learning and deep learning algorithms.

CIFAR10/100
CIFAR100 consists of 100 classes, with each class containing 600 images. These 100 classes are grouped into 20 superclasses, with each superclass containing 5 classes. The CIFAR100 dataset covers a diverse range of object categories, including animals, vehicles, household items, and natural objects. Similar to CIFAR10, CIFAR100 is divided into a training set (50,000 images) and a test set (10,000 images). It serves as a benchmark for addressing more intricate and fine-grained classification tasks, enabling researchers to delve into complex image recognition problems.
TinyImageNet [25]. TinyImageNet is extensively utilized in computer vision research and benchmarking as a downsized variant of the renowned ImageNet dataset [26]. Its purpose is to offer a more manageable alternative for training and evaluating deep learning models, given the vast scale of the original ImageNet dataset. Comprising 200 distinctive object classes, the TinyImageNet dataset encompasses an approximate quantity of 500 training images, 50 validation images, and 50 test images per class. It boasts a diverse array of object categories, encompassing animals, commonplace objects, as well as a variety of natural and human-created items. Notably, each image within the dataset maintains a resolution of 64 × 64 pixels.

SVHN [27], The SVHN (Street View House Number) Dataset is a valuable collection of real-world images extracted from door numbers captured by Google Street
View. It provides a significant amount of data, with more than 70,000 numbers specifically designated for training and an additional 20,000 numbers reserved for testing. Similar to the wellknown MNIST [28] dataset, each image in SVHN is 32 × 32 pixels in size and focuses on a single character situated at the center. Notably, many images in SVHN also contain distractors placed alongside the main character of interest. This dataset is highly useful for tasks like digit recognition and character segmentation, allowing researchers to tackle challenges posed by real-world scenarios and evaluate the performance of various machine learning algorithms.

Implementation Details
For fairness and convenience of comparison, we use the same suite of differentiable augmentations as in previous work, as well as the same number of latent code decoder parameters as in [1], which matches the size of 1, 10, and 50 images per class. Prior to the condensation step, we pre-compute 1000 expert trajectories on ConvNet-3 for each dataset. These pre-computed trajectories are then utilized in all our experiments to expedite the condensation process. This approach follows a similar technique used in trajectory matching [10]. During the distillation process, we have the flexibility to randomly select one expert trajectory as the teacher, which saves time in optimizing the expert network.
For the distillation process, we use β = 1 × 10 −5 to balance the loss of L DM and L Proj . The learning rate of the neural network (ConvNet-3 by default) is set as a constant value of 0.1. The synthetic dataset, represented by the latent code and decoder, follows a linear decay schedule for their learning rates. The initial learning rate for the latent code is set to 0.1, while the initial learning rate for the decoder is set to 0.01. To optimize the model, we employ the SGD as the inner optimizer and Adam [29] as the outer optimizer. More details can be found in Algorithm 1.
For the evaluation, we train the condensed data on ConvNet-3 for accuracy, while using ResNet [30] and DenseNet [31] for cross-architecture ability. To evaluate the performance of classification models trained with the condensed datasets, we report the mean classification accuracy and its corresponding standard deviation across 5 runs with different random seeds.

Cross-Architecture Generalization
In the context of dataset condensation, it is imperative that an ideal synthetic dataset exhibit similar training effects as the original dataset on downstream models with arbitrary structures. Consequently, the performance of cross-architecture generalization holds significant importance as a key metric. We used our synthetic data generated on ConvNet-3 (0.32 M parameters) to train different models, including ResNet-10 [30] Tables 2 and 3. Table 2. Cross-architecture experiments. Conv3, RN10, and DN121 denote ConvNet-3, ResNet-10, and DenseNet-121, respectively. We train on ConvNet-3 and evaluate on the three architectures. The results of other methods are reported in their respective papers [1,5,7,38]. For our method, we report mean accuracy and standard deviation over five runs with different random seeds.  We can see that the performance of ESP is more robust to the change in network architectures and achieves state-of-the-art performance on most of the network architectures. For the SVHN [27] dataset, the accuracy on ResNet-10 is generally higher than our baseline model (ConvNet-3), up to 2.1%. These experiments provide evidence that the synthetic dataset generated by ESP exhibits better generalizability compared to other datasets, showcasing the superior ability of our ESP method to capture representative information from the original datasets. However, the experimental results also reveal that when there is a significant architectural difference between the training and testing phases, the cross-architecture performance is weakened. This suggests that a cross-architecture generalization problem persists across all dataset condensation methods.

Memory Analysis
We perform an experimental comparison of the memory consumption between our ESP method and trajectory matching [10] on the CIFAR10 [24] dataset. In line with our analysis in Section 3.5, trajectory matching exhibits a linear increase in memory consumption with the length of the student trajectory due to its inner optimization steps being equal to the student trajectory length. This theoretical analysis is confirmed by Figure 7a. Notably, the memory consumption of trajectory matching increases approximately linearly with the length of the student trajectory, while our ESP method remains unaffected.
We conducted an additional experiment to demonstrate that our ESP method has a significantly lower growth rate in memory consumption compared to trajectory matching when increasing the size of the synthetic dataset. Figure 7b illustrates this observation, where our ESP method only exhibits a slight increase in memory usage as the IPC increases. In contrast, the memory consumption of the trajectory matching [10] method experiences a substantial surge, rendering it ineffective for training on the complete synthetic dataset with higher IPC. Conversely, our method can directly handle the complete synthetic dataset, enabling unbiased training, as we elaborate on in Section 4.6.

Synthetic Batch Size Analysis
Based on the analysis presented in [1], it is noteworthy that training the synthetic dataset using batch optimization, where the complete synthetic dataset is divided into several batches, introduces a bias in the gradient. This bias stems from disregarding the interactions between different synthetic images, resulting in a reduction in diversity among images within classes. The ablation experiments shown in Figure 8 quantify the impact of this issue. In the figure, we trained synthetic images on the CIFAR10 dataset with various batch sizes, ranging from small (batch size = 64) to large (batch size = 1024). It is evident that the accuracy increases with batch size and stabilizes with larger batch sizes, reflecting the property that the bias of the gradient of the synthetic dataset decreases as the batch size increases. Thanks to the efficient use of memory in our ESP method, we are able to train on the complete set of the synthetic dataset, which allows us to achieve better performance than trajectory matching.

Ablation Study
As discussed in Section 3.4, the projection loss and the distribution matching loss exhibit a high level of complementarity. The projection loss focuses on aligning the longrange training dynamics between the synthetic and original datasets, while the distribution matching loss aims to match the static feature distribution between the two datasets. These two loss components work together to ensure a comprehensive alignment of both the dynamic and static aspects of the datasets. This observation is further supported by the results of our ablation experiments, as presented in Table 4. The individual losses alone show suboptimal performance, whereas their combination yields excellent results. We further visualize the synthetic images with and without the distribution loss. As can be observed in Figure 9, the introduction of the distribution loss results in synthetic images that exhibit more detailed textures and recognizable visual concepts. This phenomenon may be attributed to the fact that the projection loss primarily constrains higher-order information, such as gradients. On the other hand, low-level information like texture is predominantly constrained by the distribution matching loss. . The condensed synthetic images produced by our method exhibit a noticeable difference when comparing the left and right parts. The left portion corresponds to images trained without the distribution loss, resulting in a more abstract style. On the other hand, the right portion showcases images with more pronounced and detailed texture information. Experiments are conducted on CIFAR10 [24].

Conclusions
In this paper, we have proposed a novel dataset condensation method called Expert Subspace Projection (ESP) that effectively utilizes long-range training dynamics while reducing computational overhead compared to prior trajectory matching techniques. Our key insight is to constrain model optimization to remain within the subspace spanned by expert trajectories from the original dataset. This avoids expensive unrolling of gradients across multiple steps, enabling memory-efficient training of the complete set of synthetic data. We have validated ESP extensively on image classification tasks, demonstrating state-of-the-art results on CIFAR, SVHN, and TinyImageNet datasets compared to existing condensation methods. Importantly, we have shown ESP's superior ability to transfer condensed datasets to unseen architectures, indicating it effectively distills dataset knowledge in an architecture-agnostic manner. Overall, ESP provides an effective and scalable solution for dataset condensation, resulting in the synthesis of highly informative compact datasets. This technique enables the application of modern deep learning approaches in resourceconstrained settings, where memory or computational resources are limited. Moreover, ESP contributes to minimizing the energy consumption needed for training models.

Limitations and Future Work
Despite the effectiveness of our Expert Subspace Projection (ESP) approach in reducing memory usage and computational requirements compared to the previous trajectory matching [10] approach, it is crucial to acknowledge that ESP still operates within a bi-level optimization framework. Consequently, extending ESP to large datasets that contain highresolution images presents a significant challenge, similar to previous bi-level optimization methods [1,6,7,10,36,37]. This limitation hampers the application of dataset condensation to tasks such as fine-grained classification, which heavily depends on high-resolution images for capturing intricate details. Therefore, it is imperative to focus further efforts on exploring strategies to minimize memory usage and computational requirements. Promising directions include disentangling the outer optimization from the inner optimization and approximating the inner optimization using a convex proxy model.