Xindi Wu1 Byron Zhang1 Zhiwei Deng2 Olga Russakovsky1
1Princeton University 2Google DeepMind
{xindiw, zishuoz, olgarus}@princeton.edu, zhiweideng@google.com
Abstract
Dataset distillation methods reduce large-scale datasets to smaller sets of synthetic data, preserving sufficient information to quickly train a new model from scratch. However, prior work on dataset distillation has focused exclusively on image classification datasets, whereas modern large-scale datasets are primarily vision-language datasets. In this work, we design the first vision-language dataset distillation method, building on the idea of trajectory matching. A key challenge is that vision-language datasets do not have a set of discrete classes. To overcome this, our proposed method jointly distills image-text pairs in a contrastive formulation. Further, we leverage Low-Rank Adaptation (LoRA) matching to enable more efficient and effective trajectory matching in complex modern vision-language models. Since there are no existing baselines, we compare our distillation approach with three adapted vision-language coreset selection methods. We demonstrate significant improvements on the challenging Flickr30K and COCO retrieval benchmarks: for example, on Flickr30K, the best coreset selection method selecting 1000 image-text pairs for training achieves only 5.6% image-to-text retrieval accuracy (i.e., recall@1); in contrast, our dataset distillation almost doubles that to 9.9% with just 100 training pairs, an order of magnitude fewer.
1 Introduction
Data = Information + Irrelevant Data (Wright & Ma, 2022)
Dataset distillation aims to create concise summaries of data that preserve most of the critical information of the entire dataset. It holds paramount importance in the era of big data as it addresses the challenge posed by “Data = Information + Irrelevant Data”(Wright & Ma, 2022), where we often need to learn the useful information in an ocean of non-critical data. The recent growth of dataset distillation methods, e.g.,(Wang etal., 2018; Cazenavette etal., 2022; Nguyen etal., 2020) has primarily focused on image classification datasets, capturing class-specific information to build discriminative boundaries. Considering the recent progress in multimodal machine learning, where we witness the explosion of vision-language datasets in which the majority of image pixels may belong to irrelevant contextual elements and may further lack corresponding textual descriptions, a significant necessity arises to efficiently distill this vast amount of data. A well-distilled multimodal dataset simplifies complex vision-language interactions and emphasizes the most salient connections, making it more effective for models to learn cross-modal representations.
Why is it hard? The first key challenge and the main difference from prior dataset distillation methods(Wang etal., 2018; Cazenavette etal., 2022) is that vision-language datasets do not contain a discrete set of classes to ground the distillation process. Instead, these datasets contain complex cross-modal connections and redundancies, requiring a co-distillation approach to capture their interdependencies effectively. Second, the complexity of cross-modal representations and vision-language models (VLMs) leads to computational challenges. Prior dataset distillation methods operate on low-resolution images (typically 28x28 or 32x32, as in MNIST(LeCun etal., 1998) or CIFAR(Krizhevsky etal., 2009)) and nevertheless suffer from significant computational costs, even with ConvNet when creating distilled datasets. Vision-language datasets often contain higher-resolution images, and models designed for vision-language tasks are substantially more complex, such as Vision Transformers (ViTs)(Dosovitskiy etal., 2021). Lastly, unlike continuous data, text is inherently non-differentiable, making direct gradient-based optimization impossible on discrete text tokens.
Our work. We propose the first Vision-Language Dataset Distillation method. Concretely, given a dataset of images with corresponding text descriptions, our method creates a much smaller synthetic set of (image, text embedding) pairs which can then be used to efficiently train a model that aims to learn the image-text alignment. Given the infeasibility of direct information extraction, our co-distillation is achieved by implicitly matching the by-products of the target vision-language data and the synthetic ones. In our case, the by-product is the long-range training bi-trajectory. Additionally, as finetuning pretrained models is widely used in vision-language tasks, we match the trajectories of the low-rank matrices(Hu etal., 2022) for complex models to effectively distill critical information.
Contributions.To the best of our knowledge, this is the first work to tackle vision-language dataset distillation. In doing so, we make the following key contributions:
- 1.
We highlight the challenges of vision-language dataset distillation and establish the first set of baselines for this task by adapting three coreset selection methods(Welling, 2009; Toneva etal., 2019; Farahani & Hekmatfar, 2009; Sener & Savarese, 2018).
- 2.
We propose the Bi-Trajectory Vision-Language Co-Distillation method. Different from prior image classification dataset distillation methods, our method is not restricted to discrete classes and distills vision-language pairs jointly. We leverage Low-Rank Adaptation (LoRA) matching to make it computationally feasible for training with complex models (e.g., ViTs) on high-resolution images.
- 3.
Our method significantly improves image-text retrieval with training set constraints on the challenging Flickr30K(Plummer etal., 2015) and COCO(Lin etal., 2014) datasets. For example, the best coreset selection method (adapted K-center) achieves 5.6% image-to-text retrieval performance (R@1) after selecting 1000 image-text pairs for training. In contrast, our method almost doubles that performance on the same task to 9.9% with an order of magnitude fewer (just 100) distilled image-text pairs.
The growing interest in multimodal datasets makes it even more crucial to develop mechanisms that efficiently and effectively distill insights from different modalities. We hope this work jump-starts further research into the important and challenging space of vision-language dataset distillation.
2 Related Works
Dataset Distillation.The concept of dataset distillation has demonstrated that a handful of synthetic images, although not drawn from the training distribution, can achieve comparable performance to that of the original dataset(Wang etal., 2018). Meta-learning based data distillation approaches(Nguyen etal., 2021; Zhou etal., 2022; Deng & Russakovsky, 2022; Nguyen etal., 2020; Vicol etal., 2022; Zhou etal., 2022) typically use bilevel optimization, where the inner loop trains on the distilled data samples and the outer loop optimizes meta datasets. Several works(Zhao & Bilen, 2021b; a; Cazenavette etal., 2022; Jiang etal., 2023; Du etal., 2023; Liu etal., 2023) explored by-product matching approaches, such as matching the gradient or trajectory of the gradient with respect to the model trained on the real and distilled data.
Our work is mostly inspired by the trajectory matching method(Cazenavette etal., 2022; Cui etal., 2023), which is more efficient for optimization since they mostly do not involve long unrolling of computation graphs. Rather than aligning model gradients, another thread of work (Zhao & Bilen, 2021b; Wang etal., 2022; Lee etal., 2022) has been developed to align feature distributions between real and distilled data using a distribution divergence metric in the latent space. While most of the prior works focus on image classification dataset distillations, (Sucholutsky & Schonlau, 2021) explored dataset distillation on text datasets. Our work is the first to scale up dataset distillation methods to vision-language datasets, which involves creating distilled data that capture critical features and complex relationships within and between two modalities.
Cross-modal Retrieval.Most cross-modal retrieval methods function at the representation level and encourage a joint embedding space by measuring the similarities between learned representations across different modalities(Liang etal., 2022; Zhu etal., 2022; Pokle etal., 2022; Chun etal., 2021; Wu etal., 2023). Image-text retrieval focuses on retrieving images given captions, or of captions given images(Wang etal., 2020b; Wu etal., 2019). Many techniques have been developed to produce representations that are semantically similar for image-text pairs(Huang etal., 2018; Gu etal., 2018). More advanced image-text alignment methods(Li etal., 2022; Lin etal., 2022; Pandey etal., 2022) that incorporate pretraining have shown promising results on image-text retrieval tasks. We evaluate our vision-language dataset distillation method on image-text retrieval tasks.
Vision-language Knowledge Distillation.Prior efforts on vision-language distillation are primarily centered around knowledge distillation, which transfers knowledge from a larger teacher model to a smaller student model to improve the latter’s performance(Xue etal., 2023; Radenovic etal., 2023; Valverde etal., 2021). Our dataset distillation study focuses on the orthogonal question and is fundamentally a pragmatic compression problem. We aim to find equivalent bits that can represent the entire vision-language datasets.
3 Method
We propose a vision-language dataset distillation method for distilling a large-scale dataset consisting of (image, text) pairs into a smaller dataset, while maintaining much of the original dataset’s information relevant to training vision-language models (VLMs). The detailed method is in Fig.2.
3.1 Problem Formulation
Consider a large-scale dataset , where each denotes an image and each denotes its corresponding text descriptions; note that in practice, may be a set where is the number of descriptions associated with each image. Our goal is to learn a smaller dataset , with significantly fewer data pairs that still captures most of the essential information needed to train a VLM effectively. For , we aim to use one (instead of ) sentence per image in the distilled set for a more compact representation. Concretely, consider a VLM with vision encoder and language encoder . This model can be trained by optimizing the similarity loss which encourages alignment between the image and text embeddings:
(1) |
Our goal is to distill a dataset such that the model trained with obtains comparable vision-language matching performance as the one trained on . More specifically, consider a metric defined to quantify the correlation between the model’s representation of a given image and the representation of a given text , this representation should match the actual similarity between the image and text pairs. The correlation calculation is based on whether the image-text pair is a positive (matching) or a negative (non-matching) pair. Given the test dataset , our objective can be defined as follows:
(2) |
where represents the optimal model parameters from training on the entire dataset, and denotes parameters from training on the distilled dataset. Importantly, even when the model is trained on the distilled dataset , we still evaluate its performance on the original for a fair measurement. When creating the dataset , the pairs can be subsampled from the original set , as described in the coreset selection methods below (Sec.3.2). We propose a much more effective strategy in Sec.3.3 to learn synthetic image-text pairs , which can be more information-rich.
Connection with Image-only Dataset Distillation. Traditionally, dataset distillation is tailored for classification tasks with discrete labels, each of which possesses a distinctive set of distilled data that enables efficient learning while preserving important information. We take this concept a step further to the multimodal scenario, where we distill information from both vision and language data. This involves creating synthetic data that capture critical relationships within and between these two modalities. As opposed to merely classifying discrete labels, we are examining a more complex, interconnected dataset where the relation between modalities is crucial. Our method considers the image-text correlation and how they influence each other. It is worth noting that distillation would be impossible with single modality optimization (see Sec.4.3).
3.2 Baselines: Coreset Selection
Since, to the best of our knowledge, there is no pre-existing work in the domain of vision-language dataset distillation, we begin by formulating a set of baselines to construct the smaller dataset . These baselines are based on coreset selection methods, where a subset of the training pairs is chosen, up to a given budget of pairs, as to maximize the “informativeness” of the selected subset. We consider three such methods, adapted from prior work.
Herding(Welling, 2009) Herding selects data points based on the distance between the coreset center and the original dataset center in the feature space. It greedily adds one sample each time into the coreset to minimize the distance between two centers. We use pre-trained encoders to extract features from the image-text pairs, concatenate the features, and calculate the dataset center in the feature space by averaging all feature vectors. We start with an empty coreset and for each iteration, add the image-text pair that is closest to the current center of the coreset in Euclidean distance. We recalculate the coreset center after adding each data point.
K-center(Farahani & Hekmatfar, 2009; Sener & Savarese, 2018) Different from computing a single center in Herding, K-center selects the training examples that are maximally separated. Concretely, we concatenate the features of the image and text pairs and start by randomly selecting a single data point.Then, for each iteration, until K points are selected, we add a new image-text pair that is furthest in Euclidean distance from the nearest example. The drawback of this method is its high computational cost, especially with large datasets, as it involves heavy distance calculations between data points in each iteration.
Forgetting(Toneva etal., 2019) The core idea is to identify reliable training data that the original model consistently learns well. During each training epoch, we check how accurately the models predict every image-text pair for a specific task (i.e., image-text retrieval). A forgetting event is registered for an image-text pair when the model correctly predicts the data in one epoch but fails in the next. Throughout training, we continually track these forgetting events for each pair, to identify the ones with the fewest forgetting events.
3.3 Bi-trajectory Guided Vision-Language Co-Distillation
The coreset selection methods described above, while effective to some extent, demonstrate certain limitations as they only rely on selecting a subset of the training dataset .This restriction leads to less effective results compared to our method, as ours provides the flexibility to generate an optimized distilled dataset , and the learning process efficiently helps extract the most essential information embedded in .Not only does this lead to decreased storage and computational requirements, but it also optimizes the performance of the model trained on this distilled dataset.
Here we describe our vision-language dataset distillation framework, building off of the idea of matching training trajectories (MTT)(Cazenavette etal., 2022) developed for distilling image classification datasets. The core idea of trajectory matching is that the dataset distillation can be achieved by implicitly matching the by-product, which is the parameter trajectory of the distilled dataset and the original full dataset, given direct information extraction is not feasible. We can compute a loss function on the cumulative discrepancy between the expert parameter trajectory obtained from the model trained on the full dataset and the parameters obtained from the model on the distilled dataset , and use that loss to guide the creation of a better , one that can match the parameters more closely. The approach consists of two stages:
- 1.
Obtaining the expert training trajectories , with each trajectory , by training multiple models for epochs on the full dataset . For our multimodal setting, the models are trained using bidirectional contrastive loss, described below.
- 2.
Training a set of student models on the current distilled dataset using the same bidirectional contrastive loss, and then updating based on the bi-trajectory matching loss of the student models’ parameter trajectories and the expert trajectories .
Bidirectional Contrastive Loss. We train both expert and student VLMs using the bidirectional contrastive loss, following the formulation of(Radford etal., 2021) as it is effective for learning shared image-text representation. Concretely, given a batch of image-text pairs , either from the real dataset or from the synthetic distilled dataset , we jointly learn the encoders and such that the cosine similarity of all correct image-text pairs is high and that of the incorrect pairs is low. We define cosine similarity between image and text as:We then compute bidirectional contrastive losses composed of an image-to-text matching loss and a text-to-image matching loss, following the form of the InfoNCE loss (Oord etal., 2018):
(3) |
To imitate the effect of training data on parameter trajectories, we use the same objective function to guide the update of parameters during both expert training (stage 1) and distillation (stage2). Notably, while hard negative mining is typically used in conjunction with contrastive loss, here we rely fully on the dataset distillation process itself without additional intervention. This process inherently considers hard negatives; it distills samples that are hard negative samples for others, which are eventually effective samples for learning. Dataset distillation can potentially by-pass the traditional hard negative mining complexities through the learning process.
Bi-Trajectory Matching Loss.Following the formulation of MTT(Cazenavette etal., 2022), we randomly sample image-text pairs from to initialize the distilled dataset (more details can be found in the Sec4.1). We sample an expert trajectory and a random starting epoch to initialize . We train the student model on the distilled dataset for steps to obtain . We then update the distilled dataset based on bi-trajectory matching loss computed on the accumulated difference between student trajectory and expert trajectory:
(4) |
We update the distilled dataset by back-propagating through multiple () gradient descent updates to , specifically in the image pixel space and text embedding space with respect to Eqn.4. We initialize the continuous sentence embeddings using a pretrained BERT model and update the distilled text in the continuous embedding space. For the distilled image optimization, we directly update the pixel values of the distilled images. The full details are in Algorithm1.
Low-Rank Adaptation Matching.Further, for complex image encoders like Vision Transformers (ViTs)(Dosovitskiy etal., 2021), the bi-trajectory matching does not work well due to the high dimensionality of the embeddings and large number of parameters saved in the trajectories compared to models like NFNet. To mitigate this issue, we propose Low-Rank Adaptation (LoRA) matching by matching the trajectories of only a small subset of the model’s parameters through low-rank matrices. LoRA is effective for finetuning pretrained models(Hu etal., 2022), and it introduces trainable low-rank matrices to the weight matrices of specific layers of the pretrained models. LoRA matching optimizes the trajectories of low-rank adapters instead of the full parameters.
Given the weight matrix of certain layer in the ViT model, we introduce two low-rank matrices and to each layer’s weight matrix , where is the dimension of and represents the rank. The adaptation is performed by modifying to , where denotes the adapted weight matrix. We only train and save the weights from A and B in the expert trajectories and match the student trajectories with Eqn.4. For example, the ViT model we used is vit_base_patch16_224, which has 86 million parameters, but with LoRA and , the parameters are reduced to 18 million, cutting 78.71% of the parameters. This allows for efficient adaptation of the model with minimal additional parameters. With LoRA matching, we can focus on a smaller set of parameters and efficiently optimize the during distillation while maintaining the capacity to save critical information.
4 Experiments
In this section, we first describe the cross-modal retrieval test-bed in Sec.4.1. We use it to evaluate our vision-language dataset co-distillation performance. We then compare our method to baseline coreset selection approaches and provide the key quantitative, qualitative results, and cross-architecture generalization results in Sec.4.2. We further conduct a set of ablation studies in Sec.4.3.
4.1 Vision-Language Distillation Setup
Datasets and Tasks. We evaluate our method on standard vision-language datasets: Flickr30K(Plummer etal., 2015) and COCO(Lin etal., 2014), which are widely used for image-text retrieval tasks. We use them for expert training (stage 1) and distillation (stage 2). We adopt the Karpathy split(Karpathy & Fei-Fei, 2015) for Flickr30K (29k/1k/1k) and COCO (113/5k/5k) for train/validation/test respectively. Each image is paired with five captions. We retrieve the closest matches using cosine distance from one modality based on a query from the other. We use R@K (for K) to compute the fraction of times the correct result appears among the top K items. To move from distilling image-only datasets to vision-language datasets, we validate in appendix Sec.B if our method has potential in the classic image classification setting.
Network Architectures. We primarily use pretrained and trainable NormalizerFree ResNet (NFNet)(Brock etal., 2021b) as the image backbone following Flamingo(Alayrac etal., 2022) as well as Vision Transformer (ViT), for the text backbone we use pretrained and frozen BERT(Devlin etal., 2018). Ablation studies on different backbones are in Appendix Sec.E.2. While both the encoders are pretrained, they are only pretrained on unimodal data with no exposure to the other modality. Each encoder is followed by a trainable linear projection layer with random initialization. Using a trainable BERT adds additional complexity which is orthogonal to vision-language dataset distillation and is out of the scope of this work. Pretrained models serve as a common foundation and good starting point and see Appendix Sec.E.3 for details.
Implementation. For expert training, we train on a single RTX 3090 GPU for 10 epochs, where a single epoch takes 40 minutes of wall-clock time. Sampling from a set of trajectories encourages the distilled dataset to include diverse information and avoid overfitting to a particular step, thus we save 20 image-text bi-trajectories. For distillation, it takes 6 - 15 GPU hours depending on the settings (e.g. number of distilled pairs) with a 8-GPU A6000 node. We initialize a trainable learning rate at 0.1 for the student model. We followed the data augmentation techniques in(Li etal., 2022), including resizing, cropping, flipping, and RandomAugment. We use SGD with momentum=0.5, the learning rate for updating , distilled image pixels, and distilled text embeddings are 1e-02, 1000, and 1000, respectively.
Initialization. Following prior studies(Nguyen etal., 2020; Zhou etal., 2022), we initialize the distilled set with randomly selected real samples. We randomly select image-text pairs from the original dataset, with images at 224 224 resolution, and 768-dimensional sentence embeddings obtained via pretrained BERT. Our findings in Appendix Sec.E.1 show that initializing images from Gaussian distribution results in significantly lower performance. The complexity of images makes learning from random initialization challenging. In contrast, there is little difference in performance between using real and randomly initialized text embeddings. Surprisingly, despite the initial lack of semantic meaning between ‘noise’ texts and real images, we found notable semantic similarity between distilled text and real images, suggesting potential applications of our method in Visual Question Answering.
TR IR Coreset Selection Coreset Selection Dataset #pairs R H K F Dist (ours) R H K F Dist (ours) Flickr30K 100 1.3 1.1 0.6 1.2 9.9 0.3 1.0 0.7 0.7 0.7 4.7 0.2 200 2.1 2.3 2.2 1.5 10.2 0.8 1.1 1.5 1.5 1.2 4.6 0.9 500 5.2 5.1 4.9 3.6 13.3 0.6 2.4 3.0 3.5 1.8 6.6 0.3 1000 5.2 5 5.6 3.1 13.3 1.0 3.8 4.1 4.4 3.2 7.9 0.8 100 0.8 0.8 1.4 0.7 2.5 0.3 0.3 0.5 0.4 0.3 1.3 0.1 200 1.0 1.0 1.2 1.1 3.3 0.2 0.6 0.9 0.7 0.6 1.7 0.1 500 1.9 1.9 2.5 2.1 5.0 0.4 1.1 1.7 1.1 0.8 2.5 0.5 COCO 1000 1.9 2.4 2.4 1.9 6.8 0.4 1.5 1.3 1.5 0.7 3.3 0.1
4.2 Key Results
Quantitative Results.As shown in Tab.1 and Tab.6 in Appendix Sec.A, we observe that although there is relatively little variation in performance across each of the coreset selection baselines which we compare to, dataset distillation outperforms the best alternative by anywhere between 138% (improving R@1 from 5.6 of K-center(Farahani & Hekmatfar, 2009) to 13.3 of our model) to 661% (improving R@1 from 1.3 of random selection to 9.9 of our model). The relative improvement increases when fewer pairs are used for training.
Moreover, as shown in Tab.6, we note that with 1000 pairs, almost 30 times fewer examples than in the original dataset, our data distillation approach reaches 43.7 R@10 for TR, relative to a practical upper bound of 75.2, and 34.4 for IR R@10, relative to an upper bound of 69.7. We also observe that the performance among the baseline coreset selection methods varies only slightly, with no single method consistently outperforming the others across all pair sizes and retrieval metrics, often matching or underperforming random selection. This suggests coreset selection limitations in multimodal settings. In comparison, our bi-trajectory co-distillation method is optimized for vision-language alignment settings and thus performs significantly better. Our results show the effectiveness of distilled data, achieving unparalleled efficiency with significantly fewer examples.
Dataset #Pairs Without LoRA With LoRA TR IR TR IR R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 Flickr30K 100 1.5 2.5 4.5 0.6 1.2 2.3 10.4 23.6 38.7 5.4 18.8 27.4 200 1.8 3.9 6.4 0.8 1.5 2.7 11.2 24.5 41.5 6.4 19.4 29.4 500 2.1 4.3 7.2 1.5 2.1 3.6 13.4 27.8 43.4 7.6 21.1 32.7 1000 3.3 5.8 7.9 1.5 2.3 3.9 15.8 29.7 45.9 8.1 23.4 35.8 100 0.5 0.9 2.1 0.3 0.7 1.4 5.1 17.4 27.1 2.3 8.1 14.5 200 0.8 1.5 3.5 0.3 0.8 1.8 6.8 19.3 28.5 2.9 9.5 18.4 500 1.2 2.3 4.1 0.5 1.1 2.3 7.4 21.4 29.4 3.8 11.2 19.6 COCO 1000 1.5 2.7 4.5 0.7 1.5 2.9 9.9 22.5 32.8 4.7 12.7 20.2
We compare the performance of the ViT model (vit_base_patch16_224) with and without the LoRA trajectory matching using BERT as the language encoder on the Flickr30K dataset in Tab.2. Interestingly, vanilla ViT struggles in distillation, potentially due to attention mechanisms. For 100 pairs, the TR score jumps to 10.4 and the IR score to 5.4. With 1000 pairs, the improvement is even more noticeable: the TR score increases to 15.8 and the IR to 8.1. Those results show that the LoRA trajectory matching is much more effective for distilling critical information. We report the practical upper/lower performance in Tab.3.
Qualitative Results.Here we provide distilled image-text pairs visualizations out of 100 distilled pairs from Flickr30K after 2000 distillation steps in Fig.3. We visualize the distilled text embeddings via their nearest neighbor sentences (cosine similarity) in the training set embedding space for more intuitive understanding. Additional visualizations are in Appendix Sec.G. The distilled images, compared to the original ones, add high-frequency components that help improve the generalization performance(Wang etal., 2020a). While the distilled texts maintain semantic components associated with the distilled images and capture the key attributes e.g. "couple", "kiss", "man", "surf", "huge wave", they also deviate from original sentence embeddings, as they are not in the original five captions paired with the images. The improved performance indicates that both high-frequency components and semantic ones are perceived by models and these significantly help in aligning vision-language modalities.
Lower Bound: Random Ranking Upper Bound: NFNet + BERT Upper Bound: ViT (LoRA) + BERT TR IR TR IR TR IR Dataset R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 Flickr30K 0.1 0.6 1.1 0.1 0.5 1.0 33.9 65.1 75.2 27.3 57.1 69.7 42.7 72.9 83.5 31.8 62.8 74.5 COCO 0.02 0.1 0.2 0.02 0.1 0.2 19.6 45.6 59.5 16.9 41.9 55.9 22.6 50.8 64.8 19.1 44.7 58.7
TR IR Distill Evaluate R@1 R@5 R@10 R@1 R@5 R@10 NFNet NFNet 9.9 28.3 39.1 4.7 15.7 24.6 NF-ResNet50 5.2 14.7 21.2 4.5 13.8 21.2 NF-RegNet 3.6 9.7 15.5 2.5 8.6 14.0 ViT 3.1 8.6 13.2 2.3 7.4 13.3
TR IR Distill Evaluate R@1 R@5 R@10 R@1 R@5 R@10 ViT ViT 10.4 23.6 38.7 5.4 18.8 27.4 NF-ResNet50 2.8 8.3 12.2 2.0 6.7 11.5 NF-RegNet 3.7 8.4 14.1 1.9 5.9 9.2 NFNet 4.4 12.6 20.3 2.6 7.3 13.9
TR IR R@1 R@5 R@10 R@1 R@5 R@10 # pairs T I Ours T I Ours T I Ours T I Ours T I Ours T I Ours 100 1.3 3.5 9.9 3.5 11.5 28.3 5.9 17.4 39.1 0.5 1.6 4.7 2.1 5.6 15.7 3.4 9.7 24.6 200 1.4 4.5 10.2 4.8 12.8 28.7 8.2 21.7 41.9 0.7 2.0 4.6 2.7 8.1 16.0 4.7 13.0 25.5 500 6.6 6.5 13.3 19.5 19.4 32.8 30.4 28.9 46.8 3.8 3.8 6.6 13.5 12.4 20.2 20.8 19.9 30.0 1000 7.7 5.0 13.3 20.7 17.4 34.8 31.2 24.9 45.7 4.0 3.9 9.1 13.3 13.1 24.1 20.1 20.1 33.8
Cross-Architecture Generalization. Following previous works(Cazenavette etal., 2022; Cui etal., 2023; Zhao & Bilen, 2023), we evaluate the cross-architecture generalization ability of our distilled data in training unseen architectures. The experiments are conducted on Flickr30K with 100 distilled pairs. Distilling with NFNet model, we report the cross-architecture generalization performance on NF-ResNet50(Brock etal., 2021a), NF-RegNet(Xu etal., 2022), and ViT (Dosovitskiy etal., 2021) (LoRA). As shown in Tab.4, our method transfers well across different models.
4.3 Ablation Studies
We conduct a set of ablation studies to understand unimodal distillation vs. co-distillation, distilled dataset initialization (Sec.E.1), different encoder backbones (Sec.E.2), pretraining (Sec.E.3), synthetic steps (Sec.E.4), and their influence on distillation.
We compare co-distillation with unimodal distillation, where we keep one of the modalities fixed during distillation. Tab.5 shows the retrieval performance of text-only distillation, image-only distillation, and co-distillation. Across all tasks and metrics, the co-distillation approach clearly outperforms the others. We observed that the performance of text-only distillation is worse than that of image-only distillation. This may not be surprising: text descriptions typically contain only a salient but small portion of visual information. However, descriptions in the evaluated datasets typically contain no information that cannot be inferred from the images. By distilling images to text-relevant aspects, it can highlight essential image features. Thus, if we interpret each original image as having substantially more information than its original sentence, we would expect image-only distillation to perform better in a smaller-scale regime (removing spurious information) and text-only distillation to perform better in a larger-scale regime (adding useful details).
In contrast, co-distillation allows the synthetic dataset to further optimize for compact representation and efficient storage, removing redundant information between examples in the smaller-scale contexts and adding information not present in the selected original images in larger-scale contexts. Our co-distillation method, combining text and image modalities during training, consistently outperforms single-modality distillation across different numbers of training pairs and metrics. While the improvement from co-distillation is consistent, it is particularly substantial with fewer pairs: in the 100 and 200 pairs rows, co-distillation outperforms its unimodal alternatives by over 2. In fact, co-distillation with 100 pairs consistently outperforms unimodal distillation with 1000 pairs. These results demonstrate the effectiveness of jointly distilling across modalities and highlight the complementary nature of multimodal data.
5 Conclusion
In this work, we propose the first vision-language dataset distillation method. By co-distilling both vision and language modalities, we can progressively optimize and distill the most critical information from a vision-language dataset. Our experiments show that co-distilling different modalities via bi-trajectory matching and using LoRA matching for complex model finetuning hold promise. We hope that the insights we gathered can serve as a roadmap for future studies exploring more complex settings. Furthermore, we believe our work lays the groundwork for future research aimed at understanding the minimum information required for a vision-language model to achieve comparable performance quickly, thereby building a better understanding of the compositionality of compact visual-linguistic knowledge.
Limitations.We make note of two limitations of our approach. Firstly, dataset distillation is not exempt from the “No Free Lunch” theorem(Wolpert & Macready, 1997). As discussed in(Sachdeva & McAuley, 2023), we also observed that the effectiveness of the distilled data is highly influenced by learning algorithms and models used during distillation, which could potentially lead to poor transferability. Furthermore, many dataset distillation methods are computationally intensive, i.e. the bi-level optimization in meta-learning distillation approaches, which is another major challenge. In contrast, our trajectory matching approach is significantly less computationally demanding, yet we observed that the larger synthetic steps often result in improved performance, and exploring closed-form solutions, i.e. implicit gradient-based methods(Lorraine etal., 2020) could be promising future directions to pursue.
Broader Impact Statement.Our exploration focuses on scientific understanding and practical applications of vision-language dataset distillation. While our work does not directly imply negative impacts, it may indirectly propagate existing biases in the original datasets. Therefore, it is important to incorporate rigorous bias-mitigation measurements for dataset distillation. Discussion on these critical aspects should remain a priority as we further explore the potential of vision-language dataset distillation.
Acknowledgements
This material is based upon work supported by the National Science Foundation under Grants No. 2107048 and No. 2112562. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. We thank many people from Princeton Visual AI lab (Allison Chen, Jihoon Chung, Tyler Zhu, Ye Zhu, William Yang and Kaiqu Liang) and Princeton NLP group (Carlos E. Jimenez, John Yang), as well as Tiffany Ling, George Cazenavette and Ilia Sucholutsky for their helpful feedback.
References
- Alayrac etal. (2022)Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, etal.Flamingo: a visual language model for few-shot learning.NeurIPS, 2022.
- Brock etal. (2021a)Andrew Brock, Soham De, and SamuelL Smith.Characterizing signal propagation to close the performance gap in unnormalized resnets.ICLR, 2021a.
- Brock etal. (2021b)Andy Brock, Soham De, SamuelL Smith, and Karen Simonyan.High-performance large-scale image recognition without normalization.In ICML, 2021b.
- Cazenavette etal. (2022)George Cazenavette, Tongzhou Wang, Antonio Torralba, AlexeiA. Efros, and Jun-Yan Zhu.Dataset distillation by matching training trajectories.In CVPR, 2022.
- Chun etal. (2021)Sanghyuk Chun, SeongJoon Oh, RafaelSampaio DeRezende, Yannis Kalantidis, and Diane Larlus.Probabilistic embeddings for cross-modal retrieval.In CVPR, 2021.
- Cui etal. (2023)Justin Cui, Ruochen Wang, SiSi, and Cho-Jui Hsieh.Scaling up dataset distillation to imagenet-1k with constant memory.In ICML, 2023.
- Deng & Russakovsky (2022)Zhiwei Deng and Olga Russakovsky.Remember the past: Distilling datasets into addressable memories for neural networks.In NeurIPS, 2022.
- Devlin etal. (2018)Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018.
- Dosovitskiy etal. (2021)Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, etal.An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021.
- Du etal. (2023)Jiawei Du, Yidi Jiang, Vincent T.F. Tan, JoeyTianyi Zhou, and Haizhou Li.Minimizing the accumulated trajectory error to improve dataset distillation.In CVPR, 2023.
- Farahani & Hekmatfar (2009)RezaZanjirani Farahani and Masoud Hekmatfar.Facility location: concepts, models, algorithms and case studies.Springer Science & Business Media, 2009.
- Gu etal. (2018)Jiuxiang Gu, Jianfei Cai, ShafiqR Joty, LiNiu, and Gang Wang.Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models.In CVPR, 2018.
- He etal. (2016)Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition.In CVPR, 2016.
- Hu etal. (2022)EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, LuWang, and Weizhu Chen.Lora: Low-rank adaptation of large language models.ICLR, 2022.
- Huang etal. (2018)Yan Huang, QiWu, Chunfeng Song, and Liang Wang.Learning semantic concepts and order for image and sentence matching.In CVPR, 2018.
- Jiang etal. (2023)Zixuan Jiang, Jiaqi Gu, Mingjie Liu, and DavidZ Pan.Delving into effective gradient matching for dataset condensation.COINS, 2023.
- Karpathy & Fei-Fei (2015)Andrej Karpathy and LiFei-Fei.Deep visual-semantic alignments for generating image descriptions.In CVPR, 2015.
- Krizhevsky etal. (2009)Alex Krizhevsky, Geoffrey Hinton, etal.Learning multiple layers of features from tiny images.2009.
- LeCun etal. (1998)Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.Gradient-based learning applied to document recognition.Proceedings of the IEEE, 1998.
- Lee etal. (2022)Saehyung Lee, Sanghyuk Chun, Sangwon Jung, Sangdoo Yun, and Sungroh Yoon.Dataset condensation with contrastive signals.In ICML, 2022.
- Li etal. (2022)Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation.In ICML, 2022.
- Liang etal. (2022)PaulPu Liang, Amir Zadeh, and Louis-Philippe Morency.Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions.arXiv preprint arXiv:2209.03430, 2022.
- Lin etal. (2014)Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and CLawrence Zitnick.Microsoft coco: Common objects in context.In ECCV, 2014.
- Lin etal. (2022)Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, Deepak Pathak, and Deva Ramanan.Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models.CVPR, 2022.
- Liu etal. (2023)Haoyang Liu, Tiancheng Xing, Luwei Li, Vibhu Dalal, Jingrui He, and Haohan Wang.Dataset distillation via the wasserstein metric.arXiv preprint arXiv:2311.18531, 2023.
- Lorraine etal. (2020)Jonathan Lorraine, Paul Vicol, and David Duvenaud.Optimizing millions of hyperparameters by implicit differentiation.In AISTATS, 2020.
- Nguyen etal. (2020)Timothy Nguyen, Zhourong Chen, and Jaehoon Lee.Dataset meta-learning from kernel ridge-regression.In ICLR, 2020.
- Nguyen etal. (2021)Timothy Nguyen, Roman Novak, Lechao Xiao, and Jaehoon Lee.Dataset distillation with infinitely wide convolutional networks.In NeurIPS, 2021.
- Oord etal. (2018)Aaron vanden Oord, Yazhe Li, and Oriol Vinyals.Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018.
- Pandey etal. (2022)Rohan Pandey, Rulin Shao, PaulPu Liang, Ruslan Salakhutdinov, and Louis-Philippe Morency.Cross-modal attention congruence regularization for vision-language relation alignment.ACL, 2022.
- Plummer etal. (2015)BryanA Plummer, Liwei Wang, ChrisM Cervantes, JuanC Caicedo, Julia Hockenmaier, and Svetlana Lazebnik.Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models.In ICCV, 2015.
- Pokle etal. (2022)Ashwini Pokle, Jinjin Tian, Yuchen Li, and Andrej Risteski.Contrasting the landscape of contrastive and non-contrastive learning.2022.
- Radenovic etal. (2023)Filip Radenovic, Abhimanyu Dubey, Abhishek Kadian, Todor Mihaylov, Simon Vandenhende, Yash Patel, YiWen, Vignesh Ramanathan, and Dhruv Mahajan.Filtering, distillation, and hard negatives for vision-language pre-training.In CVPR, 2023.
- Radford etal. (2021)Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In ICML, 2021.
- Sachdeva & McAuley (2023)Noveen Sachdeva and Julian McAuley.Data distillation: A survey.TMLR, 2023.
- Sener & Savarese (2018)Ozan Sener and Silvio Savarese.Active learning for convolutional neural networks: A core-set approach.ICLR, 2018.
- Sucholutsky & Schonlau (2021)Ilia Sucholutsky and Matthias Schonlau.Soft-label dataset distillation and text dataset distillation.In IJCNN, 2021.
- Toneva etal. (2019)Mariya Toneva, Alessandro Sordoni, Remi Tachetdes Combes, Adam Trischler, Yoshua Bengio, and GeoffreyJ Gordon.An empirical study of example forgetting during deep neural network learning.ICLR, 2019.
- Valverde etal. (2021)FranciscoRivera Valverde, JuanaValeria Hurtado, and Abhinav Valada.There is more than meets the eye: Self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge.In CVPR, 2021.
- Vicol etal. (2022)Paul Vicol, JonathanP Lorraine, Fabian Pedregosa, David Duvenaud, and RogerB Grosse.On implicit bias in overparameterized bilevel optimization.In ICML, 2022.
- Wang etal. (2020a)Haohan Wang, Xindi Wu, Zeyi Huang, and EricP Xing.High-frequency component helps explain the generalization of convolutional neural networks.In CVPR, 2020a.
- Wang etal. (2020b)Haoran Wang, Ying Zhang, Zhong Ji, Yanwei Pang, and Lin Ma.Consensus-aware visual-semantic embedding for image-text matching.In ECCV, 2020b.
- Wang etal. (2022)Kai Wang, BoZhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan Huang, Hakan Bilen, Xinchao Wang, and Yang You.CAFE: Learning to condense dataset by aligning features.In CVPR, 2022.
- Wang etal. (2018)Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and AlexeiA Efros.Dataset distillation.arXiv preprint arXiv:1811.10959, 2018.
- Welling (2009)Max Welling.Herding dynamical weights to learn.In ICML, 2009.
- Wolpert & Macready (1997)DavidH Wolpert and WilliamG Macready.No free lunch theorems for optimization.IEEE transactions on evolutionary computation, 1997.
- Wright & Ma (2022)John Wright and YiMa.High-dimensional data analysis with low-dimensional models: Principles, computation, and applications.Cambridge University Press, 2022.
- Wu etal. (2019)Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma.Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations.In CVPR, 2019.
- Wu etal. (2023)Xindi Wu, KwunFung Lau, Francesco Ferroni, Aljoša Ošep, and Deva Ramanan.Pix2map: Cross-modal retrieval for inferring street maps from images.In CVPR, 2023.
- Xu etal. (2022)Jing Xu, YuPan, Xinglin Pan, Steven Hoi, Zhang Yi, and Zenglin Xu.Regnet: self-regulated network for image classification.IEEE Transactions on Neural Networks and Learning Systems, 2022.
- Xue etal. (2023)Zihui Xue, Zhengqi Gao, Sucheng Ren, and Hang Zhao.The modality focusing hypothesis: Towards understanding crossmodal knowledge distillation.In ICLR, 2023.
- Zhao & Bilen (2021a)BoZhao and Hakan Bilen.Dataset condensation with differentiable siamese augmentation.In ICML, 2021a.
- Zhao & Bilen (2021b)BoZhao and Hakan Bilen.Dataset condensation with gradient matching.In ICLR, 2021b.
- Zhao & Bilen (2023)BoZhao and Hakan Bilen.Dataset condensation with distribution matching.In WACV, 2023.
- Zhou etal. (2022)Yongchao Zhou, Ehsan Nezhadarya, and Jimmy Ba.Dataset distillation using neural feature regression.In NeurIPS, 2022.
- Zhu etal. (2022)YeZhu, YuWu, Nicu Sebe, and Yan Yan.Vision+ x: A survey on multimodal learning in the light of data.arXiv preprint arXiv:2210.02884, 2022.
Appendix
In this Appendix, we first we provide the full baseline comparison (with R@1/5/10) on Flickr30K and COCO (Sec.A). Then we show the challenges of vision-language distillation (Sec.B) by transitioning the trajectory-matching pipeline from image-only to image-text retrieval. We provide analysis on distilled images (Sec.D) and lossless distillation (Sec.C). We further extend the ablation study, analyzing components of our pipeline, i.e. distilled dataset initialization (SecE.1), encoder backbones (Sec.E.2), pretraining (Sec.E.3) and synthetic steps (Sec.E.4). Lastly, we show additional visualizations of the distilled samples, as well as the ones under different backbones (Sec.G).
Appendix A Full Details for Distilled Performance
We provide full distillation results following Section 4.2, including image-to-text and text-to-image retrieval results R@5 and R@10 with NFNet in Tab.6.
TR IR Coreset Selection Dist(ours) Coreset Selection Dist(ours) Dataset #pairs Metrics R H K F R H K F Flickr30K 100 R@1 1.3 1.1 0.6 1.2 9.9 0.3 1.0 0.7 0.7 0.7 4.7 0.2 R@5 5.9 4.7 5.0 4.2 28.3 0.5 4.0 2.8 3.1 2.4 15.7 0.5 R@10 10.1 7.9 7.6 9.7 39.1 0.7 6.5 5.3 6.1 5.6 24.6 1.0 200 R@1 2.1 2.3 2.2 1.5 10.2 0.8 1.1 1.5 1.5 1.2 4.6 0.9 R@5 8.7 8.4 8.2 8.4 28.7 1.0 4.8 5.5 5.4 3.1 16.0 1.6 R@10 13.2 14.4 13.5 10.2 41.9 1.9 9.2 9.3 9.9 8.4 25.5 2.6 500 R@1 5.2 5.1 4.9 3.6 13.3 0.6 2.4 3.0 3.5 1.8 6.6 0.3 R@5 18.3 16.4 16.4 12.3 32.8 1.8 10.5 10 10.4 9.0 20.2 1.2 R@10 25.7 24.3 23.3 19.3 46.8 0.8 17.4 17.0 17.3 15.9 30.0 2.1 1000 R@1 5.2 5 5.6 3.1 13.3 1.0 3.8 4.1 4.4 3.2 7.9 0.8 R@5 15.6 14.6 16.1 14.9 34.8 1.9 11.8 12.1 12.8 9.5 24.1 1.6 R@10 21.4 20.4 20.8 18.9 45.9 2.5 19.9 20.0 20.4 18.7 33.8 2.0 R@1 0.8 0.8 1.4 0.7 2.5 0.3 0.3 0.5 0.4 0.3 1.3 0.1 R@5 3.0 2.1 3.7 2.6 10.0 0.5 1.3 1.4 1.4 1.5 5.4 0.3 100 R@10 5.0 4.9 5.5 4.8 15.7 0.4 2.7 3.5 2.5 2.5 9.5 0.5 R@1 1.0 1.0 1.2 1.1 3.3 0.2 0.6 0.9 0.7 0.6 1.7 0.1 R@5 4.0 3.6 3.8 3.5 11.9 0.6 2.3 2.4 2.1 2.8 6.5 0.4 200 R@10 7.2 7.7 7.5 7.0 19.4 1.2 4.4 4.1 5.8 4.9 12.3 0.8 R@1 1.9 1.9 2.5 2.1 5.0 0.4 1.1 1.7 1.1 0.8 2.5 0.5 R@5 7.5 7.8 8.7 8.2 17.2 1.3 5.0 5.3 6.3 5.8 8.9 0.7 500 R@10 12.5 13.7 14.3 13.0 26.0 1.9 8.7 9.9 10.5 8.2 15.8 1.5 R@1 1.9 2.4 2.4 1.9 6.8 0.4 1.5 1.3 1.5 0.7 3.3 0.1 R@5 7.6 9.0 9.0 7.7 21.9 1.2 5.6 5.7 7.1 4.6 11.9 0.5 COCO 1000 R@10 12.7 14.0 14.1 13.0 31.0 1.5 9.6 10.1 10.9 8.0 22.1 0.9
Appendix B CIFAR10 Classification vs Retrieval Distillation
Prior work has shown remarkable distillation results on CIFAR10(Krizhevsky etal., 2009) classification. To move from distilling image-only datasets to vision-language datasets, we first check if our method has potential in simple settings. Concretely, we convert CIFAR10 labels to captions that pair with their corresponding images. Under this formulation, the objective of classification is equivalent to that of image-to-text retrieval (TR): finding the best text given an image.
In Tab.7, we compare CIFAR10 distillation performance for dataset size of 1, 10, 50 images per class (IPC), under three different settings: classification, single-caption retrieval, and multi-caption retrieval. For classification, we demonstrate results from MTT(Cazenavette etal., 2022), where they distill an image-only dataset using expert trajectories trained on image-label pairs. In single-caption TR, we distill image-caption pairs using expert trajectories trained when each image is paired with a single caption "This is a {label}". In multi-caption TR, we distill image-caption pairs but the expert trajectories are trained when each image is paired with five captions that are generated with varies prompts from(Radford etal., 2021). For consistency, all image trajectories are obtained with the 3-layer ConvNet backbone asspecified in(Cazenavette etal., 2022), and text trajectories are from linear projection layers over pretrained BERT(Devlin etal., 2018) embeddings. Although the performance of vision-language distillation trails behind that of image-only distillation, the gap closes at larger IPCs. However, this gap highlights the challenge of the continuous label space in vision-language datasets. Moreover, the performance gap between single and multi-caption retrieval demonstrates the challenge of capturing the variability within human language descriptions.
IPC Classification image-to-text retrieval Single Caption Multi Caption 1 46.3 0.8 27.4 1.0 22.3 1.0 10 65.3 0.7 35.9 0.7 33.2 0.5 50 71.6 0.2 66.8 1.1 62.0 0.8 Full 84.8 0.1 79.6 0.6 80.3 0.4
Appendix C Upper Bound Performance
We further increase the distilled size to be 10% of the original Flickr30K dataset size and we provide the comparisons for distillation performance with the upper bound results (Tab.8). The distillation performance are closely approaching the upper bound results.
Result Vision Language Ratio TR IR Type Backbone Backbone R@1 R@5 R@10 R@1 R@5 R@10 Distillation NFNet BERT 10% 32.1 60.0 73.2 24.1 53.9 66.5 Upper Bound NFNet BERT 100% 33.9 65.1 75.2 27.3 57.2 69.7 Distillation NFNet CLIP 10% 60.0 86.3 91.4 47.4 78.2 86.5 Upper Bound NFNet CLIP 100% 61.2 87.5 92.8 49.8 79.8 88.3
Appendix D Analysis on Distilled Images
We have found that increasing the learning rate and distillation time lead to more noticeable changes in the images within the distilled dataset (distilled images: Fig.4, original images: Fig.5). However, it is important to note that a higher learning rate or longer distillation time does not necessarily translate to improved performance of the distilled dataset, even if the images appear to deviate more drastically from the human perception perspective. Changes in image pixels alone may not reliably predict distillation performance. It is rather a measurement of the distillation strength. More distorted images suggest uneven pixel updates, while even updates yield results similar to the visualization we provided before in Fig.3.
In line with previous studies, we initially expected more obvious changes in images would lead to better performance, but our findings suggest a different behavior of vision-language distillation with trajectory matching framework, reflecting how models capture vision-language interaction. From a human perception perspective, the distilled images appear to be moving less compared to previous classification works, yet those small vectors are still meaningful and contain useful information, as opposed to artifacts like noisy patterns. Our algorithm achieves a clear and consistent improvement over random baselines indicated by the results. We hope this discussion can inspire more research on vision-language dataset distillation.
Appendix E Additional Ablation Studies
In this section, we provide additional ablation studies. Unless specified, these distillation experiments are conducted on the Flickr30K dataset to distill 100 image-text pairs, and we use pretrained NFNet and BERT as backbones, with synthetic step set to 8 during distillation.
E.1 Distilled Dataset Initialization
In the main paper, we provided experiments with real sample initialization. Here we experiment and evaluate initializing with Gaussian noise. Our findings in Tab.9 show that initializing images from the Gaussian distribution results in significantly lower performance. It is worth noting that the complexity of images, which encodes a high degree and rich information of colors, shapes, textures and spatial relationships between objects, can make it difficult for models to learn effectively from randomly initialized images. On the other hand, using real text sampled from the training set vs. randomly initialized text embeddings does not bring a significant difference. We assume that the pretrained language models are good at generating or transforming ‘noise’ text embedding into meaningful sentences during the learning process, partly due to the inherent structure and predictability of language. We provide visualizations of real images and ‘noise’ texts combination below in Fig.6 and Fig.7 and Tab.E.1. To our surprise, even though the initialized ‘noise’ texts are not semantically meaningful to the initialized real images, we discovered a substantial degree of semantic similarity between the initialized real images and the learned distilled text. This suggests the probability of future application of our method in Visual Question Answering (VQA).
Distillation TR IR Real Image Real Text R@1 R@5 R@10 R@1 R@5 R@10 ✓ ✓ 9.9 28.3 39.1 4.7 15.7 24.6 ✓ 9 27.2 40.1 3.9 13.2 20.6 ✓ 0.2 0.7 1.1 0.1 0.5 1 0.1 0.3 0.4 0.1 0.4 0.8
E.2 Encoder Backbone Selection
In this section, we evaluate the impact of different language/vision backbones on the distillation performance.
E.2.1 Language Backbones
Perhaps not surprisingly, CLIP(Radford etal., 2021) text encoder significantly outperforms BERT in all evaluation metrics, with a striking peak performance in TR R@10 at 92.8% for expert training. This exceptional performance can be mainly attributed to the fact that the pre-trained, off-the-shelf CLIP model is designed to learn a shared embedding space across multi-modalities. Although CLIP also shows a performance drop during distillation, it still retains a relatively high performance recovery ratio. In Sec.G we provide visualization of synthetic data distilled via NFNet and CLIP.
Expert Distillation Language Model TR IR TR IR R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 BERT 33.9 65.1 75.2 27.3 57.2 69.7 9.9 28.3 39.1 4.7 15.7 24.6 CLIP 61.2 87.5 92.8 49.8 79.8 88.3 31.4 58.8 72.0 17.1 41.9 56.2
E.2.2 Vision Backbones
The vision encoders carry the main gradient flows for the distillation process. We experimented on several vision backbones, and found that the architecture choice strongly influences the distillation quality. Similar to dataset distillation by gradient matching(Zhao & Bilen, 2021b), batch normalization has an impact on the gradient/parameter matching framework. This is mainly because batch normalization incorporates a non-parametric component that can only be accumulated with batches and can not be trained.
Expert Distillation Vision Model TR IR TR IR R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 ViT (LoRA) 40.7 69.8 80.1 28.8 59.3 73.4 10.4 23.6 38.7 5.4 18.8 27.4 NFNet-l0 33.9 65.1 75.2 27.3 57.2 69.7 9.9 28.3 39.1 4.7 15.7 24.6 NFResNet50 28.9 56.6 71 22.8 50.1 63.4 6.5 18.2 28.1 3.5 11.6 18.7 NFRegNet 26.9 57.2 70.2 21.1 50.1 62.9 7.8 21.9 33.3 3.3 12.7 20.5 ResNet50 18 43.5 59.5 13.4 36.6 49.9 0.5 2.4 3.8 0.3 1.6 3.6
E.3 Pretrained vs. Non-pretrained
Tab.12 demonstrates the pretraining influence of the backbone encoders. Optimal performance is observed when both language and vision backbones are pretrained. This emphasizes the importance of pretraining before the expert training stage for large models and datasets.
Expert Language Vision TR IR Backbone Backbone R@1 R@5 R@10 R@1 R@5 R@10 ✓ ✓ 33.9 65.1 75.2 27.3 57.2 69.7 ✓ 4.4 14.1 20.7 3.5 11.4 18.8 ✓ 0.5 1.1 1.8 0.3 0.7 1.4 0.3 1 1.5 0.1 0.7 1.3
E.4 Synthetic Steps
The synthetic step size plays an important role in optimizing the dataset distillation performance, as shown in Tab.13. Using larger synthetic steps tends to achieve better distillation performance.
Distillation #Pairs #Syn Steps TR IR R@1 R@5 R@10 R@1 R@5 R@10 100 1 0.5 2.1 4.4 0.3 1.5 2.8 2 7.1 23.4 32.9 3.0 10.2 16.4 4 8.2 24.9 35.2 3.5 12.2 20.7 8 9.9 28.3 39.1 4.7 15.7 24.6 200 1 3.2 9.3 14.1 1.6 5.2 8.8 2 6.5 19.2 29.1 1.6 5.9 10.0 4 8.2 24.5 34.4 2.2 7.4 11.8 8 10.2 28.7 41.9 4.6 16.0 25.5 500 1 6.6 18.1 25.5 2.1 10.1 16.3 2 8 21.7 31.3 3.8 14.9 23.2 4 8.1 23.6 34.9 4.4 15.2 23.7 8 13.3 32.8 46.8 6.6 20.2 30.0 1000 1 7.3 20.6 29.7 3.9 13.2 20.7 2 8.8 26.8 36.6 5.7 17.4 26.4 4 10.4 29.1 37.9 6.6 19.5 29.5 8 13.3 34.8 45.7 9.1 24.1 33.8
Appendix F Beyond Trajectory Matching
In this section, we further provide experiment results of a distribution matching(Zhao & Bilen, 2023) baseline adapted to the vision-language setting. To use distribution matching for vision-language dataset distillation, concretely, we minimize the maximum mean discrepancy (mmd) between two distributions by sampling NFNet with different initialization and pretrained BERT. Similar to the distribution matching setting for image classification, we update the distilled data via mmd for vision and language modalities to match the original data distribution in a family of embedding spaces. We provide the comparison of Our method w/ DM (distribution matching) and Our method w/ TM (trajectory matching) on Flickr30K (R@1) in Tab.14.
TR | IR | |||
---|---|---|---|---|
# pairs | Ours w/ DM | Ours w/ TM | Ours w/ DM | Ours w/ TM |
100 | 3.2 1.8 | 9.9 0.3 | 1.4 0.7 | 4.7 0.2 |
200 | 3.3 1.3 | 10.2 0.8 | 1.4 0.4 | 4.6 0.9 |
500 | 5.8 1.5 | 13.3 0.6 | 4.1 0.9 | 6.6 0.3 |
1000 | 6.1 2.7 | 13.3 1.0 | 4.9 1.8 | 7.9 0.8 |
Looking forward, we hope our method could serve as a roadmap for future studies exploring more complex settings with new state-of-the-art (SOTA) methods. New SOTA dataset distillation methods can adopt low-rank adaptation matching to scale efficiently with large and complex models, and can incorporate bi-trajectory co-distillation to handle textual data more effectively. By doing so, these methods can extend their applicability to previously infeasible models for distillation, such as those involving ViTs, thus improving the scalability and efficiency of the distillation process. New approaches that distill from both text and image data can consider using methods similar to bi-trajectory matching with contrastive loss to learn the interactions and redundancies across multimodalities.
Appendix G Additional Visualizations
Here we include a number of visualizations of the data we distilled from the multimodal dataset (both Flickr30K Tab.G and Fig.8, 9 and COCO Tab.G and Fig.10,11) for a more intuitive understanding of the distilled set. We provide 50 distilled image-text paired examples including their visualization before the distillation process. Unless otherwise stated, these experiments are conducted using 100 distilled pairs, with pretrained NFNet(Brock etal., 2021b) and BERT(Devlin etal., 2018) as backbones and the synthetic step is set to 8 during distillation. We provide visualization of distilled data using NFNet and CLIP in Tab.G and Fig.12,13 in the end.