Vision-Language Dataset Distillation (2024)

Xindi Wu¹ Byron Zhang¹ Zhiwei Deng² Olga Russakovsky¹
¹Princeton University ²Google DeepMind
{xindiw, zishuoz, olgarus}@princeton.edu, zhiweideng@google.com

Abstract

Dataset distillation methods reduce large-scale datasets to smaller sets of synthetic data, preserving sufficient information to quickly train a new model from scratch. However, prior work on dataset distillation has focused exclusively on image classification datasets, whereas modern large-scale datasets are primarily vision-language datasets. In this work, we design the first vision-language dataset distillation method, building on the idea of trajectory matching. A key challenge is that vision-language datasets do not have a set of discrete classes. To overcome this, our proposed method jointly distills image-text pairs in a contrastive formulation. Further, we leverage Low-Rank Adaptation (LoRA) matching to enable more efficient and effective trajectory matching in complex modern vision-language models. Since there are no existing baselines, we compare our distillation approach with three adapted vision-language coreset selection methods. We demonstrate significant improvements on the challenging Flickr30K and COCO retrieval benchmarks: for example, on Flickr30K, the best coreset selection method selecting 1000 image-text pairs for training achieves only 5.6% image-to-text retrieval accuracy (i.e., recall@1); in contrast, our dataset distillation almost doubles that to 9.9% with just 100 training pairs, an order of magnitude fewer.

1 Introduction

Data = Information + Irrelevant Data (Wright & Ma, 2022)

Dataset distillation aims to create concise summaries of data that preserve most of the critical information of the entire dataset. It holds paramount importance in the era of big data as it addresses the challenge posed by “Data = Information + Irrelevant Data”(Wright & Ma, 2022), where we often need to learn the useful information in an ocean of non-critical data. The recent growth of dataset distillation methods, e.g.,(Wang etal., 2018; Cazenavette etal., 2022; Nguyen etal., 2020) has primarily focused on image classification datasets, capturing class-specific information to build discriminative boundaries. Considering the recent progress in multimodal machine learning, where we witness the explosion of vision-language datasets in which the majority of image pixels may belong to irrelevant contextual elements and may further lack corresponding textual descriptions, a significant necessity arises to efficiently distill this vast amount of data. A well-distilled multimodal dataset simplifies complex vision-language interactions and emphasizes the most salient connections, making it more effective for models to learn cross-modal representations.

Why is it hard? The first key challenge and the main difference from prior dataset distillation methods(Wang etal., 2018; Cazenavette etal., 2022) is that vision-language datasets do not contain a discrete set of classes to ground the distillation process. Instead, these datasets contain complex cross-modal connections and redundancies, requiring a co-distillation approach to capture their interdependencies effectively. Second, the complexity of cross-modal representations and vision-language models (VLMs) leads to computational challenges. Prior dataset distillation methods operate on low-resolution images (typically 28x28 or 32x32, as in MNIST(LeCun etal., 1998) or CIFAR(Krizhevsky etal., 2009)) and nevertheless suffer from significant computational costs, even with ConvNet when creating distilled datasets. Vision-language datasets often contain higher-resolution images, and models designed for vision-language tasks are substantially more complex, such as Vision Transformers (ViTs)(Dosovitskiy etal., 2021). Lastly, unlike continuous data, text is inherently non-differentiable, making direct gradient-based optimization impossible on discrete text tokens.

Vision-Language Dataset Distillation (1)

Our work. We propose the first Vision-Language Dataset Distillation method. Concretely, given a dataset of images with corresponding text descriptions, our method creates a much smaller synthetic set of (image, text embedding) pairs which can then be used to efficiently train a model that aims to learn the image-text alignment. Given the infeasibility of direct information extraction, our co-distillation is achieved by implicitly matching the by-products of the target vision-language data and the synthetic ones. In our case, the by-product is the long-range training bi-trajectory. Additionally, as finetuning pretrained models is widely used in vision-language tasks, we match the trajectories of the low-rank matrices(Hu etal., 2022) for complex models to effectively distill critical information.

Contributions.To the best of our knowledge, this is the first work to tackle vision-language dataset distillation. In doing so, we make the following key contributions:

1.
We highlight the challenges of vision-language dataset distillation and establish the first set of baselines for this task by adapting three coreset selection methods(Welling, 2009; Toneva etal., 2019; Farahani & Hekmatfar, 2009; Sener & Savarese, 2018).
2.
We propose the Bi-Trajectory Vision-Language Co-Distillation method. Different from prior image classification dataset distillation methods, our method is not restricted to discrete classes and distills vision-language pairs jointly. We leverage Low-Rank Adaptation (LoRA) matching to make it computationally feasible for training with complex models (e.g., ViTs) on high-resolution images.
3.
Our method significantly improves image-text retrieval with training set constraints on the challenging Flickr30K(Plummer etal., 2015) and COCO(Lin etal., 2014) datasets. For example, the best coreset selection method (adapted K-center) achieves 5.6% image-to-text retrieval performance (R@1) after selecting 1000 image-text pairs for training. In contrast, our method almost doubles that performance on the same task to 9.9% with an order of magnitude fewer (just 100) distilled image-text pairs.

The growing interest in multimodal datasets makes it even more crucial to develop mechanisms that efficiently and effectively distill insights from different modalities. We hope this work jump-starts further research into the important and challenging space of vision-language dataset distillation.

2 Related Works

Dataset Distillation.The concept of dataset distillation has demonstrated that a handful of synthetic images, although not drawn from the training distribution, can achieve comparable performance to that of the original dataset(Wang etal., 2018). Meta-learning based data distillation approaches(Nguyen etal., 2021; Zhou etal., 2022; Deng & Russakovsky, 2022; Nguyen etal., 2020; Vicol etal., 2022; Zhou etal., 2022) typically use bilevel optimization, where the inner loop trains on the distilled data samples and the outer loop optimizes meta datasets. Several works(Zhao & Bilen, 2021b; a; Cazenavette etal., 2022; Jiang etal., 2023; Du etal., 2023; Liu etal., 2023) explored by-product matching approaches, such as matching the gradient or trajectory of the gradient with respect to the model trained on the real and distilled data.

Our work is mostly inspired by the trajectory matching method(Cazenavette etal., 2022; Cui etal., 2023), which is more efficient for optimization since they mostly do not involve long unrolling of computation graphs. Rather than aligning model gradients, another thread of work (Zhao & Bilen, 2021b; Wang etal., 2022; Lee etal., 2022) has been developed to align feature distributions between real and distilled data using a distribution divergence metric in the latent space. While most of the prior works focus on image classification dataset distillations, (Sucholutsky & Schonlau, 2021) explored dataset distillation on text datasets. Our work is the first to scale up dataset distillation methods to vision-language datasets, which involves creating distilled data that capture critical features and complex relationships within and between two modalities.

Cross-modal Retrieval.Most cross-modal retrieval methods function at the representation level and encourage a joint embedding space by measuring the similarities between learned representations across different modalities(Liang etal., 2022; Zhu etal., 2022; Pokle etal., 2022; Chun etal., 2021; Wu etal., 2023). Image-text retrieval focuses on retrieving images given captions, or of captions given images(Wang etal., 2020b; Wu etal., 2019). Many techniques have been developed to produce representations that are semantically similar for image-text pairs(Huang etal., 2018; Gu etal., 2018). More advanced image-text alignment methods(Li etal., 2022; Lin etal., 2022; Pandey etal., 2022) that incorporate pretraining have shown promising results on image-text retrieval tasks. We evaluate our vision-language dataset distillation method on image-text retrieval tasks.

Vision-language Knowledge Distillation.Prior efforts on vision-language distillation are primarily centered around knowledge distillation, which transfers knowledge from a larger teacher model to a smaller student model to improve the latter’s performance(Xue etal., 2023; Radenovic etal., 2023; Valverde etal., 2021). Our dataset distillation study focuses on the orthogonal question and is fundamentally a pragmatic compression problem. We aim to find equivalent bits that can represent the entire vision-language datasets.

3 Method

We propose a vision-language dataset distillation method for distilling a large-scale dataset consisting of (image, text) pairs into a smaller dataset, while maintaining much of the original dataset’s information relevant to training vision-language models (VLMs). The detailed method is in Fig.2.

3.1 Problem Formulation

Consider a large-scale dataset $\mathbf{D}=\left\{(x_{i},y_{i})\right\}_{i=1}^{N}$ , where each $x_{i}$ denotes an image and each $y_{i}$ denotes its corresponding text descriptions; note that in practice, $y_{i}$ may be a set $\{y_{i1},y_{i2},...,y_{iK}\}$ where $K$ is the number of descriptions associated with each image. Our goal is to learn a smaller dataset $\mathbf{\hat{D}}=\left\{(\hat{x}_{j},\hat{y}_{j})\right\}_{j=1}^{M}$ , with significantly fewer data pairs $M\ll N$ that still captures most of the essential information needed to train a VLM effectively. For $\hat{y}_{i}$ , we aim to use one (instead of $K$ ) sentence per image in the distilled set for a more compact representation. Concretely, consider a VLM with vision encoder $f(\cdot;\theta_{img})$ and language encoder $g(\cdot;\theta_{txt})$ . This model can be trained by optimizing the similarity loss which encourages alignment between the image and text embeddings:

\theta^{*}\approx\underset{\theta}{\arg\min}\frac{1}{N}\sum_{i=1}^{N}\ell\left%(f(x_{i};\theta_{img}),g(y_{i};\theta_{txt})\right).\vspace{-10pt}

(1)

Our goal is to distill a dataset $\mathbf{\hat{D}}$ such that the model trained with $\mathbf{\hat{D}}$ obtains comparable vision-language matching performance as the one trained on $\mathbf{D}$ . More specifically, consider a metric $\mathbf{m}$ defined to quantify the correlation between the model’s representation $f(x;\theta_{img})$ of a given image $x$ and the representation $g(y;\theta_{img})$ of a given text $y$ , this representation should match the actual similarity between the image and text pairs. The correlation calculation is based on whether the image-text pair is a positive (matching) or a negative (non-matching) pair. Given the test dataset $\mathbf{D}_{test}$ , our objective can be defined as follows:

\displaystyle\mathbb{E}_{(x,y)\sim\mathbf{D}_{test}}\big{[}\mathbf{m}(f(x;%\theta_{img}^{*}),g(y;\theta_{txt}^{*}))\big{]}\simeq\mathbb{E}_{(x,y)\sim%\mathbf{D}_{test}}\big{[}\mathbf{m}(f(x;\hat{\theta}_{img}),g(y;\hat{\theta}_{%txt}))\big{]},

(2)

where $\theta^{*}$ represents the optimal model parameters from training on the entire dataset, and $\hat{\theta}$ denotes parameters from training on the distilled dataset. Importantly, even when the model is trained on the distilled dataset $\mathbf{\hat{D}}$ , we still evaluate its performance on the original $\mathbf{D}_{test}$ for a fair measurement. When creating the dataset $\hat{\mathbf{D}}$ , the pairs $(\hat{x},\hat{y})$ can be subsampled from the original set $\mathbf{D}$ , as described in the coreset selection methods below (Sec.3.2). We propose a much more effective strategy in Sec.3.3 to learn synthetic image-text pairs $(\hat{x},\hat{y})$ , which can be more information-rich.

Vision-Language Dataset Distillation (2)

Connection with Image-only Dataset Distillation. Traditionally, dataset distillation is tailored for classification tasks with discrete labels, each of which possesses a distinctive set of distilled data that enables efficient learning while preserving important information. We take this concept a step further to the multimodal scenario, where we distill information from both vision and language data. This involves creating synthetic data that capture critical relationships within and between these two modalities. As opposed to merely classifying discrete labels, we are examining a more complex, interconnected dataset where the relation between modalities is crucial. Our method considers the image-text correlation and how they influence each other. It is worth noting that distillation would be impossible with single modality optimization (see Sec.4.3).

3.2 Baselines: Coreset Selection

Since, to the best of our knowledge, there is no pre-existing work in the domain of vision-language dataset distillation, we begin by formulating a set of baselines to construct the smaller dataset $\hat{\mathbf{D}}$ . These baselines are based on coreset selection methods, where a subset of the training pairs $(x_{i},y_{i})$ is chosen, up to a given budget of $M$ pairs, as to maximize the “informativeness” of the selected subset. We consider three such methods, adapted from prior work.

Herding(Welling, 2009) Herding selects data points based on the distance between the coreset center and the original dataset center in the feature space. It greedily adds one sample each time into the coreset to minimize the distance between two centers. We use pre-trained encoders to extract features from the image-text pairs, concatenate the features, and calculate the dataset center in the feature space by averaging all feature vectors. We start with an empty coreset and for each iteration, add the image-text pair that is closest to the current center of the coreset in Euclidean distance. We recalculate the coreset center after adding each data point.

K-center(Farahani & Hekmatfar, 2009; Sener & Savarese, 2018) Different from computing a single center in Herding, K-center selects the training examples that are maximally separated. Concretely, we concatenate the features of the image and text pairs and start by randomly selecting a single data point.Then, for each iteration, until K points are selected, we add a new image-text pair that is furthest in Euclidean distance from the nearest example. The drawback of this method is its high computational cost, especially with large datasets, as it involves heavy distance calculations between data points in each iteration.

Forgetting(Toneva etal., 2019) The core idea is to identify reliable training data that the original model consistently learns well. During each training epoch, we check how accurately the models predict every image-text pair for a specific task (i.e., image-text retrieval). A forgetting event is registered for an image-text pair when the model correctly predicts the data in one epoch but fails in the next. Throughout training, we continually track these forgetting events for each pair, to identify the ones with the fewest forgetting events.

3.3 Bi-trajectory Guided Vision-Language Co-Distillation

The coreset selection methods described above, while effective to some extent, demonstrate certain limitations as they only rely on selecting a subset of the training dataset $\mathbf{D}$ .This restriction leads to less effective results compared to our method, as ours provides the flexibility to generate an optimized distilled dataset $\mathbf{\hat{D}}$ , and the learning process efficiently helps extract the most essential information embedded in $\mathbf{D}$ .Not only does this lead to decreased storage and computational requirements, but it also optimizes the performance of the model trained on this distilled dataset.

4 Experiments

In this section, we first describe the cross-modal retrieval test-bed in Sec.4.1. We use it to evaluate our vision-language dataset co-distillation performance. We then compare our method to baseline coreset selection approaches and provide the key quantitative, qualitative results, and cross-architecture generalization results in Sec.4.2. We further conduct a set of ablation studies in Sec.4.3.

4.1 Vision-Language Distillation Setup

Datasets and Tasks. We evaluate our method on standard vision-language datasets: Flickr30K(Plummer etal., 2015) and COCO(Lin etal., 2014), which are widely used for image-text retrieval tasks. We use them for expert training (stage 1) and distillation (stage 2). We adopt the Karpathy split(Karpathy & Fei-Fei, 2015) for Flickr30K (29k/1k/1k) and COCO (113/5k/5k) for train/validation/test respectively. Each image is paired with five captions. We retrieve the closest matches using cosine distance from one modality based on a query from the other. We use R@K (for K $\in\{1,5,10\}$ ) to compute the fraction of times the correct result appears among the top K items. To move from distilling image-only datasets to vision-language datasets, we validate in appendix Sec.B if our method has potential in the classic image classification setting.

Network Architectures. We primarily use pretrained and trainable NormalizerFree ResNet (NFNet)(Brock etal., 2021b) as the image backbone following Flamingo(Alayrac etal., 2022) as well as Vision Transformer (ViT), for the text backbone we use pretrained and frozen BERT(Devlin etal., 2018). Ablation studies on different backbones are in Appendix Sec.E.2. While both the encoders are pretrained, they are only pretrained on unimodal data with no exposure to the other modality. Each encoder is followed by a trainable linear projection layer with random initialization. Using a trainable BERT adds additional complexity which is orthogonal to vision-language dataset distillation and is out of the scope of this work. Pretrained models serve as a common foundation and good starting point and see Appendix Sec.E.3 for details.

Implementation. For expert training, we train on a single RTX 3090 GPU for 10 epochs, where a single epoch takes 40 minutes of wall-clock time. Sampling from a set of trajectories encourages the distilled dataset to include diverse information and avoid overfitting to a particular step, thus we save 20 image-text bi-trajectories. For distillation, it takes 6 - 15 GPU hours depending on the settings (e.g. number of distilled pairs) with a 8-GPU A6000 node. We initialize a trainable learning rate $\alpha$ at 0.1 for the student model. We followed the data augmentation techniques in(Li etal., 2022), including resizing, cropping, flipping, and RandomAugment. We use SGD with momentum=0.5, the learning rate for updating $\alpha$ , distilled image pixels, and distilled text embeddings are 1e-02, 1000, and 1000, respectively.

Initialization. Following prior studies(Nguyen etal., 2020; Zhou etal., 2022), we initialize the distilled set with randomly selected real samples. We randomly select $n\in\{100,200,500,1000\}$ image-text pairs from the original dataset, with images at 224 $\times$ 224 resolution, and 768-dimensional sentence embeddings obtained via pretrained BERT. Our findings in Appendix Sec.E.1 show that initializing images from Gaussian distribution results in significantly lower performance. The complexity of images makes learning from random initialization challenging. In contrast, there is little difference in performance between using real and randomly initialized text embeddings. Surprisingly, despite the initial lack of semantic meaning between ‘noise’ texts and real images, we found notable semantic similarity between distilled text and real images, suggesting potential applications of our method in Visual Question Answering.

		TR					IR
		Coreset Selection					Coreset Selection
Dataset	#pairs	R	H	K	F	Dist (ours)	R	H	K	F	Dist (ours)
Flickr30K	100	1.3	1.1	0.6	1.2	9.9 $\pm$ 0.3	1.0	0.7	0.7	0.7	4.7 $\pm$ 0.2
	200	2.1	2.3	2.2	1.5	10.2 $\pm$ 0.8	1.1	1.5	1.5	1.2	4.6 $\pm$ 0.9
	500	5.2	5.1	4.9	3.6	13.3 $\pm$ 0.6	2.4	3.0	3.5	1.8	6.6 $\pm$ 0.3
	1000	5.2	5	5.6	3.1	13.3 $\pm$ 1.0	3.8	4.1	4.4	3.2	7.9 $\pm$ 0.8
	100	0.8	0.8	1.4	0.7	2.5 $\pm$ 0.3	0.3	0.5	0.4	0.3	1.3 $\pm$ 0.1
	200	1.0	1.0	1.2	1.1	3.3 $\pm$ 0.2	0.6	0.9	0.7	0.6	1.7 $\pm$ 0.1
	500	1.9	1.9	2.5	2.1	5.0 $\pm$ 0.4	1.1	1.7	1.1	0.8	2.5 $\pm$ 0.5
COCO	1000	1.9	2.4	2.4	1.9	6.8 $\pm$ 0.4	1.5	1.3	1.5	0.7	3.3 $\pm$ 0.1

4.2 Key Results

Quantitative Results.As shown in Tab.1 and Tab.6 in Appendix Sec.A, we observe that although there is relatively little variation in performance across each of the coreset selection baselines which we compare to, dataset distillation outperforms the best alternative by anywhere between 138% (improving R@1 from 5.6 of K-center(Farahani & Hekmatfar, 2009) to 13.3 of our model) to 661% (improving R@1 from 1.3 of random selection to 9.9 of our model). The relative improvement increases when fewer pairs are used for training.

Moreover, as shown in Tab.6, we note that with 1000 pairs, almost 30 times fewer examples than in the original dataset, our data distillation approach reaches 43.7 R@10 for TR, relative to a practical upper bound of 75.2, and 34.4 for IR R@10, relative to an upper bound of 69.7. We also observe that the performance among the baseline coreset selection methods varies only slightly, with no single method consistently outperforming the others across all pair sizes and retrieval metrics, often matching or underperforming random selection. This suggests coreset selection limitations in multimodal settings. In comparison, our bi-trajectory co-distillation method is optimized for vision-language alignment settings and thus performs significantly better. Our results show the effectiveness of distilled data, achieving unparalleled efficiency with significantly fewer examples.

Dataset	#Pairs	Without LoRA						With LoRA
		TR			IR			TR			IR
		R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
Flickr30K	100	1.5	2.5	4.5	0.6	1.2	2.3	10.4	23.6	38.7	5.4	18.8	27.4
	200	1.8	3.9	6.4	0.8	1.5	2.7	11.2	24.5	41.5	6.4	19.4	29.4
	500	2.1	4.3	7.2	1.5	2.1	3.6	13.4	27.8	43.4	7.6	21.1	32.7
	1000	3.3	5.8	7.9	1.5	2.3	3.9	15.8	29.7	45.9	8.1	23.4	35.8
	100	0.5	0.9	2.1	0.3	0.7	1.4	5.1	17.4	27.1	2.3	8.1	14.5
	200	0.8	1.5	3.5	0.3	0.8	1.8	6.8	19.3	28.5	2.9	9.5	18.4
	500	1.2	2.3	4.1	0.5	1.1	2.3	7.4	21.4	29.4	3.8	11.2	19.6
COCO	1000	1.5	2.7	4.5	0.7	1.5	2.9	9.9	22.5	32.8	4.7	12.7	20.2

We compare the performance of the ViT model (vit_base_patch16_224) with and without the LoRA trajectory matching using BERT as the language encoder on the Flickr30K dataset in Tab.2. Interestingly, vanilla ViT struggles in distillation, potentially due to attention mechanisms. For 100 pairs, the TR score jumps to 10.4 and the IR score to 5.4. With 1000 pairs, the improvement is even more noticeable: the TR score increases to 15.8 and the IR to 8.1. Those results show that the LoRA trajectory matching is much more effective for distilling critical information. We report the practical upper/lower performance in Tab.3.

Qualitative Results.Here we provide distilled image-text pairs visualizations out of 100 distilled pairs from Flickr30K after 2000 distillation steps in Fig.3. We visualize the distilled text embeddings via their nearest neighbor sentences (cosine similarity) in the training set embedding space for more intuitive understanding. Additional visualizations are in Appendix Sec.G. The distilled images, compared to the original ones, add high-frequency components that help improve the generalization performance(Wang etal., 2020a). While the distilled texts maintain semantic components associated with the distilled images and capture the key attributes e.g. "couple", "kiss", "man", "surf", "huge wave", they also deviate from original sentence embeddings, as they are not in the original five captions paired with the images. The improved performance indicates that both high-frequency components and semantic ones are perceived by models and these significantly help in aligning vision-language modalities.

Vision-Language Dataset Distillation (3)

	Lower Bound: Random Ranking						Upper Bound: NFNet + BERT						Upper Bound: ViT (LoRA) + BERT
	TR			IR			TR			IR			TR			IR
Dataset	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
Flickr30K	0.1	0.6	1.1	0.1	0.5	1.0	33.9	65.1	75.2	27.3	57.1	69.7	42.7	72.9	83.5	31.8	62.8	74.5
COCO	0.02	0.1	0.2	0.02	0.1	0.2	19.6	45.6	59.5	16.9	41.9	55.9	22.6	50.8	64.8	19.1	44.7	58.7

		TR			IR
Distill	Evaluate	R@1	R@5	R@10	R@1	R@5	R@10
NFNet	NFNet	9.9	28.3	39.1	4.7	15.7	24.6
	NF-ResNet50	5.2	14.7	21.2	4.5	13.8	21.2
	NF-RegNet	3.6	9.7	15.5	2.5	8.6	14.0
	ViT	3.1	8.6	13.2	2.3	7.4	13.3

		TR			IR
Distill	Evaluate	R@1	R@5	R@10	R@1	R@5	R@10
ViT	ViT	10.4	23.6	38.7	5.4	18.8	27.4
	NF-ResNet50	2.8	8.3	12.2	2.0	6.7	11.5
	NF-RegNet	3.7	8.4	14.1	1.9	5.9	9.2
	NFNet	4.4	12.6	20.3	2.6	7.3	13.9

	TR									IR
	R@1			R@5			R@10			R@1			R@5			R@10
# pairs	T	I	Ours	T	I	Ours	T	I	Ours	T	I	Ours	T	I	Ours	T	I	Ours
100	1.3	3.5	9.9	3.5	11.5	28.3	5.9	17.4	39.1	0.5	1.6	4.7	2.1	5.6	15.7	3.4	9.7	24.6
200	1.4	4.5	10.2	4.8	12.8	28.7	8.2	21.7	41.9	0.7	2.0	4.6	2.7	8.1	16.0	4.7	13.0	25.5
500	6.6	6.5	13.3	19.5	19.4	32.8	30.4	28.9	46.8	3.8	3.8	6.6	13.5	12.4	20.2	20.8	19.9	30.0
1000	7.7	5.0	13.3	20.7	17.4	34.8	31.2	24.9	45.7	4.0	3.9	9.1	13.3	13.1	24.1	20.1	20.1	33.8

Cross-Architecture Generalization. Following previous works(Cazenavette etal., 2022; Cui etal., 2023; Zhao & Bilen, 2023), we evaluate the cross-architecture generalization ability of our distilled data in training unseen architectures. The experiments are conducted on Flickr30K with 100 distilled pairs. Distilling with NFNet model, we report the cross-architecture generalization performance on NF-ResNet50(Brock etal., 2021a), NF-RegNet(Xu etal., 2022), and ViT (Dosovitskiy etal., 2021) (LoRA). As shown in Tab.4, our method transfers well across different models.

4.3 Ablation Studies

We conduct a set of ablation studies to understand unimodal distillation vs. co-distillation, distilled dataset initialization (Sec.E.1), different encoder backbones (Sec.E.2), pretraining (Sec.E.3), synthetic steps (Sec.E.4), and their influence on distillation.

We compare co-distillation with unimodal distillation, where we keep one of the modalities fixed during distillation. Tab.5 shows the retrieval performance of text-only distillation, image-only distillation, and co-distillation. Across all tasks and metrics, the co-distillation approach clearly outperforms the others. We observed that the performance of text-only distillation is worse than that of image-only distillation. This may not be surprising: text descriptions typically contain only a salient but small portion of visual information. However, descriptions in the evaluated datasets typically contain no information that cannot be inferred from the images. By distilling images to text-relevant aspects, it can highlight essential image features. Thus, if we interpret each original image as having substantially more information than its original sentence, we would expect image-only distillation to perform better in a smaller-scale regime (removing spurious information) and text-only distillation to perform better in a larger-scale regime (adding useful details).

In contrast, co-distillation allows the synthetic dataset to further optimize for compact representation and efficient storage, removing redundant information between examples in the smaller-scale contexts and adding information not present in the selected original images in larger-scale contexts. Our co-distillation method, combining text and image modalities during training, consistently outperforms single-modality distillation across different numbers of training pairs and metrics. While the improvement from co-distillation is consistent, it is particularly substantial with fewer pairs: in the 100 and 200 pairs rows, co-distillation outperforms its unimodal alternatives by over 2 $\times$ . In fact, co-distillation with 100 pairs consistently outperforms unimodal distillation with 1000 pairs. These results demonstrate the effectiveness of jointly distilling across modalities and highlight the complementary nature of multimodal data.

5 Conclusion

In this work, we propose the first vision-language dataset distillation method. By co-distilling both vision and language modalities, we can progressively optimize and distill the most critical information from a vision-language dataset. Our experiments show that co-distilling different modalities via bi-trajectory matching and using LoRA matching for complex model finetuning hold promise. We hope that the insights we gathered can serve as a roadmap for future studies exploring more complex settings. Furthermore, we believe our work lays the groundwork for future research aimed at understanding the minimum information required for a vision-language model to achieve comparable performance quickly, thereby building a better understanding of the compositionality of compact visual-linguistic knowledge.

Limitations.We make note of two limitations of our approach. Firstly, dataset distillation is not exempt from the “No Free Lunch” theorem(Wolpert & Macready, 1997). As discussed in(Sachdeva & McAuley, 2023), we also observed that the effectiveness of the distilled data is highly influenced by learning algorithms and models used during distillation, which could potentially lead to poor transferability. Furthermore, many dataset distillation methods are computationally intensive, i.e. the bi-level optimization in meta-learning distillation approaches, which is another major challenge. In contrast, our trajectory matching approach is significantly less computationally demanding, yet we observed that the larger synthetic steps often result in improved performance, and exploring closed-form solutions, i.e. implicit gradient-based methods(Lorraine etal., 2020) could be promising future directions to pursue.

Broader Impact Statement.Our exploration focuses on scientific understanding and practical applications of vision-language dataset distillation. While our work does not directly imply negative impacts, it may indirectly propagate existing biases in the original datasets. Therefore, it is important to incorporate rigorous bias-mitigation measurements for dataset distillation. Discussion on these critical aspects should remain a priority as we further explore the potential of vision-language dataset distillation.

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grants No. 2107048 and No. 2112562. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. We thank many people from Princeton Visual AI lab (Allison Chen, Jihoon Chung, Tyler Zhu, Ye Zhu, William Yang and Kaiqu Liang) and Princeton NLP group (Carlos E. Jimenez, John Yang), as well as Tiffany Ling, George Cazenavette and Ilia Sucholutsky for their helpful feedback.

References

Alayrac etal. (2022)Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, etal.Flamingo: a visual language model for few-shot learning.NeurIPS, 2022.
Brock etal. (2021a)Andrew Brock, Soham De, and SamuelL Smith.Characterizing signal propagation to close the performance gap in unnormalized resnets.ICLR, 2021a.
Brock etal. (2021b)Andy Brock, Soham De, SamuelL Smith, and Karen Simonyan.High-performance large-scale image recognition without normalization.In ICML, 2021b.
Cazenavette etal. (2022)George Cazenavette, Tongzhou Wang, Antonio Torralba, AlexeiA. Efros, and Jun-Yan Zhu.Dataset distillation by matching training trajectories.In CVPR, 2022.
Chun etal. (2021)Sanghyuk Chun, SeongJoon Oh, RafaelSampaio DeRezende, Yannis Kalantidis, and Diane Larlus.Probabilistic embeddings for cross-modal retrieval.In CVPR, 2021.
Cui etal. (2023)Justin Cui, Ruochen Wang, SiSi, and Cho-Jui Hsieh.Scaling up dataset distillation to imagenet-1k with constant memory.In ICML, 2023.
Deng & Russakovsky (2022)Zhiwei Deng and Olga Russakovsky.Remember the past: Distilling datasets into addressable memories for neural networks.In NeurIPS, 2022.
Devlin etal. (2018)Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018.
Dosovitskiy etal. (2021)Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, etal.An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021.
Du etal. (2023)Jiawei Du, Yidi Jiang, Vincent T.F. Tan, JoeyTianyi Zhou, and Haizhou Li.Minimizing the accumulated trajectory error to improve dataset distillation.In CVPR, 2023.
Farahani & Hekmatfar (2009)RezaZanjirani Farahani and Masoud Hekmatfar.Facility location: concepts, models, algorithms and case studies.Springer Science & Business Media, 2009.
Gu etal. (2018)Jiuxiang Gu, Jianfei Cai, ShafiqR Joty, LiNiu, and Gang Wang.Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models.In CVPR, 2018.
He etal. (2016)Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition.In CVPR, 2016.
Hu etal. (2022)EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, LuWang, and Weizhu Chen.Lora: Low-rank adaptation of large language models.ICLR, 2022.
Huang etal. (2018)Yan Huang, QiWu, Chunfeng Song, and Liang Wang.Learning semantic concepts and order for image and sentence matching.In CVPR, 2018.
Jiang etal. (2023)Zixuan Jiang, Jiaqi Gu, Mingjie Liu, and DavidZ Pan.Delving into effective gradient matching for dataset condensation.COINS, 2023.
Karpathy & Fei-Fei (2015)Andrej Karpathy and LiFei-Fei.Deep visual-semantic alignments for generating image descriptions.In CVPR, 2015.
Krizhevsky etal. (2009)Alex Krizhevsky, Geoffrey Hinton, etal.Learning multiple layers of features from tiny images.2009.
LeCun etal. (1998)Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.Gradient-based learning applied to document recognition.Proceedings of the IEEE, 1998.
Lee etal. (2022)Saehyung Lee, Sanghyuk Chun, Sangwon Jung, Sangdoo Yun, and Sungroh Yoon.Dataset condensation with contrastive signals.In ICML, 2022.
Li etal. (2022)Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation.In ICML, 2022.
Liang etal. (2022)PaulPu Liang, Amir Zadeh, and Louis-Philippe Morency.Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions.arXiv preprint arXiv:2209.03430, 2022.
Lin etal. (2014)Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and CLawrence Zitnick.Microsoft coco: Common objects in context.In ECCV, 2014.
Lin etal. (2022)Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, Deepak Pathak, and Deva Ramanan.Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models.CVPR, 2022.
Liu etal. (2023)Haoyang Liu, Tiancheng Xing, Luwei Li, Vibhu Dalal, Jingrui He, and Haohan Wang.Dataset distillation via the wasserstein metric.arXiv preprint arXiv:2311.18531, 2023.
Lorraine etal. (2020)Jonathan Lorraine, Paul Vicol, and David Duvenaud.Optimizing millions of hyperparameters by implicit differentiation.In AISTATS, 2020.
Nguyen etal. (2020)Timothy Nguyen, Zhourong Chen, and Jaehoon Lee.Dataset meta-learning from kernel ridge-regression.In ICLR, 2020.
Nguyen etal. (2021)Timothy Nguyen, Roman Novak, Lechao Xiao, and Jaehoon Lee.Dataset distillation with infinitely wide convolutional networks.In NeurIPS, 2021.
Oord etal. (2018)Aaron vanden Oord, Yazhe Li, and Oriol Vinyals.Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018.
Pandey etal. (2022)Rohan Pandey, Rulin Shao, PaulPu Liang, Ruslan Salakhutdinov, and Louis-Philippe Morency.Cross-modal attention congruence regularization for vision-language relation alignment.ACL, 2022.
Plummer etal. (2015)BryanA Plummer, Liwei Wang, ChrisM Cervantes, JuanC Caicedo, Julia Hockenmaier, and Svetlana Lazebnik.Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models.In ICCV, 2015.
Pokle etal. (2022)Ashwini Pokle, Jinjin Tian, Yuchen Li, and Andrej Risteski.Contrasting the landscape of contrastive and non-contrastive learning.2022.
Radenovic etal. (2023)Filip Radenovic, Abhimanyu Dubey, Abhishek Kadian, Todor Mihaylov, Simon Vandenhende, Yash Patel, YiWen, Vignesh Ramanathan, and Dhruv Mahajan.Filtering, distillation, and hard negatives for vision-language pre-training.In CVPR, 2023.
Radford etal. (2021)Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In ICML, 2021.
Sachdeva & McAuley (2023)Noveen Sachdeva and Julian McAuley.Data distillation: A survey.TMLR, 2023.
Sener & Savarese (2018)Ozan Sener and Silvio Savarese.Active learning for convolutional neural networks: A core-set approach.ICLR, 2018.
Sucholutsky & Schonlau (2021)Ilia Sucholutsky and Matthias Schonlau.Soft-label dataset distillation and text dataset distillation.In IJCNN, 2021.
Toneva etal. (2019)Mariya Toneva, Alessandro Sordoni, Remi Tachetdes Combes, Adam Trischler, Yoshua Bengio, and GeoffreyJ Gordon.An empirical study of example forgetting during deep neural network learning.ICLR, 2019.
Valverde etal. (2021)FranciscoRivera Valverde, JuanaValeria Hurtado, and Abhinav Valada.There is more than meets the eye: Self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge.In CVPR, 2021.
Vicol etal. (2022)Paul Vicol, JonathanP Lorraine, Fabian Pedregosa, David Duvenaud, and RogerB Grosse.On implicit bias in overparameterized bilevel optimization.In ICML, 2022.
Wang etal. (2020a)Haohan Wang, Xindi Wu, Zeyi Huang, and EricP Xing.High-frequency component helps explain the generalization of convolutional neural networks.In CVPR, 2020a.
Wang etal. (2020b)Haoran Wang, Ying Zhang, Zhong Ji, Yanwei Pang, and Lin Ma.Consensus-aware visual-semantic embedding for image-text matching.In ECCV, 2020b.
Wang etal. (2022)Kai Wang, BoZhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan Huang, Hakan Bilen, Xinchao Wang, and Yang You.CAFE: Learning to condense dataset by aligning features.In CVPR, 2022.
Wang etal. (2018)Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and AlexeiA Efros.Dataset distillation.arXiv preprint arXiv:1811.10959, 2018.
Welling (2009)Max Welling.Herding dynamical weights to learn.In ICML, 2009.
Wolpert & Macready (1997)DavidH Wolpert and WilliamG Macready.No free lunch theorems for optimization.IEEE transactions on evolutionary computation, 1997.
Wright & Ma (2022)John Wright and YiMa.High-dimensional data analysis with low-dimensional models: Principles, computation, and applications.Cambridge University Press, 2022.
Wu etal. (2019)Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma.Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations.In CVPR, 2019.
Wu etal. (2023)Xindi Wu, KwunFung Lau, Francesco Ferroni, Aljoša Ošep, and Deva Ramanan.Pix2map: Cross-modal retrieval for inferring street maps from images.In CVPR, 2023.
Xu etal. (2022)Jing Xu, YuPan, Xinglin Pan, Steven Hoi, Zhang Yi, and Zenglin Xu.Regnet: self-regulated network for image classification.IEEE Transactions on Neural Networks and Learning Systems, 2022.
Xue etal. (2023)Zihui Xue, Zhengqi Gao, Sucheng Ren, and Hang Zhao.The modality focusing hypothesis: Towards understanding crossmodal knowledge distillation.In ICLR, 2023.
Zhao & Bilen (2021a)BoZhao and Hakan Bilen.Dataset condensation with differentiable siamese augmentation.In ICML, 2021a.
Zhao & Bilen (2021b)BoZhao and Hakan Bilen.Dataset condensation with gradient matching.In ICLR, 2021b.
Zhao & Bilen (2023)BoZhao and Hakan Bilen.Dataset condensation with distribution matching.In WACV, 2023.
Zhou etal. (2022)Yongchao Zhou, Ehsan Nezhadarya, and Jimmy Ba.Dataset distillation using neural feature regression.In NeurIPS, 2022.
Zhu etal. (2022)YeZhu, YuWu, Nicu Sebe, and Yan Yan.Vision+ x: A survey on multimodal learning in the light of data.arXiv preprint arXiv:2210.02884, 2022.

Appendix

In this Appendix, we first we provide the full baseline comparison (with R@1/5/10) on Flickr30K and COCO (Sec.A). Then we show the challenges of vision-language distillation (Sec.B) by transitioning the trajectory-matching pipeline from image-only to image-text retrieval. We provide analysis on distilled images (Sec.D) and lossless distillation (Sec.C). We further extend the ablation study, analyzing components of our pipeline, i.e. distilled dataset initialization (SecE.1), encoder backbones (Sec.E.2), pretraining (Sec.E.3) and synthetic steps (Sec.E.4). Lastly, we show additional visualizations of the distilled samples, as well as the ones under different backbones (Sec.G).

Appendix A Full Details for Distilled Performance

We provide full distillation results following Section 4.2, including image-to-text and text-to-image retrieval results R@5 and R@10 with NFNet in Tab.6.

			TR					IR
			Coreset Selection				Dist(ours)	Coreset Selection				Dist(ours)
Dataset	#pairs	Metrics	R	H	K	F	Dist(ours)	R	H	K	F	Dist(ours)
Flickr30K	100	R@1	1.3	1.1	0.6	1.2	9.9 $\pm$ 0.3	1.0	0.7	0.7	0.7	4.7 $\pm$ 0.2
		R@5	5.9	4.7	5.0	4.2	28.3 $\pm$ 0.5	4.0	2.8	3.1	2.4	15.7 $\pm$ 0.5
		R@10	10.1	7.9	7.6	9.7	39.1 $\pm$ 0.7	6.5	5.3	6.1	5.6	24.6 $\pm$ 1.0
	200	R@1	2.1	2.3	2.2	1.5	10.2 $\pm$ 0.8	1.1	1.5	1.5	1.2	4.6 $\pm$ 0.9
		R@5	8.7	8.4	8.2	8.4	28.7 $\pm$ 1.0	4.8	5.5	5.4	3.1	16.0 $\pm$ 1.6
		R@10	13.2	14.4	13.5	10.2	41.9 $\pm$ 1.9	9.2	9.3	9.9	8.4	25.5 $\pm$ 2.6
	500	R@1	5.2	5.1	4.9	3.6	13.3 $\pm$ 0.6	2.4	3.0	3.5	1.8	6.6 $\pm$ 0.3
		R@5	18.3	16.4	16.4	12.3	32.8 $\pm$ 1.8	10.5	10	10.4	9.0	20.2 $\pm$ 1.2
		R@10	25.7	24.3	23.3	19.3	46.8 $\pm$ 0.8	17.4	17.0	17.3	15.9	30.0 $\pm$ 2.1
	1000	R@1	5.2	5	5.6	3.1	13.3 $\pm$ 1.0	3.8	4.1	4.4	3.2	7.9 $\pm$ 0.8
		R@5	15.6	14.6	16.1	14.9	34.8 $\pm$ 1.9	11.8	12.1	12.8	9.5	24.1 $\pm$ 1.6
		R@10	21.4	20.4	20.8	18.9	45.9 $\pm$ 2.5	19.9	20.0	20.4	18.7	33.8 $\pm$ 2.0
		R@1	0.8	0.8	1.4	0.7	2.5 $\pm$ 0.3	0.3	0.5	0.4	0.3	1.3 $\pm$ 0.1
		R@5	3.0	2.1	3.7	2.6	10.0 $\pm$ 0.5	1.3	1.4	1.4	1.5	5.4 $\pm$ 0.3
	100	R@10	5.0	4.9	5.5	4.8	15.7 $\pm$ 0.4	2.7	3.5	2.5	2.5	9.5 $\pm$ 0.5
		R@1	1.0	1.0	1.2	1.1	3.3 $\pm$ 0.2	0.6	0.9	0.7	0.6	1.7 $\pm$ 0.1
		R@5	4.0	3.6	3.8	3.5	11.9 $\pm$ 0.6	2.3	2.4	2.1	2.8	6.5 $\pm$ 0.4
	200	R@10	7.2	7.7	7.5	7.0	19.4 $\pm$ 1.2	4.4	4.1	5.8	4.9	12.3 $\pm$ 0.8
		R@1	1.9	1.9	2.5	2.1	5.0 $\pm$ 0.4	1.1	1.7	1.1	0.8	2.5 $\pm$ 0.5
		R@5	7.5	7.8	8.7	8.2	17.2 $\pm$ 1.3	5.0	5.3	6.3	5.8	8.9 $\pm$ 0.7
	500	R@10	12.5	13.7	14.3	13.0	26.0 $\pm$ 1.9	8.7	9.9	10.5	8.2	15.8 $\pm$ 1.5
		R@1	1.9	2.4	2.4	1.9	6.8 $\pm$ 0.4	1.5	1.3	1.5	0.7	3.3 $\pm$ 0.1
		R@5	7.6	9.0	9.0	7.7	21.9 $\pm$ 1.2	5.6	5.7	7.1	4.6	11.9 $\pm$ 0.5
COCO	1000	R@10	12.7	14.0	14.1	13.0	31.0 $\pm$ 1.5	9.6	10.1	10.9	8.0	22.1 $\pm$ 0.9

Appendix B CIFAR10 Classification vs Retrieval Distillation

Prior work has shown remarkable distillation results on CIFAR10(Krizhevsky etal., 2009) classification. To move from distilling image-only datasets to vision-language datasets, we first check if our method has potential in simple settings. Concretely, we convert CIFAR10 labels to captions that pair with their corresponding images. Under this formulation, the objective of classification is equivalent to that of image-to-text retrieval (TR): finding the best text given an image.

In Tab.7, we compare CIFAR10 distillation performance for dataset size of 1, 10, 50 images per class (IPC), under three different settings: classification, single-caption retrieval, and multi-caption retrieval. For classification, we demonstrate results from MTT(Cazenavette etal., 2022), where they distill an image-only dataset using expert trajectories trained on image-label pairs. In single-caption TR, we distill image-caption pairs using expert trajectories trained when each image is paired with a single caption "This is a {label}". In multi-caption TR, we distill image-caption pairs but the expert trajectories are trained when each image is paired with five captions that are generated with varies prompts from(Radford etal., 2021). For consistency, all image trajectories are obtained with the 3-layer ConvNet backbone asspecified in(Cazenavette etal., 2022), and text trajectories are from linear projection layers over pretrained BERT(Devlin etal., 2018) embeddings. Although the performance of vision-language distillation trails behind that of image-only distillation, the gap closes at larger IPCs. However, this gap highlights the challenge of the continuous label space in vision-language datasets. Moreover, the performance gap between single and multi-caption retrieval demonstrates the challenge of capturing the variability within human language descriptions.

IPC	Classification	image-to-text retrieval
IPC	Classification	Single Caption	Multi Caption
1	46.3 $\pm$ 0.8	27.4 $\pm$ 1.0	22.3 $\pm$ 1.0
10	65.3 $\pm$ 0.7	35.9 $\pm$ 0.7	33.2 $\pm$ 0.5
50	71.6 $\pm$ 0.2	66.8 $\pm$ 1.1	62.0 $\pm$ 0.8
Full	84.8 $\pm$ 0.1	79.6 $\pm$ 0.6	80.3 $\pm$ 0.4

Appendix C Upper Bound Performance

We further increase the distilled size to be 10% of the original Flickr30K dataset size and we provide the comparisons for distillation performance with the upper bound results (Tab.8). The distillation performance are closely approaching the upper bound results.

Result	Vision	Language	Ratio	TR			IR
Type	Backbone	Backbone		R@1	R@5	R@10	R@1	R@5	R@10
Distillation	NFNet	BERT	10%	32.1	60.0	73.2	24.1	53.9	66.5
Upper Bound	NFNet	BERT	100%	33.9	65.1	75.2	27.3	57.2	69.7
Distillation	NFNet	CLIP	10%	60.0	86.3	91.4	47.4	78.2	86.5
Upper Bound	NFNet	CLIP	100%	61.2	87.5	92.8	49.8	79.8	88.3

Appendix D Analysis on Distilled Images

We have found that increasing the learning rate and distillation time lead to more noticeable changes in the images within the distilled dataset (distilled images: Fig.4, original images: Fig.5). However, it is important to note that a higher learning rate or longer distillation time does not necessarily translate to improved performance of the distilled dataset, even if the images appear to deviate more drastically from the human perception perspective. Changes in image pixels alone may not reliably predict distillation performance. It is rather a measurement of the distillation strength. More distorted images suggest uneven pixel updates, while even updates yield results similar to the visualization we provided before in Fig.3.

In line with previous studies, we initially expected more obvious changes in images would lead to better performance, but our findings suggest a different behavior of vision-language distillation with trajectory matching framework, reflecting how models capture vision-language interaction. From a human perception perspective, the distilled images appear to be moving less compared to previous classification works, yet those small vectors are still meaningful and contain useful information, as opposed to artifacts like noisy patterns. Our algorithm achieves a clear and consistent improvement over random baselines indicated by the results. We hope this discussion can inspire more research on vision-language dataset distillation.

Vision-Language Dataset Distillation (4)

Vision-Language Dataset Distillation (5)

Appendix E Additional Ablation Studies

In this section, we provide additional ablation studies. Unless specified, these distillation experiments are conducted on the Flickr30K dataset to distill 100 image-text pairs, and we use pretrained NFNet and BERT as backbones, with synthetic step set to 8 during distillation.

E.1 Distilled Dataset Initialization

In the main paper, we provided experiments with real sample initialization. Here we experiment and evaluate initializing with Gaussian noise. Our findings in Tab.9 show that initializing images from the Gaussian distribution results in significantly lower performance. It is worth noting that the complexity of images, which encodes a high degree and rich information of colors, shapes, textures and spatial relationships between objects, can make it difficult for models to learn effectively from randomly initialized images. On the other hand, using real text sampled from the training set vs. randomly initialized text embeddings does not bring a significant difference. We assume that the pretrained language models are good at generating or transforming ‘noise’ text embedding into meaningful sentences during the learning process, partly due to the inherent structure and predictability of language. We provide visualizations of real images and ‘noise’ texts combination below in Fig.6 and Fig.7 and Tab.E.1. To our surprise, even though the initialized ‘noise’ texts are not semantically meaningful to the initialized real images, we discovered a substantial degree of semantic similarity between the initialized real images and the learned distilled text. This suggests the probability of future application of our method in Visual Question Answering (VQA).

Real Image	Real Text	R@1	R@5	R@10	R@1	R@5	R@10
		Distillation
		TR			IR
✓	✓	9.9	28.3	39.1	4.7	15.7	24.6
✓		9	27.2	40.1	3.9	13.2	20.6
	✓	0.2	0.7	1.1	0.1	0.5	1
		0.1	0.3	0.4	0.1	0.4	0.8

Vision-Language Dataset Distillation (6)

Vision-Language Dataset Distillation (7)

‘Noise’ Texts, iteration = 0.
30 randomly initialized text from Gaussian distribution, we use nearest neighbor to find their closest sentence in the training set in Flickr30k for visualization purposes.this man is fit and well toned running enthusiastthe music concert is just started at the giant stadiuma man in a beige shirt and tan slacks sits in a chair next to a hospital patient wearing a blue gown who is sitting cross-legged on his hospital bedman and woman employed by mongolian barbecue stand at counterwoman cupping water with hands over bathroom sink as child stands beside hernear snowflake sign, man sits while another stands wearing badge and headphonesvery brave snow skier doing a flip off a cliffa guy, dressed nicely, is painting a mural on a wall, with a ladder sitting beside himdog chasing brown cow and black cowseems to me looks like people in a work room or office working they all using laptop computers from apple it seems there dr pepper soda and water bottle onolympian performing on the ringsman walking behind distracted-looking woman carrying bags and camerathree men in caps sit at fireside near cabin, reading at nightthe dog with the red collar is white, black, and brownminor league pitcherman in chair laughing and talking to others, while handling booksthe man, with no shirt, reaches into a bucket to extract the substance insidesmall brown dog on leashwoman sitting at a park bench reading a bookviolin soloists take the stage during the orchestra’s opening show at the theaterblack dog sitting while eating with neon yellow band around shouldersseveral people are standing under a tarp two ladies are facing each other and one has a backpack on with her hands in her jeans pockets while the other onea boy wearing a flowered shirt raises his arm and jumpsa quarterback is looking to set up a pass from the end zone, while a teammate provides some blockinga woman in a red coat takes a picture near marble columns at twilightpeople with anti-immigration signsthe outside of a restaurant called el triuneocheerleaders build a pyramid near the goal-linebaby wears green frog big and makes grotesque facea yellow, suspended roller coaster on a yellow track is midway through a loop

Distilled Texts, iteration = 1000.
Starting with randomly initialized text from Gaussian distribution, here is the synthetic text after distillation.superhero man leaping in a plazaa guy in a blue shirt listens to music as he skateboards along the edge of a ramptiger woods about to make a puttlittle boy pulling a green wagon wearing a sweatshirt and bootsa young girl with blond-hair and glasses sitting at a table in a restaurantthree black young man are working in a semi-deserted area with a pile of construction material and jugs, one of them is diggingwoman buying cups of fruit from street vendorsix men in blue jumpsuits and a man in an orange jumpsuit walk by a shipyardyoung girl balances on a reclined man’s legs as part of a performance in front of an audiencea woman fillets a fish, as part of preparing a recipe that includes broccoli, celery, and eggsa young man wearing a white shirt and red shorts kicking a ballcontortionist in strange checkered outfit wearing a white maskone man plays an acoustic guitar, while another accompanies him on the accordionmale wearing brown shirt holding a microphone with an expression of singinga young lady in a colorful dress, holds a white stuffed animal stands in the rain hold a plaid umbrellaa person with blue and polka-dot socks jumps on a bed with a red and white blanketa damaged black color car on the streetskateboarder jumping in air and his skateboard is between his legsa woman with a guitar sings in front of a building and grasstwo woman are sitting on a beach together, facing the watera busy street with building lined up and people walking down the street outside and nighttimeparasailer doing flip in midaircrowded arena with lots of people wearing yellow, carrying red flagsmen in turbans laying down and examining clotha woman in a white apron prepares various meats on a large grilla middle-aged man in a trench coat sleeps on a bus or traina line of people, some standing and some sitting, are waiting on a platform for a trainthree men playing drums, bass, and pianoa dirt-blonde girl in a white top with a key necklace holds a bag, standing in front of a sidewalk of streetcattle-drawn wagons traveling down a paved road and loaded with sticks

E.2 Encoder Backbone Selection

In this section, we evaluate the impact of different language/vision backbones on the distillation performance.

E.2.1 Language Backbones

Perhaps not surprisingly, CLIP(Radford etal., 2021) text encoder significantly outperforms BERT in all evaluation metrics, with a striking peak performance in TR R@10 at 92.8% for expert training. This exceptional performance can be mainly attributed to the fact that the pre-trained, off-the-shelf CLIP model is designed to learn a shared embedding space across multi-modalities. Although CLIP also shows a performance drop during distillation, it still retains a relatively high performance recovery ratio. In Sec.G we provide visualization of synthetic data distilled via NFNet and CLIP.

	Expert						Distillation
Language Model	TR			IR			TR			IR
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
BERT	33.9	65.1	75.2	27.3	57.2	69.7	9.9	28.3	39.1	4.7	15.7	24.6
CLIP	61.2	87.5	92.8	49.8	79.8	88.3	31.4	58.8	72.0	17.1	41.9	56.2

E.2.2 Vision Backbones

The vision encoders carry the main gradient flows for the distillation process. We experimented on several vision backbones, and found that the architecture choice strongly influences the distillation quality. Similar to dataset distillation by gradient matching(Zhao & Bilen, 2021b), batch normalization has an impact on the gradient/parameter matching framework. This is mainly because batch normalization incorporates a non-parametric component that can only be accumulated with batches and can not be trained.

	Expert						Distillation
Vision Model	TR			IR			TR			IR
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
ViT (LoRA)	40.7	69.8	80.1	28.8	59.3	73.4	10.4	23.6	38.7	5.4	18.8	27.4
NFNet-l0	33.9	65.1	75.2	27.3	57.2	69.7	9.9	28.3	39.1	4.7	15.7	24.6
NF $\_$ ResNet50	28.9	56.6	71	22.8	50.1	63.4	6.5	18.2	28.1	3.5	11.6	18.7
NF $\_$ RegNet	26.9	57.2	70.2	21.1	50.1	62.9	7.8	21.9	33.3	3.3	12.7	20.5
ResNet50	18	43.5	59.5	13.4	36.6	49.9	0.5	2.4	3.8	0.3	1.6	3.6

E.3 Pretrained vs. Non-pretrained

Tab.12 demonstrates the pretraining influence of the backbone encoders. Optimal performance is observed when both language and vision backbones are pretrained. This emphasizes the importance of pretraining before the expert training stage for large models and datasets.

		Expert
Language	Vision	TR			IR
Backbone	Backbone	R@1	R@5	R@10	R@1	R@5	R@10
✓	✓	33.9	65.1	75.2	27.3	57.2	69.7
✓		4.4	14.1	20.7	3.5	11.4	18.8
	✓	0.5	1.1	1.8	0.3	0.7	1.4
		0.3	1	1.5	0.1	0.7	1.3

E.4 Synthetic Steps

The synthetic step size plays an important role in optimizing the dataset distillation performance, as shown in Tab.13. Using larger synthetic steps tends to achieve better distillation performance.

		Distillation
#Pairs	#Syn Steps	TR			IR
		R@1	R@5	R@10	R@1	R@5	R@10
100	1	0.5	2.1	4.4	0.3	1.5	2.8
	2	7.1	23.4	32.9	3.0	10.2	16.4
	4	8.2	24.9	35.2	3.5	12.2	20.7
	8	9.9	28.3	39.1	4.7	15.7	24.6
200	1	3.2	9.3	14.1	1.6	5.2	8.8
	2	6.5	19.2	29.1	1.6	5.9	10.0
	4	8.2	24.5	34.4	2.2	7.4	11.8
	8	10.2	28.7	41.9	4.6	16.0	25.5
500	1	6.6	18.1	25.5	2.1	10.1	16.3
	2	8	21.7	31.3	3.8	14.9	23.2
	4	8.1	23.6	34.9	4.4	15.2	23.7
	8	13.3	32.8	46.8	6.6	20.2	30.0
1000	1	7.3	20.6	29.7	3.9	13.2	20.7
	2	8.8	26.8	36.6	5.7	17.4	26.4
	4	10.4	29.1	37.9	6.6	19.5	29.5
	8	13.3	34.8	45.7	9.1	24.1	33.8

Appendix F Beyond Trajectory Matching

In this section, we further provide experiment results of a distribution matching(Zhao & Bilen, 2023) baseline adapted to the vision-language setting. To use distribution matching for vision-language dataset distillation, concretely, we minimize the maximum mean discrepancy (mmd) between two distributions by sampling NFNet with different initialization and pretrained BERT. Similar to the distribution matching setting for image classification, we update the distilled data via mmd for vision and language modalities to match the original data distribution in a family of embedding spaces. We provide the comparison of Our method w/ DM (distribution matching) and Our method w/ TM (trajectory matching) on Flickr30K (R@1) in Tab.14.

	TR		IR
# pairs	Ours w/ DM	Ours w/ TM	Ours w/ DM	Ours w/ TM
100	3.2 $\pm$ 1.8	9.9 $\pm$ 0.3	1.4 $\pm$ 0.7	4.7 $\pm$ 0.2
200	3.3 $\pm$ 1.3	10.2 $\pm$ 0.8	1.4 $\pm$ 0.4	4.6 $\pm$ 0.9
500	5.8 $\pm$ 1.5	13.3 $\pm$ 0.6	4.1 $\pm$ 0.9	6.6 $\pm$ 0.3
1000	6.1 $\pm$ 2.7	13.3 $\pm$ 1.0	4.9 $\pm$ 1.8	7.9 $\pm$ 0.8

Looking forward, we hope our method could serve as a roadmap for future studies exploring more complex settings with new state-of-the-art (SOTA) methods. New SOTA dataset distillation methods can adopt low-rank adaptation matching to scale efficiently with large and complex models, and can incorporate bi-trajectory co-distillation to handle textual data more effectively. By doing so, these methods can extend their applicability to previously infeasible models for distillation, such as those involving ViTs, thus improving the scalability and efficiency of the distillation process. New approaches that distill from both text and image data can consider using methods similar to bi-trajectory matching with contrastive loss to learn the interactions and redundancies across multimodalities.

Appendix G Additional Visualizations

Here we include a number of visualizations of the data we distilled from the multimodal dataset (both Flickr30K Tab.G and Fig.8, 9 and COCO Tab.G and Fig.10,11) for a more intuitive understanding of the distilled set. We provide 50 distilled image-text paired examples including their visualization before the distillation process. Unless otherwise stated, these experiments are conducted using 100 distilled pairs, with pretrained NFNet(Brock etal., 2021b) and BERT(Devlin etal., 2018) as backbones and the synthetic step is set to 8 during distillation. We provide visualization of distilled data using NFNet and CLIP in Tab.G and Fig.12,13 in the end.

Vision-Language Dataset Distillation (8)

Flickr30k Initialized Texts, iteration = 0.a construction worker stares down a flight of stairs being builta man in a suit walking across a city streeta child hits a baseball into a neta large crowd of peoplean old man, behind him on glass is many political advertisem*ntsan old man with white hair in a red hata young girl, on a road, playing on the icetwo men dressed up before a big eventa dog is running through a field with its tongue hanging outan older man in a gray shirt with a white long-sleeve shirt under it holding up a small wooden cabineta motorcycle is parked along side a mountain road while another goes down the roada man in an orange shirt and black shorts throws a ball through a hoop while another man watchestwo people sit on the end of a docka man performs a back flip while preparing for an outdoor performance or competitiona little boy plays with a toy guna bunch of young children are riding on the back of a trolley while being carried by a black and white horsea man climbing up on a rock ledgea young man sits on a bench in a downtown settinga black dog carries an orange ball, walking on the ground covered in leavesbasketball players practicing for their gamea man in just his underwear jumping on a man surrounded by a crowd of peoplea man wearing lots of plaid riding a bike through the streetsmen are trying to cut down treesa black man getting a haircutthis was a big new years event the people were sing and dancing all nighta boy wearing a steve nash shirt dribbles a basketball on an indoor courta series of men taking a break from riding their motorcyclesa brown-haired man in a patterned shirt and purple tie is singing into a microphonea curly dark-haired man holds a small camcorder and films in a person in front of hima young girl in a yellow shirt holds a rather large snail in her hands next to her cheektwo young men dancing in the streeta girl standing in a shallow creek, wearing stilts3 people standing on a park talking to each othera group of young men in colorful uniforms playing with a white balltwo guys are on the side of the street playing a guitar and drumsa mime applying his makeupa child decorates a shoe with colorful sticksa man and a woman are up on a stage with microphones in their handstwo basketball players on opposing teams, one in white, the other in red, are mid-game, running down the court, white-uniform player with ball-in-handone young lady with black hair in a ponytail wearing a black bracelet and a white shirt, taking pictures with a black camera that has a shoulder strap laying ina boy on a skateboard is on a wall near the water and next to grassa sculptor is carving a picture of a knight into a brick walla man in a blue coat is walking on the sidewalkan old man wearing glasses is holding a stickchild playing with doll-like toyan old man wearing a hooded sweatshirt is crouched over a fish that has been cut opena young girl in a black dress is holding a red flag and covering a happy expressiona brown dog with a baseball in its mouthman in white and red tackling man in green shirt for the balla man in a white t-shirt is holding a snow shovel

Vision-Language Dataset Distillation (9)

Flickr30k Distilled Texts, iteration = 2000.construction workers repair walls of a subwaya ship in a harbor at night with a city skyline behindbaseball pitcher throwing a pitchgroup of people sitting around a table for a meetinga man points to something as he is talking to a woman wearing white pants, as they stand in front of a storeman in red sweater with a backwards hatwomen wearing winter coats crossing the street next to parked cars and walking down streetthe bridal party poses with the bride and groom, all wearing black except for the bridean old lady, wearing a red hat, is standing on the sidewalk of a parka man grilling hotdogs and sausagesa motocross bike kicks up dirt as it is being ridden around a bend in the circuitnine women in blue and purple dresses and one man wearing a purple shirt and black pants, clap while a man dressed in black dancestwo young men and two boys are sitting down on a boat next to an anchor and watching the waterwhile playing soccer, a man in yellow starts to fall, while a man in white trips over him, stepping on his ankle in the processa little boy is walking on a fallen tree in the woodsa jockey and horse in the middle of other jockeys and horses during a race, in the middle of jumping over a hurdlean extreme man snowboarding up side down a mountainone man, in a blue jacket, is sitting in the rain under a green umbrellathe brown and white dog is running to catch somethingboy takes a bath with diving mask and snorkelangry looking businessman walking down sidewalka person on a bmx bike, leaping onto a benchtwo people and a dog are in the snowyoung shirtless boy sleeping on a couch with his hand on his chesta person spins a sparkler around at night and sparks fly through the air and across the grounda woman, wearing sunglasses, a red athletic top, and running shorts competes in a marathona ballet dancer wearing a blue tutu doing the splits, mid-leapwoman on street corner smiles and talks on her cellphonepoliceman taping off an area by a group of firemena man with a beer and another man facing each other, talkinga woman and two children reading outside on a stone benchman on motorcycle riding in dry field wearing a helmet and backpacka man in a white shirt and black exercise shorts walks on a sidewalk, which is located behind a street under construction and in front of a two garage housea man is hitting a golf ball out of a sand trap, there are green grass all around himsome young adults are playing saxophones and clarinets outdoorsa young man sitting on a rock on the shore of a body of water looking contemplativea young boy, covered in mud, plays on the beachshirley manson poses in front of a microphone on stage while holding a large blue, red, and white flag behind hertwo hockey players playing offense and defensea group of friends, 3 boys and 2 girls, jump in the air holding hands for a photoa boy skateboards and does a jump over another skateboardginger baby playing with a train setup made out of counterfeit legoa group of choreographed rollerskaters dancinglittle blond girl in her jacket sticking out her tongue while holding a red balloona blond little girl wrapped up in a pink care bears blanketan oriental man, wearing a white shirt and apron is cookingone boys sits on a giant mortar gun as another boy walks toward himbrown dog trying to bite a white ball with yellow, green and blue puppy toeswoman standing on the shore of a beach2 males, one in a red shirt and one in a yellow shirt, cleaning a room

Vision-Language Dataset Distillation (10)

COCO Initialized Texts, iteration = 0.a photo taken at night of a young man playing frisbeea bathroom toilet is surrounded with silver handrailsa pink flower is sticking out of a green vasea woman poses next to a fridgea small group of sheep on the coasta black and white photo showing a large clock tower on a buildingthe people are waiting to be picked upa hand holding a black television remote controla man swinging a tennis racket at a tennis balla man holds a small animal to his facethree people are in the water as a frisbee is in the airyellow and red street signs warning larger vehicles in a large citya man riding on the back of a motorcyclethe train is traveling down the railroad tracksa building with a large clock on ita man standing on a wheel of a big tarcka small bathroom with a toilet, sink and windowa florist shop and other buildings on and near a street cornera line of motorcycles parked on a streettwo young girls in a store looking at skisa small toy model train on a tracka couple of giraffes are outside in the wilda woman pouring coffee into cups on a counterpeople riding on top of elephants across a rivera hitter swings at the baseball and missesan old picture of a guy holding a tennis racquetthere are people watching three men on horsebacktwo people sitting at a table with laptopsthere are many things laying on the grounda batter swings hard at a low ball as the catcher reaches out his glovefive sheep stand around a large dirt fielda bathroom with a sink, mirrors and chairglazed donut sitting on a wooden table top in a donut shopa huge chinese aircraft is sitting at an airport with people unloadinga bunch of motorcycles are parked together outsidecutting board with scissors and dried food on ita clean, decorated bedroom is pictured in this imagea man standing on skis at the top of a hill under high tension wiresa long hot dog and french fries are on a platea pan of dough is going into the dirty toaster ovena road sign with both english and arabic directionsa kite being flown on the beach while people watcha yellow fire hydrant is on the corner of an old sidewalka man with a bicycle and a helmet on his head in a subway cara horse struggles to draw a loaded cart through piles of snowa woman is sitting between two large teddy bearsa kid skateboarding while other kids stand and watchthere is a dog in the back of the trucka large assortment of fruits lie on display in a markethe’s taking a picture of his friends at the restaurant

Vision-Language Dataset Distillation (11)

COCO Distilled Texts, iteration = 2000.dog flying in mid-air running after a frisbeebath tub with metal shower head, vanity mirror, and small utilities compartmenta white vase with pink flowers and large green stemsthis apartment has an kitchen with a refrigerator, stove, dishwasher, and cabinetsa ski resort with many people gathered around the outside lodgegrandfather clock hanging on wall next to a grandfather clockthe kitchen is brightly lit from the windowsomeone is playing with a nintendo wii controllera woman swinging a tennis racket, while standing on a tennis courta jet plane taking off into the aira sign warning people to stop at red lightsman swinging baseball bat with ball in air and crowd watchinga man hitching a trailer with water sports equipment to a sports utility vehiclefour trains lined up at the train stationa large tower has a clock towards the toppeople standing at a bus stop about to board a busa small bathroom with a sink a mirrorman admiring a motorcycle in parking lot, near a large buildinga motorcyclist in a red and white suit riding a red and white motorcyclea little boy playing tennis on a tennis courta locomotive train resting on some tracks next to a buildingtwo giraffes graze on some tall plant feederwoman looking at camera while lying in beda herd of adult elephants with a baby elephant waling through a forestbaseball batter in a wide stance, waiting for a pitched ballthe cars were parked along the street near the traffic lightthe travelers are getting around by horsesa cluttered desk with a laptop opened to flickrthe lady is sitting on the wood bencha baseball player is preparing to swing a baseball battwo lambs eating hay from ground of a fieldsome brown cabinets a black oven a tea kettle and a microwavethe reception ifs full of professional peoplebaseball batter ready to strike arriving ball and umpire waiting to catch if he missesthere lot of motorcycles park in a line with a white car, a red car and a van park not far from the motorcycles while there is man riding ona very clean room and a pair of scissorsthe bedroom has a wood closet and bookcase near the beda line of skiers heading towards a cabin in the mountainsa plate topped with onions rings next to a hamburger and hot dogthe personal sized pizza has been topped with vegetablesa sign letting people know about the castle rising castlegirl and two males standing on a beachthere is a white fire dyrant on the corner streeta man skateboarding on street in front of a busvintage black and white photograph of two baseball playersraggedy ann doll sitting in a chair with a pooh bearblonde haired boy doing a jump while riding a skate boarda dog driving a car down a streetthere are bananas, apples and oranges in the bowl,a stop light with the green light lit

Vision-Language Dataset Distillation (12)

CLIP, Flickr30k Initialized Texts, iteration = 0.a woman holding a child is standing in front of a tanka woman and man with a child in the center of a circle of childrena black and brown dog eyeing a flyfour kids next to a blue house are walking down a streettourists look at merchandise on a street vendors display in england riffling through cards and mapstwo people wearing blue clothing are making hand gestures next to one anotherfour workers in a field harvestinga dog approachs a small creature in a barren, snowy fielda woman holding a baby and a man in glasses holding a small boy smile at the camera for their family photofour men are harvesting a crop on a farmtwo musicians on stage in front of a microphone standa man in a suit is walking under a metal bridgegirl getting her hair dyeda child is in a ball pit while three adults watch hera man is sitting in a small kayak on a diving boarda boy in a blue shirt holds a toy helmet in his hands while standing on a path in a parka boy doing a flip in the air at the beachtwo men are dressed up as snowmena man walking underneath a bridge glances at the cameratwo soccer players are about to butt heads getting the balla woman is at an art studio, painting a mural from her art suppliesa boy wearing a black wetsuit stands on a crowded beacha chef prepares a table with food behind ittwo blond girls in white dresses, one much smaller than the other, stand on the bank of a large body of watera man on a black mountain bike rides through a course filled with other bikersa man in a hat stands with decorated horses around himan asian marching band with uniformed members in beige, yellow, and red play in the streettwo men with angry faces drink out of white cupsyoung boy plays with leaves in a green wooded areaa person siting against a wall with a dogone fighter delivers a hard blow to the face of another fightera young group of children sitting in a row against the walla teacher stands in front of a projector and a student at the front of the class has her hand raiseda dog is running in a large body of water causing it to splashfootball players are stretching togetherman playing video game instead of workinga group of people at dining table playing a card gameyoung woman celebrating her graduationa dog looks on as a woman eatsa man wearing a flannel shirt and black pants is working on a new reed roof on top of a housenewly married couple having their first kissa woman plays hide-and-go-seek with a check scarf as she sits with a man in a dark colored jacketa woman with short blond-hair smiling and singing into a microphone while a man in a striped shirt, further back, plays an acoustic guitara group of asian teenagers wearing red t-shirts sit on stepsa man gets lots of air time as he wakeboardsa black and white dog is running through the field to catch something in its moutha person with a small dog caught in her legsa young man tends chicken wings on a barbecue while a young woman in a pink shirt watchesblack and brown dog jumping over hurdle with white supportsa group of women wearing shirts that say, hsbc is standing by a table with food on it

Vision-Language Dataset Distillation (13)

CLIP, Flickr30k Distilled Texts, iteration = 2000.a young asian girl petting an animal of some sort and the animal is laying down enjoying ita group of men in ethnic dress are dancingbrown and tan dog, mouth open with tongue hanging out, running in the grassan older black woman wearing a colorful print dress stares directly at the camerawoman in red shirt shopping in a outdoor marketa woman in a blue shirt with no braa man holding a bag walking down a long staircasea man in black walking down a streeta woman holds a baby in a blue jumpera small car in an open fielda group of male musicians are playing instruments including guitar and drumsa man is standing inside a subway train with his mouth wide opena barber shaving someone’s heada group of people are shopping in what looks to be a christmas store filled with colorful toysa man in high rubber boots and a plaid shirt is pushing a broom over the mossy blacktopthree males with cameras near each other, two sitting and the third standing, in what might be a park during a sunny daya girl jumping up in the air with her hands above her headtwo people, one of whom is in a santa costume, pose wearing funny glassesa group of people walking through an alley along a cobblestone street, between two buildingsa soccer player in a green jersey kicks a blue and yellow balla man painting over graffititwo dogs running near a river while one dogs in swimming in ita man in a black hat looks surprised at the deli countera woman sits in a chair on the beach, backdropped by the oceana young man popping a wheelie on his bicycle while riding down a country roada person attempts to rope a black cow while riding a horsemen dressed in red and white playing musical instrumentstwo men drink beer out of tall drinking glassesa little boy is eating on a sidewalka man is standing inside a doorway that is in a wall painted with a mural of a womana man wearing a hat has his eyes closed, as another man in a red shirt is licking his facea family of 3 sits and poses on a couch togethera young boy in a sports uniform stands in front of a group of childrena little boy plays outdoors in water spurting up from an inground fountaintwo men in red pants do acrobatics with a laddera young caucasian man sits at a desk using a laptop computera group of people is sharing a meal at a large table at a restauranta group of people protest with one holding up a cardboard signa group of people and their dogs at a dog showa man with a plaid shirt is working on some wood in his workshopthe fellow in the black suit at a formal occasion has a salmon rose in his lapela man and a woman walking across a field of grassa woman singer holding a microphone singing at a concertyoung asian female sitting in a pose on a stone wallas she is being photographedthe man is standing by a creek in blue flannel shortsa black and white dog leaps to catch a frisbee in a fielda man is feeding two exotic birdsan asian chef is in the foreground, working over a steaming grill while a younger man is behind hima young man is skateboarding down the railing of some stairsa female chef examines a piece of bread while showing it to the camera