Andrey Kan

2024

Textual Dataset Distillation via Language Model Embedding
Yefan Tao | Luyang Kong | Andrey Kan | Laurent Callot
Findings of the Association for Computational Linguistics: EMNLP 2024

Dataset distillation is a process aimed at condensing datasets while preserving essential characteristics. In the text domain, prevailing methods typically generate distilled data as embedding vectors, which are not human-readable. This approach simplifies optimization but limits the transferability of distilled data across different model architectures. To address this limitation, we introduce a model-agnostic, data-efficient method that leverages Language Model (LM) embeddings. Compared to parameter-efficient methods such as LORA, our approach achieves comparable performance with significantly faster processing times. We evaluate our methodology through classification tasks on datasets like IMDB and AG-News, demonstrating performance that is on par with or exceeds previous model-dependent techniques. By utilizing LM embeddings, our method offers enhanced flexibility and improved transferability, expanding the range of potential applications.

Co-authors

Venues

Findings1

Fix author