Satoshi Kosugi


2024

pdf bib
DiLM: Distilling Dataset into Language Model for Text-level Dataset Distillation
Aru Maekawa | Satoshi Kosugi | Kotaro Funakoshi | Manabu Okumura
Findings of the Association for Computational Linguistics: NAACL 2024

Dataset distillation aims to compress a training dataset by creating a small number of informative synthetic samples such that neural networks trained on them perform as well as those trained on the original training dataset. Current text dataset distillation methods create each synthetic sample as a sequence of word embeddings instead of a text to apply gradient-based optimization; however, such embedding-level distilled datasets cannot be used for training other models whose word embedding weights are different from the model used for distillation. To address this issue, we propose a novel text dataset distillation approach, called Distilling dataset into Language Model (DiLM), which trains a language model to generate informative synthetic training samples as text data, instead of directly optimizing synthetic samples. We evaluated DiLM on various text classification datasets and showed that distilled synthetic datasets from DiLM outperform those from current coreset selection methods. DiLM achieved remarkable generalization performance in training different types of models and in-context learning of large language models. Our code will be available at https://github.com/arumaekawa/DiLM.

pdf bib
Active Learning for Abstractive Text Summarization via LLM-Determined Curriculum and Certainty Gain Maximization
Dongyuan Li | Ying Zhang | Zhen Wang | Shiyin Tan | Satoshi Kosugi | Manabu Okumura
Findings of the Association for Computational Linguistics: EMNLP 2024

For abstractive text summarization, laborious data annotation and time-consuming model training become two high walls, hindering its further progress. Active Learning, selecting a few informative instances for annotation and model training, sheds light on solving these issues. However, only few active learning-based studies focus on abstractive text summarization and suffer from low stability, effectiveness, and efficiency. To solve the problems, we propose a novel LLM-determined curriculum active learning framework. Firstly, we design a prompt to ask large language models to rate the difficulty of instances, which guides the model to train on from easier to harder instances. Secondly, we design a novel active learning strategy, i.e., Certainty Gain Maximization, enabling to select instances whose distribution aligns well with the overall distribution. Experiments show our method can improve stability, effectiveness, and efficiency of abstractive text summarization backbones.