Daeil Kim


2024

pdf bib
Target-Aware Language Modeling via Granular Data Sampling
Ernie Chang | Pin-Jie Lin | Yang Li | Changsheng Zhao | Daeil Kim | Rastislav Rabatin | Zechun Liu | Yangyang Shi | Vikas Chandra
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources. However, there are instances where we desire a model that excels in specific areas without markedly compromising performance in other areas. A cost-effective and straightforward approach is sampling with low-dimensional data features, which allows selecting large-scale pretraining data for domain-specific use cases. In this work, we revisit importance sampling with n-gram features consisting of multi-granular tokens, which strikes a good balance between sentence compression and representation capabilities. We observed the sampled data to have a high correlation with the target downstream task performance *while preserving its effectiveness on other tasks*. This leads to the proposed data sampling paradigm where language models can be pretrained more efficiently on selected documents. On eight benchmarks we demonstrate with ~1% of the data, pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging from 125M to 1.5B.