Rastislav Rabatin
2024
Target-Aware Language Modeling via Granular Data Sampling
Ernie Chang
|
Pin-Jie Lin
|
Yang Li
|
Changsheng Zhao
|
Daeil Kim
|
Rastislav Rabatin
|
Zechun Liu
|
Yangyang Shi
|
Vikas Chandra
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources. However, there are instances where we desire a model that excels in specific areas without markedly compromising performance in other areas. A cost-effective and straightforward approach is sampling with low-dimensional data features, which allows selecting large-scale pretraining data for domain-specific use cases. In this work, we revisit importance sampling with n-gram features consisting of multi-granular tokens, which strikes a good balance between sentence compression and representation capabilities. We observed the sampled data to have a high correlation with the target downstream task performance *while preserving its effectiveness on other tasks*. This leads to the proposed data sampling paradigm where language models can be pretrained more efficiently on selected documents. On eight benchmarks we demonstrate with ~1% of the data, pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging from 125M to 1.5B.
Scaling Parameter-Constrained Language Models with Quality Data
Ernie Chang
|
Matteo Paltenghi
|
Yang Li
|
Pin-Jie Lin
|
Changsheng Zhao
|
Patrick Huber
|
Zechun Liu
|
Rastislav Rabatin
|
Yangyang Shi
|
Vikas Chandra
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Scaling laws in language modeling traditionally quantify training loss as a function of dataset size and model parameters, providing compute-optimal estimates but often neglecting the impact of data quality on model generalization.In this paper, we extend the conventional understanding of scaling law by offering a microscopic view of data quality within the original formulation – effective training tokens – which we posit to be a critical determinant of performance for parameter-constrained language models.Specifically, we formulate the proposed term of effective training tokens to be a combination of two readily-computed indicators of text:(i) text diversity and (ii) syntheticity as measured by a teacher model.We pretrained over 200 models of 25M to 1.5B parameters on a diverse set of sampled, synthetic data, and estimated the constants that relate text quality, model size, training tokens, and eight reasoning task accuracy scores.We demonstrated the estimated constants yield +0.83 Pearson correlation with true accuracies, and analyze it in scenarios involving widely-used data techniques such as data sampling and synthesis which aim to improve data quality.
Search
Co-authors
- Ernie Chang 2
- Pin-Jie Lin 2
- Yang Li 2
- Changsheng Zhao 2
- Zechun Liu 2
- show all...