Sequence Reducible Holdout Loss for Language Model Pretraining

Raghuveer Thirukovalluru, Nicholas Monath, Bhuwan Dhingra, Sam Wiseman


Abstract
Data selection techniques, which adaptively select datapoints inside the training loop, have demonstrated empirical benefits in reducing the number of gradient steps to train neural models. However, these techniques have so far largely been applied to classification. In this work, we study their applicability to language model pretraining, a highly time-intensive task. We propose a simple modification to an existing data selection technique (reducible hold-out loss training) in order to adapt it to the sequence losses typical in language modeling. We experiment on both autoregressive and masked language modelling, and show that applying data selection to pretraining offers notable benefits including a 4.3% reduction in total number of steps, a 21.5% steps reduction in average, to an intermediate target perplexity, over the course of pretraining an autoregressive language model. Further, data selection trained language models demonstrate significantly better generalization ability on out of domain datasets - 7.9% reduction in total number of steps and 23.2% average steps reduction to an intermediate target perplexity.
Anthology ID:
2024.lrec-main.1281
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
14705–14716
Language:
URL:
https://aclanthology.org/2024.lrec-main.1281
DOI:
Bibkey:
Cite (ACL):
Raghuveer Thirukovalluru, Nicholas Monath, Bhuwan Dhingra, and Sam Wiseman. 2024. Sequence Reducible Holdout Loss for Language Model Pretraining. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 14705–14716, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Sequence Reducible Holdout Loss for Language Model Pretraining (Thirukovalluru et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.1281.pdf