BabyLM Challenge: Experimenting with Self-Distillation and Reverse-Distillation for Language Model Pre-Training on Constrained Datasets

Aakarsh Nair, Alina Hancharova, Mayank Kumar, Ali Gharaee


Abstract
Language models (LMs) exhibit significant data inefficiency compared to human learners. A child is able to master language while consuming less than 100 million words of input, while language models require orders of magnitude more tokens during training. Our submission to the BabyLM Challenge utilizes a combination of self-distillation and reverse-distillation to train a sequence of ensemble models with improved training characteristics on a fixed-size 10 million-word dataset. Self-distillation is used to generate an ensemble of models of a certain fixed size, while reverse distillation is used to train a more expressive larger model from a previously trained generation of relatively smaller models, while largely preserving learned accuracy.We find that ensembles consisting of two smaller models and one identical born-again model serve as ideal ensembles for each trained generation of model size. We demonstrate that, although our method is not novel, it provides consistent and modest performance improvements on the BLiMP and GLUE benchmarks.
Anthology ID:
2024.conll-babylm.3
Volume:
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning
Month:
November
Year:
2024
Address:
Miami, FL, USA
Editors:
Michael Y. Hu, Aaron Mueller, Candace Ross, Adina Williams, Tal Linzen, Chengxu Zhuang, Leshem Choshen, Ryan Cotterell, Alex Warstadt, Ethan Gotlieb Wilcox
Venues:
CoNLL | BabyLM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
28–36
Language:
URL:
https://aclanthology.org/2024.conll-babylm.3/
DOI:
Bibkey:
Cite (ACL):
Aakarsh Nair, Alina Hancharova, Mayank Kumar, and Ali Gharaee. 2024. BabyLM Challenge: Experimenting with Self-Distillation and Reverse-Distillation for Language Model Pre-Training on Constrained Datasets. In The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning, pages 28–36, Miami, FL, USA. Association for Computational Linguistics.
Cite (Informal):
BabyLM Challenge: Experimenting with Self-Distillation and Reverse-Distillation for Language Model Pre-Training on Constrained Datasets (Nair et al., CoNLL-BabyLM 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.conll-babylm.3.pdf