BabyLlama-2: Ensemble-Distilled Models Consistently Outperform Teachers With Limited Data

Jean-Loup Tastet, Inar Timiryasov


Abstract
We present BabyLlama-2, a 345 million parameter model distillation-pretrained from two teachers on a 10 million word corpus for the BabyLM competition. On the BLiMP and SuperGLUE benchmarks, BabyLlama-2 outperforms baselines trained on both 10 and 100 million word datasets with the same data mix, as well as its teacher models. Through an extensive hyperparameter sweep, we demonstrate that the advantages of distillation cannot be attributed to suboptimal hyperparameter selection of the teachers. Our findings underscore the need for further investigation into distillation techniques, particularly in data-limited settings.
Anthology ID:
2024.conll-babylm.26
Volume:
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning
Month:
November
Year:
2024
Address:
Miami, FL, USA
Editors:
Michael Y. Hu, Aaron Mueller, Candace Ross, Adina Williams, Tal Linzen, Chengxu Zhuang, Leshem Choshen, Ryan Cotterell, Alex Warstadt, Ethan Gotlieb Wilcox
Venues:
CoNLL | BabyLM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
292–301
Language:
URL:
https://aclanthology.org/2024.conll-babylm.26/
DOI:
Bibkey:
Cite (ACL):
Jean-Loup Tastet and Inar Timiryasov. 2024. BabyLlama-2: Ensemble-Distilled Models Consistently Outperform Teachers With Limited Data. In The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning, pages 292–301, Miami, FL, USA. Association for Computational Linguistics.
Cite (Informal):
BabyLlama-2: Ensemble-Distilled Models Consistently Outperform Teachers With Limited Data (Tastet & Timiryasov, CoNLL-BabyLM 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.conll-babylm.26.pdf