Teaching Tiny Minds: Exploring Methods to Enhance Knowledge Distillation for Small Language Models

Hong Meng Yam, Nathan Paek


Abstract
In this paper, we build off of the success of the previous BabyLM challenge winner’s model, BabyLlama, to explore various methods of enhancing knowledge distillation for small language models. Our main focus is on investigating how small a language model can be while still maintaining competitive performance. We experiment with three main approaches: (1) DistilledGPT-44M, which uses smaller teacher models and a more compact student model compared to BabyLlama; (2) ContrastiveLlama-58M, which incorporates contrastive loss into the knowledge distillation process; and (3) MaskedAdversarialLlama-58M, incorporates adversarial loss into the knowledge distillation process. Using the 10M-word dataset from the BabyLM challenge’s strict-small track, we evaluate our models on the BLiMP, EWoK, and GLUE benchmarks. Our results show that effective knowledge distillation can still be achieved with significantly smaller teacher and student models. In particular, our model DistilledGPT-44M is able to achieve better performance than one of last year’s winning entries, LTG-BERT, while achieving similar performance but cutting training time by around 70% and parameters by around 25% compared to the other winning entry, BabyLlama.
Anthology ID:
2024.conll-babylm.27
Volume:
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning
Month:
November
Year:
2024
Address:
Miami, FL, USA
Editors:
Michael Y. Hu, Aaron Mueller, Candace Ross, Adina Williams, Tal Linzen, Chengxu Zhuang, Leshem Choshen, Ryan Cotterell, Alex Warstadt, Ethan Gotlieb Wilcox
Venues:
CoNLL | BabyLM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
302–307
Language:
URL:
https://aclanthology.org/2024.conll-babylm.27/
DOI:
Bibkey:
Cite (ACL):
Hong Meng Yam and Nathan Paek. 2024. Teaching Tiny Minds: Exploring Methods to Enhance Knowledge Distillation for Small Language Models. In The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning, pages 302–307, Miami, FL, USA. Association for Computational Linguistics.
Cite (Informal):
Teaching Tiny Minds: Exploring Methods to Enhance Knowledge Distillation for Small Language Models (Yam & Paek, CoNLL-BabyLM 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.conll-babylm.27.pdf