When Babies Teach Babies: Can student knowledge sharing outperform Teacher-Guided Distillation on small datasets?

Srikrishna Iyer


Abstract
We present our submission to the BabyLM challenge, aiming to push the boundaries of data-efficient language model pretraining. Our method builds upon deep mutual learning, introducing a student model search for diverse initialization. We address the limitation of treating students equally by formulating weighted mutual learning as a bi-level optimization problem. The inner loop learns compact students through online distillation, while the outer loop optimizes weights for better knowledge distillation from diverse students. This dynamic weighting strategy eliminates the need for a teacher model, reducing computational requirements. Our evaluations show that teacher-less methods can match or surpass teacher-supervised approaches.
Anthology ID:
2024.conll-babylm.17
Volume:
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning
Month:
November
Year:
2024
Address:
Miami, FL, USA
Editors:
Michael Y. Hu, Aaron Mueller, Candace Ross, Adina Williams, Tal Linzen, Chengxu Zhuang, Leshem Choshen, Ryan Cotterell, Alex Warstadt, Ethan Gotlieb Wilcox
Venues:
CoNLL | BabyLM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
197–211
Language:
URL:
https://aclanthology.org/2024.conll-babylm.17/
DOI:
Bibkey:
Cite (ACL):
Srikrishna Iyer. 2024. When Babies Teach Babies: Can student knowledge sharing outperform Teacher-Guided Distillation on small datasets?. In The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning, pages 197–211, Miami, FL, USA. Association for Computational Linguistics.
Cite (Informal):
When Babies Teach Babies: Can student knowledge sharing outperform Teacher-Guided Distillation on small datasets? (Iyer, CoNLL-BabyLM 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.conll-babylm.17.pdf