How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives

Xinpeng Wang; Leonie Weissweiler; Hinrich Schütze; Barbara Plank

doi:10.18653/v1/2023.acl-short.157

How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives

Xinpeng Wang, Leonie Weissweiler, Hinrich Schütze, Barbara Plank

Abstract

Recently, various intermediate layer distillation (ILD) objectives have been shown to improve compression of BERT models via Knowledge Distillation (KD). However, a comprehensive evaluation of the objectives in both task-specific and task-agnostic settings is lacking. To the best of our knowledge, this is the first work comprehensively evaluating distillation objectives in both settings. We show that attention transfer gives the best performance overall. We also study the impact of layer choice when initializing the student from the teacher layers, finding a significant impact on the performance in task-specific distillation. For vanilla KD and hidden states transfer, initialisation with lower layers of the teacher gives a considerable improvement over higher layers, especially on the task of QNLI (up to an absolute percentage change of 17.8 in accuracy). Attention transfer behaves consistently under different initialisation settings. We release our code as an efficient transformer-based model distillation framework for further studies.

Anthology ID:: 2023.acl-short.157
Volume:: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1843–1852
Language:
URL:: https://aclanthology.org/2023.acl-short.157
DOI:: 10.18653/v1/2023.acl-short.157
Bibkey:
Cite (ACL):: Xinpeng Wang, Leonie Weissweiler, Hinrich Schütze, and Barbara Plank. 2023. How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1843–1852, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives (Wang et al., ACL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.acl-short.157.pdf
Video:: https://aclanthology.org/2023.acl-short.157.mp4

PDF Cite Search Video