Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation

Yuanxin Liu; Fandong Meng; Zheng Lin; Weiping Wang; Jie Zhou

doi:10.18653/v1/2021.acl-long.228

Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation

Yuanxin Liu, Fandong Meng, Zheng Lin, Weiping Wang, Jie Zhou

Abstract

Recently, knowledge distillation (KD) has shown great success in BERT compression. Instead of only learning from the teacher’s soft label as in conventional KD, researchers find that the rich information contained in the hidden layers of BERT is conducive to the student’s performance. To better exploit the hidden knowledge, a common practice is to force the student to deeply mimic the teacher’s hidden states of all the tokens in a layer-wise manner. In this paper, however, we observe that although distilling the teacher’s hidden state knowledge (HSK) is helpful, the performance gain (marginal utility) diminishes quickly as more HSK is distilled. To understand this effect, we conduct a series of analysis. Specifically, we divide the HSK of BERT into three dimensions, namely depth, length and width. We first investigate a variety of strategies to extract crucial knowledge for each single dimension and then jointly compress the three dimensions. In this way, we show that 1) the student’s performance can be improved by extracting and distilling the crucial HSK, and 2) using a tiny fraction of HSK can achieve the same performance as extensive HSK distillation. Based on the second finding, we further propose an efficient KD paradigm to compress BERT, which does not require loading the teacher during the training of student. For two kinds of student models and computing devices, the proposed KD paradigm gives rise to training speedup of 2.7x 3.4x.

Anthology ID:: 2021.acl-long.228
Volume:: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Month:: August
Year:: 2021
Address:: Online
Editors:: Chengqing Zong, Fei Xia, Wenjie Li, Roberto Navigli
Venues:: ACL | IJCNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2928–2941
Language:
URL:: https://aclanthology.org/2021.acl-long.228/
DOI:: 10.18653/v1/2021.acl-long.228
Bibkey:
Cite (ACL):: Yuanxin Liu, Fandong Meng, Zheng Lin, Weiping Wang, and Jie Zhou. 2021. Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2928–2941, Online. Association for Computational Linguistics.
Cite (Informal):: Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation (Liu et al., ACL-IJCNLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.acl-long.228.pdf
Video:: https://aclanthology.org/2021.acl-long.228.mp4

PDF Cite Search Video Fix data