Knowledge Distillation with Reptile Meta-Learning for Pretrained Language Model Compression

Xinge Ma, Jin Wang, Liang-Chih Yu, Xuejie Zhang


Abstract
The billions, and sometimes even trillions, of parameters involved in pre-trained language models significantly hamper their deployment in resource-constrained devices and real-time applications. Knowledge distillation (KD) can transfer knowledge from the original model (i.e., teacher) into a compact model (i.e., student) to achieve model compression. However, previous KD methods have usually frozen the teacher and applied its immutable output feature maps as soft labels to guide the student’s training. Moreover, the goal of the teacher is to achieve the best performance on downstream tasks rather than knowledge transfer. Such a fixed architecture may limit the teacher’s teaching and student’s learning abilities. Herein, a knowledge distillation method with reptile meta-learning is proposed to facilitate the transfer of knowledge from the teacher to the student. The teacher can continuously meta-learn the student’s learning objective to adjust its parameters for maximizing the student’s performance throughout the distillation process. In this way, the teacher learns to teach, produces more suitable soft labels, and transfers more appropriate knowledge to the student, resulting in improved performance. Unlike previous KD using meta-learning, the proposed method only needs to calculate the first-order derivatives to update the teacher, leading to lower computational cost but better convergence. Extensive experiments on the GLUE benchmark show the competitive performance achieved by the proposed method. For reproducibility, the code for this paper is available at: https://github.com/maxinge8698/ReptileDistil.
Anthology ID:
2022.coling-1.435
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
4907–4917
Language:
URL:
https://aclanthology.org/2022.coling-1.435
DOI:
Bibkey:
Cite (ACL):
Xinge Ma, Jin Wang, Liang-Chih Yu, and Xuejie Zhang. 2022. Knowledge Distillation with Reptile Meta-Learning for Pretrained Language Model Compression. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4907–4917, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
Knowledge Distillation with Reptile Meta-Learning for Pretrained Language Model Compression (Ma et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.435.pdf
Code
 maxinge8698/reptiledistil
Data
CoLAGLUEMRPCMultiNLIQNLISST