Multi-Granularity Structural Knowledge Distillation for Language Model Compression

Chang Liu, Chongyang Tao, Jiazhan Feng, Dongyan Zhao


Abstract
Transferring the knowledge to a small model through distillation has raised great interest in recent years. Prevailing methods transfer the knowledge derived from mono-granularity language units (e.g., token-level or sample-level), which is not enough to represent the rich semantics of a text and may lose some vital knowledge. Besides, these methods form the knowledge as individual representations or their simple dependencies, neglecting abundant structural relations among intermediate representations. To overcome the problems, we present a novel knowledge distillation framework that gathers intermediate representations from multiple semantic granularities (e.g., tokens, spans and samples) and forms the knowledge as more sophisticated structural relations specified as the pair-wise interactions and the triplet-wise geometric angles based on multi-granularity representations. Moreover, we propose distilling the well-organized multi-granularity structural knowledge to the student hierarchically across layers. Experimental results on GLUE benchmark demonstrate that our method outperforms advanced distillation methods.
Anthology ID:
2022.acl-long.71
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1001–1011
Language:
URL:
https://aclanthology.org/2022.acl-long.71
DOI:
10.18653/v1/2022.acl-long.71
Bibkey:
Cite (ACL):
Chang Liu, Chongyang Tao, Jiazhan Feng, and Dongyan Zhao. 2022. Multi-Granularity Structural Knowledge Distillation for Language Model Compression. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1001–1011, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Multi-Granularity Structural Knowledge Distillation for Language Model Compression (Liu et al., ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-long.71.pdf
Code
 lc97-pku/mgskd
Data
CoLAGLUEMRPCMultiNLIQNLISSTSST-2