Weight Distillation: Transferring the Knowledge in Neural Network Parameters

Ye Lin; Yanyang Li; Ziyang Wang; Bei Li; Quan Du; Tong Xiao; Jingbo Zhu

doi:10.18653/v1/2021.acl-long.162

Weight Distillation: Transferring the Knowledge in Neural Network Parameters

Ye Lin, Yanyang Li, Ziyang Wang, Bei Li, Quan Du, Tong Xiao, Jingbo Zhu

Abstract

Knowledge distillation has been proven to be effective in model acceleration and compression. It transfers knowledge from a large neural network to a small one by using the large neural network predictions as targets of the small neural network. But this way ignores the knowledge inside the large neural networks, e.g., parameters. Our preliminary study as well as the recent success in pre-training suggests that transferring parameters are more effective in distilling knowledge. In this paper, we propose Weight Distillation to transfer the knowledge in parameters of a large neural network to a small neural network through a parameter generator. On the WMT16 En-Ro, NIST12 Zh-En, and WMT14 En-De machine translation tasks, our experiments show that weight distillation learns a small network that is 1.88 2.94x faster than the large network but with competitive BLEU performance. When fixing the size of small networks, weight distillation outperforms knowledge distillation by 0.51 1.82 BLEU points.

Anthology ID:: 2021.acl-long.162
Volume:: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Month:: August
Year:: 2021
Address:: Online
Editors:: Chengqing Zong, Fei Xia, Wenjie Li, Roberto Navigli
Venues:: ACL | IJCNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2076–2088
Language:
URL:: https://aclanthology.org/2021.acl-long.162
DOI:: 10.18653/v1/2021.acl-long.162
Bibkey:
Cite (ACL):: Ye Lin, Yanyang Li, Ziyang Wang, Bei Li, Quan Du, Tong Xiao, and Jingbo Zhu. 2021. Weight Distillation: Transferring the Knowledge in Neural Network Parameters. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2076–2088, Online. Association for Computational Linguistics.
Cite (Informal):: Weight Distillation: Transferring the Knowledge in Neural Network Parameters (Lin et al., ACL-IJCNLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.acl-long.162.pdf
Video:: https://aclanthology.org/2021.acl-long.162.mp4

PDF Cite Search Video