Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

Songming Zhang; Yunlong Liang; Shuaibo Wang; Yufeng Chen; Wenjuan Han; Jian Liu; Jinan Xu (徐金安)

doi:10.18653/v1/2023.acl-long.448

Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

Songming Zhang, Yunlong Liang, Shuaibo Wang, Yufeng Chen, Wenjuan Han, Jian Liu, Jinan Xu

Abstract

Knowledge distillation (KD) is a promising technique for model compression in neural machine translation. However, where the knowledge hides in KD is still not clear, which may hinder the development of KD. In this work, we first unravel this mystery from an empirical perspective and show that the knowledge comes from the top-1 predictions of teachers, which also helps us build a potential connection between word- and sequence-level KD. Further, we point out two inherent issues in vanilla word-level KD based on this finding. Firstly, the current objective of KD spreads its focus to whole distributions to learn the knowledge, yet lacks special treatment on the most crucial top-1 information. Secondly, the knowledge is largely covered by the golden information due to the fact that most top-1 predictions of teachers overlap with ground-truth tokens, which further restricts the potential of KD. To address these issues, we propose a new method named Top-1 Information Enhanced Knowledge Distillation (TIE-KD). Specifically, we design a hierarchical ranking loss to enforce the learning of the top-1 information from the teacher. Additionally, we develop an iterative KD procedure to infuse more additional knowledge by distilling on the data without ground-truth targets. Experiments on WMT’14 English-German, WMT’14 English-French and WMT’16 English-Romanian demonstrate that our method can respectively boost Transformer_base students by +1.04, +0.60 and +1.11 BLEU scores and significantly outperforms the vanilla word-level KD baseline. Besides, our method shows higher generalizability on different teacher-student capacity gaps than existing KD techniques.

Anthology ID:: 2023.acl-long.448
Volume:: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8062–8079
Language:
URL:: https://aclanthology.org/2023.acl-long.448/
DOI:: 10.18653/v1/2023.acl-long.448
Bibkey:
Cite (ACL):: Songming Zhang, Yunlong Liang, Shuaibo Wang, Yufeng Chen, Wenjuan Han, Jian Liu, and Jinan Xu. 2023. Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8062–8079, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation (Zhang et al., ACL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.acl-long.448.pdf
Video:: https://aclanthology.org/2023.acl-long.448.mp4

PDF Cite Search Video Fix data