Selective Knowledge Distillation for Neural Machine Translation

Fusheng Wang, Jianhao Yan, Fandong Meng, Jie Zhou


Abstract
Neural Machine Translation (NMT) models achieve state-of-the-art performance on many translation benchmarks. As an active research field in NMT, knowledge distillation is widely applied to enhance the model’s performance by transferring teacher model’s knowledge on each training sample. However, previous work rarely discusses the different impacts and connections among these samples, which serve as the medium for transferring teacher knowledge. In this paper, we design a novel protocol that can effectively analyze the different impacts of samples by comparing various samples’ partitions. Based on above protocol, we conduct extensive experiments and find that the teacher’s knowledge is not the more, the better. Knowledge over specific samples may even hurt the whole performance of knowledge distillation. Finally, to address these issues, we propose two simple yet effective strategies, i.e., batch-level and global-level selections, to pick suitable samples for distillation. We evaluate our approaches on two large-scale machine translation tasks, WMT’14 English-German and WMT’19 Chinese-English. Experimental results show that our approaches yield up to +1.28 and +0.89 BLEU points improvements over the Transformer baseline, respectively.
Anthology ID:
2021.acl-long.504
Volume:
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Month:
August
Year:
2021
Address:
Online
Venues:
ACL | IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6456–6466
Language:
URL:
https://aclanthology.org/2021.acl-long.504
DOI:
10.18653/v1/2021.acl-long.504
Bibkey:
Cite (ACL):
Fusheng Wang, Jianhao Yan, Fandong Meng, and Jie Zhou. 2021. Selective Knowledge Distillation for Neural Machine Translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6456–6466, Online. Association for Computational Linguistics.
Cite (Informal):
Selective Knowledge Distillation for Neural Machine Translation (Wang et al., ACL-IJCNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.acl-long.504.pdf
Video:
 https://aclanthology.org/2021.acl-long.504.mp4
Code
 LeslieOverfitting/selective_distillation