Analyzing Knowledge Distillation in Neural Machine Translation

Dakun Zhang, Josep Crego, Jean Senellart


Abstract
Knowledge distillation has recently been successfully applied to neural machine translation. It allows for building shrunk networks while the resulting systems retain most of the quality of the original model. Despite the fact that many authors report on the benefits of knowledge distillation, few have discussed the actual reasons why it works, especially in the context of neural MT. In this paper, we conduct several experiments aimed at understanding why and how distillation impacts accuracy on an English-German translation task. We show that translation complexity is actually reduced when building a distilled/synthesised bi-text when compared to the reference bi-text. We further remove noisy data from synthesised translations and merge filtered synthesised data together with original reference, thus achieving additional gains in terms of accuracy.
Anthology ID:
2018.iwslt-1.4
Volume:
Proceedings of the 15th International Conference on Spoken Language Translation
Month:
October 29-30
Year:
2018
Address:
Brussels
Editors:
Marco Turchi, Jan Niehues, Marcello Frederico
Venue:
IWSLT
SIG:
SIGSLT
Publisher:
International Conference on Spoken Language Translation
Note:
Pages:
23–30
Language:
URL:
https://aclanthology.org/2018.iwslt-1.4
DOI:
Bibkey:
Cite (ACL):
Dakun Zhang, Josep Crego, and Jean Senellart. 2018. Analyzing Knowledge Distillation in Neural Machine Translation. In Proceedings of the 15th International Conference on Spoken Language Translation, pages 23–30, Brussels. International Conference on Spoken Language Translation.
Cite (Informal):
Analyzing Knowledge Distillation in Neural Machine Translation (Zhang et al., IWSLT 2018)
Copy Citation:
PDF:
https://aclanthology.org/2018.iwslt-1.4.pdf