Redistributing Low-Frequency Words: Making the Most of Monolingual Data in Non-Autoregressive Translation

Liang Ding, Longyue Wang, Shuming Shi, Dacheng Tao, Zhaopeng Tu


Abstract
Knowledge distillation (KD) is the preliminary step for training non-autoregressive translation (NAT) models, which eases the training of NAT models at the cost of losing important information for translating low-frequency words. In this work, we provide an appealing alternative for NAT – monolingual KD, which trains NAT student on external monolingual data with AT teacher trained on the original bilingual data. Monolingual KD is able to transfer both the knowledge of the original bilingual data (implicitly encoded in the trained AT teacher model) and that of the new monolingual data to the NAT student model. Extensive experiments on eight WMT benchmarks over two advanced NAT models show that monolingual KD consistently outperforms the standard KD by improving low-frequency word translation, without introducing any computational cost. Monolingual KD enjoys desirable expandability, which can be further enhanced (when given more computational budget) by combining with the standard KD, a reverse monolingual KD, or enlarging the scale of monolingual data. Extensive analyses demonstrate that these techniques can be used together profitably to further recall the useful information lost in the standard KD. Encouragingly, combining with standard KD, our approach achieves 30.4 and 34.1 BLEU points on the WMT14 English-German and German-English datasets, respectively. Our code and trained models are freely available at https://github.com/alphadl/RLFW-NAT.mono.
Anthology ID:
2022.acl-long.172
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2417–2426
Language:
URL:
https://aclanthology.org/2022.acl-long.172
DOI:
10.18653/v1/2022.acl-long.172
Bibkey:
Cite (ACL):
Liang Ding, Longyue Wang, Shuming Shi, Dacheng Tao, and Zhaopeng Tu. 2022. Redistributing Low-Frequency Words: Making the Most of Monolingual Data in Non-Autoregressive Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2417–2426, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Redistributing Low-Frequency Words: Making the Most of Monolingual Data in Non-Autoregressive Translation (Ding et al., ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-long.172.pdf
Code
 alphadl/rlfw-nat.mono