Improving Neural Machine Translation Robustness via Data Augmentation: Beyond Back-Translation

Zhenhao Li; Lucia Specia

doi:10.18653/v1/D19-5543

Improving Neural Machine Translation Robustness via Data Augmentation: Beyond Back-Translation

Abstract

Neural Machine Translation (NMT) models have been proved strong when translating clean texts, but they are very sensitive to noise in the input. Improving NMT models robustness can be seen as a form of “domain” adaption to noise. The recently created Machine Translation on Noisy Text task corpus provides noisy-clean parallel data for a few language pairs, but this data is very limited in size and diversity. The state-of-the-art approaches are heavily dependent on large volumes of back-translated data. This paper has two main contributions: Firstly, we propose new data augmentation methods to extend limited noisy data and further improve NMT robustness to noise while keeping the models small. Secondly, we explore the effect of utilizing noise from external data in the form of speech transcripts and show that it could help robustness.

Anthology ID:: D19-5543
Volume:: Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)
Month:: November
Year:: 2019
Address:: Hong Kong, China
Editors:: Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
Venue:: WNUT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 328–336
Language:
URL:: https://aclanthology.org/D19-5543/
DOI:: 10.18653/v1/D19-5543
Bibkey:
Cite (ACL):: Zhenhao Li and Lucia Specia. 2019. Improving Neural Machine Translation Robustness via Data Augmentation: Beyond Back-Translation. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 328–336, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):: Improving Neural Machine Translation Robustness via Data Augmentation: Beyond Back-Translation (Li & Specia, WNUT 2019)
Copy Citation:
PDF:: https://aclanthology.org/D19-5543.pdf
Code: Nickeilf/InformalMT

PDF Cite Search Code Fix data