Selecting Backtranslated Data from Multiple Sources for Improved Neural Machine Translation

Xabier Soto, Dimitar Shterionov, Alberto Poncelas, Andy Way


Abstract
Machine translation (MT) has benefited from using synthetic training data originating from translating monolingual corpora, a technique known as backtranslation. Combining backtranslated data from different sources has led to better results than when using such data in isolation. In this work we analyse the impact that data translated with rule-based, phrase-based statistical and neural MT systems has on new MT systems. We use a real-world low-resource use-case (Basque-to-Spanish in the clinical domain) as well as a high-resource language pair (German-to-English) to test different scenarios with backtranslation and employ data selection to optimise the synthetic corpora. We exploit different data selection strategies in order to reduce the amount of data used, while at the same time maintaining high-quality MT systems. We further tune the data selection method by taking into account the quality of the MT systems used for backtranslation and lexical diversity of the resulting corpora. Our experiments show that incorporating backtranslated data from different sources can be beneficial, and that availing of data selection can yield improved performance.
Anthology ID:
2020.acl-main.359
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Editors:
Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3898–3908
Language:
URL:
https://aclanthology.org/2020.acl-main.359
DOI:
10.18653/v1/2020.acl-main.359
Bibkey:
Cite (ACL):
Xabier Soto, Dimitar Shterionov, Alberto Poncelas, and Andy Way. 2020. Selecting Backtranslated Data from Multiple Sources for Improved Neural Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3898–3908, Online. Association for Computational Linguistics.
Cite (Informal):
Selecting Backtranslated Data from Multiple Sources for Improved Neural Machine Translation (Soto et al., ACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.acl-main.359.pdf
Video:
 http://slideslive.com/38929436