Targeted Data Augmentation Improves Context-aware Neural Machine Translation

Harritxu Gete, Thierry Etchegoyhen, Gorka Labaka


Abstract
Progress in document-level Machine Translation is hindered by the lack of parallel training data that include context information. In this work, we evaluate the potential of data augmentation techniques to circumvent these limitations, showing that significant gains can be achieved via upsampling, similar context sampling and back-translations, targeted on context-relevant data. We apply these methods on standard document-level datasets in English-German and English-French and demonstrate their relevance to improve the translation of contextual phenomena. In particular, we show that relatively small volumes of targeted data augmentation lead to significant improvements over a strong context-concatenation baseline and standard back-translation of document-level data. We also compare the accuracy of the selected methods depending on data volumes or distance to relevant context information, and explore their use in combination.
Anthology ID:
2023.mtsummit-research.25
Volume:
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track
Month:
September
Year:
2023
Address:
Macau SAR, China
Editors:
Masao Utiyama, Rui Wang
Venue:
MTSummit
SIG:
Publisher:
Asia-Pacific Association for Machine Translation
Note:
Pages:
298–312
Language:
URL:
https://aclanthology.org/2023.mtsummit-research.25
DOI:
Bibkey:
Cite (ACL):
Harritxu Gete, Thierry Etchegoyhen, and Gorka Labaka. 2023. Targeted Data Augmentation Improves Context-aware Neural Machine Translation. In Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track, pages 298–312, Macau SAR, China. Asia-Pacific Association for Machine Translation.
Cite (Informal):
Targeted Data Augmentation Improves Context-aware Neural Machine Translation (Gete et al., MTSummit 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.mtsummit-research.25.pdf