Generalized Data Augmentation for Low-Resource Translation

Mengzhou Xia, Xiang Kong, Antonios Anastasopoulos, Graham Neubig


Abstract
Low-resource language pairs with a paucity of parallel data pose challenges for machine translation in terms of both adequacy and fluency. Data augmentation utilizing a large amount of monolingual data is regarded as an effective way to alleviate the problem. In this paper, we propose a general framework of data augmentation for low-resource machine translation not only using target-side monolingual data, but also by pivoting through a related high-resource language. Specifically, we experiment with a two-step pivoting method to convert high-resource data to the low-resource language, making best use of available resources to better approximate the true distribution of the low-resource language. First, we inject low-resource words into high-resource sentences through an induced bilingual dictionary. Second, we further edit the high-resource data injected with low-resource words using a modified unsupervised machine translation framework. Extensive experiments on four low-resource datasets show that under extreme low-resource settings, our data augmentation techniques improve translation quality by up to 1.5 to 8 BLEU points compared to supervised back-translation baselines.
Anthology ID:
P19-1579
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Editors:
Anna Korhonen, David Traum, Lluís Màrquez
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5786–5796
Language:
URL:
https://aclanthology.org/P19-1579
DOI:
10.18653/v1/P19-1579
Bibkey:
Cite (ACL):
Mengzhou Xia, Xiang Kong, Antonios Anastasopoulos, and Graham Neubig. 2019. Generalized Data Augmentation for Low-Resource Translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5786–5796, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Generalized Data Augmentation for Low-Resource Translation (Xia et al., ACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/P19-1579.pdf
Video:
 https://aclanthology.org/P19-1579.mp4