Low-Resource Unsupervised NMT: Diagnosing the Problem and Providing a Linguistically Motivated Solution

Lukas Edman, Antonio Toral, Gertjan van Noord


Abstract
Unsupervised Machine Translation has been advancing our ability to translate without parallel data, but state-of-the-art methods assume an abundance of monolingual data. This paper investigates the scenario where monolingual data is limited as well, finding that current unsupervised methods suffer in performance under this stricter setting. We find that the performance loss originates from the poor quality of the pretrained monolingual embeddings, and we offer a potential solution: dependency-based word embeddings. These embeddings result in a complementary word representation which offers a boost in performance of around 1.5 BLEU points compared to standard word2vec when monolingual data is limited to 1 million sentences per language. We also find that the inclusion of sub-word information is crucial to improving the quality of the embeddings.
Anthology ID:
2020.eamt-1.10
Volume:
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
Month:
November
Year:
2020
Address:
Lisboa, Portugal
Editors:
André Martins, Helena Moniz, Sara Fumega, Bruno Martins, Fernando Batista, Luisa Coheur, Carla Parra, Isabel Trancoso, Marco Turchi, Arianna Bisazza, Joss Moorkens, Ana Guerberof, Mary Nurminen, Lena Marg, Mikel L. Forcada
Venue:
EAMT
SIG:
Publisher:
European Association for Machine Translation
Note:
Pages:
81–90
Language:
URL:
https://aclanthology.org/2020.eamt-1.10
DOI:
Bibkey:
Cite (ACL):
Lukas Edman, Antonio Toral, and Gertjan van Noord. 2020. Low-Resource Unsupervised NMT: Diagnosing the Problem and Providing a Linguistically Motivated Solution. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 81–90, Lisboa, Portugal. European Association for Machine Translation.
Cite (Informal):
Low-Resource Unsupervised NMT: Diagnosing the Problem and Providing a Linguistically Motivated Solution (Edman et al., EAMT 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.eamt-1.10.pdf
Code
 leukas/lrumt