Data Selection for Unsupervised Translation of German–Upper Sorbian

Lukas Edman, Antonio Toral, Gertjan van Noord


Abstract
This paper describes the methods behind the systems submitted by the University of Groningen for the WMT 2020 Unsupervised Machine Translation task for German–Upper Sorbian. We investigate the usefulness of data selection in the unsupervised setting. We find that we can perform data selection using a pretrained model and show that the quality of a set of sentences or documents can have a great impact on the performance of the UNMT system trained on it. Furthermore, we show that document-level data selection should be preferred for training the XLM model when possible. Finally, we show that there is a trade-off between quality and quantity of the data used to train UNMT systems.
Anthology ID:
2020.wmt-1.130
Volume:
Proceedings of the Fifth Conference on Machine Translation
Month:
November
Year:
2020
Address:
Online
Editors:
Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Yvette Graham, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, André Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, Matteo Negri
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
1099–1103
Language:
URL:
https://aclanthology.org/2020.wmt-1.130
DOI:
Bibkey:
Cite (ACL):
Lukas Edman, Antonio Toral, and Gertjan van Noord. 2020. Data Selection for Unsupervised Translation of German–Upper Sorbian. In Proceedings of the Fifth Conference on Machine Translation, pages 1099–1103, Online. Association for Computational Linguistics.
Cite (Informal):
Data Selection for Unsupervised Translation of German–Upper Sorbian (Edman et al., WMT 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.wmt-1.130.pdf
Video:
 https://slideslive.com/38939613