Handle with Care: A Case Study in Comparable Corpora Exploitation for Neural Machine Translation

Thierry Etchegoyhen, Harritxu Gete


Abstract
We present the results of a case study in the exploitation of comparable corpora for Neural Machine Translation. A large comparable corpus for Basque-Spanish was prepared, on the basis of independently-produced news by the Basque public broadcaster EiTB, and we discuss the impact of various techniques to exploit the original data in order to determine optimal variants of the corpus. In particular, we show that filtering in terms of alignment thresholds and length-difference outliers has a significant impact on translation quality. The impact of tags identifying comparable data in the training datasets is also evaluated, with results indicating that this technique might be useful to help the models discriminate noisy information, in the form of informational imbalance between aligned sentences. The final corpus was prepared according to the experimental results and is made available to the scientific community for research purposes.
Anthology ID:
2020.lrec-1.469
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3799–3807
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.469
DOI:
Bibkey:
Cite (ACL):
Thierry Etchegoyhen and Harritxu Gete. 2020. Handle with Care: A Case Study in Comparable Corpora Exploitation for Neural Machine Translation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3799–3807, Marseille, France. European Language Resources Association.
Cite (Informal):
Handle with Care: A Case Study in Comparable Corpora Exploitation for Neural Machine Translation (Etchegoyhen & Gete, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.469.pdf
Data
EiTB-ParCC