Comparing and combining tagging with different decoding algorithms for back-translation in NMT: learnings from a low resource scenario

Xabier Soto; Olatz Perez-de-Vinaspre; Gorka Labaka; Maite Oronoz

Comparing and combining tagging with different decoding algorithms for back-translation in NMT: learnings from a low resource scenario

Xabier Soto, Olatz Perez-De-Viñaspre, Gorka Labaka, Maite Oronoz

Abstract

Back-translation is a well established approach to improve the performance of Neural Machine Translation (NMT) systems when large monolingual corpora of the target language and domain are available. Recently, diverse approaches have been proposed to get better automatic evaluation results of NMT models using back-translation, including the use of sampling instead of beam search as decoding algorithm for creating the synthetic corpus. Alternatively, it has been proposed to append a tag to the back-translated corpus for helping the NMT system to distinguish the synthetic bilingual corpus from the authentic one. However, not all the combinations of the previous approaches have been tested, and thus it is not clear which is the best approach for developing a given NMT system. In this work, we empirically compare and combine existing techniques for back-translation in a real low resource setting: the translation of clinical notes from Basque into Spanish. Apart from automatically evaluating the MT systems, we ask bilingual healthcare workers to perform a human evaluation, and analyze the different synthetic corpora by measuring their lexical diversity (LD). For reproducibility and generalizability, we repeat our experiments for German to English translation using public data. The results suggest that in lower resource scenarios tagging only helps when using sampling for decoding, in contradiction with the previous literature using bigger corpora from the news domain. When fine-tuning with a few thousand bilingual in-domain sentences, one of our proposed method (tagged restricted sampling) obtains the best results both in terms of automatic and human evaluation. We will publish the code upon acceptance.

Anthology ID:: 2022.eamt-1.6
Volume:: Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
Month:: June
Year:: 2022
Address:: Ghent, Belgium
Editors:: Helena Moniz, Lieve Macken, Andrew Rufener, Loïc Barrault, Marta R. Costa-jussà, Christophe Declercq, Maarit Koponen, Ellie Kemp, Spyridon Pilos, Mikel L. Forcada, Carolina Scarton, Joachim Van den Bogaert, Joke Daems, Arda Tezcan, Bram Vanroy, Margot Fonteyne
Venue:: EAMT
SIG:
Publisher:: European Association for Machine Translation
Note:
Pages:: 31–40
Language:
URL:: https://aclanthology.org/2022.eamt-1.6/
DOI:
Bibkey:
Cite (ACL):: Xabier Soto, Olatz Perez-De-Viñaspre, Gorka Labaka, and Maite Oronoz. 2022. Comparing and combining tagging with different decoding algorithms for back-translation in NMT: learnings from a low resource scenario. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 31–40, Ghent, Belgium. European Association for Machine Translation.
Cite (Informal):: Comparing and combining tagging with different decoding algorithms for back-translation in NMT: learnings from a low resource scenario (Soto et al., EAMT 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.eamt-1.6.pdf

PDF Cite Search Fix data