Language Tokens: A Frustratingly Simple Approach Improves Zero-Shot Performance of Multilingual Translation

Muhammad N. ElNokrashy; Amr Hendy; Mohamed Maher; Mohamed Afify; Hany Hassan Awadalla

Language Tokens: A Frustratingly Simple Approach Improves Zero-Shot Performance of Multilingual Translation

Muhammad ElNokrashy, Amr Hendy, Mohamed Maher, Mohamed Afify, Hany Hassan Awadalla

Abstract

This paper proposes a simple yet effective method to improve direct (X-to-Y) translation for both cases: zero-shot and when direct data is available. We modify the input tokens at both the encoder and decoder to include signals for the source and target languages. We show a performance gain when training from scratch, or finetuning a pretrained model with the pro- posed setup. In the experiments, our method shows nearly 10.0 BLEU points gain on in-house datasets depending on the checkpoint selection criteria. In a WMT evaluation campaign, From- English performance improves by 4.17 and 2.87 BLEU points, in the zero-shot setting, and when direct data is available for training, respectively. While X-to-Y improves by 1.29 BLEU over the zero-shot baseline, and 0.44 over the many-to-many baseline. In the low-resource setting, we see a 1.5 ∼ 1.7 point improvement when finetuning on X-to-Y domain data.

Anthology ID:: 2022.amta-research.6
Volume:: Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
Month:: September
Year:: 2022
Address:: Orlando, USA
Editors:: Kevin Duh, Francisco Guzmán
Venue:: AMTA
SIG:
Publisher:: Association for Machine Translation in the Americas
Note:
Pages:: 70–82
Language:
URL:: https://aclanthology.org/2022.amta-research.6/
DOI:
Bibkey:
Cite (ACL):: Muhammad ElNokrashy, Amr Hendy, Mohamed Maher, Mohamed Afify, and Hany Hassan Awadalla. 2022. Language Tokens: A Frustratingly Simple Approach Improves Zero-Shot Performance of Multilingual Translation. In Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 70–82, Orlando, USA. Association for Machine Translation in the Americas.
Cite (Informal):: Language Tokens: A Frustratingly Simple Approach Improves Zero-Shot Performance of Multilingual Translation (ElNokrashy et al., AMTA 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.amta-research.6.pdf

PDF Cite Search Fix data