Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation

Vivek Iyer; Arturo Oncevay; Alexandra Birch

doi:10.18653/v1/2023.findings-eacl.72

Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation

Vivek Iyer, Arturo Oncevay, Alexandra Birch

Abstract

Multilingual pretraining approaches in Neural Machine Translation (NMT) have shown that training models to denoise synthetic code-switched data can yield impressive performance gains — owing to better multilingual semantic representations and transfer learning. However, they generated the synthetic code-switched data using non-contextual, one-to-one word translations obtained from lexicons - which can lead to significant noise in a variety of cases, including the poor handling of polysemes and multi-word expressions, violation of linguistic agreement and inability to scale to agglutinative languages. To overcome these limitations, we propose an approach called Contextual Code-Switching (CCS), where contextual, many-to-many word translations are generated using a ‘base’ NMT model. We conduct experiments on 3 different language families - Romance, Uralic, and Indo-Aryan - and show significant improvements (by up to 5.5 spBLEU points) over the previous lexicon-based SOTA approaches. We also observe that small CCS models can perform comparably or better than massive models like mBART50 and mRASP2, depending on the size of data provided. We empirically analyse several key factors responsible for these - including context, many-to-many substitutions, code-switching language count etc. - and prove that they all contribute to enhanced pretraining of multilingual NMT models.

Anthology ID:: 2023.findings-eacl.72
Volume:: Findings of the Association for Computational Linguistics: EACL 2023
Month:: May
Year:: 2023
Address:: Dubrovnik, Croatia
Editors:: Andreas Vlachos, Isabelle Augenstein
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 984–998
Language:
URL:: https://aclanthology.org/2023.findings-eacl.72/
DOI:: 10.18653/v1/2023.findings-eacl.72
Bibkey:
Cite (ACL):: Vivek Iyer, Arturo Oncevay, and Alexandra Birch. 2023. Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation. In Findings of the Association for Computational Linguistics: EACL 2023, pages 984–998, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):: Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation (Iyer et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-eacl.72.pdf
Software:: 2023.findings-eacl.72.software.zip
Video:: https://aclanthology.org/2023.findings-eacl.72.mp4

PDF Cite Search Software Video Fix data