Code-Switching with Word Senses for Pretraining in Neural Machine Translation

Vivek Iyer, Edoardo Barba, Alexandra Birch, Jeff Pan, Roberto Navigli


Abstract
Lexical ambiguity is a significant and pervasive challenge in Neural Machine Translation (NMT), with many state-of-the-art (SOTA) NMT systems struggling to handle polysemous words (Campolungo et al., 2022). The same holds for the NMT pretraining paradigm of denoising synthetic “code-switched” text (Pan et al., 2021; Iyer et al., 2023), where word senses are ignored in the noising stage – leading to harmful sense biases in the pretraining data that are subsequently inherited by the resulting models. In this work, we introduce Word Sense Pretraining for Neural Machine Translation (WSP-NMT) - an end-to-end approach for pretraining multilingual NMT models leveraging word sense-specific information from Knowledge Bases. Our experiments show significant improvements in overall translation quality. Then, we show the robustness of our approach to scale to various challenging data and resource-scarce scenarios and, finally, report fine-grained accuracy improvements on the DiBiMT disambiguation benchmark. Our studies yield interesting and novel insights into the merits and challenges of integrating word sense information and structured knowledge in multilingual pretraining for NMT.
Anthology ID:
2023.findings-emnlp.859
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12889–12901
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.859
DOI:
10.18653/v1/2023.findings-emnlp.859
Bibkey:
Cite (ACL):
Vivek Iyer, Edoardo Barba, Alexandra Birch, Jeff Pan, and Roberto Navigli. 2023. Code-Switching with Word Senses for Pretraining in Neural Machine Translation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12889–12901, Singapore. Association for Computational Linguistics.
Cite (Informal):
Code-Switching with Word Senses for Pretraining in Neural Machine Translation (Iyer et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.859.pdf