Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages

Sonal Sannigrahi, Rachel Bawden


Abstract
Multilingual language models have shown impressive cross-lingual transfer ability across a diverse set of languages and tasks. To improve the cross-lingual ability of these models, some strategies include transliteration and finer-grained segmentation into characters as opposed to subwords. In this work, we investigate lexical sharing in multilingual machine translation (MT) from Hindi, Gujarati, Nepali into English. We explore the trade-offs that exist in translation performance between data sampling and vocabulary size, and we explore whether transliteration is useful in encouraging cross-script generalisation. We also verify how the different settings generalise to unseen languages (Marathi and Bengali). We find that transliteration does not give pronounced improvements and our analysis suggests that our multilingual MT models trained on original scripts are already robust to cross-script differences even for relatively low-resource languages.
Anthology ID:
2023.eamt-1.18
Volume:
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
Month:
June
Year:
2023
Address:
Tampere, Finland
Editors:
Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Schierl, Tharindu Ranasinghe, Eva Vanmassenhove, Sergi Alvarez Vidal, Nora Aranberri, Mara Nunziatini, Carla Parra Escartín, Mikel Forcada, Maja Popovic, Carolina Scarton, Helena Moniz
Venue:
EAMT
SIG:
Publisher:
European Association for Machine Translation
Note:
Pages:
181–192
Language:
URL:
https://aclanthology.org/2023.eamt-1.18
DOI:
Bibkey:
Cite (ACL):
Sonal Sannigrahi and Rachel Bawden. 2023. Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 181–192, Tampere, Finland. European Association for Machine Translation.
Cite (Informal):
Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages (Sannigrahi & Bawden, EAMT 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.eamt-1.18.pdf