Aditya Vavre

2022

Adapting Multilingual Models for Code-Mixed Translation
Aditya Vavre | Abhirut Gupta | Sunita Sarawagi
Findings of the Association for Computational Linguistics: EMNLP 2022

The scarcity of gold standard code-mixed to pure language parallel data makes it difficult to train translation models reliably.Prior work has addressed the paucity of parallel data with data augmentation techniques.Such methods rely heavily on external resources making systems difficult to train and scale effectively for multiple languages.We present a simple yet highly effective two-stage back-translation based training scheme for adapting multilingual models to the task of code-mixed translation which eliminates dependence on external resources.We show a substantial improvement in translation quality (measured through BLEU), beating existing prior work by up to +3.8 BLEU on code-mixed Hi→En, Mr→En, and Bn→En tasks. On the LinCE Machine Translation leader board, we achieve the highest score for code-mixed Es→En, beating existing best baseline by +6.5 BLEU, and our own stronger baseline by +1.1 BLEU.

2021

pdf bib abs

Training Data Augmentation for Code-Mixed Translation
Abhirut Gupta | Aditya Vavre | Sunita Sarawagi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Machine translation of user-generated code-mixed inputs to English is of crucial importance in applications like web search and targeted advertising. We address the scarcity of parallel training data for training such models by designing a strategy of converting existing non-code-mixed parallel data sources to code-mixed parallel data. We present an m-BERT based procedure whose core learnable component is a ternary sequence labeling model, that can be trained with a limited code-mixed corpus alone. We show a 5.8 point increase in BLEU on heavily code-mixed sentences by training a translation model using our data augmentation strategy on an Hindi-English code-mixed translation task.

Co-authors

Abhirut Gupta 2
Sunita Sarawagi 2

Venues

Findings1
NAACL1

Fix author