Enabling Code-Mixed Translation: Parallel Corpus Creation and MT Augmentation Approach

Mrinal Dhar, Vaibhav Kumar, Manish Shrivastava


Abstract
Code-mixing, use of two or more languages in a single sentence, is ubiquitous; generated by multi-lingual speakers across the world. The phenomenon presents itself prominently in social media discourse. Consequently, there is a growing need for translating code-mixed hybrid language into standard languages. However, due to the lack of gold parallel data, existing machine translation systems fail to properly translate code-mixed text. In an effort to initiate the task of machine translation of code-mixed content, we present a newly created parallel corpus of code-mixed English-Hindi and English. We selected previously available English-Hindi code-mixed data as a starting point for the creation of our parallel corpus. We then chose 4 human translators, fluent in both English and Hindi, for translating the 6088 code-mixed English-Hindi sentences to English. With the help of the created parallel corpus, we analyzed the structure of English-Hindi code-mixed data and present a technique to augment run-of-the-mill machine translation (MT) approaches that can help achieve superior translations without the need for specially designed translation systems. We present an augmentation pipeline for existing MT approaches, like Phrase Based MT (Moses) and Neural MT, to improve the translation of code-mixed text. The augmentation pipeline is presented as a pre-processing step and can be plugged with any existing MT system, which we demonstrate by improving translations done by systems like Moses, Google Neural Machine Translation System (NMTS) and Bing Translator for English-Hindi code-mixed content.
Anthology ID:
W18-3817
Volume:
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing
Month:
August
Year:
2018
Address:
Santa Fe, New Mexico, USA
Editors:
Peter Machonis, Anabela Barreiro, Kristina Kocijan, Max Silberztein
Venue:
LR4NLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
131–140
Language:
URL:
https://aclanthology.org/W18-3817
DOI:
Bibkey:
Cite (ACL):
Mrinal Dhar, Vaibhav Kumar, and Manish Shrivastava. 2018. Enabling Code-Mixed Translation: Parallel Corpus Creation and MT Augmentation Approach. In Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 131–140, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
Enabling Code-Mixed Translation: Parallel Corpus Creation and MT Augmentation Approach (Dhar et al., LR4NLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-3817.pdf