PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation

Vivek Srivastava, Mayank Singh


Abstract
Code-mixing is the phenomenon of using more than one language in a sentence. In the multilingual communities, it is a very frequently observed pattern of communication on social media platforms. Flexibility to use multiple languages in one text message might help to communicate efficiently with the target audience. But, the noisy user-generated code-mixed text adds to the challenge of processing and understanding natural language to a much larger extent. Machine translation from monolingual source to the target language is a well-studied research problem. Here, we demonstrate that widely popular and sophisticated translation systems such as Google Translate fail at times to translate code-mixed text effectively. To address this challenge, we present a parallel corpus of the 13,738 code-mixed Hindi-English sentences and their corresponding human translation in English. In addition, we also propose a translation pipeline build on top of Google Translate. The evaluation of the proposed pipeline on PHINC demonstrates an increase in the performance of the underlying system. With minimal effort, we can extend the dataset and the proposed approach to other code-mixing language pairs.
Anthology ID:
2020.wnut-1.7
Volume:
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)
Month:
November
Year:
2020
Address:
Online
Venues:
EMNLP | WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
41–49
Language:
URL:
https://aclanthology.org/2020.wnut-1.7
DOI:
10.18653/v1/2020.wnut-1.7
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2020.wnut-1.7.pdf
Data
PHINC