Shashirekha Lakshmaiah
2022
MUCS@MixMT: IndicTrans-based Machine Translation for Hinglish Text
Asha Hegde
|
Shashirekha Lakshmaiah
Proceedings of the Seventh Conference on Machine Translation (WMT)
Code-mixing is the phenomena of mixing various linguistic units such as paragraphs, sentences, phrases, words, etc., of one language with that of the other language in any text. This code-mixing is predominantly used by social media users who know more than one language. Processing code-mixed text is challenging because of its characteristics and lack of tools that supports such data. Further, pretrained models can be used for the formal text and not for the informal text such as code-mixed. Developing efficient Machine Translation (MT) systems for code-mixed text is challenging due to lack of code-mixed training data. Further, existing MT systems developed to translate monolingual data are not portable to translate code-mixed text mainly due to its informal nature. To address the MT challenges of code-mixed text, this paper describes the proposed MT models submitted by our team MUCS, to the Code-mixed Machine Translation (MixMT) shared task in the Workshop on Machine Translation (WMT) organized in connection with Empirical models in Natural Language Processing (EMNLP) 2022. This shared has two subtasks: i) subtask 1 - to translate English sentences and their corresponding Hindi translations into Hinglish text and ii) subtask 2 - to translate Hinglish text into English text. The proposed models that translate the code-mixed English text to Hinglish (English-Hindli code-mixed text) and vice-versa, comprises of i) transliterating Hinglish text from Latin to Devanagari script and vice-versa, ii) pseudo translation generation using existing models, and iii) efficient target generation by combining the pseudo translations along with the training data provided by the shared task organizers. The proposed models obtained 5th and 3rd rank with Recall-Oriented Under-study for Gisting Evaluation (ROUGE) scores of 0.35806 and 0.55453 for subtask 1 and subtask 2 respectively.