MUCS@MixMT: IndicTrans-based Machine Translation for Hinglish Text

Asha Hegde, Shashirekha Lakshmaiah


Abstract
Code-mixing is the phenomena of mixing various linguistic units such as paragraphs, sentences, phrases, words, etc., of one language with that of the other language in any text. This code-mixing is predominantly used by social media users who know more than one language. Processing code-mixed text is challenging because of its characteristics and lack of tools that supports such data. Further, pretrained models can be used for the formal text and not for the informal text such as code-mixed. Developing efficient Machine Translation (MT) systems for code-mixed text is challenging due to lack of code-mixed training data. Further, existing MT systems developed to translate monolingual data are not portable to translate code-mixed text mainly due to its informal nature. To address the MT challenges of code-mixed text, this paper describes the proposed MT models submitted by our team MUCS, to the Code-mixed Machine Translation (MixMT) shared task in the Workshop on Machine Translation (WMT) organized in connection with Empirical models in Natural Language Processing (EMNLP) 2022. This shared has two subtasks: i) subtask 1 - to translate English sentences and their corresponding Hindi translations into Hinglish text and ii) subtask 2 - to translate Hinglish text into English text. The proposed models that translate the code-mixed English text to Hinglish (English-Hindli code-mixed text) and vice-versa, comprises of i) transliterating Hinglish text from Latin to Devanagari script and vice-versa, ii) pseudo translation generation using existing models, and iii) efficient target generation by combining the pseudo translations along with the training data provided by the shared task organizers. The proposed models obtained 5th and 3rd rank with Recall-Oriented Under-study for Gisting Evaluation (ROUGE) scores of 0.35806 and 0.55453 for subtask 1 and subtask 2 respectively.
Anthology ID:
2022.wmt-1.113
Volume:
Proceedings of the Seventh Conference on Machine Translation (WMT)
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Editors:
Philipp Koehn, Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Tom Kocmi, André Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, Matteo Negri, Aurélie Névéol, Mariana Neves, Martin Popel, Marco Turchi, Marcos Zampieri
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
1131–1135
Language:
URL:
https://aclanthology.org/2022.wmt-1.113
DOI:
Bibkey:
Cite (ACL):
Asha Hegde and Shashirekha Lakshmaiah. 2022. MUCS@MixMT: IndicTrans-based Machine Translation for Hinglish Text. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1131–1135, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
MUCS@MixMT: IndicTrans-based Machine Translation for Hinglish Text (Hegde & Lakshmaiah, WMT 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.wmt-1.113.pdf