MUTANT: A Multi-sentential Code-mixed Hinglish Dataset

Rahul Gupta; Vivek Srivastava; Mayank Singh

doi:10.18653/v1/2023.findings-eacl.56

MUTANT: A Multi-sentential Code-mixed Hinglish Dataset

Rahul Gupta, Vivek Srivastava, Mayank Singh

Abstract

The multi-sentential long sequence textual data unfolds several interesting research directions pertaining to natural language processing and generation. Though we observe several high-quality long-sequence datasets for English and other monolingual languages, there is no significant effort in building such resources for code-mixed languages such as Hinglish (code-mixing of Hindi-English). In this paper, we propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles. As a use case, we leverage multilingual articles from two different data sources and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset i.e., MUTANT. We propose a token-level language-aware pipeline and extend the existing metrics measuring the degree of code-mixing to a multi-sentential framework and automatically identify MCT in the multilingual articles. The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs. To facilitate future research directions, we will make the dataset and the code publicly available upon publication.

Anthology ID:: 2023.findings-eacl.56
Volume:: Findings of the Association for Computational Linguistics: EACL 2023
Month:: May
Year:: 2023
Address:: Dubrovnik, Croatia
Editors:: Andreas Vlachos, Isabelle Augenstein
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 744–753
Language:
URL:: https://aclanthology.org/2023.findings-eacl.56/
DOI:: 10.18653/v1/2023.findings-eacl.56
Bibkey:
Cite (ACL):: Rahul Gupta, Vivek Srivastava, and Mayank Singh. 2023. MUTANT: A Multi-sentential Code-mixed Hinglish Dataset. In Findings of the Association for Computational Linguistics: EACL 2023, pages 744–753, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):: MUTANT: A Multi-sentential Code-mixed Hinglish Dataset (Gupta et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-eacl.56.pdf
Video:: https://aclanthology.org/2023.findings-eacl.56.mp4

PDF Cite Search Video Fix data