SyMCoM - Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing

Prashant Kodali, Anmol Goel, Monojit Choudhury, Manish Shrivastava, Ponnurangam Kumaraguru


Abstract
Code mixing is the linguistic phenomenon where bilingual speakers tend to switch between two or more languages in conversations. Recent work on code-mixing in computational settings has leveraged social media code mixed texts to train NLP models. For capturing the variety of code mixing in, and across corpus, Language ID (LID) tags based measures (CMI) have been proposed. Syntactical variety/patterns of code-mixing and their relationship vis-a-vis computational model’s performance is under explored. In this work, we investigate a collection of English(en)-Hindi(hi) code-mixed datasets from a syntactic lens to propose, SyMCoM, an indicator of syntactic variety in code-mixed text, with intuitive theoretical bounds. We train SoTA en-hi PoS tagger, accuracy of 93.4%, to reliably compute PoS tags on a corpus, and demonstrate the utility of SyMCoM by applying it on various syntactical categories on a collection of datasets, and compare datasets using the measure.
Anthology ID:
2022.findings-acl.40
Volume:
Findings of the Association for Computational Linguistics: ACL 2022
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
472–480
Language:
URL:
https://aclanthology.org/2022.findings-acl.40
DOI:
10.18653/v1/2022.findings-acl.40
Bibkey:
Cite (ACL):
Prashant Kodali, Anmol Goel, Monojit Choudhury, Manish Shrivastava, and Ponnurangam Kumaraguru. 2022. SyMCoM - Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing. In Findings of the Association for Computational Linguistics: ACL 2022, pages 472–480, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
SyMCoM - Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing (Kodali et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-acl.40.pdf
Data
LinCE