PACMAN:PArallel CodeMixed dAta generatioN for POS tagging

Arindam Chatterjee, Chhavi Sharma, Ayush Raj, Asif Ekbal


Abstract
Code-mixing or Code-switching is the mixing of languages in the same context, predominantly observed in multilingual societies. The existing code-mixed datasets are small and primarily contain social media text that does not adhere to standard spelling and grammar. Computational models built on such data fail to generalise on unseen code-mixed data. To address the unavailability of quality code-mixed annotated datasets, we explore the combined task of generating annotated code mixed data, and building computational models from this generated data, specifically for code-mixed Part-Of-Speech (POS) tagging. We introduce PACMAN(PArallel CodeMixed dAta generatioN) - a synthetically generated code-mixed POS tagged dataset, with above 50K samples, which is the largest annotated code-mixed dataset. We build POS taggers using classical machine learning and deep learning based techniques on the generated data to report an F1-score of 98% (8% above current State-of-the-art (SOTA)). To determine the efficacy of our data, we compare it against the existing benchmark in code-mixed POS tagging. PACMAN outperforms the benchmark, ratifying that our dataset and, subsequently, our POS tagging models are generalised and capable of handling even natural code-mixed and monolingual data.
Anthology ID:
2022.icon-main.29
Volume:
Proceedings of the 19th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2022
Address:
New Delhi, India
Editors:
Md. Shad Akhtar, Tanmoy Chakraborty
Venue:
ICON
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
234–244
Language:
URL:
https://aclanthology.org/2022.icon-main.29
DOI:
Bibkey:
Cite (ACL):
Arindam Chatterjee, Chhavi Sharma, Ayush Raj, and Asif Ekbal. 2022. PACMAN:PArallel CodeMixed dAta generatioN for POS tagging. In Proceedings of the 19th International Conference on Natural Language Processing (ICON), pages 234–244, New Delhi, India. Association for Computational Linguistics.
Cite (Informal):
PACMAN:PArallel CodeMixed dAta generatioN for POS tagging (Chatterjee et al., ICON 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.icon-main.29.pdf
Software:
 2022.icon-main.29.software.zip