Processing and Understanding Mixed Language Data

Monojit Choudhury, Anirudh Srinivasan, Sandipan Dandapat


Abstract
Multilingual communities exhibit code-mixing, that is, mixing of two or more socially stable languages in a single conversation, sometimes even in a single utterance. This phenomenon has been widely studied by linguists and interaction scientists in the spoken language of such communities. However, with the prevalence of social media and other informal interactive platforms, code-switching is now also ubiquitously observed in user-generated text. As multilingual communities are more the norm from a global perspective, it becomes essential that code-switched text and speech are adequately handled by language technologies and NUIs.Code-mixing is extremely prevalent in all multilingual societies. Current studies have shown that as much as 20% of user generated content from some geographies, like South Asia, parts of Europe, and Singapore, are code-mixed. Thus, it is very important to handle code-mixed content as a part of NLP systems and applications for these geographies.In the past 5 years, there has been an active interest in computational models for code-mixing with a substantive research outcome in terms of publications, datasets and systems. However, it is not easy to find a single point of access for a complete and coherent overview of the research. This tutorial is expecting to fill this gap and provide new researchers in the area with a foundation in both linguistic and computational aspects of code-mixing. We hope that this then becomes a starting point for those who wish to pursue research, design, development and deployment of code-mixed systems in multilingual societies.
Anthology ID:
D19-2002
Volume:
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): Tutorial Abstracts
Month:
November
Year:
2019
Address:
Hong Kong, China
Editors:
Timothy Baldwin, Marine Carpuat
Venues:
EMNLP | IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
Language:
URL:
https://aclanthology.org/D19-2002
DOI:
Bibkey:
Cite (ACL):
Monojit Choudhury, Anirudh Srinivasan, and Sandipan Dandapat. 2019. Processing and Understanding Mixed Language Data. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): Tutorial Abstracts, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
Processing and Understanding Mixed Language Data (Choudhury et al., EMNLP-IJCNLP 2019)
Copy Citation: