Language Identification and Analysis of Code-Switched Social Media Text

Deepthi Mave, Suraj Maharjan, Thamar Solorio


Abstract
In this paper, we detail our work on comparing different word-level language identification systems for code-switched Hindi-English data and a standard Spanish-English dataset. In this regard, we build a new code-switched dataset for Hindi-English. To understand the code-switching patterns in these language pairs, we investigate different code-switching metrics. We find that the CRF model outperforms the neural network based models by a margin of 2-5 percentage points for Spanish-English and 3-5 percentage points for Hindi-English.
Anthology ID:
W18-3206
Volume:
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching
Month:
July
Year:
2018
Address:
Melbourne, Australia
Venues:
ACL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
51–61
Language:
URL:
https://aclanthology.org/W18-3206
DOI:
10.18653/v1/W18-3206
Bibkey:
Cite (ACL):
Deepthi Mave, Suraj Maharjan, and Thamar Solorio. 2018. Language Identification and Analysis of Code-Switched Social Media Text. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, pages 51–61, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
Language Identification and Analysis of Code-Switched Social Media Text (Mave et al., 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-3206.pdf