Comparing the Level of Code-Switching in Corpora

Björn Gambäck, Amitava Das


Abstract
Social media texts are often fairly informal and conversational, and when produced by bilinguals tend to be written in several different languages simultaneously, in the same way as conversational speech. The recent availability of large social media corpora has thus also made large-scale code-switched resources available for research. The paper addresses the issues of evaluation and comparison these new corpora entail, by defining an objective measure of corpus level complexity of code-switched texts. It is also shown how this formal measure can be used in practice, by applying it to several code-switched corpora.
Anthology ID:
L16-1292
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1850–1855
Language:
URL:
https://aclanthology.org/L16-1292
DOI:
Bibkey:
Cite (ACL):
Björn Gambäck and Amitava Das. 2016. Comparing the Level of Code-Switching in Corpora. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 1850–1855, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Comparing the Level of Code-Switching in Corpora (Gambäck & Das, LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1292.pdf