Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus

Mohamed Balabel, Injy Hamed, Slim Abdennadher, Ngoc Thang Vu, Özlem Çetinoğlu


Abstract
Code-switching has become a prevalent phenomenon across many communities. It poses a challenge to NLP researchers, mainly due to the lack of available data needed for training and testing applications. In this paper, we introduce a new resource: a corpus of Egyptian- Arabic code-switch speech data that is fully tokenized, lemmatized and annotated for part-of-speech tags. Beside the corpus itself, we provide annotation guidelines to address the unique challenges of annotating code-switch data. Another challenge that we address is the fact that Egyptian Arabic orthography and grammar are not standardized.
Anthology ID:
2020.lrec-1.489
Volume:
Proceedings of the 12th Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3973–3977
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.489
DOI:
Bibkey:
Cite (ACL):
Mohamed Balabel, Injy Hamed, Slim Abdennadher, Ngoc Thang Vu, and Özlem Çetinoğlu. 2020. Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 3973–3977, Marseille, France. European Language Resources Association.
Cite (Informal):
Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus (Balabel et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.489.pdf