Creating a Large Multi-Layered Representational Repository of Linguistic Code Switched Arabic Data

Mona Diab; Mahmoud Ghoneim; Abdelati Hawwari; Fahad AlGhamdi; Nada Almarwani; Mohamed Al-Badrashiny

Creating a Large Multi-Layered Representational Repository of Linguistic Code Switched Arabic Data

Mona Diab, Mahmoud Ghoneim, Abdelati Hawwari, Fahad AlGhamdi, Nada AlMarwani, Mohamed Al-Badrashiny

Abstract

We present our effort to create a large Multi-Layered representational repository of Linguistic Code-Switched Arabic data. The process involves developing clear annotation standards and Guidelines, streamlining the annotation process, and implementing quality control measures. We used two main protocols for annotation: in-lab gold annotations and crowd sourcing annotations. We developed a web-based annotation tool to facilitate the management of the annotation process. The current version of the repository contains a total of 886,252 tokens that are tagged into one of sixteen code-switching tags. The data exhibits code switching between Modern Standard Arabic and Egyptian Dialectal Arabic representing three data genres: Tweets, commentaries, and discussion fora. The overall Inter-Annotator Agreement is 93.1%.

Anthology ID:: L16-1669
Volume:: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:: May
Year:: 2016
Address:: Portorož, Slovenia
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 4228–4235
Language:
URL:: https://aclanthology.org/L16-1669/
DOI:
Bibkey:
Cite (ACL):: Mona Diab, Mahmoud Ghoneim, Abdelati Hawwari, Fahad AlGhamdi, Nada AlMarwani, and Mohamed Al-Badrashiny. 2016. Creating a Large Multi-Layered Representational Repository of Linguistic Code Switched Arabic Data. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4228–4235, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):: Creating a Large Multi-Layered Representational Repository of Linguistic Code Switched Arabic Data (Diab et al., LREC 2016)
Copy Citation:
PDF:: https://aclanthology.org/L16-1669.pdf

PDF Cite Search Fix data