Guidelines and Framework for a Large Scale Arabic Diacritized Corpus

Wajdi Zaghouani, Houda Bouamor, Abdelati Hawwari, Mona Diab, Ossama Obeid, Mahmoud Ghoneim, Sawsan Alqahtani, Kemal Oflazer


Abstract
This paper presents the annotation guidelines developed as part of an effort to create a large scale manually diacritized corpus for various Arabic text genres. The target size of the annotated corpus is 2 million words. We summarize the guidelines and describe issues encountered during the training of the annotators. We also discuss the challenges posed by the complexity of the Arabic language and how they are addressed. Finally, we present the diacritization annotation procedure and detail the quality of the resulting annotations.
Anthology ID:
L16-1577
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3637–3643
Language:
URL:
https://aclanthology.org/L16-1577
DOI:
Bibkey:
Cite (ACL):
Wajdi Zaghouani, Houda Bouamor, Abdelati Hawwari, Mona Diab, Ossama Obeid, Mahmoud Ghoneim, Sawsan Alqahtani, and Kemal Oflazer. 2016. Guidelines and Framework for a Large Scale Arabic Diacritized Corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 3637–3643, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Guidelines and Framework for a Large Scale Arabic Diacritized Corpus (Zaghouani et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1577.pdf