2022
pdf
bib
abs
Camel Treebank: An Open Multi-genre Arabic Dependency Treebank
Nizar Habash
|
Muhammed AbuOdeh
|
Dima Taji
|
Reem Faraj
|
Jamila El Gizuli
|
Omar Kallas
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We present the Camel Treebank (CAMELTB), a 188K word open-source dependency treebank of Modern Standard and Classical Arabic. CAMELTB 1.0 includes 13 sub-corpora comprising selections of texts from pre-Islamic poetry to social media online commentaries, and covering a range of genres from religious and philosophical texts to news, novels, and student essays. The texts are all publicly available (out of copyright, creative commons, or under open licenses). The texts were morphologically tokenized and syntactically parsed automatically, and then manually corrected by a team of trained annotators. The annotations follow the guidelines of the Columbia Arabic Treebank (CATiB) dependency representation. We discuss our annotation process and guideline extensions, and we present some initial observations on lexical and syntactic differences among the annotated sub-corpora. This corpus will be publicly available to support and encourage research on Arabic NLP in general and on new, previously unexplored genres that are of interest to a wider spectrum of researchers, from historical linguistics and digital humanities to computer-assisted language pedagogy.
2019
pdf
bib
abs
Morphologically Annotated Corpora for Seven Arabic Dialects: Taizi, Sanaani, Najdi, Jordanian, Syrian, Iraqi and Moroccan
Faisal Alshargi
|
Shahd Dibas
|
Sakhar Alkhereyf
|
Reem Faraj
|
Basmah Abdulkareem
|
Sane Yagi
|
Ouafaa Kacha
|
Nizar Habash
|
Owen Rambow
Proceedings of the Fourth Arabic Natural Language Processing Workshop
We present a collection of morphologically annotated corpora for seven Arabic dialects: Taizi Yemeni, Sanaani Yemeni, Najdi, Jordanian, Syrian, Iraqi and Moroccan Arabic. The corpora collectively cover over 200,000 words, and are all manually annotated in a common set of standards for orthography, diacritized lemmas, tokenization, morphological units and English glosses. These corpora will be publicly available to serve as benchmarks for training and evaluating systems for Arabic dialect morphological analysis and disambiguation.
2018
pdf
bib
Unified Guidelines and Resources for Arabic Dialect Orthography
Nizar Habash
|
Fadhl Eryani
|
Salam Khalifa
|
Owen Rambow
|
Dana Abdulrahim
|
Alexander Erdmann
|
Reem Faraj
|
Wajdi Zaghouani
|
Houda Bouamor
|
Nasser Zalmout
|
Sara Hassan
|
Faisal Al-Shargi
|
Sakhar Alkhereyf
|
Basma Abdulkareem
|
Ramy Eskander
|
Mohammad Salameh
|
Hind Saddiki
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2008
pdf
bib
abs
Annotating an Arabic Learner Corpus for Error
Ghazi Abuhakema
|
Reem Faraj
|
Anna Feldman
|
Eileen Fitzpatrick
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
This paper describes an ongoing project in which we are collecting a learner corpus of Arabic, developing a tagset for error annotation and performing Computer-aided Error Analysis (CEA) on the data. We adapted the French Interlanguage Database FRIDA tagset (Granger, 2003a) to the data. We chose FRIDA in order to follow a known standard and to see whether the changes needed to move from a French to an Arabic tagset would give us a measure of the distance between the two languages with respect to learner difficulty. The current collection of texts, which is constantly growing, contains intermediate and advanced-level student writings. We describe the need for such corpora, the learner data we have collected and the tagset we have developed. We also describe the error frequency distribution of both proficiency levels and the ongoing work.