2022
pdf
bib
abs
The Bahrain Corpus: A Multi-genre Corpus of Bahraini Arabic
Dana Abdulrahim
|
Go Inoue
|
Latifa Shamsan
|
Salam Khalifa
|
Nizar Habash
Proceedings of the Thirteenth Language Resources and Evaluation Conference
In recent years, the focus on developing natural language processing (NLP) tools for Arabic has shifted from Modern Standard Arabic to various Arabic dialects. Various corpora of various sizes and representing different genres, have been created for a number of Arabic dialects. As far as Gulf Arabic is concerned, Gumar Corpus (Khalifa et al., 2016) is the largest corpus, to date, that includes data representing the dialectal Arabic of the six Gulf Cooperation Council countries (Bahrain, Kuwait, Saudi Arabia, Qatar, United Arab Emirates, and Oman), particularly in the genre of “online forum novels”. In this paper, we present the Bahrain Corpus. Our objective is to create a specialized corpus of the Bahraini Arabic dialect, which includes written texts as well as transcripts of audio files, belonging to a different genre (folktales, comedy shows, plays, cooking shows, etc.). The corpus comprises 620K words, carefully curated. We provide automatic morphological annotations of the full corpus using state-of-the-art morphosyntactic disambiguation for Gulf Arabic. We validate the quality of the annotations on a 7.6K word sample. We plan to make the annotated sample as well as the full corpus publicly available to support researchers interested in Arabic NLP.
2018
pdf
bib
The MADAR Arabic Dialect Corpus and Lexicon
Houda Bouamor
|
Nizar Habash
|
Mohammad Salameh
|
Wajdi Zaghouani
|
Owen Rambow
|
Dana Abdulrahim
|
Ossama Obeid
|
Salam Khalifa
|
Fadhl Eryani
|
Alexander Erdmann
|
Kemal Oflazer
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
Unified Guidelines and Resources for Arabic Dialect Orthography
Nizar Habash
|
Fadhl Eryani
|
Salam Khalifa
|
Owen Rambow
|
Dana Abdulrahim
|
Alexander Erdmann
|
Reem Faraj
|
Wajdi Zaghouani
|
Houda Bouamor
|
Nasser Zalmout
|
Sara Hassan
|
Faisal Al-Shargi
|
Sakhar Alkhereyf
|
Basma Abdulkareem
|
Ramy Eskander
|
Mohammad Salameh
|
Hind Saddiki
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
A Morphologically Annotated Corpus of Emirati Arabic
Salam Khalifa
|
Nizar Habash
|
Fadhl Eryani
|
Ossama Obeid
|
Dana Abdulrahim
|
Meera Al Kaabi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2016
pdf
bib
abs
A Large Scale Corpus of Gulf Arabic
Salam Khalifa
|
Nizar Habash
|
Dana Abdulrahim
|
Sara Hassan
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Most Arabic natural language processing tools and resources are developed to serve Modern Standard Arabic (MSA), which is the official written language in the Arab World. Some Dialectal Arabic varieties, notably Egyptian Arabic, have received some attention lately and have a growing collection of resources that include annotated corpora and morphological analyzers and taggers. Gulf Arabic, however, lags behind in that respect. In this paper, we present the Gumar Corpus, a large-scale corpus of Gulf Arabic consisting of 110 million words from 1,200 forum novels. We annotate the corpus for sub-dialect information at the document level. We also present results of a preliminary study in the morphological annotation of Gulf Arabic which includes developing guidelines for a conventional orthography. The text of the corpus is publicly browsable through a web interface we developed for it.
2014
pdf
bib
Annotating corpus data for a quantitative, constructional analysis of motion verbs in Modern Standard Arabic
Dana Abdulrahim
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)