YADAC: Yet another Dialectal Arabic Corpus

Rania Al-Sabbagh; Roxana Girju

YADAC: Yet another Dialectal Arabic Corpus

Abstract

This paper presents the first phase of building YADAC ― a multi-genre Dialectal Arabic (DA) corpus ― that is compiled using Web data from microblogs (i.e. Twitter), blogs/forums and online knowledge market services in which both questions and answers are user-generated. In addition to introducing two new genres to the current efforts of building DA corpora (i.e. microblogs and question-answer pairs extracted from online knowledge market services), the paper highlights and tackles several new issues related to building DA corpora that have not been handled in previous studies: function-based Web harvesting and dialect identification, vowel-based spelling variation, linguistic hypercorrection and its effect on spelling variation, unsupervised Part-of-Speech (POS) tagging and base phrase chunking for DA. Although the algorithms for both POS tagging and base-phrase chunking are still under development, the results are promising.

Anthology ID:: L12-1387
Volume:: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:: May
Year:: 2012
Address:: Istanbul, Turkey
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 2882–2889
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2012/pdf/663_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Rania Al-Sabbagh and Roxana Girju. 2012. YADAC: Yet another Dialectal Arabic Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2882–2889, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):: YADAC: Yet another Dialectal Arabic Corpus (Al-Sabbagh & Girju, LREC 2012)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2012/pdf/663_Paper.pdf

PDF Cite Search Fix data