ZAEBUC: An Annotated Arabic-English Bilingual Writer Corpus

Nizar Habash; David Palfreyman

ZAEBUC: An Annotated Arabic-English Bilingual Writer Corpus

Abstract

We present ZAEBUC, an annotated Arabic-English bilingual writer corpus comprising short essays by first-year university students at Zayed University in the United Arab Emirates. We describe and discuss the various guidelines and pipeline processes we followed to create the annotations and quality check them. The annotations include spelling and grammar correction, morphological tokenization, Part-of-Speech tagging, lemmatization, and Common European Framework of Reference (CEFR) ratings. All of the annotations are done on Arabic and English texts using consistent guidelines as much as possible, with tracked alignments among the different annotations, and to the original raw texts. For morphological tokenization, POS tagging, and lemmatization, we use existing automatic annotation tools followed by manual correction. We also present various measurements and correlations with preliminary insights drawn from the data and annotations. The publicly available ZAEBUC corpus and its annotations are intended to be the stepping stones for additional annotations.

Anthology ID:: 2022.lrec-1.9
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 79–88
Language:
URL:: https://aclanthology.org/2022.lrec-1.9/
DOI:
Bibkey:
Cite (ACL):: Nizar Habash and David Palfreyman. 2022. ZAEBUC: An Annotated Arabic-English Bilingual Writer Corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 79–88, Marseille, France. European Language Resources Association.
Cite (Informal):: ZAEBUC: An Annotated Arabic-English Bilingual Writer Corpus (Habash & Palfreyman, LREC 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.lrec-1.9.pdf

PDF Cite Search Fix data