Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing

Claire Brierley, Majdi Sawalha, Eric Atwell


Abstract
A boundary-annotated and part-of-speech tagged corpus is a prerequisite for developing phrase break classifiers. Boundary annotations in English speech corpora are descriptive, delimiting intonation units perceived by the listener. We take a novel approach to phrase break prediction for Arabic, deriving our prosodic annotation scheme from Tajwīd (recitation) mark-up in the Qur'an which we then interpret as additional text-based data for computational analysis. This mark-up is prescriptive, and signifies a widely-used recitation style, and one of seven original styles of transmission. Here we report on version 1.0 of our Boundary-Annotated Qur'an dataset of 77430 words and 8230 sentences, where each word is tagged with prosodic and syntactic information at two coarse-grained levels. In (Sawalha et al., 2012), we use the dataset in phrase break prediction experiments. This research is part of a larger-scale project to produce annotation schemes, language resources, algorithms, and applications for Classical and Modern Standard Arabic.
Anthology ID:
L12-1092
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1011–1016
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/240_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Claire Brierley, Majdi Sawalha, and Eric Atwell. 2012. Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 1011–1016, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing (Brierley et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/240_Paper.pdf