Hubert Jin


2006

pdf bib
Lexicon Development for Varieties of Spoken Colloquial Arabic
David Graff | Tim Buckwalter | Mohamed Maamouri | Hubert Jin
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In Arabic speech communities, there is a diglossic gap between written/formal Modern Standard Arabic (MSA) and spoken/casual colloquial dialectal Arabic (DA): the common spoken language has no standard representation in written form, while the language observed in texts has limited occurrence in speech. Hence the task of developing language resources to describe and model DA speech involves extra work to establish conventions for orthography and grammatical analysis. We describe work being done at the LDC to develop lexicons for DA, comprising pronunciation, morphology and part-of-speech labeling for word forms in recorded speech. Components of the approach are: (a) a two-layer transcription, providing a consonant-skeleton form and a pronunciation form; (b) manual annotation of morphology, part-of-speech and English gloss, followed by development of automatic word parsers modeled on the Buckwalter Morphological Analyzer for MSA; (c) customized user interfaces and supporting tools for all stages of annotation; and (d) a relational database for storing, emending and publishing the transcription corpus as well as the lexicon.