2014
pdf
bib
ArCADE: An Arabic Corpus of Auditory Dictation Errors
C. Anton Rytting
|
Paul Rodrigues
|
Tim Buckwalter
|
Valerie Novak
|
Aric Bills
|
Noah H. Silbert
|
Mohini Madgavkar
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications
2010
pdf
bib
abs
Error Correction for Arabic Dictionary Lookup
C. Anton Rytting
|
Paul Rodrigues
|
Tim Buckwalter
|
David Zajic
|
Bridget Hirsch
|
Jeff Carnes
|
Nathanael Lynn
|
Sarah Wayland
|
Chris Taylor
|
Jason White
|
Charles Blake III
|
Evelyn Browne
|
Corey Miller
|
Tristan Purvis
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
We describe a new Arabic spelling correction system which is intended for use with electronic dictionary search by learners of Arabic. Unlike other spelling correction systems, this system does not depend on a corpus of attested student errors but on student- and teacher-generated ratings of confusable pairs of phonemes or letters. Separate error modules for keyboard mistypings, phonetic confusions, and dialectal confusions are combined to create a weighted finite-state transducer that calculates the likelihood that an input string could correspond to each citation form in a dictionary of Iraqi Arabic. Results are ranked by the estimated likelihood that a citation form could be misheard, mistyped, or mistranscribed for the input given by the user. To evaluate the system, we developed a noisy-channel model trained on students speech errors and use it to perturb citation forms from a dictionary. We compare our system to a baseline based on Levenshtein distance and find that, when evaluated on single-error queries, our system performs 28% better than the baseline (overall MRR) and is twice as good at returning the correct dictionary form as the top-ranked result. We believe this to be the first spelling correction system designed for a spoken, colloquial dialect of Arabic.
2006
pdf
bib
abs
Lexicon Development for Varieties of Spoken Colloquial Arabic
David Graff
|
Tim Buckwalter
|
Mohamed Maamouri
|
Hubert Jin
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
In Arabic speech communities, there is a diglossic gap between written/formal Modern Standard Arabic (MSA) and spoken/casual colloquial dialectal Arabic (DA): the common spoken language has no standard representation in written form, while the language observed in texts has limited occurrence in speech. Hence the task of developing language resources to describe and model DA speech involves extra work to establish conventions for orthography and grammatical analysis. We describe work being done at the LDC to develop lexicons for DA, comprising pronunciation, morphology and part-of-speech labeling for word forms in recorded speech. Components of the approach are: (a) a two-layer transcription, providing a consonant-skeleton form and a pronunciation form; (b) manual annotation of morphology, part-of-speech and English gloss, followed by development of automatic word parsers modeled on the Buckwalter Morphological Analyzer for MSA; (c) customized user interfaces and supporting tools for all stages of annotation; and (d) a relational database for storing, emending and publishing the transcription corpus as well as the lexicon.
pdf
bib
abs
Developing and Using a Pilot Dialectal Arabic Treebank
Mohamed Maamouri
|
Ann Bies
|
Tim Buckwalter
|
Mona Diab
|
Nizar Habash
|
Owen Rambow
|
Dalila Tabessi
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
In this paper, we describe the methodological procedures and issues that emerged from the development of a pilot Levantine Arabic Treebank (LATB) at the Linguistic Data Consortium (LDC) and its use at the Johns Hopkins University (JHU) Center for Language and Speech Processing workshop on Parsing Arabic Dialects (PAD). This pilot, consisting of morphological and syntactic annotation of approximately 26,000 words of Levantine Arabic conversational telephone speech, was developed under severe time constraints; hence the LDC team drew on their experience in treebanking Modern Standard Arabic (MSA) text. The resulting Levantine dialect treebanked corpus was used by the PAD team to develop and evaluate parsers for Levantine dialect texts. The parsers were trained on MSA resources and adapted using dialect-MSA lexical resources (some developed especially for this task) and existing linguistic knowledge about syntactic differences between MSA and dialect. The use of the LATB for development and evaluation of syntactic parsers allowed the PAD team to provide feedbasck to the LDC treebank developers. In this paper, we describe the creation of resources for this corpus, as well as transformations on the corpus to eliminate speech effects and lessen the gap between our pre-existing MSA resources and the new dialectal corpus
2004
pdf
bib
Issues in Arabic Orthography and Morphology Analysis
Tim Buckwalter
Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages