Elabbas Benmamoun
2006
Challenges in Processing Colloquial Arabic
Alla Rozovskaya
|
Richard Sproat
|
Elabbas Benmamoun
Proceedings of the International Conference on the Challenge of Arabic for NLP/MT
Processing of Colloquial Arabic is a relatively new area of research, and a number of interesting challenges pertaining to spoken Arabic dialects arise. On the one hand, a whole continuum of Arabic dialects exists, with linguistic differences on phonological, morphological, syntactic, and lexical levels. On the other hand, there are inter-dialectal similarities that need be explored. Furthermore, due to scarcity of dialect-specific linguistic resources and availability of a wide range of resources for Modern Standard Arabic (MSA), it is desirable to explore the possibility of exploiting MSA tools when working on dialects. This paper describes challenges in processing of Colloquial Arabic in the context of language modeling for Automatic Speech Recognition. Using data from Egyptian Colloquial Arabic and MSA, we investigate the question of improving language modeling of Egyptian Arabic with MSA data and resources. As part of the project, we address the problem of linguistic variation between Egyptian Arabic and MSA. To account for differences between MSA and Colloquial Arabic, we experiment with the following techniques of data transformation: morphological simplification (stemming), lexical transductions, and syntactic transformations. While the best performing model remains the one built using only dialectal data, these techniques allow us to obtain an improvement over the baseline MSA model. More specifically, while the effect on perplexity of syntactic transformations is not very significant, stemming of the training and testing data improves the baseline perplexity of the MSA model trained on words by 51%, and lexical transductions yield an 82% perplexity reduction. Although the focus of the present work is on language modeling, we believe the findings of the study will be useful for researchers involved in other areas of processing Arabic dialects, such as parsing and machine translation.