Hany Fazzaa


2025

pdf bib
Capturing Intra-Dialectal Variation in Qatari Arabic: A Corpus of Cultural and Gender Dimensions
Houda Bouamor | Sara Al-Emadi | Zeinab Ibrahim | Hany Fazzaa | Aisha Al-Sultan
Proceedings of The Third Arabic Natural Language Processing Conference

We present the first publicly available, multidimensional corpus of Qatari Arabic that captures intra-dialectal variation across Urban and Bedouin speakers. While often grouped under the label of “Gulf Arabic”, Qatari Arabic exhibits rich phonological, lexical, and discourse-level differences shaped by gender, age, and sociocultural identity. Our dataset includes aligned speech and transcriptions from 255 speakers, stratified by gender and age, and collected through structured interviews on culturally salient topics such as education, heritage, and social norms. The corpus reveals systematic variation in pronunciation, vocabulary, and narrative style, offering insights for both sociolinguistic analysis and computational modeling. We also demonstrate its utility through preliminary experiments in the prediction of dialects and genders. This work provides the first large-scale, demographically balanced corpus of Qatari Arabic, laying a foundation for both sociolinguistic research and the development of dialect-aware NLP systems.