A Corpus-Based Investigation of Contemporary Arabic Dialects Using the SADA Corpus

Ghada Alfattni


Abstract
The spoken Arabic exhibits substantial dialectal variation in the Arabic-speaking world. This paper presents a corpus-based analysis of Arabic dialectal variation using the SADA corpus, examining lexical, morphosyntactic, and discourse-pragmatic patterns across dialects. We combine quantitative frequency-based measures with qualitative linguistic analysis, including keyword comparison, distributional profiling, collocational and trigram analyses, and similarity-based clustering. Our results show that Arabic dialects share a substantial common core, while differing systematically in frequent discourse markers, evaluative expressions, and recurrent phraseological patterns. These findings provide empirical evidence for regional clustering among contemporary dialects and for variation relative to the standard register. The study contributes linguistic insights that support both Arabic dialectology and the development of dialect-aware NLP systems.
Anthology ID:
2026.abjadnlp-1.35
Volume:
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Month:
March
Year:
2026
Address:
Rabat, Morocco
Venues:
AbjadNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
276–286
Language:
URL:
https://aclanthology.org/2026.abjadnlp-1.35/
DOI:
Bibkey:
Cite (ACL):
Ghada Alfattni. 2026. A Corpus-Based Investigation of Contemporary Arabic Dialects Using the SADA Corpus. In Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script, pages 276–286, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
A Corpus-Based Investigation of Contemporary Arabic Dialects Using the SADA Corpus (Alfattni, AbjadNLP 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.abjadnlp-1.35.pdf