Syllable Structures Across Arabic Varieties

Abdelrahim Qaddoumi, Jordan Kodner, Salam Khalifa, Ellen Broselow, Owen Rambow


Abstract
This study compares the syllable structures of nine Arabic varieties from Wiktionary, using a computational syllabifier. It further investigates methods for learning syllable boundaries in unsyllabified words transcribed in the International Phonetic Alphabet (IPA). The syllabification algorithm is evaluated under three conditions: (i) Default, employing fixed rules; (ii) Joint, learning onsets and codas across all varieties collectively; and (iii) Per-variety, learning onsets and codas specific to each variety. Results indicate that the default configuration yields the highest accuracy, ranging from 97.05% to 100%. The per-variety approach achieves 90.64% to 100% accuracy, while the joint approach ranges from 84.63% to 94.74%. A cross-variety analysis using Jensen-Shannon divergence reveals three principal groupings: Egyptian, Hejazi, and Modern Standard Arabic are closely related; Levantine and Gulf varieties constitute a second cluster; and Juba Arabic, Maltese, and Moroccan emerge as outliers. A cleaned dataset encompassing all nine varieties is also provided.
Anthology ID:
2026.vardial-1.21
Volume:
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
March
Year:
2026
Address:
Rabat, Morocco
Venues:
VarDial | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
250–260
Language:
URL:
https://aclanthology.org/2026.vardial-1.21/
DOI:
Bibkey:
Cite (ACL):
Abdelrahim Qaddoumi, Jordan Kodner, Salam Khalifa, Ellen Broselow, and Owen Rambow. 2026. Syllable Structures Across Arabic Varieties. In Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 250–260, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Syllable Structures Across Arabic Varieties (Qaddoumi et al., VarDial 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.vardial-1.21.pdf