Ellen Broselow
2026
Syllable Structures Across Arabic Varieties
Abdelrahim Qaddoumi | Jordan Kodner | Salam Khalifa | Ellen Broselow | Owen Rambow
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects
Abdelrahim Qaddoumi | Jordan Kodner | Salam Khalifa | Ellen Broselow | Owen Rambow
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects
This study compares the syllable structures of nine Arabic varieties from Wiktionary, using a computational syllabifier. It further investigates methods for learning syllable boundaries in unsyllabified words transcribed in the International Phonetic Alphabet (IPA). The syllabification algorithm is evaluated under three conditions: (i) Default, employing fixed rules; (ii) Joint, learning onsets and codas across all varieties collectively; and (iii) Per-variety, learning onsets and codas specific to each variety. Results indicate that the default configuration yields the highest accuracy, ranging from 97.05% to 100%. The per-variety approach achieves 90.64% to 100% accuracy, while the joint approach ranges from 84.63% to 94.74%. A cross-variety analysis using Jensen-Shannon divergence reveals three principal groupings: Egyptian, Hejazi, and Modern Standard Arabic are closely related; Levantine and Gulf varieties constitute a second cluster; and Juba Arabic, Maltese, and Moroccan emerge as outliers. A cleaned dataset encompassing all nine varieties is also provided.
2024
Picking Up Where the Linguist Left Off: Mapping Morphology to Phonology through Learning the Residuals
Salam Khalifa | Abdelrahim Qaddoumi | Ellen Broselow | Owen Rambow
Proceedings of the Second Arabic Natural Language Processing Conference
Salam Khalifa | Abdelrahim Qaddoumi | Ellen Broselow | Owen Rambow
Proceedings of the Second Arabic Natural Language Processing Conference
Learning morphophonological mappings between the spoken form of a language and its underlying morphological structures is crucial for enriching resources for morphologically rich languages like Arabic. In this work, we focus on Egyptian Arabic as our case study and explore the integration of linguistic knowledge with a neural transformer model. Our approach involves learning to correct the residual errors from hand-crafted rules to predict the spoken form from a given underlying morphological representation. We demonstrate that using a minimal set of rules, we can effectively recover errors even in very low-resource settings.
2023
A Cautious Generalization Goes a Long Way: Learning Morphophonological Rules
Salam Khalifa | Sarah Payne | Jordan Kodner | Ellen Broselow | Owen Rambow
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Salam Khalifa | Sarah Payne | Jordan Kodner | Ellen Broselow | Owen Rambow
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Explicit linguistic knowledge, encoded by resources such as rule-based morphological analyzers, continues to prove useful in downstream NLP tasks, especially for low-resource languages and dialects. Rules are an important asset in descriptive linguistic grammars. However, creating such resources is usually expensive and non-trivial, especially for spoken varieties with no written standard. In this work, we present a novel approach for automatically learning morphophonological rules of Arabic from a corpus. Motivated by classic cognitive models for rule learning, rules are generalized cautiously. Rules that are memorized for individual items are only allowed to generalize to unseen forms if they are sufficiently reliable in the training data. The learned rules are further examined to ensure that they capture true linguistic phenomena described by domain experts. We also investigate the learnability of rules in low-resource settings across different experimental setups and dialects.