Morphology-aware Word-Segmentation in Dialectal Arabic Adaptation of Neural Machine Translation

Ahmed Tawfik, Mahitab Emam, Khaled Essam, Robert Nabil, Hany Hassan


Abstract
Parallel corpora available for building machine translation (MT) models for dialectal Arabic (DA) are rather limited. The scarcity of resources has prompted the use of Modern Standard Arabic (MSA) abundant resources to complement the limited dialectal resource. However, dialectal clitics often differ between MSA and DA. This paper compares morphology-aware DA word segmentation to other word segmentation approaches like Byte Pair Encoding (BPE) and Sub-word Regularization (SR). A set of experiments conducted on Egyptian Arabic (EA), Levantine Arabic (LA), and Gulf Arabic (GA) show that a sufficiently accurate morphology-aware segmentation used in conjunction with BPE outperforms the other word segmentation approaches.
Anthology ID:
W19-4602
Volume:
Proceedings of the Fourth Arabic Natural Language Processing Workshop
Month:
August
Year:
2019
Address:
Florence, Italy
Editors:
Wassim El-Hajj, Lamia Hadrich Belguith, Fethi Bougares, Walid Magdy, Imed Zitouni, Nadi Tomeh, Mahmoud El-Haj, Wajdi Zaghouani
Venue:
WANLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11–17
Language:
URL:
https://aclanthology.org/W19-4602/
DOI:
10.18653/v1/W19-4602
Bibkey:
Cite (ACL):
Ahmed Tawfik, Mahitab Emam, Khaled Essam, Robert Nabil, and Hany Hassan. 2019. Morphology-aware Word-Segmentation in Dialectal Arabic Adaptation of Neural Machine Translation. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 11–17, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Morphology-aware Word-Segmentation in Dialectal Arabic Adaptation of Neural Machine Translation (Tawfik et al., WANLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-4602.pdf