A System for Diacritizing Four Varieties of Arabic

Hamdy Mubarak, Ahmed Abdelali, Kareem Darwish, Mohamed Eldesouki, Younes Samih, Hassan Sajjad


Abstract
Short vowels, aka diacritics, are more often omitted when writing different varieties of Arabic including Modern Standard Arabic (MSA), Classical Arabic (CA), and Dialectal Arabic (DA). However, diacritics are required to properly pronounce words, which makes diacritic restoration (a.k.a. diacritization) essential for language learning and text-to-speech applications. In this paper, we present a system for diacritizing MSA, CA, and two varieties of DA, namely Moroccan and Tunisian. The system uses a character level sequence-to-sequence deep learning model that requires no feature engineering and beats all previous SOTA systems for all the Arabic varieties that we test on.
Anthology ID:
D19-3037
Volume:
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations
Month:
November
Year:
2019
Address:
Hong Kong, China
Editors:
Sebastian Padó, Ruihong Huang
Venues:
EMNLP | IJCNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
217–222
Language:
URL:
https://aclanthology.org/D19-3037
DOI:
10.18653/v1/D19-3037
Bibkey:
Cite (ACL):
Hamdy Mubarak, Ahmed Abdelali, Kareem Darwish, Mohamed Eldesouki, Younes Samih, and Hassan Sajjad. 2019. A System for Diacritizing Four Varieties of Arabic. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 217–222, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
A System for Diacritizing Four Varieties of Arabic (Mubarak et al., EMNLP-IJCNLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/D19-3037.pdf