Arabic Diacritization: Stats, Rules, and Hacks

Kareem Darwish, Hamdy Mubarak, Ahmed Abdelali


Abstract
In this paper, we present a new and fast state-of-the-art Arabic diacritizer that guesses the diacritics of words and then their case endings. We employ a Viterbi decoder at word-level with back-off to stem, morphological patterns, and transliteration and sequence labeling based diacritization of named entities. For case endings, we use Support Vector Machine (SVM) based ranking coupled with morphological patterns and linguistic rules to properly guess case endings. We achieve a low word level diacritization error of 3.29% and 12.77% without and with case endings respectively on a new multi-genre free of copyright test set. We are making the diacritizer available for free for research purposes.
Anthology ID:
W17-1302
Volume:
Proceedings of the Third Arabic Natural Language Processing Workshop
Month:
April
Year:
2017
Address:
Valencia, Spain
Venues:
WANLP | WS
SIG:
SEMITIC
Publisher:
Association for Computational Linguistics
Note:
Pages:
9–17
Language:
URL:
https://aclanthology.org/W17-1302
DOI:
10.18653/v1/W17-1302
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/W17-1302.pdf