A Layered Language Model based Hybrid Approach to Automatic Full Diacritization of Arabic

Mohamed Al-Badrashiny, Abdelati Hawwari, Mona Diab


Abstract
In this paper we present a system for automatic Arabic text diacritization using three levels of analysis granularity in a layered back off manner. We build and exploit diacritized language models (LM) for each of three different levels of granularity: surface form, morphologically segmented into prefix/stem/suffix, and character level. For each of the passes, we use Viterbi search to pick the most probable diacritization per word in the input. We start with the surface form LM, followed by the morphological level, then finally we leverage the character level LM. Our system outperforms all of the published systems evaluated against the same training and test data. It achieves a 10.87% WER for complete full diacritization including lexical and syntactic diacritization, and 3.0% WER for lexical diacritization, ignoring syntactic diacritization.
Anthology ID:
W17-1321
Volume:
Proceedings of the Third Arabic Natural Language Processing Workshop
Month:
April
Year:
2017
Address:
Valencia, Spain
Editors:
Nizar Habash, Mona Diab, Kareem Darwish, Wassim El-Hajj, Hend Al-Khalifa, Houda Bouamor, Nadi Tomeh, Mahmoud El-Haj, Wajdi Zaghouani
Venue:
WANLP
SIG:
SEMITIC
Publisher:
Association for Computational Linguistics
Note:
Pages:
177–184
Language:
URL:
https://aclanthology.org/W17-1321
DOI:
10.18653/v1/W17-1321
Bibkey:
Cite (ACL):
Mohamed Al-Badrashiny, Abdelati Hawwari, and Mona Diab. 2017. A Layered Language Model based Hybrid Approach to Automatic Full Diacritization of Arabic. In Proceedings of the Third Arabic Natural Language Processing Workshop, pages 177–184, Valencia, Spain. Association for Computational Linguistics.
Cite (Informal):
A Layered Language Model based Hybrid Approach to Automatic Full Diacritization of Arabic (Al-Badrashiny et al., WANLP 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-1321.pdf