Diacritization: A Challenge to Arabic Treebank Annotation and Parsing

Mohamed Maamouri, Seth Kulick, Ann Bies


Abstract
Arabic diacritization (referred to sometimes as vocalization or vowelling), defined as the full or partial representation of short vowels, shadda (consonantal length or germination), tanween (nunation or definiteness), and hamza (the glottal stop and its support letters), is still largely understudied in the current NLP literature. In this paper, the lack of diacritics in standard Arabic texts is presented as a major challenge to most Arabic natural language processing tasks, including parsing. Recent studies (Messaoudi, et al. 2004; Vergyri & Kirchhoff 2004; Zitouni, et al. 2006 and Maamouri, et al. forthcoming) about the place and impact of diacritization in text-based NLP research are presented along with an analysis of the weight of the missing diacritics on Treebank morphological and syntactic analyses and the impact on parser development.
Anthology ID:
2006.bcs-1.4
Volume:
Proceedings of the International Conference on the Challenge of Arabic for NLP/MT
Month:
October 23
Year:
2006
Address:
London, UK
Venue:
BCS
SIG:
Publisher:
Note:
Pages:
35–47
Language:
URL:
https://aclanthology.org/2006.bcs-1.4
DOI:
Bibkey:
Cite (ACL):
Mohamed Maamouri, Seth Kulick, and Ann Bies. 2006. Diacritization: A Challenge to Arabic Treebank Annotation and Parsing. In Proceedings of the International Conference on the Challenge of Arabic for NLP/MT, pages 35–47, London, UK.
Cite (Informal):
Diacritization: A Challenge to Arabic Treebank Annotation and Parsing (Maamouri et al., BCS 2006)
Copy Citation:
PDF:
https://aclanthology.org/2006.bcs-1.4.pdf