A Tokenization System for the Kurdish Language

Sina Ahmadi


Abstract
Tokenization is one of the essential and fundamental tasks in natural language processing. Despite the recent advances in applying unsupervised statistical methods for this task, every language with its writing system and orthography represents specific challenges that should be addressed individually. In this paper, as a preliminary study of its kind, we propose an approach for the tokenization of the Sorani and Kurmanji dialects of Kurdish using a lexicon and a morphological analyzer. We demonstrate how the morphological complexity of the language along with the lack of a unified orthography can be efficiently addressed in tokenization. We also develop an annotated dataset for which our approach outperforms the performance of unsupervised methods.
Anthology ID:
2020.vardial-1.11
Volume:
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Yves Scherrer
Venue:
VarDial
SIG:
Publisher:
International Committee on Computational Linguistics (ICCL)
Note:
Pages:
114–127
Language:
URL:
https://aclanthology.org/2020.vardial-1.11
DOI:
Bibkey:
Cite (ACL):
Sina Ahmadi. 2020. A Tokenization System for the Kurdish Language. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 114–127, Barcelona, Spain (Online). International Committee on Computational Linguistics (ICCL).
Cite (Informal):
A Tokenization System for the Kurdish Language (Ahmadi, VarDial 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.vardial-1.11.pdf
Code
 sinaahmadi/kurdishtokenization