Subdialectal Differences in Sorani Kurdish

Shervin Malmasi


Abstract
In this study we apply classification methods for detecting subdialectal differences in Sorani Kurdish texts produced in different regions, namely Iran and Iraq. As Sorani is a low-resource language, no corpus including texts from different regions was readily available. To this end, we identified data sources that could be leveraged for this task to create a dataset of 200,000 sentences. Using surface features, we attempted to classify Sorani subdialects, showing that sentences from news sources in Iraq and Iran are distinguishable with 96% accuracy. This is the first preliminary study for a dialect that has not been widely studied in computational linguistics, evidencing the possible existence of distinct subdialects.
Anthology ID:
W16-4812
Volume:
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Preslav Nakov, Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi
Venue:
VarDial
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
89–96
Language:
URL:
https://aclanthology.org/W16-4812
DOI:
Bibkey:
Cite (ACL):
Shervin Malmasi. 2016. Subdialectal Differences in Sorani Kurdish. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 89–96, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Subdialectal Differences in Sorani Kurdish (Malmasi, VarDial 2016)
Copy Citation:
PDF:
https://aclanthology.org/W16-4812.pdf