Building a Corpus for the Zaza–Gorani Language Family

Sina Ahmadi


Abstract
Thanks to the growth of local communities and various news websites along with the increasing accessibility of the Web, some of the endangered and less-resourced languages have a chance to revive in the information era. Therefore, the Web is considered a huge resource that can be used to extract language corpora which enable researchers to carry out various studies in linguistics and language technology. The Zaza–Gorani language family is a linguistic subgroup of the Northwestern Iranian languages for which there is no significant corpus available. Motivated to create one, in this paper we present our endeavour to collect a corpus in Zazaki and Gorani languages containing over 1.6M and 194k word tokens, respectively. This corpus is publicly available.
Anthology ID:
2020.vardial-1.7
Volume:
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Yves Scherrer
Venue:
VarDial
SIG:
Publisher:
International Committee on Computational Linguistics (ICCL)
Note:
Pages:
70–78
Language:
URL:
https://aclanthology.org/2020.vardial-1.7
DOI:
Bibkey:
Cite (ACL):
Sina Ahmadi. 2020. Building a Corpus for the Zaza–Gorani Language Family. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 70–78, Barcelona, Spain (Online). International Committee on Computational Linguistics (ICCL).
Cite (Informal):
Building a Corpus for the Zaza–Gorani Language Family (Ahmadi, VarDial 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.vardial-1.7.pdf
Code
 sinaahmadi/zazagoranicorpus
Data
ZazaGoraniCorpus