Approaches to Corpus Creation for Low-Resource Language Technology: the Case of Southern Kurdish and Laki

Sina Ahmadi, Zahra Azin, Sara Belelli, Antonios Anastasopoulos


Abstract
One of the major challenges that under-represented and endangered language communities face in language technology is the lack or paucity of language data. This is also the case of the Southern varieties of the Kurdish and Laki languages for which very limited resources are available with insubstantial progress in tools. To tackle this, we provide a few approaches that rely on the content of local news websites, a local radio station that broadcasts content in Southern Kurdish and fieldwork for Laki. In this paper, we describe some of the challenges of such under-represented languages, particularly in writing and standardization, and also, in retrieving sources of data and retro-digitizing handwritten content to create a corpus for Southern Kurdish and Laki. In addition, we study the task of language identification in light of the other variants of Kurdish and Zaza-Gorani languages.
Anthology ID:
2023.fieldmatters-1.7
Volume:
Proceedings of the Second Workshop on NLP Applications to Field Linguistics
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Oleg Serikov, Ekaterina Voloshina, Anna Postnikova, Elena Klyachko, Ekaterina Vylomova, Tatiana Shavrina, Eric Le Ferrand, Valentin Malykh, Francis Tyers, Timofey Arkhangelskiy, Vladislav Mikhailov
Venue:
FieldMatters
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
52–63
Language:
URL:
https://aclanthology.org/2023.fieldmatters-1.7
DOI:
10.18653/v1/2023.fieldmatters-1.7
Bibkey:
Cite (ACL):
Sina Ahmadi, Zahra Azin, Sara Belelli, and Antonios Anastasopoulos. 2023. Approaches to Corpus Creation for Low-Resource Language Technology: the Case of Southern Kurdish and Laki. In Proceedings of the Second Workshop on NLP Applications to Field Linguistics, pages 52–63, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Approaches to Corpus Creation for Low-Resource Language Technology: the Case of Southern Kurdish and Laki (Ahmadi et al., FieldMatters 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.fieldmatters-1.7.pdf
Video:
 https://aclanthology.org/2023.fieldmatters-1.7.mp4