Building a Clean Bartangi Language Corpus and Training Word Embeddings for Low-Resource Language Modeling

Warda Tariq; Victor Popov; Vasilii Gromov

Building a Clean Bartangi Language Corpus and Training Word Embeddings for Low-Resource Language Modeling

Warda Tariq, Victor Popov, Vasilii Gromov

Abstract

In this paper, we showcase a comprehensive end-to-end pipeline for creating a superior Bartangi language corpus and using it for training word embeddings. The critically low-resource Pamiri lan- guage of Bartangi, which is spoken in Tajikistan, has difficulties such as morphological complexity, orthographic variety, and a lack of data. In order to overcome these obstacles, we gathered a raw corpus of roughly 6,550 phrases, used the Uniparser-Morph-Bartangi morphological analyzer for linguistically accurate lemmatization, and implemented a thorough cleaning procedure to eliminate noise and ensure proper tokenization. The lemmatized corpus that results greatly lowers word spar- sity and raises the standard of linguistic analysis.The processed corpus was then used to train two different Word2Vec models, Skip-gram and CBOW, with a vector size of 100, a context window of 5, and a minimum frequency threshold of 1. The resultant word embeddings were displayed using dimensionality reduction techniques like PCA and t-SNE, and assessed using intrinsic methods like nearest-neighbor similarity tests. Our tests show that even from tiny datasets, meaningful semantic representations can be obtained by combining informed morphological analysis with clean prepro- cessing. One of the earliest computational datasets for Bartangi, this resource serves as a vital basis for upcoming NLP tasks, such as language modeling, semantic analysis, and low-resource machine translation. To promote more research in Pamiri and other under-represented languages, we make the corpus, lemmatizer pipeline, and trained embeddings publicly available.

Anthology ID:: 2025.ranlp-1.145
Volume:: Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Month:: September
Year:: 2025
Address:: Varna, Bulgaria
Editors:: Galia Angelova, Maria Kunilovskaya, Marie Escribe, Ruslan Mitkov
Venue:: RANLP
SIG:
Publisher:: INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:: 1256–1262
Language:
URL:: https://aclanthology.org/2025.ranlp-1.145/
DOI:
Bibkey:
Cite (ACL):: Warda Tariq, Victor Popov, and Vasilii Gromov. 2025. Building a Clean Bartangi Language Corpus and Training Word Embeddings for Low-Resource Language Modeling. In Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era, pages 1256–1262, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):: Building a Clean Bartangi Language Corpus and Training Word Embeddings for Low-Resource Language Modeling (Tariq et al., RANLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.ranlp-1.145.pdf

PDF Cite Search Fix data