The Parallel Corpus of Russian and Ruska Romani Languages

Kirill Koncha, Abina Kukanova, Kazakova Tatiana, Gloria Rozovskaya


Abstract
The paper presents a parallel corpus for the Ruska Romani dialect and Russian language. Ruska Romani is the dialect of Romani language attributed to Ruska Roma, the largest subgroup of Romani people in Russia. The corpus contains the translations of Russian literature into Ruska Romani dialect. The corpus creation involved manual alignment of a small part of translations with original works, fine-tuning a language model on the aligned pairs, and using the fine-tuned model to align the remaining data. Ruska Romani sentences were annotated using a morphological analyzer, with rules crafted for proper nouns and borrowings. The corpus, available in JSON and Russian National Corpus XML formats. It includes 88,742 Russian tokens and 84,635 Ruska Romani tokens, 74,291 of which were grammatically annotated. The corpus could be used for linguistic research, including comparative and diachronic studies, bilingual dictionary creation, stylometry research, and NLP/MT tool development for Ruska Romani.
Anthology ID:
2024.fieldmatters-1.1
Volume:
Proceedings of the 3rd Workshop on NLP Applications to Field Linguistics (Field Matters 2024)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Oleg Serikov, Ekaterina Voloshina, Anna Postnikova, Saliha Muradoglu, Eric Le Ferrand, Elena Klyachko, Ekaterina Vylomova, Tatiana Shavrina, Francis Tyers
Venues:
FieldMatters | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–5
Language:
URL:
https://aclanthology.org/2024.fieldmatters-1.1
DOI:
10.18653/v1/2024.fieldmatters-1.1
Bibkey:
Cite (ACL):
Kirill Koncha, Abina Kukanova, Kazakova Tatiana, and Gloria Rozovskaya. 2024. The Parallel Corpus of Russian and Ruska Romani Languages. In Proceedings of the 3rd Workshop on NLP Applications to Field Linguistics (Field Matters 2024), pages 1–5, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
The Parallel Corpus of Russian and Ruska Romani Languages (Koncha et al., FieldMatters-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.fieldmatters-1.1.pdf