Retrieval of Parallelizable Texts Across Church Slavic Variants

Piroska Lendvai, Uwe Reichel, Anna Jouravel, Achim Rabus, Elena Renje


Abstract
The goal of our study is to identify parallelizable texts for Church Slavic, across chronological and regional variants. Next to using a benchmark text, we utilize a recently digitized, large text collection and compile new resources for the retrieval of similar texts: a ground truth dataset holding a small amount of manually aligned sentences in Old Church Slavic and in Old East Slavic, and a large unaligned dataset that has a subset of ground truth (GT) quality texts but contains noise from handwritten text recognition (HTR) for the majority of the collection. We discuss preprocessing challenges in the data and the impact of sentence segmentation on retrieval performance. We evaluate sentence snippets mapped across these two diachronic variants of Church Slavic, expressed by mean reciprocal rank, using embedding representations from large language models (LLMs) as well as classical string similarity based approaches combined with k-nearest neighbor (kNN) search. Experimental results indicate that in the current setup (short text snippets, off-the-shelf multilingual embeddings), classical string similarity based retrieval can still outperform embedding based retrieval.
Anthology ID:
2025.vardial-1.8
Volume:
Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jorg Tiedemann, Marcos Zampieri
Venues:
VarDial | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
105–114
Language:
URL:
https://aclanthology.org/2025.vardial-1.8/
DOI:
Bibkey:
Cite (ACL):
Piroska Lendvai, Uwe Reichel, Anna Jouravel, Achim Rabus, and Elena Renje. 2025. Retrieval of Parallelizable Texts Across Church Slavic Variants. In Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 105–114, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Retrieval of Parallelizable Texts Across Church Slavic Variants (Lendvai et al., VarDial 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.vardial-1.8.pdf