Instruction Finetuning to Attribute Language Stage, Dialect, and Provenance Region to Historical Church Slavic Texts

Piroska Lendvai, Uwe Reichel, Anna Jouravel, Achim Rabus, Elena Renje


Abstract
Our study addresses domain-specific text provenance classification for the historical Church Slavic language. The downstream task is to attribute the language stage and its dialectal and regional varieties to texts compiled from newly curated sources, including digitally unpublished manuscripts, in addition to established Church Slavic resources from the Universal Dependencies Treebank. We aim to harmonize previously used tag sets pertaining to textual provenance, and construct a new, hierarchical, multi-layer provenance labeling scheme. For the classification task, we finetune Vikhr (Nikolich et al., 2004), a generative LLM with knowledge of modern Russian, with the instruction to generate labels to classify the provenance of sentence-level text units. Besides gold standard manuscript transcriptions, we test the finetuned model on character-corrupted data that emulate the quality of noisy, handwritten text recognition material. The experiments show that the Vikhr base model has low provenance attribution knowledge of Church Slavic, whereas our finetuned model achieves above .9 F-scores on Language stage labeling and Dialect labeling, and above .8 F-score on generating the label that jointly classifies all three provenance layers. The task of classifying the fine-grained geographical region from which a manuscript originates proves harder (but still performs above .8), and is negatively impacted by character level noise injection.
Anthology ID:
2025.ranlp-1.76
Volume:
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Month:
September
Year:
2025
Address:
Varna, Bulgaria
Editors:
Galia Angelova, Maria Kunilovskaya, Marie Escribe, Ruslan Mitkov
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:
654–662
Language:
URL:
https://aclanthology.org/2025.ranlp-1.76/
DOI:
Bibkey:
Cite (ACL):
Piroska Lendvai, Uwe Reichel, Anna Jouravel, Achim Rabus, and Elena Renje. 2025. Instruction Finetuning to Attribute Language Stage, Dialect, and Provenance Region to Historical Church Slavic Texts. In Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era, pages 654–662, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):
Instruction Finetuning to Attribute Language Stage, Dialect, and Provenance Region to Historical Church Slavic Texts (Lendvai et al., RANLP 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.ranlp-1.76.pdf