Paragraph-Level Machine Translation for Low-Resource Finno-Ugric Languages

Dmytro Pashchenko, Lisa Yankovskaya, Mark Fishel


Abstract
We develop paragraph-level machine translation for four low-resource Finno-Ugric languages: Proper Karelian, Livvi, Ludian, and Veps. The approach is based on sentence-level pre-trained translation models, which are fine-tuned with paragraph-parallel data. This allows the resulting model to develop a native ability to handle discource-level phenomena correctly, in particular translating from grammatically gender-neutral input in Finno-Ugric languages. We collect monolingual and parallel paragraph-level corpora for these languages. Our experiments show that paragraph-level translation models can translate sentences no worse than sentence-level systems, while handling discourse-level phenomena better. For evaluation, we manually translate part of FLORES-200 into these four languages. All our results, data, and models are released openly.
Anthology ID:
2025.nodalida-1.50
Volume:
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
Month:
march
Year:
2025
Address:
Tallinn, Estonia
Editors:
Richard Johansson, Sara Stymne
Venue:
NoDaLiDa
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
458–469
Language:
URL:
https://aclanthology.org/2025.nodalida-1.50/
DOI:
Bibkey:
Cite (ACL):
Dmytro Pashchenko, Lisa Yankovskaya, and Mark Fishel. 2025. Paragraph-Level Machine Translation for Low-Resource Finno-Ugric Languages. In Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), pages 458–469, Tallinn, Estonia. University of Tartu Library.
Cite (Informal):
Paragraph-Level Machine Translation for Low-Resource Finno-Ugric Languages (Pashchenko et al., NoDaLiDa 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.nodalida-1.50.pdf