Dmytro Pashchenko
2025
Paragraph-Level Machine Translation for Low-Resource Finno-Ugric Languages
Dmytro Pashchenko
|
Lisa Yankovskaya
|
Mark Fishel
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
We develop paragraph-level machine translation for four low-resource Finno-Ugric languages: Proper Karelian, Livvi, Ludian, and Veps. The approach is based on sentence-level pre-trained translation models, which are fine-tuned with paragraph-parallel data. This allows the resulting model to develop a native ability to handle discource-level phenomena correctly, in particular translating from grammatically gender-neutral input in Finno-Ugric languages. We collect monolingual and parallel paragraph-level corpora for these languages. Our experiments show that paragraph-level translation models can translate sentences no worse than sentence-level systems, while handling discourse-level phenomena better. For evaluation, we manually translate part of FLORES-200 into these four languages. All our results, data, and models are released openly.