Data-Efficient Adaptation of Multilingual LLMs to Ukrainian

Yurii Paniv; Bohdan Didenko; Mykola Haltiuk; Vladyslav Humennyy; Andrian Kravchenko; Roman Kyslyi; Viktoriia Makovska; Artem Orlovskyi; Bohdan Ruban; Maksym-Yurii Rudko; Anastasiia Senyk; Nazarii Drushchak; Dmytro Chaplynskyi; Mariana Romanyshyn

Data-Efficient Adaptation of Multilingual LLMs to Ukrainian

Yurii Paniv, Bohdan Didenko, Mykola Haltiuk, Vladyslav Humennyy, Andrian Kravchenko, Roman Kyslyi, Viktoriia Makovska, Artem Orlovskyi, Bohdan Ruban, Maksym-Yurii Rudko, Anastasiia Senyk, Nazarii Drushchak, Dmytro Chaplynskyi, Mariana Romanyshyn

Abstract

Adapting large language models to low-resource languages presents three interconnected challenges: inefficient tokenization, scarcity of high-quality annotated data, and limited resources for instruction tuning. We present a reproducible approach that addresses each challenge using data-centric methods that primarily rely on unlabeled text corpora, parallel translation data, and a multilingual base model. Our approach combines (1) vocabulary surgery for tokenizer adaptation without full retraining, (2) cross-lingual transfer of quality classifiers via translation, enabling filtering without target-language annotations, and (3) generation of instruction data through translation, task conversion, and targeted synthesis. We validate this recipe by adapting Gemma-3-12B to Ukrainian. %, producing Lapa-12BOur pretrained model achieves top performance on Ukrainian benchmarks, while our instruction-tuned variant demonstrates strong performance on translation (33 BLEU on FLORES), summarization, and question-answering tasks, while requiring 1.5x fewer tokens than the original model for the same text. We release all models, datasets, classifiers, and code to enable replication for other languages.

Anthology ID:: 2026.unlp-1.14
Volume:: Proceedings of the Fifth Ukrainian Natural Language Processing Conference (UNLP 2026)
Month:: May
Year:: 2026
Address:: Lviv, Ukraine
Editor:: Mariana Romanyshyn
Venue:: UNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 155–168
Language:
URL:: https://aclanthology.org/2026.unlp-1.14/
DOI:
Bibkey:
Cite (ACL):: Yurii Paniv, Bohdan Didenko, Mykola Haltiuk, Vladyslav Humennyy, Andrian Kravchenko, Roman Kyslyi, Viktoriia Makovska, Artem Orlovskyi, Bohdan Ruban, Maksym-Yurii Rudko, Anastasiia Senyk, Nazarii Drushchak, Dmytro Chaplynskyi, and Mariana Romanyshyn. 2026. Data-Efficient Adaptation of Multilingual LLMs to Ukrainian. In Proceedings of the Fifth Ukrainian Natural Language Processing Conference (UNLP 2026), pages 155–168, Lviv, Ukraine. Association for Computational Linguistics.
Cite (Informal):: Data-Efficient Adaptation of Multilingual LLMs to Ukrainian (Paniv et al., UNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.unlp-1.14.pdf

PDF Cite Search Fix data