Vikhr: Constructing a State-of-the-art Bilingual Open-Source Instruction-Following Large Language Model for Russian

Aleksandr Nikolich; Konstantin Korolev; Sergei Bratchikov; Igor Kiselev; Artem Shelmanov

Vikhr: Constructing a State-of-the-art Bilingual Open-Source Instruction-Following Large Language Model for Russian

Aleksandr Nikolich, Konstantin Korolev, Sergei Bratchikov, Igor Kiselev, Artem Shelmanov

Abstract

There has been a surge in the development of various Large Language Models (LLMs). However, text generation for languages other than English often faces significant challenges, including poor generation quality and reduced computational performance due to the disproportionate representation of tokens in the model’s vocabulary. In this work, we address these issues by developing a pipeline for adaptation of English-oriented pre-trained models to other languages and constructing efficient bilingual LLMs. Using this pipeline, we construct Vikhr, a state-of-the-art bilingual open-source instruction-following LLM designed specifically for the Russian language. “Vikhr” refers to the name of the Mistral LLM series and means a “strong gust of wind.”Unlike previous Russian-language models that typically rely on LoRA adapters on top of English-oriented models, sacrificing performance for lower training costs, Vikhr features an adapted tokenizer vocabulary and undergoes the continued pre-training and instruction tuning of all weights. This not only enhances the model’s performance but also significantly improves its computational and contextual efficiency.The remarkable performance of Vikhr across various Russian-language benchmarks can also be attributed to our efforts in expanding instruction datasets and corpora for continued pre-training. Vikhr not only sets the new state of the art among open-source LLMs for Russian but even outperforms some proprietary closed-source models on certain benchmarks. The model weights, instruction sets, and code are publicly available.

Anthology ID:: 2024.mrl-1.15
Volume:: Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Jonne Sälevä, Abraham Owodunni
Venue:: MRL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 189–199
Language:
URL:: https://aclanthology.org/2024.mrl-1.15
DOI:
Bibkey:
Cite (ACL):: Aleksandr Nikolich, Konstantin Korolev, Sergei Bratchikov, Igor Kiselev, and Artem Shelmanov. 2024. Vikhr: Constructing a State-of-the-art Bilingual Open-Source Instruction-Following Large Language Model for Russian. In Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024), pages 189–199, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Vikhr: Constructing a State-of-the-art Bilingual Open-Source Instruction-Following Large Language Model for Russian (Nikolich et al., MRL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.mrl-1.15.pdf

PDF Cite Search