Oksana Volchek
2026
Tracking the evolution of LLM capabilities for Belarusian with OpenAI Evals
Vladislav Poritski | Oksana Volchek | Maksim Aparovich | Volha Harytskaya | Pavel Smrz
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Vladislav Poritski | Oksana Volchek | Maksim Aparovich | Volha Harytskaya | Pavel Smrz
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
We examine how the capabilities of large language models (LLMs) have evolved on eight Belarusian language tasks contributed in 2023 to OpenAI’s Evals framework. We evaluate state-of-the-art models both on the original development sets and newly created test sets. Results demonstrate significant but non-uniform progress over this period: some tasks are almost saturated, while others show minor improvement beyond trivial baselines. Error analysis shows that certain challenges haven’t yet been addressed, e.g. misidentification of non-words as legitimate vocabulary, or conversion from modern to classical orthography. We release the datasets and the generated completions (https://doi.org/10.5281/zenodo.18163825).
2025
BelarusianGLUE: Towards a Natural Language Understanding Benchmark for Belarusian
Maksim Aparovich | Volha Harytskaya | Vladislav Poritski | Oksana Volchek | Pavel Smrz
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Maksim Aparovich | Volha Harytskaya | Vladislav Poritski | Oksana Volchek | Pavel Smrz
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In the epoch of multilingual large language models (LLMs), it is still challenging to evaluate the models’ understanding of lower-resourced languages, which motivates further development of expert-crafted natural language understanding benchmarks. We introduce BelarusianGLUE — a natural language understanding benchmark for Belarusian, an East Slavic language, with ≈15K instances in five tasks: sentiment analysis, linguistic acceptability, word in context, Winograd schema challenge, textual entailment. A systematic evaluation of BERT models and LLMs against this novel benchmark reveals that both types of models approach human-level performance on easier tasks, such as sentiment analysis, but there is a significant gap in performance between machine and human on a harder task — Winograd schema challenge. We find the optimal choice of model type to be task-specific: e.g. BERT models underperform on textual entailment task but are competitive for linguistic acceptability. We release the datasets (https://hf.co/datasets/maaxap/BelarusianGLUE) and evaluation code (https://github.com/maaxap/BelarusianGLUE).