Roman Kyslyi


2026

Dialectal speech remains largely underexplored in Automatic Speech Recognition (ASR) research, particularly for Slavic languages. While Ukrainian ASR systems have rapidly improved in recent years with the adoption of Whisper, XLS-R, and Wav2Vec-based models, performance on dialectal variants remains unknown and often significantly degraded. In this work, we present the first dedicated effort to build ASR resources for the Hutsul dialect of Ukrainian. We develop a data preparation and segmentation pipeline, evaluate multiple forced alignment strategies, and benchmark state-of-the-art ASR models under zero-shot and fine-tuned conditions. We evaluate results using WER and CER demonstrating that large multilingual ASR models struggle with dialectal speech, while lightweight fine-tuning produces substantial improvements. All scripts, alignment tools, and training recipes are made publicly available to support future research on Ukrainian dialect speech.
We present UkrSL, an annotated dataset for Ukrainian Sign Language (USL) — one ofthe most underresourced sign languages in Europe. The dataset comprises 1,456 annotated clips (1,463 with cropped video segments) totalling approximately two hours of signing, sourced from six broadcast videos from Suspilne, Ukraine’s public broadcaster.Each clip is annotated with a spoken Ukrainian transcription aligned to the corresponding signing segment. We describe the data collection pipeline, the annotation methodology, and provide a detailed analysis of the dataset’s statistics and limitations. The dataset is being actively expanded, and we release this snapshot to support the research community and invite collaboration.
We present a corpus of aligned Ukrainian–English idiomatic expressions and a comprehensive evaluation of six large language models on the task of translating sentences containing idioms. The corpus is constructed by linking entries across multiple phraseological dictionaries and the MIDAS corpus using vector similarity search, enriched with figurative meanings, contextual sentences from the UberText fiction corpus, and semantic transparency scores. We evaluate Gemini 2.5 Flash, Claude Haiku 4.5, Gemma 3 12B, Qwen3-30B-A3B, LapaLM, and Tiny Aya Global in both Ukrainian-to-English and English-to-Ukrainian directions under default and context-augmented prompting. Our evaluation of 65{,}723 translations reveals a pronounced direction asymmetry, with all models performing substantially worse when translating into Ukrainian. Providing figurative meaning and target idiom candidates improves quality for most models in Ukrainian-to-English but has limited effect in the reverse direction. We additionally show that semantic transparency of idioms is only weakly correlated with translation quality. We release the corpus and evaluation framework to support research on idiomatic translation for mid-resource languages.
Online discussions increasingly serve as a major venue for exchanging information and evaluating competing viewpoints. Yet most computational approaches to discourse quality focus on detecting harmful language or predicting engagement, providing limited insight into whether interactions actually improve collective understanding.We introduce a two-dimensional framework for modeling dialogic constructiveness, distinguishing between substantive contribution (SC) and relational conduct (SC). Using expert-annotated Ukrainian-language discussions, we show that collapsing rubric-level labels into these axes improves inter-annotator agreement, suggesting that constructiveness is better captured as a multidimensional judgment.We further compare nominal, regression, and ordinal prediction approaches and find that explicitly modeling constructiveness as an ordinal task yields substantially higher agreement with expert annotations under quadratic weighted kappa (QWK). These results indicate that dialogic constructiveness is better understood as an ordered interactional judgment rather than a binary label or continuous score.
This paper presents the results of the UNLP 2026 Shared Task on Multi-Domain Document Understanding. This Shared Task aims to challenge and assess AI capabilities to find the right information in a stack of domain-specific documents and generalize across domains. Participants were required not only to select the correct answer, but also to localize it by predicting the corresponding document and page. A total of 54 teams registered for the competition, 15 teams submitted systems, and 513 runs were evaluated on a hidden test set via Kaggle in a code-only submission format under constrained computational resources. The Kaggle leaderboard is left open for further submissions. Summarizing the contributions of this work, we establish a Ukrainian multi-domain document understanding benchmark, which consists of: (1) a collected dataset; (2) a proposed evaluation metric; and (3) an analysis of top-performing systems evaluated under a unified framework.
We introduce UAReviews, a multi-task Ukrainian-language dataset for emotion and intent classification comprising 11,580 annotated texts. The dataset combines two sources: citizen reviews of government digital services provided by the Ministry of Digital Transformation of Ukraine and Ukrainian-language Telegram posts drawn from the COSMUS corpus. Each text is annotated with both an emotion label following the Ekman taxonomy (seven classes) and an intent label (five classes), making it the first publicly available Ukrainian resource for joint emotion and intent analysis. Annotation was performed by students at the Anonymous Institution, with a gold standard subset (20\%) validated by three independent annotators achieving Krippendorff’s alpha = 0.93. We establish baselines using single-task and multi-task fine-tuned XLM-RoBERTa models and analyze emotion to intent correlation. Both the dataset and the baseline models are publicly available.
We present a significant expansion of ASR resources for the Hutsul dialect of Ukrainian, building on prior work that established the first aligned speech corpus from a single literary source. In this work, we scale the dataset from a single speaker to a multi-speaker corpus comprising 40 speakers and 60.63 hours of audio drawn from diverse sources: YouTube channels (with author permissions), field recordings from native speakers, linguist student recordings, and regional radio broadcasts. To obtain reference transcriptions for audio without existing text, we introduce a novel RAG-enhanced correction pipeline: audio is first transcribed using ElevenLabs, then corrected through a RAG pipeline backed by a dialect-aware language model. We evaluate a fine-tuned ASR models across five distinct speaker datasets, demonstrating that while the model achieves strong performance on in-domain speakers (CER 3.24%), cross-speaker generalization remains challenging, with CER ranging from 5.33% to 17.24% depending on speaker characteristics. All data, code, and models are released publicly to support further research on Ukrainian dialect speech technologies.
Adapting large language models to low-resource languages presents three interconnected challenges: inefficient tokenization, scarcity of high-quality annotated data, and limited resources for instruction tuning. We present a reproducible approach that addresses each challenge using data-centric methods that primarily rely on unlabeled text corpora, parallel translation data, and a multilingual base model. Our approach combines (1) vocabulary surgery for tokenizer adaptation without full retraining, (2) cross-lingual transfer of quality classifiers via translation, enabling filtering without target-language annotations, and (3) generation of instruction data through translation, task conversion, and targeted synthesis. We validate this recipe by adapting Gemma-3-12B to Ukrainian. %, producing Lapa-12BOur pretrained model achieves top performance on Ukrainian benchmarks, while our instruction-tuned variant demonstrates strong performance on translation (33 BLEU on FLORES), summarization, and question-answering tasks, while requiring 1.5x fewer tokens than the original model for the same text. We release all models, datasets, classifiers, and code to enable replication for other languages.
This paper describes a Natural Language Processing (NLP) course taught at Kyiv School of Economics. The course consists of 16 lectures, 5 practical assignments and focuses on modern large language models (LLMs) while preserving an introduction to classical NLP. Practical assignments are organized using Kaggle, where GPU support plays an important role in enabling students to work with complex models. A key feature of the course is the focus on Ukrainian in the practical assignments, contributing to the development of Ukrainian NLP expertise and community. The course is taught primarily in-person, but due to the ongoing war in Ukraine, also includes a full online participation option and additional weekly QnA sessions.

2025

This paper presents the results of the UNLP 2025 Shared Task on Detecting Social Media Manipulation. The task included two tracks: Technique Classification and Span Identification. The benchmark dataset contains 9,557 posts from Ukrainian Telegram channels manually annotated by media experts. A total of 51 teams registered, 22 teams submitted systems, and 595 runs were evaluated on a hidden test set via Kaggle. Performance was measured with macro F1 for classification and token‐level F1 for identification. The shared task provides the first publicly available benchmark for manipulation detection in Ukrainian social media and highlights promising directions for low‐resource propaganda research. The Kaggle leaderboard is left open for further submissions.
In this paper we introduce the first effort to adapt large language models (LLMs) to the Ukrainian dialect (in our case Hutsul), a low-resource and morphologically complex dialect spoken in the Carpathian Highlands. We created a parallel corpus of 9852 dialect-to-standard Ukrainian sentence pairs and a dictionary of 7320 dialectal word mappings. We also addressed data shortage by proposing an advanced Retrieval-Augmented Generation (RAG) pipeline to generate synthetic parallel translation pairs, expanding the corpus with 52142 examples. We have fine-tuned multiple open-source LLMs using LoRA and evaluated them on a standard-to-dialect translation task, also comparing with few-shot GPT-4o translation. In the absence of human annotators, we adopt a multi-metric evaluation strategy combining BLEU, chrF++, TER, and LLM-based judgment (GPT-4o). The results show that even small(7B) finetuned models outperform zero-shot baselines such as GPT-4o across both automatic and LLM-evaluated metrics. All data, models, and code are publicly released at: https://github.com/woters/vuyko-hutsul.

2024

This paper presents the results of the UNLP 2024 shared task, the first Shared Task on Fine-Tuning Large Language Models for the Ukrainian language. The goal of the task was to facilitate the creation of models that have knowledge of the Ukrainian language, history, and culture, as well as common knowledge, and are capable of generating fluent and accurate responses in Ukrainian. The participants were required to use models with open weights and reasonable size to ensure the reproducibility of the solutions. The participating systems were evaluated using multiple-choice exam questions and manually crafted open questions. Three teams submitted their solutions before the deadline, and two teams submitted papers that were accepted to appear in the UNLP workshop proceedings and are referred to in this report. The Codabench leaderboard is left open for further submissions.