Automated CEFR-Level Assignment for Ukrainian Texts

Olha Kanishcheva, Mikhail Kopotev


Abstract
The present study evaluates CEFR-based text complexity for Ukrainian using a new dataset compiled from textbooks, designed for language learners. We compare traditional machine learning, transformer-based models, and LLM-based evaluation across A1–B2 language proficiency levels. Results show that explicit linguistic features remain highly effective: a Random Forest classifier achieves the highest macro-F1 (0.576), slightly outperforming fine-tuned XLM-RoBERTa (0.574). While GPT-5.5 shows strong performance (macro-F1 0.564), marking a significant advancement over GPT-4.1, supervised models achieve slightly better scores in this experiment for the proficiency-level assessment. These findings suggest that structured linguistic analysis is a robust alternative to purely neural approaches for Ukrainian CEFR classification.
Anthology ID:
2026.unlp-1.18
Volume:
Proceedings of the Fifth Ukrainian Natural Language Processing Conference (UNLP 2026)
Month:
May
Year:
2026
Address:
Lviv, Ukraine
Editor:
Mariana Romanyshyn
Venue:
UNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
209–222
Language:
URL:
https://aclanthology.org/2026.unlp-1.18/
DOI:
Bibkey:
Cite (ACL):
Olha Kanishcheva and Mikhail Kopotev. 2026. Automated CEFR-Level Assignment for Ukrainian Texts. In Proceedings of the Fifth Ukrainian Natural Language Processing Conference (UNLP 2026), pages 209–222, Lviv, Ukraine. Association for Computational Linguistics.
Cite (Informal):
Automated CEFR-Level Assignment for Ukrainian Texts (Kanishcheva & Kopotev, UNLP 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.unlp-1.18.pdf