Teodora Mihajlov

2026

Serbian SuperGLUE: Towards an Evaluation Benchmark for South Slavic Language Models
Mitar Perovic | Teodora Mihajlov
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)

We introduce Serbian SuperGLUE, a comprehensive benchmark for evaluating natural language understanding in Serbian, adapted from the English SuperGLUE benchmark. The benchmark comprises seven tasks spanning question answering, natural language inference, and coreference resolution, created through a combination of LLM-based translation with automatic post-editing and native data generation. We evaluate seven encoder-based language models, including both Serbian-specific (BERTić, Jerteh) and multilingual models (mmBERT, XLM-RoBERTa variants). Our results reveal that multilingual models remain competitive with language-specific alternatives, with mmBERT achieving the best performance on RTE (75.7%) and XLM-R-BERTić leading on BoolQ (82.0%). We observe significant training variance on smaller datasets, with standard deviations exceeding 10% in some configurations, highlighting the importance of multi-seed evaluation for low-resource benchmarking. We release the benchmark, evaluation code, and model checkpoints to facilitate reproducible research on South Slavic language understanding.

2023

pdf bib abs

Automatic Student Answer Assessment using LSA
Teodora Mihajlov
Proceedings of the Workshop on Computational Terminology in NLP and Translation Studies (ConTeNTS) Incorporating the 16th Workshop on Building and Using Comparable Corpora (BUCC)

Implementing technology in a modern-day classroom is an ongoing challenge. In this paper, we created a system for an automatic assessment of student answers using Latent Semantic Analysis (LSA) – a method with an underlying assumption that words with similar meanings will appear in the same contexts. The system will be used within digital lexical flash-cards for L2 vocabulary acquisition in a CLIL classroom. Results presented in this paper indicate that while LSA does well in creating semantic spaces for longer texts, it somewhat struggles with detecting topics in short texts. After obtaining LSA semantic spaces, answer accuracy was assessed by calculating the cosine similarity between a student’s answer and the golden standard. The answers were classified by accuracy using KNN, for both binary and multinomial classification. The results of KNN classification are as follows: precision P = 0.73, recall R = 1.00, F1 = 0.85 for binary classification, and P = 0.50, R = 0.47, F1 = 0.46 score for the multinomial classifier. The results are to be taken with a grain of salt, due to a small test and training dataset.

Co-authors

Mitar Perovic 1

Venues

Fix author