Kristýna Onderková
2026
LLMs as Span Annotators: A Comparative Study of LLMs and Humans
Zdeněk Kasner | Vilém Zouhar | Patrícia Schmidtová | Ivan Kartáč | Kristýna Onderková | Ondrej Platek | Dimitra Gkatzia | Saad Mahamood | Ondrej Dusek | Simone Balloccu
Proceedings of the First Workshop on Multilingual Multicultural Evaluation
Zdeněk Kasner | Vilém Zouhar | Patrícia Schmidtová | Ivan Kartáč | Kristýna Onderková | Ondrej Platek | Dimitra Gkatzia | Saad Mahamood | Ondrej Dusek | Simone Balloccu
Proceedings of the First Workshop on Multilingual Multicultural Evaluation
Span annotation - annotating specific text features at the span level - can be used to evaluate texts where single-score metrics fail to provide actionable feedback. Until recently, span annotation was done by human annotators or fine-tuned models. In this paper, we study whether large language models (LLMs) can serve as an alternative to human annotators. We compare the abilities of LLMs to skilled human annotators on three span annotation tasks: evaluating data-to-text generation, identifying translation errors, and detecting propaganda techniques. We show that overall, LLMs have only moderate inter-annotator agreement (IAA) with human annotators. However, we demonstrate that LLMs make errors at a similar rate as skilled crowdworkers. LLMs also produce annotations at a fraction of the cost per output annotation. We release the dataset of over 40k model and human span annotations for further research.
2025
FreshTab: Sourcing Fresh Data for Table-to-Text Generation Evaluation
Kristýna Onderková | Ondrej Platek | Zdeněk Kasner | Ondrej Dusek
Proceedings of the 18th International Natural Language Generation Conference
Kristýna Onderková | Ondrej Platek | Zdeněk Kasner | Ondrej Dusek
Proceedings of the 18th International Natural Language Generation Conference
Table-to-text generation (insight generation from tables) is a challenging task that requires precision in analyzing the data. In addition, the evaluation of existing benchmarks is affected by contamination of Large Language Model (LLM) training data as well as domain imbalance. We introduce FreshTab, an on-the-fly table-to-text benchmark generation from Wikipedia, to combat the LLM data contamination problem and enable domain-sensitive evaluation. While non-English table-to-text datasets are limited, FreshTab collects datasets in different languages on demand (we experiment with German, Russian and French in addition to English). We find that insights generated by LLMs from recent tables collected by our method appear clearly worse by automatic metrics, but this does not translate into LLM and human evaluations. Domain effects are visible in all evaluations, showing that a domain-balanced benchmark is more challenging.
ReproHum #0669-08: Reproducing Sentiment Transfer Evaluation
Kristýna Onderková | Mateusz Lango | Patrícia Schmidtová | Ondrej Dusek
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Kristýna Onderková | Mateusz Lango | Patrícia Schmidtová | Ondrej Dusek
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
We describe a reproduction of a human annotation experiment that was performed to evaluate the effectiveness of text style transfer systems (Reif et al., 2021). Despite our efforts to closely imitate the conditions of the original study, the results obtained differ significantly from those in the original study. We performed a statistical analysis of the results obtained, discussed the sources of these discrepancies in the study design, and quantified reproducibility. The reproduction followed the common approach to reproduction adopted by the ReproHum project.
2024
Proceedings of the 2nd Workshop on Practical LLM-assisted Data-to-Text Generation
Simone Balloccu | Zdeněk Kasner | Ondřej Plátek | Patrícia Schmidtová | Kristýna Onderková | Mateusz Lango | Ondřej Dušek | Lucie Flek | Ehud Reiter | Dimitra Gkatzia | Simon Mille
Proceedings of the 2nd Workshop on Practical LLM-assisted Data-to-Text Generation
Simone Balloccu | Zdeněk Kasner | Ondřej Plátek | Patrícia Schmidtová | Kristýna Onderková | Mateusz Lango | Ondřej Dušek | Lucie Flek | Ehud Reiter | Dimitra Gkatzia | Simon Mille
Proceedings of the 2nd Workshop on Practical LLM-assisted Data-to-Text Generation