Ondřej Herman
2026
Detecting Subtle Sense Shift with Polysemy-Aware Trends
Ondřej Herman | Pavel Rychlý
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Ondřej Herman | Pavel Rychlý
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Language changes faster than dictionaries can be revised, yet automatic tools still struggle to spot the subtle, short-term shifts in meaning that precede a formal update. We present a language-independent pipeline that detects word-sense shifts in large, time-stamped web corpora. The method couples a robust re-implementation of the Adaptive Skip-Gram model, which induces multiple sense vectors per lemma without any external inventory, with a second stage that tracks each sense through time under three alternative frequency normalizations. Linear Regression and the robust Mann-Kendall/Theil-Sen estimator then test whether a sense’s frequency slope deviates significantly from zero, producing a ranked list of headwords whose semantics are drifting.We evaluate the system on the English (12 B tokens) and Czech (1 B tokens) Timestamped corpora for May 2023-May 2025. Expert annotation of the top-100 candidates for each model variant shows that 50.7% of Czech and 25.7% of English headwords exhibit genuine sense shifts, despite web-scale noise.
2024
ShadowSense: A Multi-annotated Dataset for Evaluating Word Sense Induction
Ondřej Herman | Miloš Jakubíček
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Ondřej Herman | Miloš Jakubíček
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
In this paper we present a novel bilingual (Czech, English) dataset called ShadowSense developed for the purposes of word sense induction (WSI) evaluation. Unlike existing WSI datasets, ShadowSense is annotated by multiple annotators whose inter-annotator agreement represents key reliability score to be used for evaluation of systems automatically inducing word senses. In this paper we clarify the motivation for such an approach, describe the dataset in detail and provide evaluation of three neural WSI systems showing substantial differences compared to traditional evaluation paradigms.
2019
Benchmark Dataset for Propaganda Detection in Czech Newspaper Texts
Vít Baisa | Ondřej Herman | Ales Horak
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Vít Baisa | Ondřej Herman | Ales Horak
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Propaganda of various pressure groups ranging from big economies to ideological blocks is often presented in a form of objective newspaper texts. However, the real objectivity is here shaded with the support of imbalanced views and distorted attitudes by means of various manipulative stylistic techniques. In the project of Manipulative Propaganda Techniques in the Age of Internet, a new resource for automatic analysis of stylistic mechanisms for influencing the readers’ opinion is developed. In its current version, the resource consists of 7,494 newspaper articles from four selected Czech digital news servers annotated for the presence of specific manipulative techniques. In this paper, we present the current state of the annotations and describe the structure of the dataset in detail. We also offer an evaluation of bag-of-words classification algorithms for the annotated manipulative techniques.
2016
DSL Shared Task 2016: Perfect Is The Enemy of Good Language Discrimination Through Expectation–Maximization and Chunk-based Language Model
Ondřej Herman | Vít Suchomel | Vít Baisa | Pavel Rychlý
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
Ondřej Herman | Vít Suchomel | Vít Baisa | Pavel Rychlý
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
In this paper we investigate two approaches to discrimination of similar languages: Expectation–maximization algorithm for estimating conditional probability P(word|language) and byte level language models similar to compression-based language modelling methods. The accuracy of these methods reached respectively 86.6% and 88.3% on set A of the DSL Shared task 2016 competition.