Konstantin Vorontsov


2024

pdf bib
LomonosovMSU at SemEval-2024 Task 4: Comparing LLMs and embedder models to identifying propaganda techniques in the content of memes in English for subtasks No1, No2a, and No2b
Gleb Skiba | Mikhail Pukemo | Dmitry Melikhov | Konstantin Vorontsov
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

This paper presents the solution of the LomonosovMSU team for the SemEval-2024 Task 4 “Multilingual Detection of Persuasion Techniques in Memes” competition for the English language task. During the task solving process, generative and BERT-like (training classifiers on top of embedder models) approaches were tested for subtask No1, as well as an BERT-like approach on top of multimodal embedder models for subtasks No2a/No2b. The models were trained using datasets provided by the competition organizers, enriched with filtered datasets from previous SemEval competitions. The following results were achieved: 18th place for subtask No1, 9th place for subtask No2a, and 11th place for subtask No2b.

2020

pdf bib
TopicNet: Making Additive Regularisation for Topic Modelling Accessible
Victor Bulatov | Vasiliy Alekseev | Konstantin Vorontsov | Darya Polyudova | Eugenia Veselova | Alexey Goncharov | Evgeny Egorov
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper introduces TopicNet, a new Python module for topic modeling. This package, distributed under the MIT license, focuses on bringing additive regularization topic modelling (ARTM) to non-specialists using a general-purpose high-level language. The module features include powerful model visualization techniques, various training strategies, semi-automated model selection, support for user-defined goal metrics, and a modular approach to topic model training. Source code and documentation are available at https://github.com/machine-intelligence-laboratory/TopicNet

pdf bib
Topic Balancing with Additive Regularization of Topic Models
Eugeniia Veselova | Konstantin Vorontsov
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

This article proposes a new approach for building topic models on unbalanced collections in topic modelling, based on the existing methods and our experiments with such methods. Real-world data collections contain topics in various proportions, and often documents of the relatively small theme become distributed all over the larger topics instead of being grouped into one topic. To address this issue, we design a new regularizer for Theta and Phi matrices in probabilistic Latent Semantic Analysis (pLSA) model. We make sure this regularizer increases the quality of topic models, trained on unbalanced collections. Besides, we conceptually support this regularizer by our experiments.

2019

pdf bib
Lexical Quantile-Based Text Complexity Measure
Maksim Eremeev | Konstantin Vorontsov
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This paper introduces a new approach to estimating the text document complexity. Common readability indices are based on average length of sentences and words. In contrast to these methods, we propose to count the number of rare words occurring abnormally often in the document. We use the reference corpus of texts and the quantile approach in order to determine what words are rare, and what frequencies are abnormal. We construct a general text complexity model, which can be adjusted for the specific task, and introduce two special models. The experimental design is based on a set of thematically similar pairs of Wikipedia articles, labeled using crowdsourcing. The experiments demonstrate the competitiveness of the proposed approach.