Vasily Konovalov - ACL Anthology

Vasily Konovalov

2026

Using BERT to Explore Lexical Semantic Change of Prepositions
Liudmila Radchankava | Vasily Konovalov
The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26)

This paper presents a semi-supervised approach to investigating lexical semantic change in English prepositions using contextualized word embeddings from BERT. Due to their hybrid lexico-grammatical nature and high degree of polysemy, prepositions have received limited attention in computational studies of semantic change. We address this gap by first applying BERT-based embeddings in combination with a k-nearest neighbors classifier to the task of preposition sense disambiguation, achieving competitive performance without relying on external lexical resources. The trained model is then applied to diachronic data from the Corpus of Historical American English to analyze semantic change over time. By measuring classifier confidence and correlating it with usage year, we detect systematic differences between simple and compound prepositions. Our results confirm linguistic hypotheses that simple prepositions remain largely semantically stable, while compound prepositions exhibit measurable semantic change. The study demonstrates that BERT embeddings provide an effective tool for exploring diachronic semantic phenomena in functionally complex word classes and can be extended to other languages and datasets.

Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce MERA Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.

DeepPavlov Strikes Back: A Toolkit for Improving LLM Reliability and Trustworthiness
Evgenii Nikolaev | Timur Ionov | Anna Korzanova | Vasily Konovalov | Maksim Savkin
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations)

This paper introduces DeepPavlov 1.1, a new version of an open-source library for natural language processing (NLP). DeepPavlov 1.1 supports both traditional NLP tasks (like named entity recognition, sentiment classification) and new tasks needed to enhance LLMs truthfulness and reliability. These tools include: a hallucination detection model, an evergreen question classifier, and a toxicity classifier. The library is easy to use, flexible, and works with many languages. It is designed to help researchers and developers build better, safer AI systems that use language. It is publicly available under the Apache 2.0 license and includes access to an interactive online demo.

2025

RAGulator: Effective RAG for Regulatory Question Answering
Islam Aushev | Egor Kratkov | Evgenii Nikolaev | Andrei Glinskii | Vasilii Krikunov | Alexander Panchenko | Vasily Konovalov | Julia Belikova
Proceedings of the 1st Regulatory NLP Workshop (RegNLP 2025)

Regulatory Natural Language Processing (RegNLP) is a multidisciplinary domain focused on facilitating access to and comprehension of regulatory regulations and requirements. This paper outlines our strategy for creating a system to address the Regulatory Information Retrieval and Answer Generation (RIRAG) challenge, which was conducted during the RegNLP 2025 Workshop. The objective of this competition is to design a system capable of efficiently extracting pertinent passages from regulatory texts (ObliQA) and subsequently generating accurate, cohesive responses to inquiries related to compliance and obligations. Our proposed method employs a lightweight BM25 pre-filtering in retrieving relevant passages. This technique efficiently shortlisting candidates for subsequent processing with Transformer-based embeddings, thereby optimizing the use of resources.

SPY: Enhancing Privacy with Synthetic PII Detection Dataset
Maksim Savkin | Timur Ionov | Vasily Konovalov
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

We introduce **SPY Dataset**: a novel synthetic dataset for the task of **Personal Identifiable Information (PII) detection**, underscoring the significance of protecting PII in modern data processing. Our research innovates by leveraging Large Language Models (LLMs) to generate a dataset that emulates real-world PII scenarios. Through evaluation, we validate the dataset’s quality, providing a benchmark for PII detection. Comparative analyses reveal that while PII and Named Entity Recognition (NER) share similarities, **dedicated NER models exhibit limitations** when applied to PII-specific contexts. This work contributes to the field by making the generation methodology and the generated dataset publicly, thereby enabling further research and development in this field.

FactDebug at SemEval-2025 Task 7: Hybrid Retrieval Pipeline for Identifying Previously Fact-Checked Claims Across Multiple Languages
Evgenii Nikolaev | Ivan Bondarenko | Islam Aushev | Vasilii Krikunov | Andrei Glinskii | Vasily Konovalov | Julia Belikova
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

The proliferation of multilingual misinformation demands robust systems for crosslingual fact-checked claim retrieval. This paper addresses SemEval-2025 Shared Task 7, which challenges participants to retrieve fact-checks for social media posts across 14 languages, even when posts and fact-checks are in different languages. We propose a hybrid retrieval pipeline that combines sparse lexical matching (BM25, BGE-m3) and dense semantic retrieval (pretrained and fine-tuned BGE-m3) with dynamic fusion and curriculum-trained rerankers. Our system achieves 67.2% crosslingual and 86.01% monolingual accuracy on the Shared Task MultiClaim dataset.

SmurfCat at SemEval-2025 Task 3: Bridging External Knowledge and Model Uncertainty for Enhanced Hallucination Detection
Elisei Rykov | Valerii Olisov | Maksim Savkin | Artem Vazhentsev | Kseniia Titova | Alexander Panchenko | Vasily Konovalov | Julia Belikova
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

The Multilingual shared-task on Hallucinations and Related Observable Overgeneration Mistakes in the SemEval-2025 competition aims to detect hallucination spans in the outputs of instruction-tuned LLMs in a multilingual context. In this paper, we address the detection of span hallucinations by applying an ensemble of approaches. In particular, we synthesized a PsiloQA dataset and fine-tuned LLM to detect hallucination spans. In addition, we combined this approach with a white-box method based on uncertainty quantification techniques. Using our combined pipeline, we achieved 3rd place in detecting span hallucinations in Arabic, Catalan, Finnish, Italian, and ranked within the top ten for the rest of the languages.

When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA
Elisei Rykov | Kseniia Petrushina | Maksim Savkin | Valerii Olisov | Artem Vazhentsev | Kseniia Titova | Alexander Panchenko | Vasily Konovalov | Julia Belikova
Findings of the Association for Computational Linguistics: EMNLP 2025

Hallucination detection remains a fundamental challenge for the safe and reliable deployment of large language models (LLMs), especially in applications requiring factual accuracy. Existing hallucination benchmarks often operate at the sequence level and are limited to English, lacking the fine-grained, multilingual supervision needed for comprehensive evaluation. In this work, we introduce PsiloQA, a large-scale, multilingual dataset annotated with span-level hallucinations across 14 languages. PsiloQA is constructed through an automated three-stage pipeline: generating question–answer pairs from Wikipedia using GPT-4o, eliciting potentially hallucinated answers from diverse LLMs in a no-context setting, and automatically annotating hallucinated spans using GPT-4o by comparing against golden answers and retrieved context. We evaluate a wide range of hallucination detection methods-including uncertainty quantification, LLM-based tagging, and fine-tuned encoder models-and show that encoder-based models achieve the strongest performance across languages. Furthermore, PsiloQA demonstrates effective cross-lingual generalization and supports robust knowledge transfer to other benchmarks, all while being significantly more cost-efficient than human-annotated datasets. Our dataset and results advance the development of scalable, fine-grained hallucination detection in multilingual settings.

SmurfCat at SHROOM-CAP: Factual but Awkward? Fluent but Wrong? Tackling Both in LLM Scientific QA
Timur Ionov | Evgenii Nikolaev | Artem Vazhentsev | Mikhail Chaichuk | Anton Korznikov | Elena Tutubalina | Alexander Panchenko | Vasily Konovalov | Elisei Rykov
Proceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025)

Large Language Models (LLMs) often generate hallucinations, a critical issue in domains like scientific communication where factual accuracy and fluency are essential. The SHROOM-CAP shared task addresses this challenge by evaluating Factual Mistakes and Fluency Mistakes across diverse languages, extending earlier SHROOM editions to the scientific domain. We present Smurfcat, our system for SHROOM-CAP, which integrates three complementary approaches: uncertainty estimation (white-box and black-box signals), encoder-based classifiers (Multilingual Modern BERT), and decoder-based judges (instruction-tuned LLMs with classification heads). Results show that decoder-based judges achieve the strongest overall performance, while uncertainty methods and encoders provide complementary strengths. Our findings highlight the value of combining uncertainty signals with encoder and decoder architectures for robust, multilingual detection of hallucinations and related errors in scientific publications.

Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA
Sergey Pletenev | Maria Marina | Nikolay Ivanov | Daria Galimzianova | Nikita Krayko | Mikhail Salnikov | Vasily Konovalov | Alexander Panchenko | Viktor Moskvoretskii
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions – whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o’s retrieval behavior.

How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?
Sergey Pletenev | Maria Marina | Daniil Moskovskiy | Vasily Konovalov | Pavel Braslavski | Alexander Panchenko | Mikhail Salnikov
Findings of the Association for Computational Linguistics: NAACL 2025

The performance of Large Language Models (LLMs) on many tasks is greatly limited by the knowledge learned during pre-training and stored in the model’s parameters. Low-rank adaptation (LoRA) is a popular and efficient training technique for updating or domain-specific adaptation of LLMs. In this study, we investigate how new facts can be incorporated into the LLM using LoRA without compromising the previously learned knowledge. We fine-tuned Llama-3.1-8B-instruct using LoRA with varying amounts of new knowledge. Our experiments have shown that the best results are obtained when the training data contains a mixture of known and new facts. However, this approach is still potentially harmful because the model’s performance on external question-answering benchmarks declines after such fine-tuning. When the training data is biased towards certain entities, the model tends to regress to few overrepresented answers. In addition, we found that the model becomes more confident and refuses to provide an answer in only few cases. These findings highlight the potential pitfalls of LoRA-based LLM updates and underscore the importance of training data composition and tuning parameters to balance new knowledge integration and general model capabilities.

TabaQA at SemEval-2025 Task 8: Column Augmented Generation for Question Answering over Tabular Data
Ekaterina Antropova | Egor Kratkov | Roman Derunets | Margarita Trofimova | Ivan Bondarenko | Alexander Panchenko | Vasily Konovalov | Maksim Savkin
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

The DataBench shared task in the SemEval-2025 competition aims to tackle the problem of QA from data in tables. Given the diversity of the structure of tables, there are different approaches to retrieving the answer. Although Retrieval-Augmented Generation (RAG) is a viable solution, extracting relevant information from tables remains challenging. In addition, the table can be prohibitively large for direct integration into the LLM context. In this paper, we address QA over tabular data first by identifying relevant columns that might contain the answers, then the LLM generates answers by providing the context of the relevant columns, and finally, the LLM refines its answers. This approach secured us 7th place in the DataBench lite category.

Through the Looking Glass: Common Sense Consistency Evaluation of Weird Images
Elisei Rykov | Kseniia Petrushina | Kseniia Titova | Anton Razzhigaev | Alexander Panchenko | Vasily Konovalov
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

Measuring how real images look is a complex task in artificial intelligence research. For example, an image of Albert Einstein holding a smartphone violates common-sense because modern smartphone were invented after Einstein’s death. We introduce a novel method, which we called Through the Looking Glass (TLG), to assess image common sense consistency using Large Vision-Language Models (LVLMs) and Transformer-based encoder. By leveraging LVLM to extract atomic facts from these images, we obtain a mix of accurate facts. We proceed by fine-tuning a compact attention-pooling classifier over encoded atomic facts. Our TLG has achieved a new state-of-the-art performance on the WHOOPS! and WEIRD datasets while leveraging a compact fine-tuning component.

Adaptive Retrieval Without Self-Knowledge? Bringing Uncertainty Back Home
Viktor Moskvoretskii | Maria Marina | Mikhail Salnikov | Nikolay Ivanov | Sergey Pletenev | Daria Galimzianova | Nikita Krayko | Vasily Konovalov | Irina Nikishina | Alexander Panchenko
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Retrieval Augmented Generation (RAG) improves correctness of Question Answering (QA) and addresses hallucinations in Large Language Models (LLMs), yet greatly increase computational costs. Besides, RAG is not always needed as may introduce irrelevant information. Recent adaptive retrieval methods integrate LLMs’ intrinsic knowledge with external information appealing to LLM self-knowledge, but they often neglect efficiency evaluations and comparisons with uncertainty estimation techniques. We bridge this gap by conducting a comprehensive analysis of 35 adaptive retrieval methods, including 8 recent approaches and 27 uncertainty estimation techniques, across 6 datasets using 10 metrics for QA performance, self-knowledge, and efficiency. Our findings show that uncertainty estimation techniques often outperform complex pipelines in terms of efficiency and self-knowledge, while maintaining comparable QA performance.

LLM-Independent Adaptive RAG: Let the Question Speak for Itself
Maria Marina | Nikolay Ivanov | Sergey Pletenev | Mikhail Salnikov | Daria Galimzianova | Nikita Krayko | Vasily Konovalov | Alexander Panchenko | Viktor Moskvoretskii
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) are prone to hallucinations, and Retrieval-Augmented Generation (RAG) helps mitigate this, but at a high computational cost while risking misinformation. Adaptive retrieval aims to retrieve only when necessary, but existing approaches rely on LLM-based uncertainty estimation, which remains inefficient and impractical.In this study, we introduce lightweight LLM-independent adaptive retrieval methods based on external information. We investigated 27 features, organized into 7 groups, and their hybrid combinations. We evaluated these methods on 6 QA datasets, assessing the QA performance and efficiency. The results show that our approach matches the performance of complex LLM-based methods while achieving significant efficiency gains, demonstrating the potential of external information for adaptive retrieval.

2024

Efficient Answer Retrieval System (EARS): Combining Local DB Search and Web Search for Generative QA
Nikita Krayko | Ivan Sidorov | Fedor Laputin | Daria Galimzianova | Vasily Konovalov
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

In this work, we propose an efficient answer retrieval system **EARS**: a production-ready, factual question answering (QA) system that combines local knowledge base search with generative, context-based QA. To assess the quality of the generated content, we devise comprehensive metrics for both manual and automatic evaluation of the answers to questions. A distinctive feature of our system is the Ranker component, which ranks answer candidates based on their relevance. This feature enhances the effectiveness of local knowledge base retrieval by 23%. Another crucial aspect of our system is the LLM, which utilizes contextual information from a web search API to generate responses. This results in substantial 92.8% boost in the usefulness of voice-based responses. **EARS** is language-agnostic and can be applied to any data domain.

DeepPavlov 1.0: Your Gateway to Advanced NLP Models Backed by Transformers and Transfer Learning
Maksim Savkin | Anastasia Voznyuk | Fedor Ignatov | Anna Korzanova | Dmitry Karpov | Alexander Popov | Vasily Konovalov
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present DeepPavlov 1.0, an open-source framework for using Natural Language Processing (NLP) models by leveraging transfer learning techniques. DeepPavlov 1.0 is created for modular and configuration-driven development of state-of-the-art NLP models and supports a wide range of NLP model applications. DeepPavlov 1.0 is designed for practitioners with limited knowledge of NLP/ML. DeepPavlov is based on PyTorch and supports HuggingFace transformers. DeepPavlov is publicly released under the Apache 2.0 license and provides access to an online demo.

DeepPavlov at SemEval-2024 Task 6: Detection of Hallucinations and Overgeneration Mistakes with an Ensemble of Transformer-based Models
Ivan Maksimov | Vasily Konovalov | Andrei Glinskii
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

The inclination of large language models (LLMs) to produce mistaken assertions, known as hallucinations, can be problematic. These hallucinations could potentially be harmful since sporadic factual inaccuracies within the generated text might be concealed by the overall coherence of the content, making it immensely challenging for users to identify them. The goal of the SHROOM shared-task is to detect grammatically sound outputs that contain incorrect or unsupported semantic information. Although there are a lot of existing hallucination detectors in generated AI content, we found out that pretrained Natural Language Inference (NLI) models yet exhibit success in detecting hallucinations. Moreover their ensemble outperforms more complicated models.

DeepPavlov at SemEval-2024 Task 8: Leveraging Transfer Learning for Detecting Boundaries of Machine-Generated Texts
Anastasia Voznyuk | Vasily Konovalov
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

The Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection shared task in the SemEval-2024 competition aims to tackle the problem of misusing collaborative human-AI writing. Although there are a lot of existing detectors of AI content, they are often designed to give a binary answer and thus may not be suitable for more nuanced problem of finding the boundaries between human-written and machine-generated texts, while hybrid human-AI writing becomes more and more popular. In this paper, we address the boundary detection problem. Particularly, we present a pipeline for augmenting data for supervised fine-tuning of DeBERTaV3. We receive new best MAE score, according to the leaderboard of the competition, with this pipeline.

JellyBell at TextGraphs-17 Shared Task: Fusing Large Language Models with External Knowledge for Enhanced Question Answering
Julia Belikova | Evegeniy Beliakin | Vasily Konovalov
Proceedings of TextGraphs-17: Graph-based Methods for Natural Language Processing

This work describes an approach to develop Knowledge Graph Question Answering (KGQA) system for TextGraphs-17 shared task. The task focuses on the fusion of Large Language Models (LLMs) with Knowledge Graphs (KGs). The goal is to select a KG entity (out of several candidates) which corresponds to an answer given a textual question. Our approach applies LLM to identify the correct answer among the list of possible candidates. We confirm that integrating external information is particularly beneficial when the subject entities are not well-known, and using RAG can negatively impact the performance of LLM on questions related to popular entities, as the retrieved context might be misleading. With our result, we achieved 2nd place in the post-evaluation phase.

2016

The Negochat Corpus of Human-agent Negotiation Dialogues
Vasily Konovalov | Ron Artstein | Oren Melamud | Ido Dagan
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Annotated in-domain corpora are crucial to the successful development of dialogue systems of automated agents, and in particular for developing natural language understanding (NLU) components of such systems. Unfortunately, such important resources are scarce. In this work, we introduce an annotated natural language human-agent dialogue corpus in the negotiation domain. The corpus was collected using Amazon Mechanical Turk following the ‘Wizard-Of-Oz’ approach, where a ‘wizard’ human translates the participants’ natural language utterances in real time into a semantic language. Once dialogue collection was completed, utterances were annotated with intent labels by two independent annotators, achieving high inter-annotator agreement. Our initial experiments with an SVM classifier show that automatically inferring such labels from the utterances is far from trivial. We make our corpus publicly available to serve as an aid in the development of dialogue systems for negotiation agents, and suggest that analogous corpora can be created following our methodology and using our available source code. To the best of our knowledge this is the first publicly available negotiation dialogue corpus.

Co-authors

Nikita Krayko 4

Evgenii Nikolaev 4

Sergey Pletenev 4

Mikhail Salnikov 4

Andrei Glinskii 3

Nikolay Ivanov 3

Viktor Moskvoretskii 3

Kseniia Titova 3

Artem Vazhentsev 3

Ivan Bondarenko 2

Anna Korzanova 2

Vasilii Krikunov 2

Valerii Olisov 2

Kseniia Petrushina 2

Anastasia Voznyuk 2

Ilseyar Alimova 1

Ekaterina Antropova 1

Evegeniy Beliakin 1

Pavel Braslavski 1

Mikhail Chaichuk 1

Artem Chervyakov 1

Roman Derunets 1

Anton Emelyanov 1

Alena Fenogenova 1

Fedor Ignatov 1

Ulyana Isaeva 1

Alexander Kapitanov 1

Dmitry Karpov 1

Alexander Kharitonov 1

Anton Korznikov 1

Fedor Laputin 1

Ivan Maksimov 1

Amina Miftakhova 1

Daniil Moskovskiy 1

Irina Nikishina 1

Alexander Popov 1

Liudmila Radchankava 1

Anton Razzhigaev 1

Vildan Saburov 1

Denis Shevelev 1

Petr Surovtsev 1

Ivan Sviridov 1

Maria Tikhonova 1

Margarita Trofimova 1

Elena Tutubalina 1

Venues