Timur Ionov
2026
DeepPavlov Strikes Back: A Toolkit for Improving LLM Reliability and Trustworthiness
Evgenii Nikolaev | Timur Ionov | Anna Korzanova | Vasily Konovalov | Maksim Savkin
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Evgenii Nikolaev | Timur Ionov | Anna Korzanova | Vasily Konovalov | Maksim Savkin
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations)
This paper introduces DeepPavlov 1.1, a new version of an open-source library for natural language processing (NLP). DeepPavlov 1.1 supports both traditional NLP tasks (like named entity recognition, sentiment classification) and new tasks needed to enhance LLMs truthfulness and reliability. These tools include: a hallucination detection model, an evergreen question classifier, and a toxicity classifier. The library is easy to use, flexible, and works with many languages. It is designed to help researchers and developers build better, safer AI systems that use language. It is publicly available under the Apache 2.0 license and includes access to an interactive online demo.
Call, Reward, Repeat: Advancing Dialog State Tracking with GRPO and Function Calling
Timur Ionov | Anna Marshalova | Valentin Malykh
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Timur Ionov | Anna Marshalova | Valentin Malykh
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Recent advancements in Large Language Models (LLMs) have notably enhanced task-oriented dialogue systems, particularly in Dialogue State Tracking (DST), owing to their generative capabilities and strong generalization. Although recent approaches such as LDST and FnCTOD significantly improved cross-domain DST performance via supervised fine-tuning (SFT), these methods typically require substantial amounts of domain-specific data. In this paper, we address this limitation by employing Group Relative Policy Optimization (GRPO) - a critic-free reinforcement learning method that efficiently guides LLMs toward improved DST accuracy even under low-resource conditions. Our results on established DST benchmarks, including MultiWOZ 2.1 and 2.4, demonstrate that the RL approach achieves superior performance to existing methods while using significantly reduced out-of-domain training data. In addition, we found out that models pretrained specifically for tool-use tasks can be a better starting point, especially on small scales.
2025
SmurfCat at SHROOM-CAP: Factual but Awkward? Fluent but Wrong? Tackling Both in LLM Scientific QA
Timur Ionov | Evgenii Nikolaev | Artem Vazhentsev | Mikhail Chaichuk | Anton Korznikov | Elena Tutubalina | Alexander Panchenko | Vasily Konovalov | Elisei Rykov
Proceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025)
Timur Ionov | Evgenii Nikolaev | Artem Vazhentsev | Mikhail Chaichuk | Anton Korznikov | Elena Tutubalina | Alexander Panchenko | Vasily Konovalov | Elisei Rykov
Proceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025)
Large Language Models (LLMs) often generate hallucinations, a critical issue in domains like scientific communication where factual accuracy and fluency are essential. The SHROOM-CAP shared task addresses this challenge by evaluating Factual Mistakes and Fluency Mistakes across diverse languages, extending earlier SHROOM editions to the scientific domain. We present Smurfcat, our system for SHROOM-CAP, which integrates three complementary approaches: uncertainty estimation (white-box and black-box signals), encoder-based classifiers (Multilingual Modern BERT), and decoder-based judges (instruction-tuned LLMs with classification heads). Results show that decoder-based judges achieve the strongest overall performance, while uncertainty methods and encoders provide complementary strengths. Our findings highlight the value of combining uncertainty signals with encoder and decoder architectures for robust, multilingual detection of hallucinations and related errors in scientific publications.
SPY: Enhancing Privacy with Synthetic PII Detection Dataset
Maksim Savkin | Timur Ionov | Vasily Konovalov
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Maksim Savkin | Timur Ionov | Vasily Konovalov
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
We introduce **SPY Dataset**: a novel synthetic dataset for the task of **Personal Identifiable Information (PII) detection**, underscoring the significance of protecting PII in modern data processing. Our research innovates by leveraging Large Language Models (LLMs) to generate a dataset that emulates real-world PII scenarios. Through evaluation, we validate the dataset’s quality, providing a benchmark for PII detection. Comparative analyses reveal that while PII and Named Entity Recognition (NER) share similarities, **dedicated NER models exhibit limitations** when applied to PII-specific contexts. This work contributes to the field by making the generation methodology and the generated dataset publicly, thereby enabling further research and development in this field.