Valentin Malykh - ACL Anthology

Valentin Malykh

2025

StRuCom: A Novel Dataset of Structured Code Comments in Russian
Maria Dziuba | Valentin Malykh
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Structured code comments in docstring format are essential for code comprehension and maintenance, but existing machine learning models for their generation perform poorly for Russian compared to English. To bridge this gap, we present StRuCom — the first large-scale dataset (153K examples) specifically designed for Russian code documentation. Unlike machine-translated English datasets that distort terminology (e.g., technical loanwords vs. literal translations) and docstring structures, StRuCom combines human-written comments from Russian GitHub repositories with synthetically generated ones, ensuring compliance with Python, Java, JavaScript, C#, and Go standards through automated validation.

SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks
Adamenko Pavel | Ivanov Mikhail | Aidar Valeev | Rodion Levichev | Pavel Zadorozhny | Ivan Lopatin | Dmitrii Babaev | Alena Fenogenova | Valentin Malykh
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

The rapid advancement of Large Language Models (LLMs) in software engineering has revealed critical limitations in existing benchmarks, particularly the widely used SWE-bench dataset. Recent studies have uncovered severe data contamination issues, e.g., SWE-bench reports 32.67% of successful patches involve direct solution leakage and 31.08% pass due to inadequate test cases. We introduce SWE-MERA, a dynamic, continuously updated benchmark designed to address these fundamental challenges through an automated collection of real-world GitHub issues and rigorous quality validation. Our approach implements a reliable pipeline that ensures quality while minimizing contamination risks, resulting in approximately 10,000 potential tasks with 728 samples currently available. Evaluation using the Aider coding agent demonstrates strong discriminative power in state-of-the-art models. We report performance across a dozen recent LLMs evaluated on tasks collected between September 2024 and June 2025.

Low-resource Machine Translation for Code-switched Kazakh-Russian Language Pair
Maksim Borisov | Zhanibek Kozhirbayev | Valentin Malykh
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

Machine translation for low-resource language pairs is a challenging task. This task could become extremely difficult once a speaker uses code switching. We present the first code-switching Kazakh-Russian parallel corpus.Additionally, we propose a method to build a machine translation model for code-switched Kazakh-Russian language pair with no labeled data. Our method is basing on generation of synthetic data. This method results in a model beating an existing commercial system by human evaluation.

CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search
Nikita Sorokin | Tikhonov Anton | Dmitry Abulkhanov | Ivan Sedykh | Irina Piontkovskaya | Valentin Malykh
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

We consider the well-known and important tasks of clone detection and information retrieval for source code. The most standard setup is to search clones inside the same language code snippets. But it is also useful to find code snippets with identical behaviour in different programming languages. Nevertheless multi- and cross-lingual clone detection has been little studied in literature. We present a novel training procedure, cross-consistency training (CCT) leveraging cross-lingual similarity, that we apply to train language models on source code in various programming languages. We show that this training is effective both for encoder- and decoder-based models.The trained encoder-based CCT-LM model%and fine-tuned with CCT,achieves a new state of the art on POJ-104 (monolingual C++ clone detection benchmark) with 96.73% MAP and AdvTest (monolingual Python code search benchmark) with 47.18% MRR. The decoder-based CCT-LM model shows comparable performance in these tasks. In addition, we formulate the multi- and cross-lingual clone detection problem and present XCD, a new benchmark dataset produced from CodeForces submissions.

2024

Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024)
Atul Kr. Ojha | Chao-hong Liu | Ekaterina Vylomova | Flammie Pirinen | Jade Abbott | Jonathan Washington | Nathaniel Oco | Valentin Malykh | Varvara Logacheva | Xiaobing Zhao
Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024)

Searching by Code: A New SearchBySnippet Dataset and SnippeR Retrieval Model for Searching by Code Snippets
Ivan Sedykh | Nikita Sorokin | Dmitry Abulkhanov | Sergey I. Nikolenko | Valentin Malykh
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Code search is an important and well-studied task, but it usually means searching for code by a text query. We argue that using a code snippet (and possibly an error traceback) as a query while looking for bugfixing instructions and code samples is a natural use case not covered by prior art. Moreover, existing datasets use code comments rather than full-text descriptions as text, making them unsuitable for this use case. We present a new SearchBySnippet dataset implementing the search-by-code use case based on StackOverflow data; we show that on SearchBySnippet, existing architectures fall short of a simple BM25 baseline even after fine-tuning. We present a new single encoder model SnippeR that outperforms several strong baselines on SearchBySnippet with a result of 0.451 Recall@10; we propose the SearchBySnippet dataset and SnippeR as a new important benchmark for code search evaluation.

2023

A System for Answering Simple Questions in Multiple Languages
Anton Razzhigaev | Mikhail Salnikov | Valentin Malykh | Pavel Braslavski | Alexander Panchenko
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Our research focuses on the most prevalent type of queries— simple questions —exemplified by questions like “What is the capital of France?”. These questions reference an entity such as “France”, which is directly connected (one hop) to the answer entity “Paris” in the underlying knowledge graph (KG). We propose a multilingual Knowledge Graph Question Answering (KGQA) technique that orders potential responses based on the distance between the question’s text embeddings and the answer’s graph embeddings. A system incorporating this novel method is also described in our work. Through comprehensive experimentation using various English and multilingual datasets and two KGs — Freebase and Wikidata — we illustrate the comparative advantage of the proposed method across diverse KG embeddings and languages. This edge is apparent even against robust baseline systems, including seq2seq QA models, search-based solutions and intricate rule-based pipelines. Interestingly, our research underscores that even advanced AI systems like ChatGPT encounter difficulties when tasked with answering simple questions. This finding emphasizes the relevance and effectiveness of our approach, which consistently outperforms such systems. We are making the source code and trained models from our study publicly accessible to promote further advancements in multilingual KGQA.

Proceedings of the Second Workshop on NLP Applications to Field Linguistics
Oleg Serikov | Ekaterina Voloshina | Anna Postnikova | Elena Klyachko | Ekaterina Vylomova | Tatiana Shavrina | Eric Le Ferrand | Valentin Malykh | Francis Tyers | Timofey Arkhangelskiy | Vladislav Mikhailov
Proceedings of the Second Workshop on NLP Applications to Field Linguistics

Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)
Atul Kr. Ojha | Chao-hong Liu | Ekaterina Vylomova | Flammie Pirinen | Jade Abbott | Jonathan Washington | Nathaniel Oco | Valentin Malykh | Varvara Logacheva | Xiaobing Zhao
Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)

Large Language Models Meet Knowledge Graphs to Answer Factoid Questions
Mikhail Salnikov | Hai Le | Prateek Rajput | Irina Nikishina | Pavel Braslavski | Valentin Malykh | Alexander Panchenko
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

2022

Proceedings of the First Workshop on NLP applications to field linguistics
Oleg Serikov | Ekaterina Voloshina | Anna Postnikova | Elena Klyachko | Ekaterina Neminova | Ekaterina Vylomova | Tatiana Shavrina | Eric Le Ferrand | Valentin Malykh | Francis Tyers | Timofey Arkhangelskiy | Vladislav Mikhailov | Alena Fenogenova
Proceedings of the First Workshop on NLP applications to field linguistics

Template-based Approach to Zero-shot Intent Recognition
Dmitry Lamanov | Pavel Burnyshev | Ekaterina Artemova | Valentin Malykh | Andrey Bout | Irina Piontkovskaya
Proceedings of the 15th International Conference on Natural Language Generation

Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022)
Atul Kr. Ojha | Chao-Hong Liu | Ekaterina Vylomova | Jade Abbott | Jonathan Washington | Nathaniel Oco | Tommi A Pirinen | Valentin Malykh | Varvara Logacheva | Xiaobing Zhao
Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022)

Ask Me Anything in Your Native Language
Nikita Sorokin | Dmitry Abulkhanov | Irina Piontkovskaya | Valentin Malykh
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Cross-lingual question answering is a thriving field in the modern world, helping people to search information on the web more efficiently. One of the important scenarios is to give an answer even there is no answer in the language a person asks a question with. We present a novel approach based on single encoder for query and passage for retrieval from multi-lingual collection, together with cross-lingual generative reader. It achieves a new state of the art in both retrieval and end-to-end tasks on the XOR TyDi dataset outperforming the previous results up to 10% on several languages. We find that our approach can be generalized to more than 20 languages in zero-shot approach and outperform all previous models by 12%.

Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP
Tatiana Shavrina | Vladislav Mikhailov | Valentin Malykh | Ekaterina Artemova | Oleg Serikov | Vitaly Protasov
Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP

2021

Single Example Can Improve Zero-Shot Data Generation
Pavel Burnyshev | Valentin Malykh | Andrey Bout | Ekaterina Artemova | Irina Piontkovskaya
Proceedings of the 14th International Conference on Natural Language Generation

Sub-tasks of intent classification, such as robustness to distribution shift, adaptation to specific user groups and personalization, out-of-domain detection, require extensive and flexible datasets for experiments and evaluation. As collecting such datasets is time- and labor-consuming, we propose to use text generation methods to gather datasets. The generator should be trained to generate utterances that belong to the given intent. We explore two approaches to the generation of task-oriented utterances: in the zero-shot approach, the model is trained to generate utterances from seen intents and is further used to generate utterances for intents unseen during training. In the one-shot approach, the model is presented with a single utterance from a test intent. We perform a thorough automatic, and human evaluation of the intrinsic properties of two-generation approaches. The attributes of the generated data are close to original test sets, collected via crowd-sourcing.

InFoBERT: Zero-Shot Approach to Natural Language Understanding Using Contextualized Word Embedding
Pavel Burnyshev | Andrey Bout | Valentin Malykh | Irina Piontkovskaya
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Natural language understanding is an important task in modern dialogue systems. It becomes more important with the rapid extension of the dialogue systems’ functionality. In this work, we present an approach to zero-shot transfer learning for the tasks of intent classification and slot-filling based on pre-trained language models. We use deep contextualized models feeding them with utterances and natural language descriptions of user intents to get text embeddings. These embeddings then used by a small neural network to produce predictions for intent and slot probabilities. This architecture achieves new state-of-the-art results in two zero-shot scenarios. One is a single language new skill adaptation and another one is a cross-lingual adaptation.

Multiple Teacher Distillation for Robust and Greener Models
Artur Ilichev | Nikita Sorokin | Irina Piontkovskaya | Valentin Malykh
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

The language models nowadays are in the center of natural language processing progress. These models are mostly of significant size. There are successful attempts to reduce them, but at least some of these attempts rely on randomness. We propose a novel distillation procedure leveraging on multiple teachers usage which alleviates random seed dependency and makes the models more robust. We show that this procedure applied to TinyBERT and DistilBERT models improves their worst case results up to 2% while keeping almost the same best-case ones. The latter fact keeps true with a constraint on computational time, which is important to lessen the carbon footprint. In addition, we present the results of an application of the proposed procedure to a computer vision model ResNet, which shows that the statement keeps true in this totally different domain.

2020

SumTitles: a Summarization Dataset with Low Extractiveness
Valentin Malykh | Konstantin Chernis | Ekaterina Artemova | Irina Piontkovskaya
Proceedings of the 28th International Conference on Computational Linguistics

The existing dialogue summarization corpora are significantly extractive. We introduce a methodology for dataset extractiveness evaluation and present a new low-extractive corpus of movie dialogues for abstractive text summarization along with baseline evaluation. The corpus contains 153k dialogues and consists of three parts: 1) automatically aligned subtitles, 2) automatically aligned scenes from scripts, and 3) manually aligned scenes from scripts. We also present an alignment algorithm which we use to construct the corpus.

RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark
Tatiana Shavrina | Alena Fenogenova | Emelyanov Anton | Denis Shevelev | Ekaterina Artemova | Valentin Malykh | Vladislav Mikhailov | Maria Tikhonova | Andrey Chertok | Andrey Evlampiev
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In this paper, we introduce an advanced Russian general language understanding evaluation benchmark – Russian SuperGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logical operations regardless of text subject or lexicon. For the first time, a benchmark of nine tasks, collected and organized analogically to the SuperGLUE methodology, was developed from scratch for the Russian language. We also provide baselines, human level evaluation, open-source framework for evaluating models, and an overall leaderboard of transformer models for the Russian language. Besides, we present the first results of comparing multilingual models in the translated diagnostic test set and offer the first steps to further expanding or assessing State-of-the-art models independently of language.

Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages
Alina Karakanta | Atul Kr. Ojha | Chao-Hong Liu | Jade Abbott | John Ortega | Jonathan Washington | Nathaniel Oco | Surafel Melaku Lakew | Tommi A Pirinen | Valentin Malykh | Varvara Logacheva | Xiaobing Zhao
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

Findings of the LoResMT 2020 Shared Task on Zero-Shot for Low-Resource languages
Atul Kr. Ojha | Valentin Malykh | Alina Karakanta | Chao-Hong Liu
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

This paper presents the findings of the LoResMT 2020 Shared Task on zero-shot translation for low resource languages. This task was organised as part of the 3rd Workshop on Technologies for MT of Low Resource Languages (LoResMT) at AACL-IJCNLP 2020. The focus was on the zero-shot approach as a notable development in Neural Machine Translation to build MT systems for language pairs where parallel corpora are small or even non-existent. The shared task experience suggests that back-translation and domain adaptation methods result in better accuracy for small-size datasets. We further noted that, although translation between similar languages is no cakewalk, linguistically distinct languages require more data to give better results.

Humans Keep It One Hundred: an Overview of AI Journey
Tatiana Shavrina | Anton Emelyanov | Alena Fenogenova | Vadim Fomin | Vladislav Mikhailov | Andrey Evlampiev | Valentin Malykh | Vladimir Larin | Alex Natekin | Aleksandr Vatulin | Peter Romov | Daniil Anastasiev | Nikolai Zinov | Andrey Chertok
Proceedings of the Twelfth Language Resources and Evaluation Conference

Artificial General Intelligence (AGI) is showing growing performance in numerous applications - beating human performance in Chess and Go, using knowledge bases and text sources to answer questions (SQuAD) and even pass human examination (Aristo project). In this paper, we describe the results of AI Journey, a competition of AI-systems aimed to improve AI performance on knowledge bases, reasoning and text generation. Competing systems pass the final native language exam (in Russian), including versatile grammar tasks (test and open questions) and an essay, achieving a high score of 69%, with 68% being an average human result. During the competition, a baseline for the task and essay parts was proposed, and 80+ systems were submitted, showing different approaches to task understanding and reasoning. All the data and solutions can be found on github https://github.com/sberbank-ai/combined_solution_aij2019

2019

Robust to Noise Models in Natural Language Processing Tasks
Valentin Malykh
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

There are a lot of noise texts surrounding a person in modern life. The traditional approach is to use spelling correction, yet the existing solutions are far from perfect. We propose robust to noise word embeddings model, which outperforms existing commonly used models, like fasttext and word2vec in different tasks. In addition, we investigate the noise robustness of current models in different natural language processing tasks. We propose extensions for modern models in three downstream tasks, i.e. text classification, named entity recognition and aspect extraction, which shows improvement in noise robustness over existing solutions.

AspeRa: Aspect-Based Rating Prediction Based on User Reviews
Elena Tutubalina | Valentin Malykh | Sergey Nikolenko | Anton Alekseev | Ilya Shenbin
Proceedings of the 2019 Workshop on Widening NLP

We propose a novel Aspect-based Rating Prediction model (AspeRa) that estimates user rating based on review texts for the items. It is based on aspect extraction with neural networks and combines the advantages of deep learning and topic modeling. It is mainly designed for recommendations, but an important secondary goal of AspeRa is to discover coherent aspects of reviews that can be used to explain predictions or for user profiling. We conduct a comprehensive empirical study of AspeRa, showing that it outperforms state-of-the-art models in terms of recommendation quality and produces interpretable aspects. This paper is an abridged version of our work (Nikolenko et al., 2019)

Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages
Alina Karakanta | Atul Kr. Ojha | Chao-Hong Liu | Jonathan Washington | Nathaniel Oco | Surafel Melaku Lakew | Valentin Malykh | Xiaobing Zhao
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages

2018

Adoption of messaging communication and voice assistants has grown rapidly in the last years. This creates a demand for tools that speed up prototyping of feature-rich dialogue systems. An open-source library DeepPavlov is tailored for development of conversational agents. The library prioritises efficiency, modularity, and extensibility with the goal to make it easier to develop dialogue systems from scratch and with limited data available. It supports modular as well as end-to-end approaches to implementation of conversational agents. Conversational agent consists of skills and every skill can be decomposed into components. Components are usually models which solve typical NLP tasks such as intent classification, named entity recognition or pre-trained word vectors. Sequence-to-sequence chit-chat skill, question answering skill or task-oriented skill can be assembled from components provided in the library.

Robust Word Vectors: Context-Informed Embeddings for Noisy Texts
Valentin Malykh | Varvara Logacheva | Taras Khakhulin
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

We suggest a new language-independent architecture of robust word vectors (RoVe). It is designed to alleviate the issue of typos, which are common in almost any user-generated content, and hinder automatic text processing. Our model is morphologically motivated, which allows it to deal with unseen word forms in morphologically rich languages. We present the results on a number of Natural Language Processing (NLP) tasks and languages for the variety of related architectures and show that proposed architecture is typo-proof.

Co-authors

Vladislav Mikhailov 5

Nathaniel Oco 5

Tatiana Shavrina 5

Ekaterina Vylomova 5

Jonathan Washington 5

Xiaobing Zhao 5

Alena Fenogenova 4

Flammie A. Pirinen 4

Nikita Sorokin 4

Dmitry Abulkhanov 3

Pavel Burnyshev 3

Alina Karakanta 3

Timofey Arkhangelskiy 2

Pavel Braslavski 2

Andrey Chertok 2

Andrey Evlampiev 2

Taras Khakhulin 2

Elena Klyachko 2

Surafel Melaku Lakew 2

Eric Le Ferrand 2

Alexander Panchenko 2

Anna Postnikova 2

Mikhail Salnikov 2

Francis Tyers 2

Ekaterina Voloshina 2

Rafael Airapetyan 1

Anton Alekseev 1

Daniil Anastasiev 1

Emelyanov Anton 1

Tikhonov Anton 1

Mikhail Arkhipov 1

Dmitrii Babaev 1

Dilyara Baymurzina 1

Maksim Borisov 1

Mikhail Burtsev 1

Nickolay Bushkov 1

Konstantin Chernis 1

Anton Emelyanov 1

Olga Gureenkova 1

Artur Ilichev 1

Zhanibek Kozhirbayev 1

Yurii Kuratov 1

Denis Kuznetsov 1

Dmitry Lamanov 1

Vladimir Larin 1

Rodion Levichev 1

Alexey Litinsky 1

Ivanov Mikhail 1

Ekaterina Neminova 1

Irina Nikishina 1

Sergey I. Nikolenko 1

Sergey Nikolenko 1

Adamenko Pavel 1

Vadim Polulyakh 1

Vitaly Protasov 1

Leonid Pugachev 1

Prateek Rajput 1

Anton Razzhigaev 1

Alexander Seliverstov 1

Denis Shevelev 1

Alexey Sorokin 1

Maria Tikhonova 1

Elena Tutubalina 1

Aleksandr Vatulin 1

Maria Vikhreva 1

Pavel Zadorozhny 1

Marat Zaynutdinov 1

Nikolai Zinov 1

Venues