Alena Fenogenova


2024

pdf bib
RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs
Ekaterina Taktasheva | Maxim Bazhukov | Kirill Koncha | Alena Fenogenova | Ekaterina Artemova | Vladislav Mikhailov
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Minimal pairs are a well-established approach to evaluating the grammatical knowledge of language models. However, existing resources for minimal pairs address a limited number of languages and lack diversity of language-specific grammatical phenomena. This paper introduces the Russian Benchmark of Linguistic Minimal Pairs (RuBLiMP), which includes 45k pairs of sentences that differ in grammaticality and isolate a morphological, syntactic, or semantic phenomenon. In contrast to existing benchmarks of linguistic minimal pairs, RuBLiMP is created by applying linguistic perturbations to automatically annotated sentences from open text corpora and decontaminating test data. We describe the data collection protocol and present the results of evaluating 25 language models in various scenarios. We find that the widely used LMs for Russian are sensitive to morphological and agreement-oriented contrasts, but fall behind humans on phenomena requiring the understanding of structural relations, negation, transitivity, and tense. RuBLiMP, the codebase, and other materials are publicly available.

pdf bib
mGPT: Few-Shot Learners Go Multilingual
Oleh Shliazhko | Alena Fenogenova | Maria Tikhonova | Anastasia Kozlova | Vladislav Mikhailov | Tatiana Shavrina
Transactions of the Association for Computational Linguistics, Volume 12

This paper introduces mGPT, a multilingual variant of GPT-3, pretrained on 61 languages from 25 linguistically diverse language families using Wikipedia and the C4 Corpus. We detail the design and pretraining procedure. The models undergo an intrinsic and extrinsic evaluation: language modeling in all languages, downstream evaluation on cross-lingual NLU datasets and benchmarks in 33 languages, and world knowledge probing in 23 languages. The in-context learning abilities are on par with the contemporaneous language models while covering a larger number of languages, including underrepresented and low-resource languages of the Commonwealth of Independent States and the indigenous peoples in Russia. The source code and the language models are publicly available under the MIT license.

pdf bib
How to tame your plotline: A framework for goal-driven interactive fairy tale generation
Marina Ermolaeva | Anastasia Shakhmatova | Alina Nepomnyashchikh | Alena Fenogenova
Proceedings of the The 6th Workshop on Narrative Understanding

Automatic storytelling is a difficult NLP task that poses a challenge even for state-of-the-art large language models. This paper proposes a pipeline for interactive fairy tale generation in a mixed-initiative setting. Our approach introduces a story goal as a stopping condition, imposes minimal structure on the narrative in the form of a simple emotional arc, and controls the transition between the stages of the story via system prompt engineering. The resulting framework reconciles creating a structured and complete short-form narrative with retaining player agency and allowing users to influence the storyline through their input. We evaluate our approach with several proprietary and open-source language models and examine its transferability to different languages, specifically English and Russian.

pdf bib
A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages
Nikita Martynov | Mark Baushenko | Anastasia Kozlova | Katerina Kolomeytseva | Aleksandr Abramov | Alena Fenogenova
Findings of the Association for Computational Linguistics: EACL 2024

Large language models excel in text generation and generalization, however they face challenges in text editing tasks, especially in correcting spelling errors and mistyping.In this paper, we present a methodology for generative spelling correction (SC), tested on English and Russian languages and potentially can be extended to any language with minor changes. Our research mainly focuses on exploring natural spelling errors and mistyping in texts and studying how those errors can be emulated in correct sentences to enrich generative models’ pre-train procedure effectively. We investigate the effects of emulations in various text domains and examine two spelling corruption techniques: 1) first one mimics human behavior when making a mistake through leveraging statistics of errors from a particular dataset, and 2) second adds the most common spelling errors, keyboard miss clicks, and some heuristics within the texts.We conducted experiments employing various corruption strategies, models’ architectures, and sizes in the pre-training and fine-tuning stages and evaluated the models using single-domain and multi-domain test sets. As a practical outcome of our work, we introduce SAGE (Spell checking via Augmentation and Generative distribution Emulation).

pdf bib
Transformer Attention vs Human Attention in Anaphora Resolution
Anastasia Kozlova | Albina Akhmetgareeva | Aigul Khanova | Semen Kudriavtsev | Alena Fenogenova
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Motivated by human cognitive processes, attention mechanism within transformer architecture has been developed to assist neural networks in allocating focus to specific aspects within input data. Despite claims regarding the interpretability achieved by attention mechanisms, the extent of correlation and similarity between machine and human attention remains a subject requiring further investigation.In this paper, we conduct a quantitative analysis of human attention compared to neural attention mechanisms in the context of the anaphora resolution task. We collect an eye-tracking dataset based on the Winograd schema challenge task for the Russian language. Leveraging this dataset, we conduct an extensive analysis of the correlations between human and machine attention maps across various transformer architectures, network layers of pre-trained and fine-tuned models. Our aim is to investigate whether insights from human attention mechanisms can be used to enhance the performance of neural networks in tasks such as anaphora resolution. The results reveal distinctions in anaphora resolution processing, offering promising prospects for improving the performance of neural networks and understanding the cognitive nuances of human perception.

pdf bib
MERA: A Comprehensive LLM Evaluation in Russian
Alena Fenogenova | Artem Chervyakov | Nikita Martynov | Anastasia Kozlova | Maria Tikhonova | Albina Akhmetgareeva | Anton Emelyanov | Denis Shevelev | Pavel Lebedev | Leonid Sinev | Ulyana Isaeva | Katerina Kolomeytseva | Daniil Moskovskiy | Elizaveta Goncharova | Nikita Savushkin | Polina Mikhailova | Anastasia Minaeva | Denis Dimitrov | Alexander Panchenko | Sergey Markov
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Over the past few years, one of the most notable advancements in AI research has been in foundation models (FMs), headlined by the rise of language models (LMs). However, despite researchers’ attention and the rapid growth in LM application, the capabilities, limitations, and associated risks still need to be better understood. To address these issues, we introduce a new instruction benchmark, MERA, oriented towards the FMs’ performance on the Russian language. The benchmark encompasses 21 evaluation tasks for generative models covering 10 skills and is supplied with private answer scoring to prevent data leakage. The paper introduces a methodology to evaluate FMs and LMs in fixed zero- and few-shot instruction settings that can be extended to other modalities. We propose an evaluation methodology, an open-source code base for the MERA assessment, and a leaderboard with a submission system. We evaluate open LMs as baselines and find they are still far behind the human level. We publicly release MERA to guide forthcoming research, anticipate groundbreaking model features, standardize the evaluation procedure, and address potential ethical concerns and drawbacks.

pdf bib
A Family of Pretrained Transformer Language Models for Russian
Dmitry Zmitrovich | Aleksandr Abramov | Andrey Kalmykov | Vitaly Kadulin | Maria Tikhonova | Ekaterina Taktasheva | Danil Astafurov | Mark Baushenko | Artem Snegirev | Tatiana Shavrina | Sergei S. Markov | Vladislav Mikhailov | Alena Fenogenova
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Transformer language models (LMs) are fundamental to NLP research methodologies and applications in various languages. However, developing such models specifically for the Russian language has received little attention. This paper introduces a collection of 13 Russian Transformer LMs, which spans encoder (ruBERT, ruRoBERTa, ruELECTRA), decoder (ruGPT-3), and encoder-decoder (ruT5, FRED-T5) architectures. We provide a report on the model architecture design and pretraining, and the results of evaluating their generalization abilities on Russian language understanding and generation datasets and benchmarks. By pretraining and releasing these specialized Transformer LMs, we aim to broaden the scope of the NLP research directions and enable the development of industrial solutions for the Russian language.

2022

pdf bib
A Study on Manual and Automatic Evaluation for Text Style Transfer: The Case of Detoxification
Varvara Logacheva | Daryna Dementieva | Irina Krotova | Alena Fenogenova | Irina Nikishina | Tatiana Shavrina | Alexander Panchenko
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)

It is often difficult to reliably evaluate models which generate text. Among them, text style transfer is a particularly difficult to evaluate, because its success depends on a number of parameters. We conduct an evaluation of a large number of models on a detoxification task. We explore the relations between the manual and automatic metrics and find that there is only weak correlation between them, which is dependent on the type of model which generated text. Automatic metrics tend to be less reliable for better-performing models. However, our findings suggest that, ChrF and BertScore metrics can be used as a proxy for human evaluation of text detoxification to some extent.

pdf bib
TAPE: Assessing Few-shot Russian Language Understanding
Ekaterina Taktasheva | Alena Fenogenova | Denis Shevelev | Nadezhda Katricheva | Maria Tikhonova | Albina Akhmetgareeva | Oleg Zinkevich | Anastasiia Bashmakova | Svetlana Iordanskaia | Valentina Kurenshchikova | Alena Spiridonova | Ekaterina Artemova | Tatiana Shavrina | Vladislav Mikhailov
Findings of the Association for Computational Linguistics: EMNLP 2022

Recent advances in zero-shot and few-shot learning have shown promise for a scope of research and practical purposes. However, this fast-growing area lacks standardized evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm. To address this line of research, we propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that includes six more complex NLU tasks for Russian, covering multi-hop reasoning, ethical concepts, logic and commonsense knowledge. The TAPE’s design focuses on systematic zero-shot and few-shot NLU evaluation: (i) linguistic-oriented adversarial attacks and perturbations for analyzing robustness, and (ii) subpopulations for nuanced interpretation. The detailed analysis of testing the autoregressive baselines indicates that simple spelling-based perturbations affect the performance the most, while paraphrasing the input has a more negligible effect. At the same time, the results demonstrate a significant gap between the neural and human baselines for most tasks. We publicly release TAPE (https://tape-benchmark.com) to foster research on robust LMs that can generalize to new tasks when little to no supervision is available.

pdf bib
Proceedings of the first workshop on NLP applications to field linguistics
Oleg Serikov | Ekaterina Voloshina | Anna Postnikova | Elena Klyachko | Ekaterina Neminova | Ekaterina Vylomova | Tatiana Shavrina | Eric Le Ferrand | Valentin Malykh | Francis Tyers | Timofey Arkhangelskiy | Vladislav Mikhailov | Alena Fenogenova
Proceedings of the first workshop on NLP applications to field linguistics

2021

pdf bib
Russian Paraphrasers: Paraphrase with Transformers
Alena Fenogenova
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing

This paper studies the generation methods for paraphrasing in the Russian language. There are several transformer-based models (Russian and multilingual) trained on a collected corpus of paraphrases. We compare different models, contrast the quality of paraphrases using different ranking methods and apply paraphrasing methods in the context of augmentation procedure for different tasks. The contributions of the work are the combined paraphrasing dataset, fine-tuned generated models for Russian paraphrasing task and additionally the open source tool for simple usage of the paraphrasers.

2020

pdf bib
Read and Reason with MuSeRC and RuCoS: Datasets for Machine Reading Comprehension for Russian
Alena Fenogenova | Vladislav Mikhailov | Denis Shevelev
Proceedings of the 28th International Conference on Computational Linguistics

The paper introduces two Russian machine reading comprehension (MRC) datasets, called MuSeRC and RuCoS, which require reasoning over multiple sentences and commonsense knowledge to infer the answer. The former follows the design of MultiRC, while the latter is a counterpart of the ReCoRD dataset. The datasets are included in RussianSuperGLUE, the Russian general language understanding benchmark. We provide a comparative analysis and demonstrate that the proposed tasks are relatively more complex as compared to the original ones for English. Besides, performance results of human solvers and BERT-based models show that MuSeRC and RuCoS represent a challenge for recent advanced neural models. We thus hope to facilitate research in the field of MRC for Russian and prompt the study of multi-hop reasoning in a cross-lingual scenario.

pdf bib
Humans Keep It One Hundred: an Overview of AI Journey
Tatiana Shavrina | Anton Emelyanov | Alena Fenogenova | Vadim Fomin | Vladislav Mikhailov | Andrey Evlampiev | Valentin Malykh | Vladimir Larin | Alex Natekin | Aleksandr Vatulin | Peter Romov | Daniil Anastasiev | Nikolai Zinov | Andrey Chertok
Proceedings of the Twelfth Language Resources and Evaluation Conference

Artificial General Intelligence (AGI) is showing growing performance in numerous applications - beating human performance in Chess and Go, using knowledge bases and text sources to answer questions (SQuAD) and even pass human examination (Aristo project). In this paper, we describe the results of AI Journey, a competition of AI-systems aimed to improve AI performance on knowledge bases, reasoning and text generation. Competing systems pass the final native language exam (in Russian), including versatile grammar tasks (test and open questions) and an essay, achieving a high score of 69%, with 68% being an average human result. During the competition, a baseline for the task and essay parts was proposed, and 80+ systems were submitted, showing different approaches to task understanding and reasoning. All the data and solutions can be found on github https://github.com/sberbank-ai/combined_solution_aij2019

pdf bib
RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark
Tatiana Shavrina | Alena Fenogenova | Emelyanov Anton | Denis Shevelev | Ekaterina Artemova | Valentin Malykh | Vladislav Mikhailov | Maria Tikhonova | Andrey Chertok | Andrey Evlampiev
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In this paper, we introduce an advanced Russian general language understanding evaluation benchmark – Russian SuperGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logical operations regardless of text subject or lexicon. For the first time, a benchmark of nine tasks, collected and organized analogically to the SuperGLUE methodology, was developed from scratch for the Russian language. We also provide baselines, human level evaluation, open-source framework for evaluating models, and an overall leaderboard of transformer models for the Russian language. Besides, we present the first results of comparing multilingual models in the translated diagnostic test set and offer the first steps to further expanding or assessing State-of-the-art models independently of language.