Elena Tutubalina - ACL Anthology

Elena Tutubalina

2026

Feature Drift: How Fine-Tuning Repurposes Representations in LLMs
Andrey V. Galichin | Anton Korznikov | Alexey Dontsov | Oleg Rogov | Elena Tutubalina | Ivan Oseledets
Findings of the Association for Computational Linguistics: EACL 2026

Fine-tuning LLMs introduces many important behaviors, such as instruction-following and safety alignment. This makes it crucial to study how fine-tuning changes models’ internal mechanisms. Sparse Autoencoders (SAEs) offer a powerful tool for interpreting neural networks by extracting concepts (features) represented in their activations. Previous work observed that SAEs trained on base models transfer effectively to instruction-tuned (chat) models, attributed to activation similarity. In this work, we propose *feature drift* as an alternative explanation: the feature space remains relevant, but the distribution of feature activations changes. In other words, fine-tuning recombines existing concepts rather than learning new ones. We validate this by showing base SAEs reconstruct both base and chat activations comparably despite systematic differences, with individual features exhibiting clear drift patterns. In a refusal behavior case study, we identify base SAE features that drift to activate on harmful instructions in chat models. Causal interventions using these features confirm that they mediate refusal. Our findings suggest that monitoring how existing features drift, rather than searching for entirely new features, may provide a more complete explanation of how fine-tuning changes model capabilities.

POLAR: A Benchmark for Multilingual, Multicultural, and Multi-Event Online Polarization
Usman Naseem | Robert Geislinger | Juan Ren | Sarah Kohail | Rudy Alexandro Garrido Veliz | P Sam Sahil | Yiran Zhang | Idris Abdulmumin | Marco Antonio Stranisci | Özge Alacam | Cengiz Acarturk | Aisha Jabr | Saba Anwar | Abinew Ali Ayele | Simona Frenda | Alessandra Teresa Cignarella | Elena Tutubalina | Oleg Rogov | Aung Kyaw Htet | Xintong Wang | Surendrabikram Thapa | Kritesh Rauniyar | Tanmoy Chakraborty | MD Arfeen Zeeshan | Dheeraj Kodati | Satya Keerthi | Sahar Moradizeyveh | Firoj Alam | Md Arid Hasan | Syed Ishtiaque Ahmed | Ye Kyaw Thu | Shantipriya Parida | Ihsan Ayyub Qazi | Lilian Diana Awuor Wanzare | Nelson Odhiambo Onyango | Clemencia Siro | Jane Wanjiru Kimani | Ibrahim Said Ahmad | Adem Chanie Ali | Martin Semmann | Chris Biemann | Shamsuddeen Hassan Muhammad | Seid Muhie Yimam
Findings of the Association for Computational Linguistics: ACL 2026

Online polarization poses a growing challenge for democratic discourse, yet most computational social science research remains monolingual, culturally narrow, or event-specific. We introduce POLAR, a multilingual, multicultural, and multi-event dataset with over 110K instances in 22 languages drawn from diverse online platforms and real-world events. Polarization is annotated along three axes, namely detection, type, and manifestation, using a variety of annotation platforms adapted to each cultural context. We conduct two main experiments: (1) fine-tuning six pretrained small language models; and (2) evaluating a range of open and closed large language models in few-shot and zero-shot settings. Results show that while most models perform well on binary polarization detection, they achieve substantially lower performance when predicting polarization types and manifestations. These findings highlight the complex, highly contextual nature of polarization and underscore the need for robust, adaptable approaches in NLP and computational social science. All resources will be released to support further research and effective mitigation of digital polarization globally.

Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning
Anna Borisiuk | Andrey Savchenko | Alexander Panchenko | Elena Tutubalina
Findings of the Association for Computational Linguistics: ACL 2026

Machine Unlearning (MU) enables Large Language Models (LLMs) to remove unsafe or outdated information. However, existing work assumes that all facts are equally forgettable and largely ignores whether the forgotten knowledge originates from pretraining or supervised fine-tuning (SFT). In this paper, we introduce DUAL (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores. Our experiments show that pretrained and SFT models respond differently to unlearning. An SFT step on the forget data yields smoother forgetting, more stable tuning, and 10–50% higher retention, while direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.

Bring the Apple, Not the Sofa: Impact of Irrelevant Context in Embodied AI Commands on VLA Models
Andrey Moskalenko | Daria Pugacheva | Denis Shepelev | Andrey Kuznetsov | Vlad Shakhuro | Elena Tutubalina
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Vision Language Action (VLA) models are widely used in Embodied AI, enabling robots to interpret and execute language instructions. However, their robustness to natural language variability in real-world scenarios has not been thoroughly investigated.In this work, we present a novel systematic study of the robustness of state-of-the-art VLA models under linguistic perturbations. Specifically, we evaluate model performance under two types of instruction noise: (1) human-generated paraphrasing and (2) the addition of irrelevant context. We further categorize irrelevant contexts into two groups according to their length and their semantic and lexical proximity to robot commands. In this study, we observe consistent performance degradation as context size expands. We also demonstrate that the model can exhibit relative robustness to random context, with a performance drop within 10%, while semantically and lexically similar context of the same length can trigger a quality decline of around 50%. Human paraphrases of instructions lead to a drop of nearly 20%. Our results highlights a critical gap in the safety and efficiency of modern VLA models for real-world deployment.

Confidence Leaps in LLM Reasoning: Early Stopping and Cross-Model Transfer
Pavel Tikhonov | Ivan Oseledets | Elena Tutubalina
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

We challenge the common assumption that Large Language Models (LLMs) build confidence gradually during reasoning. Instead, we find that conviction is often reached in a discrete "moment of insight", characterized by a sudden and sharp increase in an answer’s probability-a phenomenon we term a "confidence leap". Leveraging this discovery, we introduce a training-free, model-agnostic early-stopping heuristic that halts generation upon detecting such a leap, significantly reducing the generation length without sacrificing accuracy. We also demonstrate that the reasoning text leading up to this leap is semantically potent and transferable: feeding this partial reasoning to a different model family substantially boosts its performance. This suggests that the "confidence leap" marks a shared, interpretable reasoning milestone, not just a model-specific statistical artifact.

Out of Distribution, Out of Luck: Process Rewards Misguide Reasoning Models
Alexey Dontsov | Anton Korznikov | Andrey V. Galichin | Elena Tutubalina
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

Process Reward Models (PRMs) have emerged as a promising approach for guiding large language models (LLMs) through multi-step reasoning by providing step-level feedback during inference. However, our evaluation across 7 LLMs reveals a failure mode: while PRMs improve performance for instruct mathematical models, they fail to enhance and sometimes degrade reasoning model performance. Through systematic analysis with linear probes, we identify distinct reward prediction patterns that differentiate reasoning from non-reasoning model outputs. To understand this mechanism, we train Sparse Autoencoders on the Qwen2.5-Math-PRM and analyze reasoning features. Our analysis reveals that 80% of these features respond to formatting artifacts (whitespace patterns, Unicode tokens, punctuation) rather than mathematical content. Reasoning model outputs exhibit distinct metacognitive patterns absent from standard mathematical solutions. This explains why they lead to unreliable reward estimation. Our findings expose a fundamental limitation in applying existing reward models to reasoning systems and provide mechanistic insights into this failure mode. We release our trained SAEs to facilitate future research into reward model interpretability.

SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space
Viktoriia Zinkovich | Anton Antonov | Andrei Spiridonov | Denis Shepelev | Andrey Moskalenko | Daria Pugacheva | Elena Tutubalina | Andrey Kuznetsov | Vlad Shakhuro
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Multimodal large language models (MLLMs) have shown impressive capabilities in vision-language tasks such as reasoning segmentation, where models generate segmentation masks based on textual queries. While prior work has primarily focused on perturbing image inputs, semantically equivalent textual paraphrases—crucial in real-world applications where users express the same intent in varied ways—remain underexplored. To address this gap, we introduce a novel adversarial paraphrasing task: generating grammatically correct paraphrases that preserve the original query meaning while degrading segmentation performance. To evaluate the quality of adversarial paraphrases, we develop a comprehensive automatic evaluation protocol validated with human studies. Furthermore, we introduce SPARTA—a black-box, sentence-level optimization method that operates in the low-dimensional semantic latent space of a text autoencoder, guided by reinforcement learning. SPARTA achieves significantly higher success rates, outperforming prior methods by up to 2x on both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive baselines to assess the robustness of advanced reasoning segmentation models. We reveal that they remain vulnerable to adversarial paraphrasing—even under strict semantic and grammatical constraints. All code and data will be released publicly upon acceptance.

The Silence of the Facts: Popularity as a Barrier to Machine Unlearning
Anna Borisiuk | Andrey Savchenko | Alexander Panchenko | Elena Tutubalina
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Machine Unlearning is a valuable ability of LLMs, enabling the removal of unsafe, outdated, or private information. Existing unlearning methods, however, are often evaluated under the assumption that all facts are equally challenging to forget. Controllable knowledge removal is essential for reliable NLP systems. In this paper, we investigate whether fact popularity influences the efficiency of LLM unlearning. To answer this question, we build **UNLamb** benchmark designed to systematically investigate this relationship. It consists of 11.6k question-answer pairs derived from real-world knowledge in Wikidata, explicitly partitioned into rare and popular facts. Using this benchmark, we perform a comprehensive evaluation of state-of-the-art unlearning algorithms on a set of models of different sizes. We conduct a comprehensive analysis of four unlearning methods across three validation sets and two LLMs. We show that larger models struggle more to forget popular entities, often damaging related knowledge in the process. In contrast, it is much easier to remove rare facts without side effects.

One Task Vector is not Enough: A Large-Scale Study for In-Context Learning
Pavel Tikhonov | Ivan Oseledets | Elena Tutubalina
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

In-context learning (ICL) enables Large Language Models (LLMs) to adapt to new tasks using few examples, with task vectors, defined as specific hidden state activations hypothesized to encode task information. Existing studies are limited by small-scale benchmarks, restricting comprehensive analysis. We introduce QᴜɪᴛᴇAFᴇᴡ, a novel dataset of 3,096 diverse few-shot tasks, each with 30 input-output pairs derived from the Alpaca dataset. Experiments with Llama-3-8B on QᴜɪᴛᴇAFᴇᴡ reveal: (1) task vector performance peaks at an intermediate layer (e.g., 15th), (2) effectiveness varies significantly by task type, and (3) complex tasks rely on multiple, subtask-specific vectors rather than a single vector, suggesting distributed task knowledge representation.

Evolutionary Search for Automated Design of Uncertainty Quantification Methods
Mikhail Seleznyov | Daniil Korbut | Viktor Moskvoretskii | Oleg Somov | Alexander Panchenko | Elena Tutubalina
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Uncertainty quantification (UQ) methods for large language models are predominantly designed by hand based on domain knowledge and heuristics, limiting their scalability and generality. We apply LLM-powered evolutionary search to automatically discover unsupervised UQ methods represented as Python programs. On the task of atomic claim verification, our evolved methods outperform strong manually-designed baselines, achieving up to 6.7% relative ROC-AUC improvement across 9 datasets while generalizing robustly out-of-distribution. Qualitative analysis reveals that different LLMs employ qualitatively distinct evolutionary strategies: Claude models consistently design high-feature-count linear estimators, while Gpt-oss-120B gravitates toward simpler and more interpretable positional weighting schemes. Surprisingly, only Sonnet 4.5 and Opus 4.5 reliably leverage increased method complexity to improve performance – Opus 4.6 shows an unexpected regression relative to its predecessor. Overall, our results hint that LLM-powered evolutionary search is a promising paradigm for automated, interpretable hallucination detector design.

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs
Nikita Afonin | Nikita Andriianov | Vahagn Hovhannisyan | Nikhil Bageshpura | Kyle Liu | Kevin Zhu | Sunishchal Dev | Ashwinee Panda | Oleg Rogov | Elena Tutubalina | Alexander Panchenko | Mikhail Seleznyov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across four model families (Gemini, Kimi-K2, Grok, and Qwen), narrow in-context examples cause models to produce misaligned responses to benign, unrelated queries. With 16 in-context examples, EM rates range from 1% to 24% depending on model and domain, appearing with as few as 2 examples. Neither larger model scale nor explicit reasoning provides reliable protection, and larger models are typically even more susceptible. Next, we formulate and test a hypothesis, which explains in-context EM as conflict between safety objectives and context-following behavior. Consistent with this, instructing models to prioritize safety reduces EM while prioritizing context-following increases it. These findings establish ICL as a previously underappreciated vector for emergent misalignment that resists simple scaling-based solutions.

2025

Team Anotheroption at SemEval-2025 Task 8: Bridging the Gap Between Open-Source and Proprietary LLMs in Table QA
Nikolas Evkarpidi | Elena Tutubalina
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper presents a system developed for SemEval 2025 Task 8: Question Answering (QA) over tabular data. Our approach integrates several key components: text-to-SQL and text-to-Code generation modules, a self-correction mechanism, and a retrieval-augmented generation (RAG). Additionally, it includes an end-to-end (E2E) module, all orchestrated by a large language model (LLM). Through ablation studies, we analyzed the effects of different parts of our pipeline and identified the challenges that are still present in this field. During the evaluation phase of the competition, our solution achieved an accuracy of 80%, resulting in a top-13 ranking among the 38 participating teams. Our pipeline demonstrates a significant improvement in accuracy for open-source models and achieves a performance comparable to proprietary LLMs in QA tasks over tables.

SkipCLM: Enhancing Crosslingual Alignment of Decoder Transformer Models via Contrastive Learning and Skip Connection
Nikita Sushko | Alexander Panchenko | Elena Tutubalina
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

This paper proposes SkipCLM, a novel method for improving multilingual machine translation in Decoder Transformers. We augment contrastive learning for cross-lingual alignment with a trainable skip connection to preserve information crucial for accurate target language generation. Experiments with XGLM-564M on the Flores-101 benchmark demonstrate improved performance, particularly for en-de and en-zh direction translations, compared to direct sequence-to-sequence training and existing contrastive learning methods. Code is available at: https://github.com/s-nlp/skipclm.

SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators
Daniil Moskovskiy | Nikita Sushko | Sergey Pletenev | Elena Tutubalina | Alexander Panchenko
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work, we introduce a pipeline for the generation of multilingual parallel detoxification data. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on the produced synthetic datasets have superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our dataset and code to help further research in multilingual text detoxification.

Two Steps from Hell: Compositionality on Chemical LMs
Veronika Ganeeva | Kuzma Khrabrov | Artur Kadurin | Elena Tutubalina
Findings of the Association for Computational Linguistics: EMNLP 2025

This paper investigates compositionality in chemical language models (ChemLLMs). We introduce STEPS, a benchmark with compositional questions that reflect intricate chemical structures and reactions, to evaluate models’ understanding of chemical language. Our approach focuses on identifying and analyzing compositional patterns within chemical data, allowing us to evaluate how well existing LLMs can handle complex queries. Experiments with state-of-the-art ChemLLMs show significant performance drops in compositional tasks, highlighting the need for models that move beyond pattern recognition. By creating and sharing this benchmark, we aim to enhance the development of more capable chemical LLMs and provide a resource for future research on compositionality in chemical understanding.

When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs
Mikhail Seleznyov | Mikhail Chaichuk | Gleb Ershov | Alexander Panchenko | Elena Tutubalina | Oleg Somov
Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Models (LLMs) are highly sensitive to subtle, non-semantic variations in prompt phrasing and formatting. In this work, we present the first systematic evaluation of 4 methods for improving prompt robustness within a unified experimental framework. We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset. Our evaluation covers robustness methods from both fine-tuned and in-context learning paradigms, and tests their generalization against multiple types of distribution shifts. Finally, we extend our analysis to GPT-4.1 and DeepSeek V3 to assess frontier models’ current robustness to format perturbations. Our findings offer actionable insights into the relative effectiveness of these robustness methods, enabling practitioners to make informed decisions when aiming for stable and reliable LLM performance in real-world applications. Code: tthttps://github.com/AIRI-Institute/when-punctuation-matters.

CLEAR: Character Unlearning in Textual and Visual Modalities
Alexey Dontsov | Dmitrii Korzh | Alexey Zhavoronkin | Boris Mikheev | Denis Bobkov | Aibek Alanov | Oleg Rogov | Ivan Oseledets | Elena Tutubalina
Findings of the Association for Computational Linguistics: ACL 2025

Machine Unlearning (MU) is critical for removing private or hazardous information from deep learning models. While MU has advanced significantly in unimodal (text or vision) settings, multimodal unlearning (MMU) remains underexplored due to the lack of open benchmarks for evaluating cross-modal data removal. To address this gap, we introduce CLEAR, the first open-source benchmark designed specifically for MMU. CLEAR contains 200 fictitious individuals and 3,700 images linked with corresponding question-answer pairs, enabling a thorough evaluation across modalities. We conduct a comprehensive analysis of 11 MU methods (e.g., SCRUB, gradient ascent, DPO) across four evaluation sets, demonstrating that jointly unlearning both modalities outperforms single-modality approaches. The dataset is available at [link](https://huggingface.co/datasets/therem/CLEAR)

RuCCoD: Towards Automated ICD Coding in Russian
Alexandr Nesterov | Andrey Sakhovskiy | Ivan Sviridov | Airat Valiev | Vladimir Makharev | Petr Anokhin | Galina Zubkova | Elena Tutubalina
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

This study investigates the feasibility of automating clinical coding in Russian, a language with limited biomedical resources. We present a new dataset for ICD coding, which includes diagnosis fields from electronic health records (EHRs) annotated with over 10,000 entities and more than 1,500 unique ICD codes. This dataset serves as a benchmark for several state-of-the-art models, including BERT, LLaMA with LoRA, and RAG, with additional experiments examining transfer learning across domains (from PubMed abstracts to medical diagnosis) and terminologies (from UMLS concepts to ICD codes). We then apply the best-performing model to label an in-house EHR dataset containing patient histories from 2017 to 2021. Our experiments, conducted on a carefully curated test set, demonstrate that training with the automated predicted codes leads to a significant improvement in accuracy compared to manually annotated data from physicians. We believe our findings offer valuable insights into the potential for automating clinical coding in resource-limited languages like Russian, which could enhance clinical efficiency and data accuracy in these contexts. Our code and dataset are available at https://github.com/auto-icd-coding/ruccod.

SmurfCat at SHROOM-CAP: Factual but Awkward? Fluent but Wrong? Tackling Both in LLM Scientific QA
Timur Ionov | Evgenii Nikolaev | Artem Vazhentsev | Mikhail Chaichuk | Anton Korznikov | Elena Tutubalina | Alexander Panchenko | Vasily Konovalov | Elisei Rykov
Proceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025)

Large Language Models (LLMs) often generate hallucinations, a critical issue in domains like scientific communication where factual accuracy and fluency are essential. The SHROOM-CAP shared task addresses this challenge by evaluating Factual Mistakes and Fluency Mistakes across diverse languages, extending earlier SHROOM editions to the scientific domain. We present Smurfcat, our system for SHROOM-CAP, which integrates three complementary approaches: uncertainty estimation (white-box and black-box signals), encoder-based classifiers (Multilingual Modern BERT), and decoder-based judges (instruction-tuned LLMs with classification heads). Results show that decoder-based judges achieve the strongest overall performance, while uncertainty methods and encoders provide complementary strengths. Our findings highlight the value of combining uncertainty signals with encoder and decoder architectures for robust, multilingual detection of hallucinations and related errors in scientific publications.

Bridging the Gap with RedSQL: A Russian Text-to-SQL Benchmark for Domain-Specific Applications
Irina Brodskaya | Elena Tutubalina | Oleg Somov
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)

We present the first domain-specific text-to-SQL benchmark in Russian, targeting fields with high operational load where rapid decision-making is critical. The benchmark spans across 9 domains, including healthcare, aviation, and others, and comprises 409 curated query pairs. It is designed to test model generalization under domain shift, introducing challenges such as specialized terminology and complex schema structures. Evaluation of state-of-the-art large language models (LLM) reveals significant performance drop in comparison to open-domain academic benchmarks, highlighting the need for domain-aware approaches in text-to-SQL. The benchmark is available at: https://github.com/BrodskaiaIrina/functional-text2sql-subsets

2024

This paper describes the results of the Knowledge Graph Question Answering (KGQA) shared task that was co-located with the TextGraphs 2024 workshop. In this task, given a textual question and a list of entities with the corresponding KG subgraphs, the participating system should choose the entity that correctly answers the question. Our competition attracted thirty teams, four of which outperformed our strong ChatGPT-based zero-shot baseline. In this paper, we overview the participating systems and analyze their performance according to a large-scale automatic evaluation. To the best of our knowledge, this is the first competition aimed at the KGQA problem using the interaction between large language models (LLMs) and knowledge graphs.

Proceedings of TextGraphs-17: Graph-based Methods for Natural Language Processing
Dmitry Ustalov | Yanjun Gao | Alexander Panchenko | Elena Tutubalina | Irina Nikishina | Arti Ramesh | Andrey Sakhovskiy | Ricardo Usbeck | Gerald Penn | Marco Valentino
Proceedings of TextGraphs-17: Graph-based Methods for Natural Language Processing

Biomedical Concept Normalization over Nested Entities with Partial UMLS Terminology in Russian
Natalia Loukachevitch | Andrey Sakhovskiy | Elena Tutubalina
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present a new manually annotated dataset of PubMed abstracts for concept normalization in Russian. It contains over 23,641 entity mentions in 756 documents linked to 4,544 unique concepts from the UMLS ontology. Compared to existing corpora, we explore two novel annotation characteristics: the nestedness of named entities and the incompleteness of the Russian medical terminology in UMLS. 4,424 entity mentions are linked to 1,535 unique English concepts absent in the Russian part of the UMLS ontology. We present several baselines for normalization over nested named entities obtained with state-of-the-art models such as SapBERT. Our experimental results show that models pre-trained on graph structural data from UMLS achieve superior performance in a zero-shot setting on bilingual terminology.

Biomedical Entity Representation with Graph-Augmented Multi-Objective Transformer
Andrey Sakhovskiy | Natalia Semenova | Artur Kadurin | Elena Tutubalina
Findings of the Association for Computational Linguistics: NAACL 2024

Modern biomedical concept representations are mostly trained on synonymous concept names from a biomedical knowledge base, ignoring the inter-concept interactions and a concept’s local neighborhood in a knowledge base graph. In this paper, we introduce Biomedical Entity Representation with a Graph-Augmented Multi-Objective Transformer (BERGAMOT), which adopts the power of pre-trained language models (LMs) and graph neural networks to capture both inter-concept and intra-concept interactions from the multilingual UMLS graph. To obtain fine-grained graph representations, we introduce two additional graph-based objectives: (i) a node-level contrastive objective and (ii) the Deep Graph Infomax (DGI) loss, which maximizes the mutual information between a local subgraph and a high-level graph summary. We apply contrastive loss on textual and graph representations to make them less sensitive to surface forms and enable intermodal knowledge exchange. BERGAMOT achieves state-of-the-art results in zero-shot entity linking without task-specific supervision on 4 of 5 languages of the Mantra corpus and on 8 of 10 languages of the XL-BEL benchmark.

Lost in Translation: Chemical Language Models and the Misunderstanding of Molecule Structures
Veronika Ganeeva | Andrey Sakhovskiy | Kuzma Khrabrov | Andrey Savchenko | Artur Kadurin | Elena Tutubalina
Findings of the Association for Computational Linguistics: EMNLP 2024

The recent integration of chemistry with natural language processing (NLP) has advanced drug discovery. Molecule representation in language models (LMs) is crucial in enhancing chemical understanding. We propose Augmented Molecular Retrieval (AMORE), a flexible zero-shot framework for assessment of Chemistry LMs of different natures: trained solely on molecules for chemical tasks and on a combined corpus of natural language texts and string-based structures. The framework relies on molecule augmentations that preserve an underlying chemical, such as kekulization and cycle replacements. We evaluate encoder-only and generative LMs by calculating a metric based on the similarity score between distributed representations of molecules and their augmentations. Our experiments on ChEBI-20 and QM9 benchmarks show that these models exhibit significantly lower scores than graph-based molecular models trained without language modeling objectives. Additionally, our results on the molecule captioning task for cross-domain models, MolT5 and Text+Chem T5, demonstrate that the lower the representation-based evaluation metrics, the lower the classical text generation metrics like ROUGE and METEOR.

HSE NLP Team at MEDIQA-CORR 2024 Task: In-Prompt Ensemble with Entities and Knowledge Graph for Medical Error Correction
Airat Valiev | Elena Tutubalina
Proceedings of the 6th Clinical Natural Language Processing Workshop

This paper presents our LLM-based system designed for the MEDIQA-CORR @ NAACL-ClinicalNLP 2024 Shared Task 3, focusing on medical error detection and correction in medical records. Our approach consists of three key components: entity extraction, prompt engineering, and ensemble. First, we automatically extract biomedical entities such as therapies, diagnoses, and biological species. Next, we explore few-shot learning techniques and incorporate graph information from the MeSH database for the identified entities. Finally, we investigate two methods for ensembling: (i) combining the predictions of three previous LLMs using an AND strategy within a prompt and (ii) integrating the previous predictions into the prompt as separate ‘expert’ solutions, accompanied by trust scores representing their performance. The latter system ranked second with a BERTScore score of 0.8059 and third with an aggregated score of 0.7806 out of the 15 teams’ solutions in the shared task.

AIRI NLP Team at EHRSQL 2024 Shared Task: T5 and Logistic Regression to the Rescue
Oleg Somov | Alexey Dontsov | Elena Tutubalina
Proceedings of the 6th Clinical Natural Language Processing Workshop

This paper presents a system developed for the Clinical NLP 2024 Shared Task, focusing on reliable text-to-SQL modeling on Electronic Health Records (EHRs). The goal is to create a model that accurately generates SQL queries for answerable questions while avoiding incorrect responses and handling unanswerable queries. Our approach comprises three main components: a query correspondence model, a text-to-SQL model, and an SQL verifier.For the query correspondence model, we trained a logistic regression model using hand-crafted features to distinguish between answerable and unanswerable queries. As for the text-to-SQL model, we utilized T5-3B as a pretrained language model, further fine-tuned on pairs of natural language questions and corresponding SQL queries. Finally, we applied the SQL verifier to inspect the resulting SQL queries.During the evaluation stage of the shared task, our system achieved an accuracy of 68.9 % (metric version without penalty), positioning it at the fifth-place ranking. While our approach did not surpass solutions based on large language models (LMMs) like ChatGPT, it demonstrates the promising potential of domain-specific specialized models that are more resource-efficient. The code is publicly available at https://github.com/runnerup96/EHRSQL-text2sql-solution.

2023

Graph-Enriched Biomedical Language Models: A Research Proposal
Andrey Sakhovskiy | Alexander Panchenko | Elena Tutubalina
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Student Research Workshop

Shifted PAUQ: Distribution shift in text-to-SQL
Oleg Somov | Elena Tutubalina
Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP

Semantic parsing plays a pivotal role in advancing the accessibility of human-computer interaction on a large scale. Spider, a widely recognized dataset for text2SQL, contains a wide range of natural language (NL) questions in English and corresponding SQL queries. Original splits of Spider and its adapted to Russian language and improved version, PAUQ, assume independence and identical distribution of training and testing data (i.i.d split). In this work, we propose a target length split and multilingual i.i.d split to measure compositionality and cross-language generalization. We present experimental results of popular text2SQL models on original, multilingual, and target length splits. We also construct a context-free grammar for the evaluation of compositionality in text2SQL in an out-of-distribution setting. We make the splits publicly available on HuggingFace hub via https://huggingface.co/datasets/composite/pauq

Vote’n’Rank: Revision of Benchmarking with Social Choice Theory
Mark Rofin | Vladislav Mikhailov | Mikhail Florinsky | Andrey Kravchenko | Tatiana Shavrina | Elena Tutubalina | Daniel Karabekyan | Ekaterina Artemova
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

The development of state-of-the-art systems in different applied areas of machine learning (ML) is driven by benchmarks, which have shaped the paradigm of evaluating generalisation capabilities from multiple perspectives. Although the paradigm is shifting towards more fine-grained evaluation across diverse tasks, the delicate question of how to aggregate the performances has received particular interest in the community. In general, benchmarks follow the unspoken utilitarian principles, where the systems are ranked based on their mean average score over task-specific metrics. Such aggregation procedure has been viewed as a sub-optimal evaluation protocol, which may have created the illusion of progress. This paper proposes Vote’n’Rank, a framework for ranking systems in multi-task benchmarks under the principles of the social choice theory. We demonstrate that our approach can be efficiently utilised to draw new insights on benchmarking in several ML sub-fields and identify the best-performing systems in research and development case studies. The Vote’n’Rank’s procedures are more robust than the mean average while being able to handle missing performance scores and determine conditions under which the system becomes the winner.

2022

Cross-Modal Contextualized Hidden State Projection Method for Expanding of Taxonomic Graphs
Irina Nikishina | Alsu Vakhitova | Elena Tutubalina | Alexander Panchenko
Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing

Taxonomy is a graph of terms organized hierarchically using is-a (hypernymy) relations. We suggest novel candidate-free task formulation for the taxonomy enrichment task. To solve the task, we leverage lexical knowledge from the pre-trained models to predict new words missing in the taxonomic resource. We propose a method that combines graph-, and text-based contextualized representations from transformer networks to predict new entries to the taxonomy. We have evaluated the method suggested for this task against text-only baselines based on BERT and fastText representations. The results demonstrate that incorporation of graph embedding is beneficial in the task of hyponym prediction using contextualized models. We hope the new challenging task will foster further research in automatic text graph construction methods.

For the past seven years, the Social Media Mining for Health Applications (#SMM4H) shared tasks have promoted the community-driven development and evaluation of advanced natural language processing systems to detect, extract, and normalize health-related information in public, user-generated content. This seventh iteration consists of ten tasks that include English and Spanish posts on Twitter, Reddit, and WebMD. Interest in the #SMM4H shared tasks continues to grow, with 117 teams that registered and 54 teams that participated in at least one task—a 17.5% and 35% increase in registration and participation, respectively, over the last iteration. This paper provides an overview of the tasks and participants’ systems. The data sets remain available upon request, and new systems can be evaluated through the post-evaluation phase on CodaLab.

SMM4H 2022 Task 2: Dataset for stance and premise detection in tweets about health mandates related to COVID-19
Vera Davydova | Elena Tutubalina
Proceedings of the Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

This paper is an organizers’ report of the competition on argument mining systems dealing with English tweets about COVID-19 health mandates. This competition was held within the framework of the SMM4H 2022 shared tasks. During the competition, the participants were offered two subtasks: stance detection and premise classification. We present a manually annotated corpus containing 6,156 short posts from Twitter on three topics related to the COVID-19 pandemic: school closures, stay-at-home orders, and wearing masks. We hope the prepared dataset will support further research on argument mining in the health field.

Entity Linking over Nested Named Entities for Russian
Natalia Loukachevitch | Pavel Braslavski | Vladimir Ivanov | Tatiana Batura | Suresh Manandhar | Artem Shelmanov | Elena Tutubalina
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper, we describe entity linking annotation over nested named entities in the recently released Russian NEREL dataset for information extraction. The NEREL collection is currently the largest Russian dataset annotated with entities and relations. It includes 933 news texts with annotation of 29 entity types and 49 relation types. The paper describes the main design principles behind NEREL’s entity linking annotation, provides its statistics, and reports evaluation results for several entity linking baselines. To date, 38,152 entity mentions in 933 documents are linked to Wikidata. The NEREL dataset is publicly available.

Medical Crossing: a Cross-lingual Evaluation of Clinical Entity Linking
Anton Alekseev | Zulfat Miftahutdinov | Elena Tutubalina | Artem Shelmanov | Vladimir Ivanov | Vladimir Kokh | Alexander Nesterov | Manvel Avetisian | Andrei Chertok | Sergey Nikolenko
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Medical data annotation requires highly qualified expertise. Despite the efforts devoted to medical entity linking in different languages, available data is very sparse in terms of both data volume and languages. In this work, we establish benchmarks for cross-lingual medical entity linking using clinical reports, clinical guidelines, and medical research papers. We present a test set filtering procedure designed to analyze the “hard cases” of entity linking approaching zero-shot cross-lingual transfer learning, evaluate state-of-the-art models, and draw several interesting conclusions based on our evaluation results.

PAUQ: Text-to-SQL in Russian
Daria Bakshandaeva | Oleg Somov | Ekaterina Dmitrieva | Vera Davydova | Elena Tutubalina
Findings of the Association for Computational Linguistics: EMNLP 2022

Semantic parsing is an important task that allows to democratize human-computer interaction. One of the most popular text-to-SQL datasets with complex and diverse natural language (NL) questions and SQL queries is Spider. We construct and complement a Spider dataset for Russian, thus creating the first publicly available text-to-SQL dataset for this language. While examining its components - NL questions, SQL queries and databases content - we identify limitations of the existing database structure, fill out missing values for tables and add new requests for underrepresented categories. We select thirty functional test sets with different features that can be used for the evaluation of neural models’ abilities. To conduct the experiments, we adapt baseline architectures RAT-SQL and BRIDGE and provide in-depth query component analysis. On the target language, both models demonstrate strong results with monolingual training and improved accuracy in multilingual scenario. In this paper, we also study trade-offs between machine-translated and manually-created NL queries. At present, Russian text-to-SQL is lacking in datasets as well as trained models, and we view this work as an important step towards filling this gap.

RuCCoN: Clinical Concept Normalization in Russian
Alexandr Nesterov | Galina Zubkova | Zulfat Miftahutdinov | Vladimir Kokh | Elena Tutubalina | Artem Shelmanov | Anton Alekseev | Manvel Avetisian | Andrey Chertok | Sergey Nikolenko
Findings of the Association for Computational Linguistics: ACL 2022

We present RuCCoN, a new dataset for clinical concept normalization in Russian manually annotated by medical professionals. It contains over 16,028 entity mentions manually linked to over 2,409 unique concepts from the Russian language part of the UMLS ontology. We provide train/test splits for different settings (stratified, zero-shot, and CUI-less) and present strong baselines obtained with state-of-the-art models such as SapBERT. At present, Russian medical NLP is lacking in both datasets and trained models, and we view this work as an important step towards filling this gap. Our dataset and annotation guidelines are available at https://github.com/AIRI-Institute/RuCCoN.

A Comprehensive Evaluation of Biomedical Entity-centric Search
Elena Tutubalina | Zulfat Miftahutdinov | Vladimir Muravlev | Anastasia Shneyderman
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

Biomedical information retrieval has often been studied as a task of detecting whether a system correctly detects entity spans and links these entities to concepts from a given terminology. Most academic research has focused on evaluation of named entity recognition (NER) and entity linking (EL) models which are key components to recognizing diseases and genes in PubMed abstracts. In this work, we perform a fine-grained evaluation intended to understand the efficiency of state-of-the-art BERT-based information extraction (IE) architecture as a biomedical search engine. We present a novel manually annotated dataset of abstracts for disease and gene search. The dataset contains 23K query-abstract pairs, where 152 queries are selected from logs of our target discovery platform and PubMed abstracts annotated with relevance judgments. Specifically, the query list also includes a subset of concepts with at least one ambiguous concept name. As a baseline, we use off-she-shelf Elasticsearch with BM25. Our experiments on NER, EL, and retrieval in a zero-shot setup show the neural IE architecture shows superior performance for both disease and gene concept queries.

2021

KFU NLP Team at SMM4H 2021 Tasks: Cross-lingual and Cross-modal BERT-based Models for Adverse Drug Effects
Andrey Sakhovskiy | Zulfat Miftahutdinov | Elena Tutubalina
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task

This paper describes neural models developed for the Social Media Mining for Health (SMM4H) 2021 Shared Task. We participated in two tasks on classification of tweets that mention an adverse drug effect (ADE) (Tasks 1a & 2) and two tasks on extraction of ADE concepts (Tasks 1b & 1c). For classification, we investigate the impact of joint use of BERTbased language models and drug embeddings obtained by chemical structure BERT-based encoder. The BERT-based multimodal models ranked first and second on classification of Russian (Task 2) and English tweets (Task 1a) with the F1 scores of 57% and 61%, respectively. For Task 1b and 1c, we utilized the previous year’s best solution based on the EnDR-BERT model with additional corpora. Our model achieved the best results in Task 1c, obtaining an F1 of 29%.

The global growth of social media usage over the past decade has opened research avenues for mining health related information that can ultimately be used to improve public health. The Social Media Mining for Health Applications (#SMM4H) shared tasks in its sixth iteration sought to advance the use of social media texts such as Twitter for pharmacovigilance, disease tracking and patient centered outcomes. #SMM4H 2021 hosted a total of eight tasks that included reruns of adverse drug effect extraction in English and Russian and newer tasks such as detecting medication non-adherence from Twitter and WebMD forum, detecting self-reported adverse pregnancy outcomes, detecting cases and symptoms of COVID-19, identifying occupations mentioned in Spanish by Twitter users, and detecting self-reported breast cancer diagnosis. The eight tasks included a total of 12 individual subtasks spanning three languages requiring methods for binary classification, multi-class classification, named entity recognition and entity normalization. With a total of 97 registering teams and 40 teams submitting predictions, the interest in the shared tasks grew by 70% and participation grew by 38% compared to the previous iteration.

NEREL: A Russian Dataset with Nested Named Entities, Relations and Events
Natalia Loukachevitch | Ekaterina Artemova | Tatiana Batura | Pavel Braslavski | Ilia Denisov | Vladimir Ivanov | Suresh Manandhar | Alexander Pugachev | Elena Tutubalina
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this paper, we present NEREL, a Russian dataset for named entity recognition and relation extraction. NEREL is significantly larger than existing Russian datasets: to date it contains 56K annotated named entities and 39K annotated relations. Its important difference from previous datasets is annotation of nested named entities, as well as relations within nested entities and at the discourse level. NEREL can facilitate development of novel models that can extract relations between nested named entities, as well as relations on both sentence and document levels. NEREL also contains the annotation of events involving named entities and their roles in the events. The NEREL collection is available via https://github.com/nerel-ds/NEREL.

2020

KFU NLP Team at SMM4H 2020 Tasks: Cross-lingual Transfer Learning with Pretrained Language Models for Drug Reactions
Zulfat Miftahutdinov | Andrey Sakhovskiy | Elena Tutubalina
Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task

This paper describes neural models developed for the Social Media Mining for Health (SMM4H) 2020 shared tasks. Specifically, we participated in two tasks. We investigate the use of a language representation model BERT pretrained on a large-scale corpus of 5 million health-related user reviews in English and Russian. The ensemble of neural networks for extraction and normalization of adverse drug reactions ranked first among 7 teams at the SMM4H 2020 Task 3 and obtained a relaxed F1 of 46%. The BERT-based multilingual model for classification of English and Russian tweets that report adverse reactions ranked second among 16 and 7 teams at two first subtasks of the SMM4H 2019 Task 2 and obtained a relaxed F1 of 58% on English tweets and 51% on Russian tweets.

Overview of the Fifth Social Media Mining for Health Applications (#SMM4H) Shared Tasks at COLING 2020
Ari Klein | Ilseyar Alimova | Ivan Flores | Arjun Magge | Zulfat Miftahutdinov | Anne-Lyse Minard | Karen O’Connor | Abeed Sarker | Elena Tutubalina | Davy Weissenbacher | Graciela Gonzalez-Hernandez
Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task

The vast amount of data on social media presents significant opportunities and challenges for utilizing it as a resource for health informatics. The fifth iteration of the Social Media Mining for Health Applications (#SMM4H) shared tasks sought to advance the use of Twitter data (tweets) for pharmacovigilance, toxicovigilance, and epidemiology of birth defects. In addition to re-runs of three tasks, #SMM4H 2020 included new tasks for detecting adverse effects of medications in French and Russian tweets, characterizing chatter related to prescription medication abuse, and detecting self reports of birth defect pregnancy outcomes. The five tasks required methods for binary classification, multi-class classification, and named entity recognition (NER). With 29 teams and a total of 130 system submissions, participation in the #SMM4H shared tasks continues to grow.

Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task
Graciela Gonzalez-Hernandez | Ari Z. Klein | Ivan Flores | Davy Weissenbacher | Arjun Magge | Karen O'Connor | Abeed Sarker | Anne-Lyse Minard | Elena Tutubalina | Zulfat Miftahutdinov | Ilseyar Alimova
Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task

Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT-based Models
Elena Tutubalina | Artur Kadurin | Zulfat Miftahutdinov
Proceedings of the 28th International Conference on Computational Linguistics

Linking of biomedical entity mentions to various terminologies of chemicals, diseases, genes, adverse drug reactions is a challenging task, often requiring non-syntactic interpretation. A large number of biomedical corpora and state-of-the-art models have been introduced in the past five years. However, there are no general guidelines regarding the evaluation of models on these corpora in single- and cross-terminology settings. In this work, we perform a comparative evaluation of various benchmarks and study the efficiency of state-of-the-art neural architectures based on Bidirectional Encoder Representations from Transformers (BERT) for linking of three entity types across three domains: research abstracts, drug labels, and user-generated texts on drug therapy in English. We have made the source code and results available at https://github.com/insilicomedicine/Fair-Evaluation-BERT.

Ad Lingua: Text Classification Improves Symbolism Prediction in Image Advertisements
Andrey Savchenko | Anton Alekseev | Sejeong Kwon | Elena Tutubalina | Evgeny Myasnikov | Sergey Nikolenko
Proceedings of the 28th International Conference on Computational Linguistics

Understanding image advertisements is a challenging task, often requiring non-literal interpretation. We argue that standard image-based predictions are insufficient for symbolism prediction. Following the intuition that texts and images are complementary in advertising, we introduce a multimodal ensemble of a state of the art image-based classifier, a classifier based on an object detection architecture, and a fine-tuned language model applied to texts extracted from ads by OCR. The resulting system establishes a new state of the art in symbolism prediction.

Cross-lingual Transfer Learning for Semantic Role Labeling in Russian
Ilseyar Alimova | Elena Tutubalina | Alexander Kirillovich
Proceedings of the Fourth International Conference on Computational Linguistics in Bulgaria (CLIB 2020)

This work is devoted to semantic role labeling (SRL) task in Russian. We investigate the role of transfer learning strategies between English FrameNet and Russian FrameBank corpora. We perform experiments with embeddings obtained from various types of multilingual language models, including BERT, XLM-R, MUSE, and LASER. For evaluation, we use a Russian FrameBank dataset. As source data for transfer learning, we experimented with the full version of FrameNet and the reduced dataset with a smaller number of semantic roles identical to FrameBank. Evaluation results demonstrate that BERT embeddings show the best transfer capabilities. The model with pretraining on the reduced English SRL data and fine-tuning on the Russian SRL data show macro-averaged F1-measure of 79.8%, which is above our baseline of 78.4%.

2019

Entity-level Classification of Adverse Drug Reactions: a Comparison of Neural Network Models
Ilseyar Alimova | Elena Tutubalina
Proceedings of the 2019 Workshop on Widening NLP

This paper presents our experimental work on exploring the potential of neural network models developed for aspect-based sentiment analysis for entity-level adverse drug reaction (ADR) classification. Our goal is to explore how to represent local context around ADR mentions and learn an entity representation, interacting with its context. We conducted extensive experiments on various sources of text-based information, including social media, electronic health records, and abstracts of scientific articles from PubMed. The results show that Interactive Attention Neural Network (IAN) outperformed other models on four corpora in terms of macro F-measure. This work is an abridged version of our recent paper accepted to Programming and Computer Software journal in 2019.

AspeRa: Aspect-Based Rating Prediction Based on User Reviews
Elena Tutubalina | Valentin Malykh | Sergey Nikolenko | Anton Alekseev | Ilya Shenbin
Proceedings of the 2019 Workshop on Widening NLP

We propose a novel Aspect-based Rating Prediction model (AspeRa) that estimates user rating based on review texts for the items. It is based on aspect extraction with neural networks and combines the advantages of deep learning and topic modeling. It is mainly designed for recommendations, but an important secondary goal of AspeRa is to discover coherent aspects of reviews that can be used to explain predictions or for user profiling. We conduct a comprehensive empirical study of AspeRa, showing that it outperforms state-of-the-art models in terms of recommendation quality and produces interpretable aspects. This paper is an abridged version of our work (Nikolenko et al., 2019)

KFU NLP Team at SMM4H 2019 Tasks: Want to Extract Adverse Drugs Reactions from Tweets? BERT to The Rescue
Zulfat Miftahutdinov | Ilseyar Alimova | Elena Tutubalina
Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task

This paper describes a system developed for the Social Media Mining for Health (SMM4H) 2019 shared tasks. Specifically, we participated in three tasks. The goals of the first two tasks are to classify whether a tweet contains mentions of adverse drug reactions (ADR) and extract these mentions, respectively. The objective of the third task is to build an end-to-end solution: first, detect ADR mentions and then map these entities to concepts in a controlled vocabulary. We investigate the use of a language representation model BERT trained to obtain semantic representations of social media texts. Our experiments on a dataset of user reviews showed that BERT is superior to state-of-the-art models based on recurrent neural networks. The BERT-based system for Task 1 obtained an F1 of 57.38%, with improvements up to +7.19% F1 over a score averaged across all 43 submissions. The ensemble of neural networks with a voting scheme for named entity recognition ranked first among 9 teams at the SMM4H 2019 Task 2 and obtained a relaxed F1 of 65.8%. The end-to-end model based on BERT for ADR normalization ranked first at the SMM4H 2019 Task 3 and obtained a relaxed F1 of 43.2%.

Distant Supervision for Sentiment Attitude Extraction
Nicolay Rusnachenko | Natalia Loukachevitch | Elena Tutubalina
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

News articles often convey attitudes between the mentioned subjects, which is essential for understanding the described situation. In this paper, we describe a new approach to distant supervision for extracting sentiment attitudes between named entities mentioned in texts. Two factors (pair-based and frame-based) were used to automatically label an extensive news collection, dubbed as RuAttitudes. The latter became a basis for adaptation and training convolutional architectures, including piecewise max pooling and full use of information across different sentences. The results show that models, trained with RuAttitudes, outperform ones that were trained with only supervised learning approach and achieve 13.4% increase in F1-score on RuSentRel collection.

Detecting Adverse Drug Reactions from Biomedical Texts with Neural Networks
Ilseyar Alimova | Elena Tutubalina
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Detection of adverse drug reactions in postapproval periods is a crucial challenge for pharmacology. Social media and electronic clinical reports are becoming increasingly popular as a source for obtaining health related information. In this work, we focus on extraction information of adverse drug reactions from various sources of biomedical textbased information, including biomedical literature and social media. We formulate the problem as a binary classification task and compare the performance of four state-of-the-art attention-based neural networks in terms of the F-measure. We show the effectiveness of these methods on four different benchmarks.

Deep Neural Models for Medical Concept Normalization in User-Generated Texts
Zulfat Miftahutdinov | Elena Tutubalina
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

In this work, we consider the medical concept normalization problem, i.e., the problem of mapping a health-related entity mention in a free-form text to a concept in a controlled vocabulary, usually to the standard thesaurus in the Unified Medical Language System (UMLS). This is a challenging task since medical terminology is very different when coming from health care professionals or from the general public in the form of social media texts. We approach it as a sequence learning problem with powerful neural networks such as recurrent neural networks and contextualized word representation models trained to obtain semantic representations of social media expressions. Our experimental evaluation over three different benchmarks shows that neural architectures leverage the semantic meaning of the entity mention and significantly outperform existing state of the art models.

2015

Clustering-based Approach to Multiword Expression Extraction and Ranking
Elena Tutubalina
Proceedings of the 11th Workshop on Multiword Expressions

2014

Unsupervised Approach to Extracting Problem Phrases from User Reviews of Products
Elena Tutubalina | Vladimir Ivanov
Proceedings of the First AHA!-Workshop on Information Discovery in Text

Co-authors

Graciela Gonzalez 5

Davy Weissenbacher 5

Anton Alekseev 4

Alexey Dontsov 4

Vladimir Ivanov 4

Artur Kadurin 4

Natalia Loukachevitch 4

Sergey Nikolenko 4

Ivan Oseledets 4

Karen O’Connor 4

Andrey Savchenko 4

Vera Davydova 3

Anton Korznikov 3

Martin Krallinger 3

Irina Nikishina 3

Mikhail Seleznyov 3

Artem Shelmanov 3

Mohammed Ali Al-Garadi 2

Ekaterina Artemova 2

Manvel Avetisian 2

Tatiana Batura 2

Anna Borisiuk 2

Pavel Braslavski 2

Mikhail Chaichuk 2

Andrey V. Galichin 2

Veronika Ganeeva 2

Kuzma Khrabrov 2

Vladimir Kokh 2

Andrey Kuznetsov 2

Salvador Lima-López 2

Suresh Manandhar 2

Anne-Lyse Minard 2

Antonio Miranda-Escalada 2

Andrey Moskalenko 2

Alexandr Nesterov 2

Daria Pugacheva 2

Vlad Shakhuro 2

Denis Shepelev 2

Nikita Sushko 2

Pavel Tikhonov 2

Ricardo Usbeck 2

Dmitry Ustalov 2

Galina Zubkova 2

Rana Abdullah 1

Idris Abdulmumin 1

Cengiz Acarturk 1

Nikita Afonin 1

Ibrahim Said Ahmad 1

Syed Ishtiaque Ahmed 1

Adem Chanie Ali 1

Nikita Andriianov 1

Anton Antonov 1

Abinew Ali Ayele 1

Nikhil Bageshpura 1

Daria Bakshandaeva 1

Debayan Banerjee 1

Chris Biemann 1

Irina Brodskaya 1

Tanmoy Chakraborty 1

Andrei Chertok 1

Andrey Chertok 1

Alessandra Teresa Cignarella 1

Sunishchal Dev 1

Ekaterina Dmitrieva 1

Darryl Estrada Zavala 1

Nikolas Evkarpidi 1

Eulalia Farre 1

Eulàlia Farré-Maduell 1

Mikhail Florinsky 1

Simona Frenda 1

Luis Gasco Sánchez 1

Robert Geislinger 1

Md. Arid Hasan 1

Vahagn Hovhannisyan 1

Aung Kyaw Htet 1

Longquan Jiang 1

Daniel Karabekyan 1

Satya Keerthi 1

Jane Wanjiru Kimani 1

Alexander Kirillovich 1

Dheeraj Kodati 1

Vasily Konovalov 1

Daniil Korbut 1

Dmitrii Korzh 1

Angelie Kraft 1

Andrey Kravchenko 1

Mathias Leddin 1

Vladimir Makharev 1

Valentin Malykh 1

Vladislav Mikhailov 1

Boris Mikheev 1

Sahar Moradizeyveh 1

Daniil Moskovskiy 1

Viktor Moskvoretskii 1

Shamsuddeen Hassan Muhammad 1

Vladimir Muravlev 1

Evgeny Myasnikov 1

Cedric Möller 1

Alexander Nesterov 1

Evgenii Nikolaev 1

Nelson Odhiambo Onyango 1

Ashwinee Panda 1

Shantipriya Parida 1

Sergey Pletenev 1

Alexander Pugachev 1

Ihsan Ayyub Qazi 1

Kritesh Rauniyar 1

Raul Rodriguez-Esteban 1

Nicolay Rusnachenko 1

Mikhail Salnikov 1

Lucia Schmidt 1

Natalia Semenova 1

Martin Semmann 1

Tatiana Shavrina 1

Anastasia Shneyderman 1

Clemencia Siro 1

Andrei Spiridonov 1

Marco Antonio Stranisci 1

Ivan Sviridov 1

Surendrabikram Thapa 1

Aida Usmanova 1

Alsu Vakhitova 1

Marco Valentino 1

Artem Vazhentsev 1

Rudy Alexandro Garrido Veliz 1

Lilian Diana Awuor Wanzare 1

Seid Muhie Yimam 1

MD Arfeen Zeeshan 1

Alexey Zhavoronkin 1

Viktoriia Zinkovich 1

Venues