Elena Tutubalina


2022

pdf bib
RuCCoN: Clinical Concept Normalization in Russian
Alexandr Nesterov | Galina Zubkova | Zulfat Miftahutdinov | Vladimir Kokh | Elena Tutubalina | Artem Shelmanov | Anton Alekseev | Manvel Avetisian | Andrey Chertok | Sergey Nikolenko
Findings of the Association for Computational Linguistics: ACL 2022

We present RuCCoN, a new dataset for clinical concept normalization in Russian manually annotated by medical professionals. It contains over 16,028 entity mentions manually linked to over 2,409 unique concepts from the Russian language part of the UMLS ontology. We provide train/test splits for different settings (stratified, zero-shot, and CUI-less) and present strong baselines obtained with state-of-the-art models such as SapBERT. At present, Russian medical NLP is lacking in both datasets and trained models, and we view this work as an important step towards filling this gap. Our dataset and annotation guidelines are available at https://github.com/sberbank-ai-lab/RuCCoN.

2021

pdf bib
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task
Arjun Magge | Ari Klein | Antonio Miranda-Escalada | Mohammed Ali Al-garadi | Ilseyar Alimova | Zulfat Miftahutdinov | Eulalia Farre-Maduell | Salvador Lima Lopez | Ivan Flores | Karen O'Connor | Davy Weissenbacher | Elena Tutubalina | Abeed Sarker | Juan M Banda | Martin Krallinger | Graciela Gonzalez-Hernandez
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task

pdf bib
Overview of the Sixth Social Media Mining for Health Applications (#SMM4H) Shared Tasks at NAACL 2021
Arjun Magge | Ari Klein | Antonio Miranda-Escalada | Mohammed Ali Al-Garadi | Ilseyar Alimova | Zulfat Miftahutdinov | Eulalia Farre | Salvador Lima López | Ivan Flores | Karen O’Connor | Davy Weissenbacher | Elena Tutubalina | Abeed Sarker | Juan Banda | Martin Krallinger | Graciela Gonzalez-Hernandez
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task

The global growth of social media usage over the past decade has opened research avenues for mining health related information that can ultimately be used to improve public health. The Social Media Mining for Health Applications (#SMM4H) shared tasks in its sixth iteration sought to advance the use of social media texts such as Twitter for pharmacovigilance, disease tracking and patient centered outcomes. #SMM4H 2021 hosted a total of eight tasks that included reruns of adverse drug effect extraction in English and Russian and newer tasks such as detecting medication non-adherence from Twitter and WebMD forum, detecting self-reported adverse pregnancy outcomes, detecting cases and symptoms of COVID-19, identifying occupations mentioned in Spanish by Twitter users, and detecting self-reported breast cancer diagnosis. The eight tasks included a total of 12 individual subtasks spanning three languages requiring methods for binary classification, multi-class classification, named entity recognition and entity normalization. With a total of 97 registering teams and 40 teams submitting predictions, the interest in the shared tasks grew by 70% and participation grew by 38% compared to the previous iteration.

pdf bib
KFU NLP Team at SMM4H 2021 Tasks: Cross-lingual and Cross-modal BERT-based Models for Adverse Drug Effects
Andrey Sakhovskiy | Zulfat Miftahutdinov | Elena Tutubalina
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task

This paper describes neural models developed for the Social Media Mining for Health (SMM4H) 2021 Shared Task. We participated in two tasks on classification of tweets that mention an adverse drug effect (ADE) (Tasks 1a & 2) and two tasks on extraction of ADE concepts (Tasks 1b & 1c). For classification, we investigate the impact of joint use of BERTbased language models and drug embeddings obtained by chemical structure BERT-based encoder. The BERT-based multimodal models ranked first and second on classification of Russian (Task 2) and English tweets (Task 1a) with the F1 scores of 57% and 61%, respectively. For Task 1b and 1c, we utilized the previous year’s best solution based on the EnDR-BERT model with additional corpora. Our model achieved the best results in Task 1c, obtaining an F1 of 29%.

pdf bib
NEREL: A Russian Dataset with Nested Named Entities, Relations and Events
Natalia Loukachevitch | Ekaterina Artemova | Tatiana Batura | Pavel Braslavski | Ilia Denisov | Vladimir Ivanov | Suresh Manandhar | Alexander Pugachev | Elena Tutubalina
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this paper, we present NEREL, a Russian dataset for named entity recognition and relation extraction. NEREL is significantly larger than existing Russian datasets: to date it contains 56K annotated named entities and 39K annotated relations. Its important difference from previous datasets is annotation of nested named entities, as well as relations within nested entities and at the discourse level. NEREL can facilitate development of novel models that can extract relations between nested named entities, as well as relations on both sentence and document levels. NEREL also contains the annotation of events involving named entities and their roles in the events. The NEREL collection is available via https://github.com/nerel-ds/NEREL.

2020

pdf bib
Ad Lingua: Text Classification Improves Symbolism Prediction in Image Advertisements
Andrey Savchenko | Anton Alekseev | Sejeong Kwon | Elena Tutubalina | Evgeny Myasnikov | Sergey Nikolenko
Proceedings of the 28th International Conference on Computational Linguistics

Understanding image advertisements is a challenging task, often requiring non-literal interpretation. We argue that standard image-based predictions are insufficient for symbolism prediction. Following the intuition that texts and images are complementary in advertising, we introduce a multimodal ensemble of a state of the art image-based classifier, a classifier based on an object detection architecture, and a fine-tuned language model applied to texts extracted from ads by OCR. The resulting system establishes a new state of the art in symbolism prediction.

pdf bib
Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT-based Models
Elena Tutubalina | Artur Kadurin | Zulfat Miftahutdinov
Proceedings of the 28th International Conference on Computational Linguistics

Linking of biomedical entity mentions to various terminologies of chemicals, diseases, genes, adverse drug reactions is a challenging task, often requiring non-syntactic interpretation. A large number of biomedical corpora and state-of-the-art models have been introduced in the past five years. However, there are no general guidelines regarding the evaluation of models on these corpora in single- and cross-terminology settings. In this work, we perform a comparative evaluation of various benchmarks and study the efficiency of state-of-the-art neural architectures based on Bidirectional Encoder Representations from Transformers (BERT) for linking of three entity types across three domains: research abstracts, drug labels, and user-generated texts on drug therapy in English. We have made the source code and results available at https://github.com/insilicomedicine/Fair-Evaluation-BERT.

pdf bib
Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task
Graciela Gonzalez-Hernandez | Ari Z. Klein | Ivan Flores | Davy Weissenbacher | Arjun Magge | Karen O'Connor | Abeed Sarker | Anne-Lyse Minard | Elena Tutubalina | Zulfat Miftahutdinov | Ilseyar Alimova
Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task

pdf bib
Overview of the Fifth Social Media Mining for Health Applications (#SMM4H) Shared Tasks at COLING 2020
Ari Klein | Ilseyar Alimova | Ivan Flores | Arjun Magge | Zulfat Miftahutdinov | Anne-Lyse Minard | Karen O’Connor | Abeed Sarker | Elena Tutubalina | Davy Weissenbacher | Graciela Gonzalez-Hernandez
Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task

The vast amount of data on social media presents significant opportunities and challenges for utilizing it as a resource for health informatics. The fifth iteration of the Social Media Mining for Health Applications (#SMM4H) shared tasks sought to advance the use of Twitter data (tweets) for pharmacovigilance, toxicovigilance, and epidemiology of birth defects. In addition to re-runs of three tasks, #SMM4H 2020 included new tasks for detecting adverse effects of medications in French and Russian tweets, characterizing chatter related to prescription medication abuse, and detecting self reports of birth defect pregnancy outcomes. The five tasks required methods for binary classification, multi-class classification, and named entity recognition (NER). With 29 teams and a total of 130 system submissions, participation in the #SMM4H shared tasks continues to grow.

pdf bib
KFU NLP Team at SMM4H 2020 Tasks: Cross-lingual Transfer Learning with Pretrained Language Models for Drug Reactions
Zulfat Miftahutdinov | Andrey Sakhovskiy | Elena Tutubalina
Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task

This paper describes neural models developed for the Social Media Mining for Health (SMM4H) 2020 shared tasks. Specifically, we participated in two tasks. We investigate the use of a language representation model BERT pretrained on a large-scale corpus of 5 million health-related user reviews in English and Russian. The ensemble of neural networks for extraction and normalization of adverse drug reactions ranked first among 7 teams at the SMM4H 2020 Task 3 and obtained a relaxed F1 of 46%. The BERT-based multilingual model for classification of English and Russian tweets that report adverse reactions ranked second among 16 and 7 teams at two first subtasks of the SMM4H 2019 Task 2 and obtained a relaxed F1 of 58% on English tweets and 51% on Russian tweets.

2019

pdf bib
Distant Supervision for Sentiment Attitude Extraction
Nicolay Rusnachenko | Natalia Loukachevitch | Elena Tutubalina
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

News articles often convey attitudes between the mentioned subjects, which is essential for understanding the described situation. In this paper, we describe a new approach to distant supervision for extracting sentiment attitudes between named entities mentioned in texts. Two factors (pair-based and frame-based) were used to automatically label an extensive news collection, dubbed as RuAttitudes. The latter became a basis for adaptation and training convolutional architectures, including piecewise max pooling and full use of information across different sentences. The results show that models, trained with RuAttitudes, outperform ones that were trained with only supervised learning approach and achieve 13.4% increase in F1-score on RuSentRel collection.

pdf bib
KFU NLP Team at SMM4H 2019 Tasks: Want to Extract Adverse Drugs Reactions from Tweets? BERT to The Rescue
Zulfat Miftahutdinov | Ilseyar Alimova | Elena Tutubalina
Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task

This paper describes a system developed for the Social Media Mining for Health (SMM4H) 2019 shared tasks. Specifically, we participated in three tasks. The goals of the first two tasks are to classify whether a tweet contains mentions of adverse drug reactions (ADR) and extract these mentions, respectively. The objective of the third task is to build an end-to-end solution: first, detect ADR mentions and then map these entities to concepts in a controlled vocabulary. We investigate the use of a language representation model BERT trained to obtain semantic representations of social media texts. Our experiments on a dataset of user reviews showed that BERT is superior to state-of-the-art models based on recurrent neural networks. The BERT-based system for Task 1 obtained an F1 of 57.38%, with improvements up to +7.19% F1 over a score averaged across all 43 submissions. The ensemble of neural networks with a voting scheme for named entity recognition ranked first among 9 teams at the SMM4H 2019 Task 2 and obtained a relaxed F1 of 65.8%. The end-to-end model based on BERT for ADR normalization ranked first at the SMM4H 2019 Task 3 and obtained a relaxed F1 of 43.2%.

bib
AspeRa: Aspect-Based Rating Prediction Based on User Reviews
Elena Tutubalina | Valentin Malykh | Sergey Nikolenko | Anton Alekseev | Ilya Shenbin
Proceedings of the 2019 Workshop on Widening NLP

We propose a novel Aspect-based Rating Prediction model (AspeRa) that estimates user rating based on review texts for the items. It is based on aspect extraction with neural networks and combines the advantages of deep learning and topic modeling. It is mainly designed for recommendations, but an important secondary goal of AspeRa is to discover coherent aspects of reviews that can be used to explain predictions or for user profiling. We conduct a comprehensive empirical study of AspeRa, showing that it outperforms state-of-the-art models in terms of recommendation quality and produces interpretable aspects. This paper is an abridged version of our work (Nikolenko et al., 2019)

bib
Entity-level Classification of Adverse Drug Reactions: a Comparison of Neural Network Models
Ilseyar Alimova | Elena Tutubalina
Proceedings of the 2019 Workshop on Widening NLP

This paper presents our experimental work on exploring the potential of neural network models developed for aspect-based sentiment analysis for entity-level adverse drug reaction (ADR) classification. Our goal is to explore how to represent local context around ADR mentions and learn an entity representation, interacting with its context. We conducted extensive experiments on various sources of text-based information, including social media, electronic health records, and abstracts of scientific articles from PubMed. The results show that Interactive Attention Neural Network (IAN) outperformed other models on four corpora in terms of macro F-measure. This work is an abridged version of our recent paper accepted to Programming and Computer Software journal in 2019.

pdf bib
Deep Neural Models for Medical Concept Normalization in User-Generated Texts
Zulfat Miftahutdinov | Elena Tutubalina
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

In this work, we consider the medical concept normalization problem, i.e., the problem of mapping a health-related entity mention in a free-form text to a concept in a controlled vocabulary, usually to the standard thesaurus in the Unified Medical Language System (UMLS). This is a challenging task since medical terminology is very different when coming from health care professionals or from the general public in the form of social media texts. We approach it as a sequence learning problem with powerful neural networks such as recurrent neural networks and contextualized word representation models trained to obtain semantic representations of social media expressions. Our experimental evaluation over three different benchmarks shows that neural architectures leverage the semantic meaning of the entity mention and significantly outperform existing state of the art models.

pdf bib
Detecting Adverse Drug Reactions from Biomedical Texts with Neural Networks
Ilseyar Alimova | Elena Tutubalina
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Detection of adverse drug reactions in postapproval periods is a crucial challenge for pharmacology. Social media and electronic clinical reports are becoming increasingly popular as a source for obtaining health related information. In this work, we focus on extraction information of adverse drug reactions from various sources of biomedical textbased information, including biomedical literature and social media. We formulate the problem as a binary classification task and compare the performance of four state-of-the-art attention-based neural networks in terms of the F-measure. We show the effectiveness of these methods on four different benchmarks.

2015

pdf bib
Clustering-based Approach to Multiword Expression Extraction and Ranking
Elena Tutubalina
Proceedings of the 11th Workshop on Multiword Expressions

2014

pdf bib
Unsupervised Approach to Extracting Problem Phrases from User Reviews of Products
Elena Tutubalina | Vladimir Ivanov
Proceedings of the First AHA!-Workshop on Information Discovery in Text