Irina Nikishina - ACL Anthology

Irina Nikishina

2025

Adaptive Retrieval Without Self-Knowledge? Bringing Uncertainty Back Home
Viktor Moskvoretskii | Maria Marina | Mikhail Salnikov | Nikolay Ivanov | Sergey Pletenev | Daria Galimzianova | Nikita Krayko | Vasily Konovalov | Irina Nikishina | Alexander Panchenko
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Retrieval Augmented Generation (RAG) improves correctness of Question Answering (QA) and addresses hallucinations in Large Language Models (LLMs), yet greatly increase computational costs. Besides, RAG is not always needed as may introduce irrelevant information. Recent adaptive retrieval methods integrate LLMs’ intrinsic knowledge with external information appealing to LLM self-knowledge, but they often neglect efficiency evaluations and comparisons with uncertainty estimation techniques. We bridge this gap by conducting a comprehensive analysis of 35 adaptive retrieval methods, including 8 recent approaches and 27 uncertainty estimation techniques, across 6 datasets using 10 metrics for QA performance, self-knowledge, and efficiency. Our findings show that uncertainty estimation techniques often outperform complex pipelines in terms of efficiency and self-knowledge, while maintaining comparable QA performance.

CompUGE-Bench: Comparative Understanding and Generation Evaluation Benchmark for Comparative Question Answering
Ahmad Shallouf | Irina Nikishina | Chris Biemann
Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations

This paper presents CompUGE, a comprehensive benchmark designed to evaluate Comparative Question Answering (CompQA) systems. The benchmark is structured around four core tasks: Comparative Question Identification, Object and Aspect Identification, Stance Classification, and Answer Generation. It unifies multiple datasets and provides a robust evaluation platform to compare various models across these sub-tasks. We also create additional all-encompassing CompUGE datasets by filtering and merging the existing ones. The benchmark for comparative question answering sub-tasks is designed as a web application available on HuggingFace Spaces: https://huggingface.co/spaces/uhhlt/CompUGE-Bench

How to Compare Things Properly? A Study of Argument Relevance in Comparative Question Answering
Irina Nikishina | Saba Anwar | Nikolay Dolgov | Maria Manina | Daria Ignatenko | Artem Shelmanov | Chris Biemann
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Comparative Question Answering (CQA) lies at the intersection of Question Answering, Argument Mining, and Summarization. It poses unique challenges due to the inherently subjective nature of many questions and the need to integrate diverse perspectives. Although the CQA task can be addressed using recently emerged instruction-following Large Language Models (LLMs), challenges such as hallucinations in their outputs and the lack of transparent argument provenance remain significant limitations.To address these challenges, we construct a manually curated dataset comprising arguments annotated with their relevance. These arguments are further used to answer comparative questions, enabling precise traceability and faithfulness. Furthermore, we define explicit criteria for an “ideal” comparison and introduce a benchmark for evaluating the outputs of various Retrieval-Augmented Generation (RAG) models with respect to argument relevance. All code and data are publicly released to support further research.

2024

This paper describes the results of the Knowledge Graph Question Answering (KGQA) shared task that was co-located with the TextGraphs 2024 workshop. In this task, given a textual question and a list of entities with the corresponding KG subgraphs, the participating system should choose the entity that correctly answers the question. Our competition attracted thirty teams, four of which outperformed our strong ChatGPT-based zero-shot baseline. In this paper, we overview the participating systems and analyze their performance according to a large-scale automatic evaluation. To the best of our knowledge, this is the first competition aimed at the KGQA problem using the interaction between large language models (LLMs) and knowledge graphs.

Low-Resource Machine Translation through the Lens of Personalized Federated Learning
Viktor Moskvoretskii | Nazarii Tupitsa | Chris Biemann | Samuel Horváth | Eduard Gorbunov | Irina Nikishina
Findings of the Association for Computational Linguistics: EMNLP 2024

We present a new approach called MeritOpt based on the Personalized Federated Learning algorithm MeritFed that can be applied to Natural Language Tasks with heterogeneous data. We evaluate it on the Low-Resource Machine Translation task, using the datasets of South East Asian and Finno-Ugric languages. In addition to its effectiveness, MeritOpt is also highly interpretable, as it can be applied to track the impact of each language used for training. Our analysis reveals that target dataset size affects weight distribution across auxiliary languages, that unrelated languages do not interfere with the training, and auxiliary optimizer parameters have minimal impact. Our approach is easy to apply with a few lines of code, and we provide scripts for reproducing the experiments (https://github.com/VityaVitalich/MeritOpt).

Industry vs Academia: Running a Course on Transformers in Two Setups
Irina Nikishina | Maria Tikhonova | Viktoriia Chekalina | Alexey Zaytsev | Artem Vazhentsev | Alexander Panchenko
Proceedings of the Sixth Workshop on Teaching NLP

This paper presents a course on neural networks based on the Transformer architecture targeted at diverse groups of people from academia and industry with experience in Python, Machine Learning, and Deep Learning but little or no experience with Transformers. The course covers a comprehensive overview of the Transformers NLP applications and their use for other data types. The course features 15 sessions, each consisting of a lecture and a practical part, and two homework assignments organized as CodaLab competitions. The first six sessions of the course are devoted to the Transformer and the variations of this architecture (e.g., encoders, decoders, encoder-decoders) as well as different techniques of model tuning. Subsequent sessions are devoted to multilingualism, multimodality (e.g., texts and images), efficiency, event sequences, and tabular data.We ran the course for different audiences: academic students and people from industry. The first run was held in 2022. During the subsequent iterations until 2024, it was constantly updated and extended with recently emerged findings on GPT-4, LLMs, RLHF, etc. Overall, it has been ran six times (four times in industry and twice in academia) and received positive feedback from academic and industry students.

Sövereign at The Perspective Argument Retrieval Shared Task 2024: Using LLMs with Argument Mining
Robert Günzler | Özge Sevgili | Steffen Remus | Chris Biemann | Irina Nikishina
Proceedings of the 11th Workshop on Argument Mining (ArgMining 2024)

This paper presents the Sövereign submission for the shared task on perspective argument retrieval for the Argument Mining Workshop 2024. The main challenge is to perform argument retrieval considering socio-cultural aspects such as political interests, occupation, age, and gender. To address the challenge, we apply open-access Large Language Models (Mistral-7b) in a zero-shot fashion for re-ranking and explicit similarity scoring. Additionally, we combine different features in an ensemble setup using logistic regression. Our system ranks second in the competition for all test set rounds on average for the logistic regression approach using LLM similarity scores as a feature. In addition to the description of the approach, we also provide further results of our ablation study. Our code will be open-sourced upon acceptance.

UHH at AVeriTeC: RAG for Fact-Checking with Real-World Claims
Özge Sevgili | Irina Nikishina | Seid Muhie Yimam | Martin Semmann | Chris Biemann
Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER)

This paper presents UHH’s approach developed for the AVeriTeC shared task. The goal of the challenge is to verify given real-world claims with evidences from the Web. In this shared task, we investigate a Retrieval-Augmented Generation (RAG) model, which mainly contains retrieval, generation, and augmentation components. We start with the selection of the top 10k evidences via BM25 scores, and continue with two approaches to retrieve the most similar evidences: (1) to retrieve top 10 evidences through vector similarity, generate questions for them, and rerank them or (2) to generate questions for the claim and retrieve the most similar evidence, again, through vector similarity. After retrieving the top evidences, a Large Language Model (LLM) is prompted using the claim along with either all evidences or individual evidence to predict the label. Our system submission, UHH, using the first approach and individual evidence prompts, ranks 6th out of 23 systems.

Proceedings of TextGraphs-17: Graph-based Methods for Natural Language Processing
Dmitry Ustalov | Yanjun Gao | Alexander Panchenko | Elena Tutubalina | Irina Nikishina | Arti Ramesh | Andrey Sakhovskiy | Ricardo Usbeck | Gerald Penn | Marco Valentino
Proceedings of TextGraphs-17: Graph-based Methods for Natural Language Processing

Are Large Language Models Good at Lexical Semantics? A Case of Taxonomy Learning
Viktor Moskvoretskii | Alexander Panchenko | Irina Nikishina
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Recent studies on LLMs do not pay enough attention to linguistic and lexical semantic tasks, such as taxonomy learning. In this paper, we explore the capacities of Large Language Models featuring LLaMA-2 and Mistral for several Taxonomy-related tasks. We introduce a new methodology and algorithm for data collection via stochastic graph traversal leading to controllable data collection. Collected cases provide the ability to form nearly any type of graph operation. We test the collected dataset for learning taxonomy structure based on English WordNet and compare different input templates for fine-tuning LLMs. Moreover, we apply the fine-tuned models on such datasets on the downstream tasks achieving state-of-the-art results on the TexEval-2 dataset.

CAM 2.0: End-to-End Open Domain Comparative Question Answering System
Ahmad Shallouf | Hanna Herasimchyk | Mikhail Salnikov | Rudy Alexandro Garrido Veliz | Natia Mestvirishvili | Alexander Panchenko | Chris Biemann | Irina Nikishina
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Comparative Question Answering (CompQA) is a Natural Language Processing task that combines Question Answering and Argument Mining approaches to answer subjective comparative questions in an efficient argumentative manner. In this paper, we present an end-to-end (full pipeline) system for answering comparative questions called CAM 2.0 as well as a public leaderboard called CompUGE that unifies the existing datasets under a single easy-to-use evaluation suite. As compared to previous web-form-based CompQA systems, it features question identification, object and aspect labeling, stance classification, and summarization using up-to-date models. We also select the most time- and memory-effective pipeline by comparing separately fine-tuned Transformer Encoder models which show state-of-the-art performance on the subtasks with Generative LLMs in few-shot and LoRA setups. We also conduct a user study for a whole-system evaluation.

TaxoLLaMA: WordNet-based Model for Solving Multiple Lexical Semantic Tasks
Viktor Moskvoretskii | Ekaterina Neminova | Alina Lobanova | Alexander Panchenko | Irina Nikishina
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we explore the capabilities of LLMs in capturing lexical-semantic knowledge from WordNet on the example of the LLaMA-2-7b model and test it on multiple lexical semantic tasks. As the outcome of our experiments, we present TaxoLLaMA, the “all-in-one” model for taxonomy-related tasks, lightweight due to 4-bit quantization and LoRA. TaxoLLaMA achieves 11 SOTA results, and 4 top-2 results out of 16 tasks on the Taxonomy Enrichment, Hypernym Discovery, Taxonomy Construction, and Lexical Entailment tasks. Moreover, it demonstrates a very strong zero-shot performance on Lexical Entailment and Taxonomy Construction with no fine-tuning. We also explore its hidden multilingual and domain adaptation capabilities with a little tuning or few-shot learning. All datasets, code, and pre-trained models are available online (code: https://github.com/VityaVitalich/TaxoLLaMA)

On Improving Repository-Level Code QA for Large Language Models
Jan Strich | Florian Schneider | Irina Nikishina | Chris Biemann
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Large Language Models (LLMs) such as ChatGPT, GitHub Copilot, Llama, or Mistral assist programmers as copilots and knowledge sources to make the coding process faster and more efficient. This paper aims to improve the copilot performance by implementing different self-alignment processes and retrieval-augmented generation (RAG) pipelines, as well as their combination. To test the effectiveness of all approaches, we create a dataset and apply a model-based evaluation, using LLM as a judge. It is designed to check the model’s abilities to understand the source code semantics, the dependency between files, and the overall meta-information about the repository. We also compare our approach with other existing solutions, e.g. ChatGPT-3.5, and evaluate on the existing benchmarks. Code and dataset are available online (https://anonymous.4open.science/r/ma_llm-382D).

2023

Predicting Terms in IS-A Relations with Pre-trained Transformers
Irina Nikishina | Polina Chernomorchenko | Anastasiia Demidova | Alexander Panchenko | Chris Biemann
Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)

Large Language Models Meet Knowledge Graphs to Answer Factoid Questions
Mikhail Salnikov | Hai Le | Prateek Rajput | Irina Nikishina | Pavel Braslavski | Valentin Malykh | Alexander Panchenko
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

2022

Cross-Modal Contextualized Hidden State Projection Method for Expanding of Taxonomic Graphs
Irina Nikishina | Alsu Vakhitova | Elena Tutubalina | Alexander Panchenko
Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing

Taxonomy is a graph of terms organized hierarchically using is-a (hypernymy) relations. We suggest novel candidate-free task formulation for the taxonomy enrichment task. To solve the task, we leverage lexical knowledge from the pre-trained models to predict new words missing in the taxonomic resource. We propose a method that combines graph-, and text-based contextualized representations from transformer networks to predict new entries to the taxonomy. We have evaluated the method suggested for this task against text-only baselines based on BERT and fastText representations. The results demonstrate that incorporation of graph embedding is beneficial in the task of hyponym prediction using contextualized models. We hope the new challenging task will foster further research in automatic text graph construction methods.

A Study on Manual and Automatic Evaluation for Text Style Transfer: The Case of Detoxification
Varvara Logacheva | Daryna Dementieva | Irina Krotova | Alena Fenogenova | Irina Nikishina | Tatiana Shavrina | Alexander Panchenko
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)

It is often difficult to reliably evaluate models which generate text. Among them, text style transfer is a particularly difficult to evaluate, because its success depends on a number of parameters. We conduct an evaluation of a large number of models on a detoxification task. We explore the relations between the manual and automatic metrics and find that there is only weak correlation between them, which is dependent on the type of model which generated text. Automatic metrics tend to be less reliable for better-performing models. However, our findings suggest that, ChrF and BertScore metrics can be used as a proxy for human evaluation of text detoxification to some extent.

TaxFree: a Visualization Tool for Candidate-free Taxonomy Enrichment
Irina Nikishina | Ivan Andrianov | Alsu Vakhitova | Alexander Panchenko
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations

Taxonomies are widely used in a various number of downstream NLP tasks and, therefore, should be kept up-to-date. In this paper, we present TaxFree, an open source system for taxonomy visualisation and automatic Taxonomy Enrichment without pre-defined candidates on the example of WordNet-3.0. As oppose to the traditional task formulation (where the list of new words is provided beforehand), we provide an approach for automatic extension of a taxonomy using a large pre-trained language model. As an advantage to the existing visualisation tools of WordNet, TaxFree also integrates graphic representations of synsets from ImageNet. Such visualisation tool can be used for both updating taxonomies and inspecting them for the required modifications.

2021

Evaluation of Taxonomy Enrichment on Diachronic WordNet Versions
Irina Nikishina | Natalia Loukachevitch | Varvara Logacheva | Alexander Panchenko
Proceedings of the 11th Global Wordnet Conference

The vast majority of the existing approaches for taxonomy enrichment apply word embeddings as they have proven to accumulate contexts (in a broad sense) extracted from texts which are sufficient for attaching orphan words to the taxonomy. On the other hand, apart from being large lexical and semantic resources, taxonomies are graph structures. Combining word embeddings with graph structure of taxonomy could be of use for predicting taxonomic relations. In this paper we compare several approaches for attaching new words to the existing taxonomy which are based on the graph representations with the one that relies on fastText embeddings. We test all methods on Russian and English datasets, but they could be also applied to other wordnets and languages.

2020

Studying Taxonomy Enrichment on Diachronic WordNet Versions
Irina Nikishina | Varvara Logacheva | Alexander Panchenko | Natalia Loukachevitch
Proceedings of the 28th International Conference on Computational Linguistics

Ontologies, taxonomies, and thesauri have always been in high demand in a large number of NLP tasks. However, most studies are focused on the creation of lexical resources rather than the maintenance of the existing ones and keeping them up-to-date. In this paper, we address the problem of taxonomy enrichment. Namely, we explore the possibilities of taxonomy extension in a resource-poor setting and present several methods which are applicable to a large number of languages. We also create novel English and Russian datasets for training and evaluating taxonomy enrichment systems and describe a technique of creating such datasets for other languages.

Co-authors

Elena Tutubalina 3

Natalia Loukachevitch 2

Andrey Sakhovskiy 2

Özge Sevgili 2

Ahmad Shallouf 2

Ricardo Usbeck 2

Dmitry Ustalov 2

Alsu Vakhitova 2

Rana Abdullah 1

Ivan Andrianov 1

Debayan Banerjee 1

Pavel Braslavski 1

Viktoriia Chekalina 1

Polina Chernomorchenko 1

Daryna Dementieva 1

Anastasiia Demidova 1

Nikolay Dolgov 1

Alena Fenogenova 1

Daria Galimzianova 1

Rudy Alexandro Garrido Veliz 1

Eduard Gorbunov 1

Robert Günzler 1

Hanna Herasimchyk 1

Samuel Horváth 1

Daria Ignatenko 1

Nikolay Ivanov 1

Longquan Jiang 1

Vasily Konovalov 1

Angelie Kraft 1

Nikita Krayko 1

Irina Krotova 1

Alina Lobanova 1

Valentin Malykh 1

Natia Mestvirishvili 1

Cedric Möller 1

Ekaterina Neminova 1

Sergey Pletenev 1

Prateek Rajput 1

Steffen Remus 1

Florian Schneider 1

Martin Semmann 1

Tatiana Shavrina 1

Artem Shelmanov 1

Maria Tikhonova 1

Nazarii Tupitsa 1

Aida Usmanova 1

Marco Valentino 1

Artem Vazhentsev 1

Seid Muhie Yimam 1

Alexey Zaytsev 1

Venues