Workshop on Insights from Negative Results in NLP (2023)


pdf (full)
bib (full)
The Fourth Workshop on Insights from Negative Results in NLP

pdf bib
The Fourth Workshop on Insights from Negative Results in NLP
Shabnam Tafreshi | Arjun Akula | João Sedoc | Aleksandr Drozd | Anna Rogers | Anna Rumshisky

pdf bib
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
Anya Belz | Craig Thomson | Ehud Reiter | Gavin Abercrombie | Jose M. Alonso-Moral | Mohammad Arvan | Anouck Braggaar | Mark Cieliebak | Elizabeth Clark | Kees van Deemter | Tanvi Dinkar | Ondřej Dušek | Steffen Eger | Qixiang Fang | Mingqi Gao | Albert Gatt | Dimitra Gkatzia | Javier González-Corbelle | Dirk Hovy | Manuela Hürlimann | Takumi Ito | John D. Kelleher | Filip Klubicka | Emiel Krahmer | Huiyuan Lai | Chris van der Lee | Yiru Li | Saad Mahamood | Margot Mieskes | Emiel van Miltenburg | Pablo Mosteiro | Malvina Nissim | Natalie Parde | Ondřej Plátek | Verena Rieser | Jie Ruan | Joel Tetreault | Antonio Toral | Xiaojun Wan | Leo Wanner | Lewis Watson | Diyi Yang

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

pdf bib
ERATE: Efficient Retrieval Augmented Text Embeddings
Vatsal Raina | Nora Kassner | Kashyap Popat | Patrick Lewis | Nicola Cancedda | Louis Martin

Embedding representations of text are useful for downstream natural language processing tasks. Several universal sentence representation methods have been proposed with a particular focus on self-supervised pre-training approaches to leverage the vast quantities of unlabelled data. However, there are two challenges for generating rich embedding representations for a new document. 1) The latest rich embedding generators are based on very large costly transformer-based architectures. 2) The rich embedding representation of a new document is limited to only the information provided without access to any explicit contextual and temporal information that could potentially further enrich the representation. We propose efficient retrieval-augmented text embeddings (ERATE) that tackles the first issue and offers a method to tackle the second issue. To the best of our knowledge, we are the first to incorporate retrieval to general purpose embeddings as a new paradigm, which we apply to the semantic similarity tasks of SentEval. Despite not reaching state-of-the-art performance, ERATE offers key insights that encourages future work into investigating the potential of retrieval-based embeddings.

pdf bib
A Data-centric Framework for Improving Domain-specific Machine Reading Comprehension Datasets
Iva Bojic | Josef Halim | Verena Suharman | Sreeja Tar | Qi Chwen Ong | Duy Phung | Mathieu Ravaut | Shafiq Joty | Josip Car

Low-quality data can cause downstream problems in high-stakes applications. Data-centric approach emphasizes on improving dataset quality to enhance model performance. High-quality datasets are needed for general-purpose Large Language Models (LLMs) training, as well as for domain-specific models, which are usually small in size as it is costly to engage a large number of domain experts for their creation. Thus, it is vital to ensure high-quality domain-specific training data. In this paper, we propose a framework for enhancing the data quality of original datasets. (Code and dataset are available at We applied the proposed framework to four biomedical datasets and showed relative improvement of up to 33%/40% for fine-tuning of retrieval/reader models on the BioASQ dataset when using back translation to enhance the original dataset quality.

pdf bib
Encoding Sentence Position in Context-Aware Neural Machine Translation with Concatenation
Lorenzo Lupo | Marco Dinarelli | Laurent Besacier

Context-aware translation can be achieved by processing a concatenation of consecutive sentences with the standard Transformer architecture. This paper investigates the intuitive idea of providing the model with explicit information about the position of the sentences contained in the concatenation window. We compare various methods to encode sentence positions into token representations, including novel methods. Our results show that the Transformer benefits from certain sentence position encoding methods on English to Russian translation, if trained with a context-discounted loss. However, the same benefits are not observed on English to German. Further empirical efforts are necessary to define the conditions under which the proposed approach is beneficial.

pdf bib
SocBERT: A Pretrained Model for Social Media Text
Yuting Guo | Abeed Sarker

Pretrained language models (PLMs) on domain-specific data have been proven to be effective for in-domain natural language processing (NLP) tasks. Our work aimed to develop a language model which can be effective for the NLP tasks with the data from diverse social media platforms. We pretrained a language model on Twitter and Reddit posts in English consisting of 929M sequence blocks for 112K steps. We benchmarked our model and 3 transformer-based models—BERT, BERTweet, and RoBERTa on 40 social media text classification tasks. The results showed that although our model did not perform the best on all of the tasks, it outperformed the baseline model—BERT on most of the tasks, which illustrates the effectiveness of our model. Also, our work provides some insights of how to improve the efficiency of training PLMs.

pdf bib
Edit Aware Representation Learning via Levenshtein Prediction
Edison Marrese-Taylor | Machel Reid | Alfredo Solano

We propose a novel approach that employs token-level Levenshtein operations to learn a continuous latent space of vector representations to capture the underlying semantic information with regard to the document editing process. Though our model outperforms strong baselines when fine-tuned on edit-centric tasks, it is unclear if these results are due to domain similarities between fine-tuning and pre-training data, suggesting that the benefits of our proposed approach over regular masked language-modelling pre-training are limited.

pdf bib
What changes when you randomly choose BPE merge operations? Not much.
Jonne Saleva | Constantine Lignos

We introduce two simple randomized variants of byte pair encoding (BPE) and explore whether randomizing the selection of merge operations substantially affects a downstream machine translation task. We focus on translation into morphologically rich languages, hypothesizing that this task may show sensitivity to the method of choosing subwords. Analysis using a Bayesian linear model indicates that one variant performs nearly indistinguishably compared to standard BPE while the other degrades performance less than we anticipated. We conclude that although standard BPE is widely used, there exists an interesting universe of potential variations on it worth investigating. Our code is available at:

pdf bib
Hiding in Plain Sight: Insights into Abstractive Text Summarization
Vivek Srivastava | Savita Bhat | Niranjan Pedanekar

In recent years, there has been growing interest in the field of abstractive text summarization with focused contributions in relevant model architectures, datasets, and evaluation metrics. Despite notable research advances, previous works have identified certain limitations concerning the quality of datasets and the effectiveness of evaluation techniques for generated summaries. In this context, we examine these limitations further with the help of three quality measures, namely, Information Coverage, Entity Hallucination, and Summarization Complexity. As a part of this work, we investigate two widely used datasets (XSUM and CNNDM) and three existing models (BART, PEGASUS, and BRIO) and report our findings. Some key insights are: 1) Cumulative ROUGE score is an inappropriate evaluation measure since few high-scoring samples dominate the overall performance, 2) Existing summarization models have limited capability for information coverage and hallucinate to generate factual information, and 3) Compared to the model generated summaries, the reference summaries have lowest information coverage and highest entity hallucinations reiterating the need of new and better reference summaries.

pdf bib
Annotating PubMed Abstracts with MeSH Headings using Graph Neural Network
Faizan E Mustafa | Rafika Boutalbi | Anastasiia Iurshina

The number of scientific publications in the biomedical domain is continuously increasing with time. An efficient system for indexing these publications is required to make the information accessible according to the user’s information needs. Task 10a of the BioASQ challenge aims to classify PubMed articles according to the MeSH ontology so that new publications can be grouped with similar preexisting publications in the field without the assistance of time-consuming and costly annotations by human annotators. In this work, we use Graph Neural Network (GNN) in the link prediction setting to exploit potential graph-structured information present in the dataset which could otherwise be neglected by transformer-based models. Additionally, we provide error analysis and a plausible reason for the substandard performance achieved by GNN.

pdf bib
Do not Trust the Experts: How the Lack of Standard Complicates NLP for Historical Irish
Oksana Dereza | Theodorus Fransen | John P. Mccrae

In this paper, we describe how we unearthed some fundamental problems while building an analogy dataset modelled on BATS (Gladkova et al., 2016) to evaluate historical Irish embeddings on their ability to detect orthographic, morphological and semantic similarity. The performance of our models in the analogy task was extremely poor regardless of the architecture, hyperparameters and evaluation metrics, while the qualitative evaluation revealed positive tendencies. We argue that low agreement between field experts on fundamental lexical and orthographic issues, and the lack of a unified editorial standard in available resources make it impossible to build reliable evaluation datasets for computational models and obtain interpretable results. We emphasise the need for such a standard, particularly for NLP applications, and prompt Celticists and historical linguists to engage in further discussion. We would also like to draw NLP scholars’ attention to the role of data and its (extra)linguistic properties in testing new models, technologies and evaluation scenarios.

pdf bib
Exploring the Reasons for Non-generalizability of KBQA systems
Sopan Khosla | Ritam Dutt | Vinayshekhar Bannihatti Kumar | Rashmi Gangadharaiah

Recent research has demonstrated impressive generalization capabilities of several Knowledge Base Question Answering (KBQA) models on the GrailQA dataset. We inspect whether these models can generalize to other datasets in a zero-shot setting. We notice a significant drop in performance and investigate the causes for the same. We observe that the models are dependent not only on the structural complexity of the questions, but also on the linguistic styles of framing a question. Specifically, the linguistic dimensions corresponding to explicitness, readability, coherence, and grammaticality have a significant impact on the performance of state-of-the-art KBQA models. Overall our results showcase the brittleness of such models and the need for creating generalizable systems.

pdf bib
An Empirical Study on Active Learning for Multi-label Text Classification
Mengqi Wang | Ming Liu

Active learning has been widely used in the task of text classification for its ability to select the most valuable samples to annotate while improving the model performance. However, the efficiency of active learning in multi-label text classification tasks has been under-explored due to the label imbalanceness problem. In this paper, we conduct an empirical study of active learning on multi-label text classification and evaluate the efficiency of five active learning strategies on six multi-label text classification tasks. The experiments show that some strategies in the single-label setting especially in imbalanced datasets.