Leon Weber-Genzel

2024

VariErr NLI: Separating Annotation Error from Human Label Variation
Leon Weber-Genzel | Siyao Peng | Marie-Catherine De Marneffe | Barbara Plank
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Human label variation arises when annotators assign different labels to the same item for valid reasons, while annotation errors occur when labels are assigned for invalid reasons. These two issues are prevalent in NLP benchmarks, yet existing research has studied them in isolation. To the best of our knowledge, there exists no prior work that focuses on teasing apart error from signal, especially in cases where signal is beyond black-and-white.To fill this gap, we introduce a systematic methodology and a new dataset, VariErr (variation versus error), focusing on the NLI task in English. We propose a 2-round annotation procedure with annotators explaining each label and subsequently judging the validity of label-explanation pairs.VariErr contains 7,732 validity judgments on 1,933 explanations for 500 re-annotated MNLI items. We assess the effectiveness of various automatic error detection (AED) methods and GPTs in uncovering errors versus human label variation. We find that state-of-the-art AED methods significantly underperform GPTs and humans. While GPT-4 is the best system, it still falls short of human performance. Our methodology is applicable beyond NLI, offering fertile ground for future research on error versus plausible variation, which in turn can yield better and more trustworthy NLP systems.

pdf bib abs

Fine-grained Sexism Detection in Italian Newspapers
Federica Manzi | Leon Weber-Genzel | Barbara Plank
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

In recent years, tasks revolving around hate speech detection have experienced a growing interest in the field of Natural Language Processing. Two main trends stand out in the context of sexism recognition: the focus on overt forms of sexism such as misogyny on social media and tackling the problem as a text classification task. The main objective of this work is to introduce a new approach to tackle sexism recognition as a sequence labelling task, operating on the token level rather than the document one. To achieve this goal, we introduce (i) the FGSDI (Fine-Grained Sexism Detection in Italian) corpus, containing Italian newspaper articles annotated with fine-grained linguistic markers of sexism, and (ii) a two-step pipeline that sequentially performs sexism detection on the sentence level and sexism classification on the token one. Our primary findings include that (i) tackling the task of sexism recognition as a sequence labelling task is possible, however, a large amount of labelled data is needed; (ii) leveraging few-shot learning for sexism detection proves to be an effective solution in scenarios where only a limited amount of data are available; (iii) the proposed pipeline approach allows for better results compared to the baseline by doubling the overall precision and achieving a better F1-score.

pdf bib abs

The open-ended nature of language generation makes the evaluation of autoregressive large language models (LLMs) challenging. One common evaluation approach uses multiple-choice questions to limit the response space. The model is then evaluated by ranking the candidate answers by the log probability of the first token prediction. However, first-tokens may not consistently reflect the final response output, due to model’s diverse response styles such as starting with “Sure” or refusing to answer. Consequently, first-token evaluation is not indicative of model behaviour when interacting with users. But by how much? We evaluate how aligned first-token evaluation is with the text output along several dimensions, namely final option choice, refusal rate, choice distribution and robustness under prompt perturbation. Our results show that the two approaches are severely misaligned on all dimensions, reaching mismatch rates over 60%. Models heavily fine-tuned on conversational or safety data are especially impacted. Crucially, models remain misaligned even when we increasingly constrain prompts, i.e., force them to start with an option letter or example template. Our findings i) underscore the importance of inspecting the text output as well and ii) caution against relying solely on first-token evaluation.

In this paper, we describe our submission for the NLI4CT 2024 shared task on robust Natural Language Inference over clinical trial reports. Our system is an ensemble of nine diverse models which we aggregate via majority voting. The models use a large spectrum of different approaches ranging from a straightforward Convolutional Neural Network over fine-tuned Large Language Models to few-shot-prompted language models using chain-of-thought reasoning.Surprisingly, we find that some individual ensemble members are not only more accurate than the final ensemble model but also more robust.

2023

pdf bib abs

Establishing Trustworthiness: Rethinking Tasks and Model Evaluation
Robert Litschko | Max Müller-Eberstein | Rob van der Goot | Leon Weber-Genzel | Barbara Plank
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Language understanding is a multi-faceted cognitive capability, which the Natural Language Processing (NLP) community has striven to model computationally for decades. Traditionally, facets of linguistic intelligence have been compartmentalized into tasks with specialized model architectures and corresponding evaluation protocols. With the advent of large language models (LLMs) the community has witnessed a dramatic shift towards general purpose, task-agnostic approaches powered by generative models. As a consequence, the traditional compartmentalized notion of language tasks is breaking down, followed by an increasing challenge for evaluation and analysis. At the same time, LLMs are being deployed in more real-world scenarios, including previously unforeseen zero-shot setups, increasing the need for trustworthy and reliable systems. Therefore, we argue that it is time to rethink what constitutes tasks and model evaluation in NLP, and pursue a more holistic view on language, placing trustworthiness at the center. Towards this goal, we review existing compartmentalized approaches for understanding the origins of a model’s functional capacity, and provide recommendations for more multi-faceted evaluation protocols.

Co-authors

Venues

Fix author