Benjamin Roth - ACL Anthology

Benjamin Roth

2026

Explaining Generalization of AI-Generated Text Detectors Through Linguistic Analysis
Yuxi Xia | Kinga Stańczak | Benjamin Roth
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

AI-text detectors achieve high accuracy on in-domain benchmarks, but often struggle to generalize across different generation conditions such as unseen prompts, model families, or domains. While prior work has reported these generalization gaps, there are limited insights about the underlying causes. In this work, we present a systematic study aimed at explaining generalization behavior through linguistic analysis. We construct a comprehensive benchmark that spans 6 prompting strategies, 7 large language models (LLMs), and 4 domain datasets, resulting in a diverse set of human- and AI-generated texts. Using this dataset, we fine-tune classification-based detectors on various generation settings and evaluate their cross-prompt, cross-model, and cross-dataset generalization. To explain the performance variance, we compute correlations between generalization accuracies and feature shifts of 80 linguistic features between training and test conditions. Our analysis reveals that generalization performance for specific detectors and evaluation conditions is significantly associated with linguistic features such as tense usage and pronoun frequency.

Do language models accommodate their users? A study of linguistic convergence
Terra Blevins | Susanne Schmalwieser | Benjamin Roth
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

While large language models (LLMs) are generally considered proficient in generating language, how similar their language usage is to that of humans remains understudied. In this paper, we test whether models exhibit linguistic convergence, a core pragmatic element of human language communication: do models adapt, or converge, to the linguistic patterns of their user? To answer this, we systematically compare model completions of existing dialogues to original human responses across sixteen language models, three dialogue corpora, and various stylometric features. We find that models strongly converge to the conversation’s style, often significantly overfitting relative to the human baseline. While convergence patterns are often feature-specific, we observe consistent shifts in convergence across modeling settings, with instruction-tuned and larger models converging less than their pretrained and smaller counterparts. Given the differences in human and model convergence patterns, we hypothesize that the underlying mechanisms driving these behaviors are very different.

Persistent Personas? Role-Playing, Instruction Following, and Safety in Extended Interactions
Pedro Henrique Luz de Araujo | Michael A. Hedderich | Ali Modarressi | Hinrich Schuetze | Benjamin Roth
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Persona-assigned large language models (LLMs) are used in domains such as education, healthcare, and sociodemographic simulation. Yet, they are typically evaluated only in short, single-round settings that do not reflect real-world usage. We introduce an evaluation protocol that combines long persona dialogues (over 100 rounds) and evaluation datasets to create dialogue-conditioned benchmarks that can robustly measure long-context effects. We then investigate the effects of dialogue length on persona fidelity, instruction-following, and safety of seven state-of-the-art open- and closed-weight LLMs. We find that persona fidelity degrades over the course of dialogues, especially in goal-oriented conversations, where models must sustain both persona fidelity and instruction following. We identify a trade-off between fidelity and instruction following, with non-persona baselines initially outperforming persona-assigned models; as dialogues progress and fidelity fades, persona responses become increasingly similar to baseline responses. Our findings highlight the fragility of persona applications in extended interactions and our work provides a protocol to systematically measure such failures.

2025

Influence-driven Curriculum Learning for Pre-training on Limited Data
Loris Schoenegger | Lukas Thoma | Terra Blevins | Benjamin Roth
Proceedings of the First BabyLM Workshop

Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their training data influence, a score which estimates the effect of individual training examples on the model’s output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of difficulty is adopted.

RecombiText: Compositional Data Augmentation for Enhancing LLM Pre-Training Datasets in Low-Resource Scenarios
Alexander Tampier | Lukas Thoma | Loris Schoenegger | Benjamin Roth
Proceedings of the First BabyLM Workshop

We introduce RecombiText Augmentation (RTA), a novel purely statistical NLP method for compositional data augmentation for data-efficient LLM pre-training in low-resource scenarios. RTA identifies lexically and semantically similar sentences within the corpus and generates synthetic sentence pairs from them while preserving underlying patterns from the corpus. We pre-train GPT-2 and RoBERTa language models on a domain-specific, low-resource corpus of 10 million words, with different proportions of augmented data. We compare our RTA-augmented model variants to a baseline model trained on the full original dataset. Zero-shot results show that the language models pre-trained on synthetic data improve in entity tracking, self-paced reading, and morphological generalization benchmarks. In other tasks, the performance is comparable to the baseline model. We demonstrate that it is possible to expand low-resource datasets by two- to four-fold without compromising benchmark performance, solely through statistical processing of the available data.

Influences on LLM Calibration: A Study of Response Agreement, Loss Functions, and Prompt Styles
Yuxi Xia | Pedro Henrique Luz De Araujo | Klim Zaporojets | Benjamin Roth
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Calibration, the alignment between model confidence and prediction accuracy, is critical for the reliable deployment of large language models (LLMs). Existing works neglect to measure the generalization of their methods to other prompt styles and different sizes of LLMs. To address this, we define a controlled experimental setting covering 12 LLMs and four prompt styles. We additionally investigate if incorporating the response agreement of multiple LLMs and an appropriate loss function can improve calibration performance. Concretely, we build Calib-n, a novel framework that trains an auxiliary model for confidence estimation that aggregates responses from multiple LLMs to capture inter-model agreement. To optimize calibration, we integrate focal and AUC surrogate losses alongside binary cross-entropy. Experiments across four datasets demonstrate that both response agreement and focal loss improve calibration from baselines. We find that few-shot prompts are the most effective for auxiliary model-based methods, and auxiliary models demonstrate robust calibration performance across accuracy variations, outperforming LLMs’ internal probabilities and verbalized confidences. These insights deepen the understanding of influence factors in LLM calibration, supporting their reliable deployment in diverse applications.

Learn to pick the winner: Black-box ensembling for textual and visual question answering
Yuxi Xia | Klim Zaporojets | Benjamin Roth
Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Long and Short Papers

From Calculation to Adjudication: Examining LLM Judges on Mathematical Reasoning Tasks
Andreas Stephan | Dawei Zhu | Matthias Aßenmacher | Xiaoyu Shen | Benjamin Roth
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

To reduce the need for human annotations, large language models (LLMs) have been proposed as judges of the quality of other candidate models. The performance of LLM judges is typically evaluated by measuring the correlation with human judgments on generative tasks such as summarization or machine translation. In contrast, we study LLM judges on mathematical reasoning tasks. These tasks require multi-step reasoning, and the correctness of their solutions is verifiable, enabling a more objective evaluation. We perform a detailed performance analysis and find that easy samples are easy to judge, and difficult samples are difficult to judge. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance, indicating that judges tend to favor higher-quality models even if their answer is incorrect. As a consequence, we test whether we can predict the behavior of LLM judges using simple features such as part-of-speech tags and find that we can correctly predict 70%-75% of judgments. We conclude this study by analyzing practical use cases, showing that LLM judges consistently detect the on-average better model but largely fail if we use them to improve task performance.

Principled Personas: Defining and Measuring the Intended Effects of Persona Prompting on Task Performance
Pedro Henrique Luz de Araujo | Paul Röttger | Dirk Hovy | Benjamin Roth
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Expert persona prompting—assigning roles such as expert in math to language models—is widely used for task improvement. However, prior work shows mixed results on its effectiveness, and does not consider when and why personas should improve performance. We analyze the literature on persona prompting for task improvement and distill three desiderata: 1) performance advantage of expert personas, 2) robustness to irrelevant persona attributes, and 3) fidelity to persona attributes. We then evaluate 9 state-of-the-art LLMs across 27 tasks with respect to these desiderata. We find that expert personas usually lead to positive or non-significant performance changes. Surprisingly, models are highly sensitive to irrelevant persona details, with performance drops of almost 30 percentage points. In terms of fidelity, we find that while higher education, specialization, and domain-relatedness can boost performance, their effects are often inconsistent or negligible across tasks. We propose mitigation strategies to improve robustness—but find they only work for the largest, most capable models. Our findings underscore the need for more careful persona design and for evaluation schemes that reflect the intended effects of persona usage.

2024

Collaborative Development of Modular Open Source Educational Resources for Natural Language Processing
Matthias Aßenmacher | Andreas Stephan | Leonie Weissweiler | Erion Çano | Ingo Ziegler | Marwin Härttrich | Bernd Bischl | Benjamin Roth | Christian Heumann | Hinrich Schütze
Proceedings of the Sixth Workshop on Teaching NLP

In this work, we present a collaboratively and continuously developed open-source educational resource (OSER) for teaching natural language processing at two different universities. We shed light on the principles we followed for the initial design of the course and the rationale for ongoing developments, followed by a reflection on the inter-university collaboration for designing and maintaining teaching material. When reflecting on the latter, we explicitly emphasize the considerations that need to be made when facing heterogeneous groups and when having to accommodate multiple examination regulations within one single course framework. Relying on the fundamental principles of OSER developments as defined by Bothmann et al. (2023) proved to be an important guideline during this process. The final part pertains to open-sourcing our teaching material, coping with the increasing speed of developments in the field, and integrating the course digitally, also addressing conflicting priorities and challenges we are currently facing.

To Know or Not To Know? Analyzing Self-Consistency of Large Language Models under Ambiguity
Anastasiia Sedova | Robert Litschko | Diego Frassinelli | Benjamin Roth | Barbara Plank
Findings of the Association for Computational Linguistics: EMNLP 2024

One of the major aspects contributing to the striking performance of large language models (LLMs) is the vast amount of factual knowledge accumulated during pre-training. Yet, many LLMs suffer from self-inconsistency, which raises doubts about their trustworthiness and reliability. This paper focuses on entity type ambiguity, analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities. To do so, we propose an evaluation protocol that disentangles knowing from applying knowledge, and test state-of-the-art LLMs on 49 ambiguous entities. Our experiments reveal that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts. The results also reveal systematic discrepancies in LLM behavior, showing that while the models may possess knowledge, they struggle to apply it consistently, exhibit biases toward preferred readings, and display self-inconsistencies. This highlights the need to address entity ambiguity in the future for more trustworthy LLMs.

Text-Guided Alternative Image Clustering
Andreas Stephan | Lukas Miklautz | Collin Leiber | Pedro Henrique Luz De Araujo | Dominik Répás | Claudia Plant | Benjamin Roth
Proceedings of the 9th Workshop on Representation Learning for NLP (RepL4NLP-2024)

Traditional image clustering techniques only find a single grouping within visual data. In particular, they do not provide a possibility to explicitly define multiple types of clustering. This work explores the potential of large vision-language models to facilitate alternative image clustering. We propose Text-Guided Alternative Image Consensus Clustering (TGAICC), a novel approach that leverages user-specified interests via prompts to guide the discovery of diverse clusterings. To achieve this, it generates a clustering for each prompt, groups them using hierarchical clustering, and then aggregates them using consensus clustering. TGAICC outperforms image- and text-based baselines on four alternative image clustering benchmark datasets. Furthermore, using count-based word statistics, we are able to obtain text-based explanations of the alternative clusterings. In conclusion, our research illustrates how contemporary large vision-language models can transform explanatory data analysis, enabling the generation of insightful, customizable, and diverse image clusterings.

Text-Guided Image Clustering
Andreas Stephan | Lukas Miklautz | Kevin Sidak | Jan Philip Wahle | Bela Gipp | Claudia Plant | Benjamin Roth
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Image clustering divides a collection of images into meaningful groups, typically interpreted post-hoc via human-given annotations. Those are usually in the form of text, begging the question of using text as an abstraction for image clustering. Current image clustering methods, however, neglect the use of generated textual descriptions. We, therefore, propose Text-Guided Image Clustering, i.e., generating text using image captioning and visual question-answering (VQA) models and subsequently clustering the generated text. Further, we introduce a novel approach to inject task- or domain knowledge for clustering by prompting VQA models. Across eight diverse image clustering datasets, our results show that the obtained text representations often outperform image features. Additionally, we propose a counting-based cluster explainability method. Our evaluations show that the derived keyword-based explanations describe clusters better than the respective cluster accuracy suggests. Overall, this research challenges traditional approaches and paves the way for a paradigm shift in image clustering, using generated text.

Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024)
Pedro Henrique Luz de Araujo | Andreas Baumann | Dagmar Gromann | Brigitte Krenn | Benjamin Roth | Michael Wiegand
Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024)

Functionality learning through specification instructions
Pedro Henrique Luz De Araujo | Benjamin Roth
Findings of the Association for Computational Linguistics: EMNLP 2024

Test suites assess natural language processing models’ performance on specific functionalities: cases of interest involving model robustness, fairness, or particular linguistic capabilities. This paper introduces specification instructions: text descriptions specifying fine-grained task-specific behaviors. For each functionality in a suite, we generate an instruction that describes it. We combine the specification instructions to create specification-augmented prompts, which we feed to language models pre-trained on natural instruction data.We conduct experiments to measure how optimizing for some functionalities may negatively impact functionalities that are not covered by the specification set. Our analyses across four tasks and models of diverse sizes and families show that smaller models struggle to follow specification instructions. However, larger models (> 3B params.) can benefit from specifications and—surprisingly—even generalize certain desirable behaviors across functionalities.

Analysing zero-shot temporal relation extraction on clinical notes using temporal consistency
Vasiliki Kougia | Anastasiia Sedova | Andreas Joseph Stephan | Klim Zaporojets | Benjamin Roth
Proceedings of the 23rd Workshop on Biomedical Natural Language Processing

This paper presents the first study for temporal relation extraction in a zero-shot setting focusing on biomedical text. We employ two types of prompts and five Large Language Models (LLMs; GPT-3.5, Mixtral, Llama 2, Gemma, and PMC-LLaMA) to obtain responses about the temporal relations between two events. Our experiments demonstrate that LLMs struggle in the zero-shot setting, performing worse than fine-tuned specialized models in terms of F1 score. This highlights the challenging nature of this task and underscores the need for further research to enhance the performance of LLMs in this context. We further contribute a novel comprehensive temporal analysis by calculating consistency scores for each LLM. Our findings reveal that LLMs face challenges in providing responses consistent with the temporal properties of uniqueness and transitivity. Moreover, we study the relation between the temporal consistency of an LLM and its accuracy, and whether the latter can be improved by solving temporal inconsistencies. Our analysis shows that even when temporal consistency is achieved, the predictions can remain inaccurate.

Counterfactual Reasoning with Knowledge Graph Embeddings
Lena Zellinger | Andreas Stephan | Benjamin Roth
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Knowledge graph embeddings (KGEs) were originally developed to infer true but missing facts in incomplete knowledge repositories.In this paper, we link knowledge graph completion and counterfactual reasoning via our new task CFKGR. We model the original world state as a knowledge graph, hypothetical scenarios as edges added to the graph, and plausible changes to the graph as inferences from logical rules. We create corresponding benchmark datasets, which contain diverse hypothetical scenarios with plausible changes to the original knowledge graph and facts that should be retained. We develop COULDD, a general method for adapting existing knowledge graph embeddings given a hypothetical premise, and evaluate it on our benchmark. Our results indicate that KGEs learn patterns in the graph without explicit training. We further observe that KGEs adapted with COULDD solidly detect plausible counterfactual changes to the graph that follow these patterns. An evaluation on human-annotated data reveals that KGEs adapted with COULDD are mostly unable to recognize changes to the graph that do not follow learned inference rules. In contrast, ChatGPT mostly outperforms KGEs in detecting plausible changes to the graph but has poor knowledge retention. In summary, CFKGR connects two previously distinct areas, namely KG completion and counterfactual reasoning.

2023

CogMemLM: Human-Like Memory Mechanisms Improve Performance and Cognitive Plausibility of LLMs
Lukas Thoma | Ivonne Weyers | Erion Çano | Stefan Schweter | Jutta L Mueller | Benjamin Roth
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

ACTC: Active Threshold Calibration for Cold-Start Knowledge Graph Completion
Anastasiia Sedova | Benjamin Roth
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Self-supervised knowledge-graph completion (KGC) relies on estimating a scoring model over (entity, relation, entity)-tuples, for example, by embedding an initial knowledge graph. Prediction quality can be improved by calibrating the scoring model, typically by adjusting the prediction thresholds using manually annotated examples. In this paper, we attempt for the first time cold-start calibration for KGC, where no annotated examples exist initially for calibration, and only a limited number of tuples can be selected for annotation. Our new method ACTC finds good per-relation thresholds efficiently based on a limited set of annotated tuples. Additionally to a few annotated tuples, ACTC also leverages unlabeled tuples by estimating their correctness with Logistic Regression or Gaussian Process classifiers. We also experiment with different methods for selecting candidate tuples for annotation: density-based and random selection. Experiments with five scoring models and an oracle annotator show an improvement of 7% points when using ACTC in the challenging setting with an annotation budget of only 10 tuples, and an average improvement of 4% points over different budgets.

Cross-functional Analysis of Generalization in Behavioral Learning
Pedro Henrique Luz de Araujo | Benjamin Roth
Transactions of the Association for Computational Linguistics, Volume 11

In behavioral testing, system functionalities underrepresented in the standard evaluation setting (with a held-out test set) are validated through controlled input-output pairs. Optimizing performance on the behavioral tests during training (behavioral learning) would improve coverage of phenomena not sufficiently represented in the i.i.d. data and could lead to seemingly more robust models. However, there is the risk that the model narrowly captures spurious correlations from the behavioral test suite, leading to overestimation and misrepresentation of model performance—one of the original pitfalls of traditional evaluation. In this work, we introduce BeLUGA, an analysis method for evaluating behavioral learning considering generalization across dimensions of different granularity levels. We optimize behavior-specific loss functions and evaluate models on several partitions of the behavioral test suite controlled to leave out specific phenomena. An aggregate score measures generalization to unseen functionalities (or overfitting). We use BeLUGA to examine three representative NLP tasks (sentiment analysis, paraphrase identification, and reading comprehension) and compare the impact of a diverse set of regularization and domain generalization methods on generalization performance.1

Seeing through the mess: evolutionary dynamics of lexical polysemy
Andreas Baumann | Andreas Stephan | Benjamin Roth
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Evidently, words can have multiple senses. For example, the word mess refers to a place to have food or to a confusing situation. How exactly multiple senses emerge is less clear. In this work, we propose and analyze a mathematical model of the evolution of lexical meaning to investigate mechanisms leading to polysemy. This model features factors that have been discussed to impact the semantic processing and transmission of words: word frequency, non-conformism, and semantic discriminability. We formally derive conditions under which a sense of a word tends to diversify itself into multiple senses that coexist stably. The model predicts that diversification is promoted by low frequency, a strong bias for non-conformist usage, and high semantic discriminability. We statistically validate these predictions with historical language data covering semantic developments of a set of English words. Multiple alternative measures are used to operationalize each variable involved, and we confirm the predicted tendencies for twelve combinations of measures.

ULF: Unsupervised Labeling Function Correction using Cross-Validation for Weak Supervision
Anastasiia Sedova | Benjamin Roth
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

A cost-effective alternative to manual data labeling is weak supervision (WS), where data samples are automatically annotated using a predefined set of labeling functions (LFs), rule-based mechanisms that generate artificial labels for the associated classes. In this work, we investigate noise reduction techniques for WS based on the principle of k-fold cross-validation. We introduce a new algorithm ULF for Unsupervised Labeling Function correction, which denoises WS data by leveraging models trained on all but some LFs to identify and correct biases specific to the held-out LFs. Specifically, ULF refines the allocation of LFs to classes by re-estimating this assignment on highly reliable cross-validated samples. Evaluation on multiple datasets confirms ULF’s effectiveness in enhancing WS learning without the need for manual labeling.

Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)
Munir Georges | Aaricia Herygers | Annemarie Friedrich | Benjamin Roth
Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)

2022

WeaNF”:" Weak Supervision with Normalizing Flows
Andreas Stephan | Benjamin Roth
Proceedings of the 7th Workshop on Representation Learning for NLP

A popular approach to decrease the need for costly manual annotation of large data sets is weak supervision, which introduces problems of noisy labels, coverage and bias. Methods for overcoming these problems have either relied on discriminative models, trained with cost functions specific to weak supervision, and more recently, generative models, trying to model the output of the automatic annotation process. In this work, we explore a novel direction of generative modeling for weak supervision”:” Instead of modeling the output of the annotation process (the labeling function matches), we generatively model the input-side data distributions (the feature space) covered by labeling functions. Specifically, we estimate a density for each weak labeling source, or labeling function, by using normalizing flows. An integral part of our method is the flow-based modeling of multiple simultaneously matching labeling functions, and therefore phenomena such as labeling function overlap and correlations are captured. We analyze the effectiveness and modeling capabilities on various commonly used weak supervision data sets, and show that weakly supervised normalizing flows compare favorably to standard weak supervision baselines.

SepLL: Separating Latent Class Labels from Weak Supervision Noise
Andreas Stephan | Vasiliki Kougia | Benjamin Roth
Findings of the Association for Computational Linguistics: EMNLP 2022

In the weakly supervised learning paradigm, labeling functions automatically assign heuristic, often noisy, labels to data samples. In this work, we provide a method for learning from weak labels by separating two types of complementary information associated with the labeling functions: information related to the target label and information specific to one labeling function only. Both types of information are reflected to different degrees by all labeled instances. In contrast to previous works that aimed at correcting or removing wrongly labeled instances, we learn a branched deep model that uses all data as-is, but splits the labeling function information in the latent space. Specifically, we propose the end-to-end model SepLL which extends a transformer classifier by introducing a latent space for labeling function specific and task-specific information. The learning signal is only given by the labeling functions matches, no pre-processing or label model is required for our method. Notably, the task prediction is made from the latent layer without any direct task signal. Experiments on Wrench text classification tasks show that our model is competitive with the state-of-the-art, and yields a new best average performance.

Checking HateCheck: a cross-functional analysis of behaviour-aware learning for hate speech detection
Pedro Henrique Luz de Araujo | Benjamin Roth
Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP

Behavioural testing—verifying system capabilities by validating human-designed input-output pairs—is an alternative evaluation method of natural language processing systems proposed to address the shortcomings of the standard approach: computing metrics on held-out data. While behavioural tests capture human prior knowledge and insights, there has been little exploration on how to leverage them for model training and development. With this in mind, we explore behaviour-aware learning by examining several fine-tuning schemes using HateCheck, a suite of functional tests for hate speech detection systems. To address potential pitfalls of training on data originally intended for evaluation, we train and evaluate models on different configurations of HateCheck by holding out categories of test cases, which enables us to estimate performance on potentially overlooked system properties. The fine-tuning procedure led to improvements in the classification accuracy of held-out functionalities and identity groups, suggesting that models can potentially generalise to overlooked functionalities. However, performance on held-out functionality classes and i.i.d. hate speech detection data decreased, which indicates that generalisation occurs mostly across functionalities from the same class and that the procedure led to overfitting to the HateCheck data distribution.

2021

KnowMAN: Weakly Supervised Multinomial Adversarial Networks
Luisa März | Ehsaneddin Asgari | Fabienne Braune | Franziska Zimmermann | Benjamin Roth
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

The absence of labeled data for training neural models is often addressed by leveraging knowledge about the specific task, resulting in heuristic but noisy labels. The knowledge is captured in labeling functions, which detect certain regularities or patterns in the training samples and annotate corresponding labels for training. This process of weakly supervised training may result in an over-reliance on the signals captured by the labeling functions and hinder models to exploit other signals or to generalize well. We propose KnowMAN, an adversarial scheme that enables to control influence of signals associated with specific labeling functions. KnowMAN forces the network to learn representations that are invariant to those signals and to pick up other signals that are more generally associated with an output label. KnowMAN strongly improves results compared to direct weakly supervised learning with a pre-trained transformer language model and a feature-based baseline.

Python for Linguists
Benjamin Roth | Michael Wiegand
Computational Linguistics, Volume 47, Issue 1 - March 2021

Knodle: Modular Weakly Supervised Learning with PyTorch
Anastasiia Sedova | Andreas Stephan | Marina Speranskaya | Benjamin Roth
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)

Strategies for improving the training and prediction quality of weakly supervised machine learning models vary in how much they are tailored to a specific task or integrated with a specific model architecture. In this work, we introduce Knodle, a software framework that treats weak data annotations, deep learning models, and methods for improving weakly supervised training as separate, modular components. This modularization gives the training process access to fine-grained information such as data set characteristics, matches of heuristic rules, or elements of the deep learning model ultimately used for prediction. Hence, our framework can encompass a wide range of training methods for improving weak supervision, ranging from methods that only look at correlations of rules and output classes (independently of the machine learning model trained with the resulting labels), to those that harness the interplay of neural networks and weakly labeled data. We illustrate the benchmarking potential of the framework with a performance comparison of several reference implementations on a selection of datasets that are already available in Knodle.

2020

Intent Recognition in Doctor-Patient Interviews
Robin Rojowiec | Benjamin Roth | Maximilian Fink
Proceedings of the Twelfth Language Resources and Evaluation Conference

Learning to interview patients to find out their disease is an essential part of the training of medical students. The practical part of this training has traditionally relied on paid actors that play the role of a patient to be interviewed. This process is expensive and severely limits the amount of practice per student. In this work, we present a novel data set and methods based on Natural Language Processing, for making progress towards modern applications and e-learning tools that support this training by providing language-based user interfaces with virtual patients. A data set of german transcriptions from live doctor-patient interviews was collected. These transcriptions are based on audio recordings of exercise sessions within the university and only the doctor’s utterances could be transcribed. We annotated each utterance with an intent inventory characterizing the purpose of the question or statement. For some intent classes, the data only contains a few samples, and we apply Information Retrieval and Deep Learning methods that are robust with respect to small amounts of training data for recognizing the intent of an utterance and providing the correct response. Our results show that the models are effective and they provide baseline performance scores on the data set for further research.

Dirichlet-Smoothed Word Embeddings for Low-Resource Settings
Jakob Jungmaier | Nora Kassner | Benjamin Roth
Proceedings of the Twelfth Language Resources and Evaluation Conference

Nowadays, classical count-based word embeddings using positive pointwise mutual information (PPMI) weighted co-occurrence matrices have been widely superseded by machine-learning-based methods like word2vec and GloVe. But these methods are usually applied using very large amounts of text data. In many cases, however, there is not much text data available, for example for specific domains or low-resource languages. This paper revisits PPMI by adding Dirichlet smoothing to correct its bias towards rare words. We evaluate on standard word similarity data sets and compare to word2vec and the recent state of the art for low-resource settings: Positive and Unlabeled (PU) Learning for word embeddings. The proposed method outperforms PU-Learning for low-resource settings and obtains competitive results for Maltese and Luxembourgish.

UniSent: Universal Adaptable Sentiment Lexica for 1000+ Languages
Ehsaneddin Asgari | Fabienne Braune | Benjamin Roth | Christoph Ringlstetter | Mohammad Mofrad
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we introduce UniSent universal sentiment lexica for 1000+ languages. Sentiment lexica are vital for sentiment analysis in absence of document-level annotations, a very common scenario for low-resource languages. To the best of our knowledge, UniSent is the largest sentiment resource to date in terms of the number of covered languages, including many low resource ones. In this work, we use a massively parallel Bible corpus to project sentiment information from English to other languages for sentiment analysis on Twitter data. We introduce a method called DomDrift to mitigate the huge domain mismatch between Bible and Twitter by a confidence weighting scheme that uses domain-specific embeddings to compare the nearest neighbors for a candidate sentiment word in the source (Bible) and target (Twitter) domain. We evaluate the quality of UniSent in a subset of languages for which manually created ground truth was available, Macedonian, Czech, German, Spanish, and French. We show that the quality of UniSent is comparable to manually created sentiment resources when it is used as the sentiment seed for the task of word sentiment prediction on top of embedding representations. In addition, we show that emoticon sentiments could be reliably predicted in the Twitter domain using only UniSent and monolingual embeddings in German, Spanish, French, and Italian. With the publication of this paper, we release the UniSent sentiment lexica at http://language-lab.info/unisent.

2019

Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications
Vivi Nastase | Benjamin Roth | Laura Dietz | Andrew McCallum
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications

Domain adaptation for part-of-speech tagging of noisy user-generated text
Luisa März | Dietrich Trautmann | Benjamin Roth
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The performance of a Part-of-speech (POS) tagger is highly dependent on the domain of the processed text, and for many domains there is no or only very little training data available. This work addresses the problem of POS tagging noisy user-generated text using a neural network. We propose an architecture that trains an out-of-domain model on a large newswire corpus, and transfers those weights by using them as a prior for a model trained on the target domain (a data-set of German Tweets) for which there is very little annotations available. The neural network has a standard bidirectional LSTM at its core. However, we find it crucial to also encode a set of task-specific features, and to obtain reliable (source-domain and target-domain) word representations. Experiments with different regularization techniques such as early stopping, dropout and fine-tuning the domain adaptation prior weights are conducted. Our best model uses external weights from the out-of-domain model, as well as feature embeddings, pre-trained word and sub-word embeddings and achieves a tagging accuracy of slightly over 90%, improving on the previous state of the art for this task.

Interpretable Question Answering on Knowledge Bases and Text
Alona Sydorova | Nina Poerner | Benjamin Roth
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Interpretability of machine learning (ML) models becomes more relevant with their increasing adoption. In this work, we address the interpretability of ML based question answering (QA) models on a combination of knowledge bases (KB) and text documents. We adapt post hoc explanation methods such as LIME and input perturbation (IP) and compare them with the self-explanatory attention mechanism of the model. For this purpose, we propose an automatic evaluation paradigm for explanation methods in the context of QA. We also conduct a study with human annotators to evaluate whether explanations help them identify better QA models. Our results suggest that IP provides better explanations than LIME or attention, according to both automatic and human evaluation. We obtain the same ranking of methods in both experiments, which supports the validity of our automatic evaluation paradigm.

2018

Evaluating neural network explanation methods using hybrid documents and morphosyntactic agreement
Nina Poerner | Hinrich Schütze | Benjamin Roth
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The behavior of deep neural networks (DNNs) is hard to understand. This makes it necessary to explore post hoc explanation methods. We conduct the first comprehensive evaluation of explanation methods for NLP. To this end, we design two novel evaluation paradigms that cover two important classes of NLP problems: small context and large context problems. Both paradigms require no manual annotation and are therefore broadly applicable. We also introduce LIMSSE, an explanation method inspired by LIME that is designed for NLP. We show empirically that LIMSSE, LRP and DeepLIFT are the most effective explanation methods and recommend them for explaining DNNs in NLP.

Joint Aspect and Polarity Classification for Aspect-based Sentiment Analysis with End-to-End Neural Networks
Martin Schmitt | Simon Steinheber | Konrad Schreiber | Benjamin Roth
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In this work, we propose a new model for aspect-based sentiment analysis. In contrast to previous approaches, we jointly model the detection of aspects and the classification of their polarity in an end-to-end trainable neural network. We conduct experiments with different neural architectures and word representations on the recent GermEval 2017 dataset. We were able to show considerable performance gains by using the joint modeling approach in all settings compared to pipeline approaches. The combination of a convolutional neural network and fasttext embeddings outperformed the best submission of the shared task in 2017, establishing a new state of the art.

Joint Bootstrapping Machines for High Confidence Relation Extraction
Pankaj Gupta | Benjamin Roth | Hinrich Schütze
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Semi-supervised bootstrapping techniques for relationship extraction from text iteratively expand a set of initial seed instances. Due to the lack of labeled data, a key challenge in bootstrapping is semantic drift: if a false positive instance is added during an iteration, then all following iterations are contaminated. We introduce BREX, a new bootstrapping method that protects against such contamination by highly effective confidence assessment. This is achieved by using entity and template seeds jointly (as opposed to just one as in previous work), by expanding entities and templates in parallel and in a mutually constraining fashion in each iteration and by introducing higherquality similarity measures for templates. Experimental results show that BREX achieves an F1 that is 0.13 (0.87 vs. 0.74) better than the state of the art for four relationships.

Interpretable Textual Neuron Representations for NLP
Nina Poerner | Benjamin Roth | Hinrich Schütze
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Input optimization methods, such as Google Deep Dream, create interpretable representations of neurons for computer vision DNNs. We propose and evaluate ways of transferring this technology to NLP. Our results suggest that gradient ascent with a gumbel softmax layer produces n-gram representations that outperform naive corpus search in terms of target neuron activation. The representations highlight differences in syntax awareness between the language and visual models of the Imaginet architecture.

2017

Towards Bootstrapping a Polarity Shifter Lexicon using Linguistic Features
Marc Schulder | Michael Wiegand | Josef Ruppenhofer | Benjamin Roth
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We present a major step towards the creation of the first high-coverage lexicon of polarity shifters. In this work, we bootstrap a lexicon of verbs by exploiting various linguistic features. Polarity shifters, such as “abandon”, are similar to negations (e.g. “not”) in that they move the polarity of a phrase towards its inverse, as in “abandon all hope”. While there exist lists of negation words, creating comprehensive lists of polarity shifters is far more challenging due to their sheer number. On a sample of manually annotated verbs we examine a variety of linguistic features for this task. Then we build a supervised classifier to increase coverage. We show that this approach drastically reduces the annotation effort while ensuring a high-precision lexicon. We also show that our acquired knowledge of verbal polarity shifters improves phrase-level sentiment analysis.

2016

Comparing Convolutional Neural Networks to Traditional Models for Slot Filling
Heike Adel | Benjamin Roth | Hinrich Schütze
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Multilingual Relation Extraction using Compositional Universal Schema
Patrick Verga | David Belanger | Emma Strubell | Benjamin Roth | Andrew McCallum
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2015

Compositional Vector Space Models for Knowledge Base Completion
Arvind Neelakantan | Benjamin Roth | Andrew McCallum
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2014

Automatic Food Categorization from Large Unlabeled Corpora and Its Impact on Relation Extraction
Michael Wiegand | Benjamin Roth | Dietrich Klakow
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

RelationFactory: A Fast, Modular and Effective System for Knowledge Base Population
Benjamin Roth | Tassilo Barth | Grzegorz Chrupała | Martin Gropp | Dietrich Klakow
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics

Unsupervised Parsing for Generating Surface-Based Relation Extraction Patterns
Jens Illig | Benjamin Roth | Dietrich Klakow
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

2013

Combining Generative and Discriminative Model Scores for Distant Supervision
Benjamin Roth | Dietrich Klakow
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

2012

A Gold Standard for Relation Extraction in the Food Domain
Michael Wiegand | Benjamin Roth | Eva Lasarcyk | Stephanie Köser | Dietrich Klakow
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present a gold standard for semantic relation extraction in the food domain for German. The relation types that we address are motivated by scenarios for which IT applications present a commercial potential, such as virtual customer advice in which a virtual agent assists a customer in a supermarket in finding those products that satisfy their needs best. Moreover, we focus on those relation types that can be extracted from natural language text corpora, ideally content from the internet, such as web forums, that are easy to retrieve. A typical relation type that meets these requirements are pairs of food items that are usually consumed together. Such a relation type could be used by a virtual agent to suggest additional products available in a shop that would potentially complement the items a customer has already in their shopping cart. Our gold standard comprises structural data, i.e. relation tables, which encode relation instances. These tables are vital in order to evaluate natural language processing systems that extract those relations.

2010

Machine Translation Using Overlapping Alignments and SampleRank
Benjamin Roth | Andrew McCallum | Marc Dymetman | Nicola Cancedda
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

We present a conditional-random-field approach to discriminatively-trained phrase-based machine translation in which training and decoding are both cast in a sampling framework and are implemented uniformly in a new probabilistic programming language for factor graphs. In traditional phrase-based translation, decoding infers both a "Viterbi" alignment and the target sentence. In contrast, in our approach, a rich overlapping-phrase alignment is produced by a fast deterministic method, while probabilistic decoding infers only the target sentence, which is then able to leverage arbitrary features of the entire source sentence, target sentence and alignment. By using SampleRank for learning we could in principle efficiently estimate hundreds of thousands of parameters. Test-time decoding is done by MCMC sampling with annealing. To demonstrate the potential of our approach we show preliminary experiments leveraging alignments that may contain overlapping bi-phrases.

Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection
Linlin Li | Benjamin Roth | Caroline Sporleder
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

A survey on the role of negation in sentiment analysis
Michael Wiegand | Alexandra Balahur | Benjamin Roth | Dietrich Klakow | Andrés Montoyo
Proceedings of the Workshop on Negation and Speculation in Natural Language Processing

Co-authors

Anastasiia Sedova 5

Andrew McCallum 4

Klim Zaporojets 3

Ehsaneddin Asgari 2

Matthias Aßenmacher 2

Andreas Baumann 2

Terra Blevins 2

Fabienne Braune 2

Vasiliki Kougia 2

Lukas Miklautz 2

Claudia Plant 2

Loris Schoenegger 2

Alexandra Balahur 1

Tassilo Barth 1

David Belanger 1

Nicola Cancedda 1

Grzegorz Chrupała 1

Marc Dymetman 1

Maximilian Fink 1

Diego Frassinelli 1

Annemarie Friedrich 1

Munir Georges 1

Dagmar Gromann 1

Michael A. Hedderich 1

Aaricia Herygers 1

Christian Heumann 1

Marwin Härttrich 1

Jakob Jungmaier 1

Brigitte Krenn 1

Stephanie Köser 1

Collin Leiber 1

Robert Litschko 1

Ali Modarressi 1

Mohammad Mofrad 1

Andrés Montoyo 1

Jutta L Mueller 1

Arvind Neelakantan 1

Barbara Plank 1

Christoph Ringlstetter 1

Robin Rojowiec 1

Josef Ruppenhofer 1

Dominik Répás 1

Paul Röttger 1

Susanne Schmalwieser 1

Martin Schmitt 1

Konrad Schreiber 1

Marc Schulder 1

Stefan Schweter 1

Marina Speranskaya 1

Caroline Sporleder 1

Kinga Stańczak 1

Simon Steinheber 1

Andreas Joseph Stephan 1

Emma Strubell 1

Alona Sydorova 1

Alexander Tampier 1

Dietrich Trautmann 1

Patrick Verga 1

Jan Philip Wahle 1

Leonie Weissweiler 1

Ivonne Weyers 1

Lena Zellinger 1

Franziska Zimmermann 1

Venues