Georg Groh - ACL Anthology

Georg Groh

2026

DIALECTIC: A Multi-Agent System for Startup Evaluation
Jae Yoon Bae | Simon Malberg | Joyce Ann Clarize Galang | Andre Retterath | Georg Groh
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)

Venture capital (VC) investors face a large number of investment opportunities but only invest in few of these, with even fewer ending up successful. Early-stage screening of opportunities is often limited by investor bandwidth, demanding tradeoffs between evaluation diligence and number of opportunities assessed. To ease this tradeoff, we introduce DIALECTIC, an LLM-based multi-agent system for startup evaluation. DIALECTIC first gathers factual knowledge about a startup and organizes these facts into a hierarchical question tree. It then synthesizes the facts into natural-language arguments for and against an investment and iteratively critiques and refines these arguments through a simulated debate, which surfaces only the most convincing arguments. Our system also produces numeric decision scores that allow investors to rank and thus efficiently prioritize opportunities. We evaluate DIALECTIC through backtesting on real investment opportunities aggregated from five VC funds, showing that DIALECTIC matches the precision of human VCs in predicting startup success.

2025

Tuning Into Bias: A Computational Study of Gender Bias in Song Lyrics
Danqing Chen | Adithi Satish | Rasul Khanbayov | Carolin Schuster | Georg Groh
Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)

The application of text mining methods is becoming increasingly prevalent, particularly within Humanities and Computational Social Sciences, as well as in a broader range of disciplines. This paper presents an analysis of gender bias in English song lyrics using topic modeling and bias measurement techniques. Leveraging BERTopic, we cluster a dataset of 537,553 English songs into distinct topics and analyze their temporal evolution. Our results reveal a significant thematic shift in song lyrics over time, transitioning from romantic themes to a heightened focus on the sexualization of women. Additionally, we observe a substantial prevalence of profanity and misogynistic content across various topics, with a particularly high concentration in the largest thematic cluster. To further analyse gender bias across topics and genres in a quantitative way, we employ the Single Category Word Embedding Association Test (SC-WEAT) to calculate bias scores for word embeddings trained on the most prominent topics as well as individual genres. The results indicate a consistent male bias in words associated with intelligence and strength, while appearance and weakness words show a female bias. Further analysis highlights variations in these biases across topics, illustrating the interplay between thematic content and gender stereotypes in song lyrics.

Simplifications Are Absolutists: How Simplified Language Reduces Word Sense Awareness in LLM-Generated Definitions
Lukas Ellinger | Miriam Anschütz | Georg Groh
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Large Language Models (LLMs) can provide accurate word definitions and explanations for any context. However, the scope of the definition changes for different target groups, like children or language learners. This is especially relevant for homonyms—words with multiple meanings—where oversimplification might risk information loss by omitting key senses, potentially misleading users who trust LLM outputs. We investigate how simplification impacts homonym definition quality across three target groups: Normal, Simple, and ELI5. Using two novel evaluation datasets spanning multiple languages, we test DeepSeek v3, Llama 4 Maverick, Qwen3-30B A3B, GPT-4o mini, and Llama 3.1 8B via LLM-as-Judge and human annotations. Our results show that simplification drastically degrades definition completeness by neglecting polysemy, increasing the risk of misunderstanding. Fine-tuning Llama 3.1 8B with Direct Preference Optimization substantially improves homonym response quality across all prompt types. These findings highlight the need to balance simplicity and completeness in educational NLP to ensure reliable, context-aware definitions for all learners.

TUM-MiKaNi at SemEval-2025 Task 3: Towards Multilingual and Knowledge-Aware Non-factual Hallucination Identification
Miriam Anschütz | Ekaterina Gikalo | Niklas Herbster | Georg Groh
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Hallucinations are one of the major problems of LLMs, hindering their trustworthiness and deployment to wider use cases. However, most of the research on hallucinations focuses on English data, neglecting the multilingual nature of LLMs. This paper describes our submission to the "{textit{SemEval-2025 Task-3 — Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes}}”. We propose a two-part pipeline that combines retrieval-based fact verification against Wikipedia with a BERT-based system fine-tuned to identify common hallucination patterns. Our system achieves competitive results across all languages, reaching top-10 results in eight languages, including English. Moreover, it supports multiple languages beyond the fourteen covered by the shared task. This multilingual hallucination identifier can help to improve LLM outputs and their usefulness in the future.

It Depends: Resolving Referential Ambiguity in Minimal Contexts with Commonsense Knowledge
Lukas Ellinger | Georg Groh
Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025)

Ambiguous words or underspecified references require interlocutors to resolve them, often by relying on shared context and commonsense knowledge. Therefore, we systematically investigate whether Large Language Models (LLMs) can leverage commonsense to resolve referential ambiguity in multi-turn conversations and analyze their behavior when ambiguity persists. Further, we study how requests for simplified language affect this capacity. Using a novel multilingual evaluation dataset, we test DeepSeek v3, GPT-4o, Qwen3-32B, GPT-4o-mini, and Llama-3.1-8B via LLM-as-Judge and human annotations. Our findings indicate that current LLMs struggle to resolve ambiguity effectively: they tend to commit to a single interpretation or cover all possible references, rather than hedging or seeking clarification. This limitation becomes more pronounced under simplification prompts, which drastically reduce the use of commonsense reasoning and diverse response strategies. Fine-tuning Llama-3.1-8B with Direct Preference Optimization substantially improves ambiguity resolution across all request types. These results underscore the need for advanced fine-tuning to improve LLMs’ handling of ambiguity and to ensure robust performance across diverse communication styles.

(Dis)improved?! How Simplified Language Affects Large Language Model Performance across Languages
Miriam Anschütz | Anastasiya Damaratskaya | Chaeeun Joy Lee | Arthur Schmalz | Edoardo Mosca | Georg Groh
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

Simplified language enhances the accessibility and human understanding of texts. However, whether it also benefits large language models (LLMs) remains underexplored. This paper extensively studies whether LLM performance improves on simplified data compared to its original counterpart. Our experiments span six datasets and nine automatic simplification systems across three languages. We show that English models, including GPT-4o-mini, show a weak generalization and exhibit a significant performance drop on simplified data. This introduces an intriguing paradox: simplified data is helpful for humans but not for LLMs. At the same time, the performance in non-English languages sometimes improves, depending on the task and quality of the simplifier. Our findings offer a comprehensive view of the impact of simplified language on LLM performance and uncover severe implications for people depending on simple language.

Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings
Carolin M. Schuster | Maria-Alexandra Roman | Shashwat Ghatiwala | Georg Groh
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Large language models (LLMs) are the foundation of the current successes of artificial intelligence (AI), however, they are unavoidably biased. To effectively communicate the risks and encourage mitigation efforts these models need adequate and intuitive descriptions of their discriminatory properties, appropriate for all audiences of AI. We suggest bias profiles with respect to stereotype dimensions based on dictionaries from social psychology research. Along these dimensions we investigate gender bias in contextual embeddings, across contexts and layers, and generate stereotype profiles for twelve different LLMs, demonstrating their intuition and use case for exposing and visualizing bias.

German4All – A Dataset and Model for Readability-Controlled Paraphrasing in German
Miriam Anschütz | Thanh Mai Pham | Eslam Nasrallah | Maximilian Müller | Cristian-George Craciun | Georg Groh
Proceedings of the 18th International Natural Language Generation Conference

The ability to paraphrase texts across different complexity levels is essential for creating accessible texts that can be tailored toward diverse reader groups. Thus, we introduce German4All, the first large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases. It spans five readability levels and comprises over 25,000 samples. The dataset is automatically synthesized using GPT-4 and rigorously evaluated through both human and LLM-based judgments. Using German4All, we train an open-source, readability-controlled paraphrasing model that achieves state-of-the-art performance in German text simplification, enabling more nuanced and reader-specific adaptations. We open-source both the dataset and the model to encourage further research on multi-level paraphrasing.

Semantic Component Analysis: Introducing Multi-Topic Distributions to Clustering-Based Topic Modeling
Florian Eichin | Carolin M. Schuster | Georg Groh | Michael A. Hedderich
Findings of the Association for Computational Linguistics: EMNLP 2025

Topic modeling is a key method in text analysis, but existing approaches fail to efficiently scale to large datasets or are limited by assuming one topic per document. Overcoming these limitations, we introduce Semantic Component Analysis (SCA), a topic modeling technique that discovers multiple topics per sample by introducing a decomposition step to the clustering-based topic modeling framework. We evaluate SCA on Twitter datasets in English, Hausa and Chinese. There, it achieves competetive coherence and diversity compared to BERTopic, while uncovering at least double the topics and maintaining a noise rate close to zero. We also find that SCA outperforms the LLM-based TopicGPT in scenarios with similar compute budgets. SCA thus provides an effective and efficient approach for topic modeling of large datasets.

A Comprehensive Evaluation of Cognitive Biases in LLMs
Simon Malberg | Roman Poletukhin | Carolin M. Schuster | Georg Groh
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

We present a large-scale evaluation of 30 cognitive biases in 20 state-of-the-art large language models (LLMs) under various decision-making scenarios. Our contributions include a novel general-purpose test framework for reliable and large-scale generation of tests for LLMs, a benchmark dataset with 30,000 tests for detecting cognitive biases in LLMs, and a comprehensive assessment of the biases found in the 20 evaluated LLMs. Our work confirms and broadens previous findings suggesting the presence of cognitive biases in LLMs by reporting evidence of all 30 tested biases in at least some of the 20 LLMs. We publish our framework code and dataset to encourage future research on cognitive biases in LLMs: https://github.com/simonmalberg/cognitive-biases-in-llms.

From Knowledge to Noise: CTIM-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents
Tobias Lindenbauer | Georg Groh | Hinrich Schuetze
Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)

We introduce CTIM-Rover, an AI agent for Software Engineering (SE) built on top of AutoCodeRover (Zhang et al., 2024) that extends agentic reasoning frameworks with an episodic memory, more specifically, a general and repository-level Cross-Task-Instance Memory (CTIM). While existing open-source SE agents mostly rely on ReAct (Yao et al., 2023b), Reflexion (Shinn et al., 2023), or Code-Act (Wang et al., 2024), all of these reasoning and planning frameworks inefficiently discard their long-term memory after a single task instance. As repository-level understanding is pivotal for identifying all locations requiring a patch for fixing a bug, we hypothesize that SE is particularly well positioned to benefit from CTIM. For this, we build on the Experiential Learning (EL) approach ExpeL (Zhao et al., 2024), proposing a Mixture-Of-Experts (MoEs) inspired approach to create both a general-purpose and repository-level CTIM . We find that CTIM-Rover does not outperform AutoCodeRover in any configuration and thus conclude that neither ExpeL nor DoT-Bank (Lingam et al., 2024) scale to real-world SE problems. Our analysis indicates noise introduced by distracting CTIM items or exemplar trajectories as the likely source of the performance degradation.

From Causal Parrots to Causal Prophets? Towards Sound Causal Reasoning with Large Language Models
Rahul Babu Shrestha | Simon Malberg | Georg Groh
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

Causal reasoning is a fundamental property of human and machine intelligence. While large language models (LLMs) excel in many natural language tasks, their ability to infer causal relationships beyond memorized associations is debated. This study systematically evaluates recent LLMs’ causal reasoning across three levels of Pearl’s Ladder of Causation—associational, interventional, and counterfactual—as well as commonsensical, anti-commonsensical, and nonsensical causal structures using the CLadder dataset. We further explore the effectiveness of prompting techniques, including chain of thought (CoT), self-consistency (SC), and causal chain of thought (CausalCoT), in enhancing causal reasoning, and propose two new techniques causal tree of thoughts (CausalToT) and causal program of thoughts (CausalPoT). While larger models tend to outperform smaller ones and are generally more robust against perturbations, our results indicate that all tested LLMs still have difficulties, especially with counterfactual reasoning. However, our CausalToT and CausalPoT significantly improve performance over existing prompting techniques, suggesting that hybrid approaches combining LLMs with formal reasoning frameworks can mitigate these limitations. Our findings contribute to understanding LLMs’ reasoning capacities and outline promising strategies for improving their ability to reason causally as humans would. We release our code and data.

Adaptive Parameter Compression for Language Models
Jeremias Bohn | Frederic Mrozinski | Georg Groh
Findings of the Association for Computational Linguistics: NAACL 2025

Cross-lingual Text Classification Transfer: The Case of Ukrainian
Daryna Dementieva | Valeriia Khylenko | Georg Groh
Proceedings of the 31st International Conference on Computational Linguistics

Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. To support further fair development of NLP models, exploring the possibilities of effective knowledge transfer to new languages is crucial. Ukrainian, in particular, stands as a language that still can benefit from the continued refinement of cross-lingual methodologies. Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks, i.e., different types of style, or harmful speech, or texts relationships. However, the amount of resources required for such corpora collection from scratch is understandable. In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods avoiding manual data curation: large multilingual encoders and translation systems, LLMs, and language adapters. We test the approaches on three text classification tasks—toxicity classification, formality classification, and natural language inference (NLI)—providing the “recipe” for the optimal setups for each task.

2024

Overview of the GermEval 2024 Shared Task on Statement Segmentation in German Easy Language (StaGE)
Thorben Schomacker | Miriam Anschütz | Regina Stodden | Georg Groh | Marina Tropmann-Frick
Proceedings of GermEval 2024 Shared Task on Statement Segmentation in German Easy Language (StaGE)

From Language to Pixels: Task Recognition and Task Learning in LLMs
Janek Falkenstein | Carolin M. Schuster | Alexander H. Berger | Georg Groh
Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP

LLMs can perform unseen tasks by learning from a few in-context examples. How in-context learning works is still uncertain. We investigate the mechanisms of in-context learning on a challenging non-language task. The task requires the LLM to generate pixel matrices representing images of basic shapes. We introduce a framework to analyze if this task is solved by recognizing similar formats from the training data (task recognition) or by understanding the instructions and learning the skill de novo during inference (task learning). Our experiments demonstrate that LLMs generate meaningful pixel matrices with task recognition and fail to learn such tasks when encountering unfamiliar formats. Our findings offer insights into LLMs’ learning mechanisms and their generalization ability to guide future research on their seemingly human-like behavior.

Simpler Becomes Harder: Do LLMs Exhibit a Coherent Behavior on Simplified Corpora?
Miriam Anschütz | Edoardo Mosca | Georg Groh
Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024

Text simplification seeks to improve readability while retaining the original content and meaning. Our study investigates whether pre-trained classifiers also maintain such coherence by comparing their predictions on both original and simplified inputs. We conduct experiments using 11 pre-trained models, including BERT and OpenAI’s GPT 3.5, across six datasets spanning three languages. Additionally, we conduct a detailed analysis of the correlation between prediction change rates and simplification types/strengths. Our findings reveal alarming inconsistencies across all languages and models. If not promptly addressed, simplified inputs can be easily exploited to craft zero-iteration model-agnostic adversarial attacks with success rates of up to 50%.

Images Speak Volumes: User-Centric Assessment of Image Generation for Accessible Communication
Miriam Anschütz | Tringa Sylaj | Georg Groh
Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)

Explanatory images play a pivotal role in accessible and easy-to-read (E2R) texts. However, the images available in online databases are not tailored toward the respective texts, and the creation of customized images is expensive. In this large-scale study, we investigated whether text-to-image generation models can close this gap by providing customizable images quickly and easily. We benchmarked seven, four open- and three closed-source, image generation models and provide an extensive evaluation of the resulting images. In addition, we performed a user study with people from the E2R target group to examine whether the images met their requirements. We find that some of the models show remarkable performance, but none of the models are ready to be used at a larger scale without human supervision. Our research is an important step toward facilitating the creation of accessible information for E2R creators and tailoring accessible images to the target group’s needs.

Toxicity Classification in Ukrainian
Daryna Dementieva | Valeriia Khylenko | Nikolay Babakov | Georg Groh
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)

The task of toxicity detection is still a relevant task, especially in the context of safe and fair LMs development. Nevertheless, labeled binary toxicity classification corpora are not available for all languages, which is understandable given the resource-intensive nature of the annotation process. Ukrainian, in particular, is among the languages lacking such resources. To our knowledge, there has been no existing toxicity classification corpus in Ukrainian. In this study, we aim to fill this gap by investigating cross-lingual knowledge transfer techniques and creating labeled corpora by: (i)~translating from an English corpus, (ii)~filtering toxic samples using keywords, and (iii)~annotating with crowdsourcing. We compare LLMs prompting and other cross-lingual transfer approaches with and without fine-tuning offering insights into the most robust and efficient baselines.

Crafting Tomorrow’s Headlines: Neural News Generation and Detection in English, Turkish, Hungarian, and Persian
Cem Üyük | Danica Rovó | Shaghayeghkolli Shaghayeghkolli | Rabia Varol | Georg Groh | Daryna Dementieva
Proceedings of the Third Workshop on NLP for Positive Impact

In the era dominated by information overload and its facilitation with Large Language Models (LLMs), the prevalence of misinformation poses a significant threat to public discourse and societal well-being. A critical concern at present involves the identification of machine-generated news. In this work, we take a significant step by introducing a benchmark dataset designed for neural news detection in four languages: English, Turkish, Hungarian, and Persian. The dataset incorporates outputs from multiple multilingual generators (in both, zero-shot and fine-tuned setups) such as BloomZ, LLaMa-2, Mistral, Mixtral, and GPT-4. Next, we experiment with a variety of classifiers, ranging from those based on linguistic features to advanced Transformer-based models and LLMs prompting. We present the detection results aiming to delve into the interpretablity and robustness of machine-generated texts detectors across all target languages.

Exploring Hallucinations in Task-oriented Dialogue Systems with Narrow Domains
Yan Pan | Davide Cadamuro | Georg Groh
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

2023

IFAN: An Explainability-Focused Interaction Framework for Humans and NLP Models
Edoardo Mosca | Daryna Dementieva | Tohid Ebrahim Ajdari | Maximilian Kummeth | Kirill Gringauz | Yutong Zhou | Georg Groh
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations

Data-Augmented Task-Oriented Dialogue Response Generation with Domain Adaptation
Yan Pan | Davide Cadamuro | Georg Groh
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

AdamR at SemEval-2023 Task 10: Solving the Class Imbalance Problem in Sexism Detection with Ensemble Learning
Adam Rydelek | Daryna Dementieva | Georg Groh
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

The Explainable Detection of Online Sexism task presents the problem of explainable sexism detection through fine-grained categorisation of sexist cases with three subtasks. Our team experimented with different ways to combat class imbalance throughout the tasks using data augmentation and loss alteration techniques. We tackled the challenge by utilising ensembles of Transformer models trained on different datasets, which are tested to find the balance between performance and interpretability. This solution ranked us in the top 40% of teams for each of the tracks.

Distinguishing Fact from Fiction: A Benchmark Dataset for Identifying Machine-Generated Scientific Papers in the LLM Era.
Edoardo Mosca | Mohamed Hesham Ibrahim Abdalla | Paolo Basso | Margherita Musumeci | Georg Groh
Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)

As generative NLP can now produce content nearly indistinguishable from human writing, it becomes difficult to identify genuine research contributions in academic writing and scientific publications. Moreover, information in NLP-generated text can potentially be factually wrong or even entirely fabricated. This study introduces a novel benchmark dataset, containing human-written and machine-generated scientific papers from SCIgen, GPT-2, GPT-3, ChatGPT, and Galactica. After describing the generation and extraction pipelines, we also experiment with four distinct classifiers as a baseline for detecting the authorship of scientific text. A strong focus is put on generalization capabilities and explainability to highlight the strengths and weaknesses of detectors. We believe our work serves as an important step towards creating more robust methods for distinguishing between human-written and machine-generated scientific papers, ultimately ensuring the integrity of scientific literature.

This is not correct! Negation-aware Evaluation of Language Generation Systems
Miriam Anschütz | Diego Miguel Lozano | Georg Groh
Proceedings of the 16th International Natural Language Generation Conference

Large language models underestimate the impact of negations on how much they change the meaning of a sentence. Therefore, learned evaluation metrics based on these models are insensitive to negations. In this paper, we propose NegBLEURT, a negation-aware version of the BLEURT evaluation metric. For that, we designed a rule-based sentence negation tool and used it to create the CANNOT negation evaluation dataset. Based on this dataset, we fine-tuned a sentence transformer and an evaluation metric to improve their negation sensitivity. Evaluating these models on existing benchmarks shows that our fine-tuned models outperform existing metrics on the negated sentences by far while preserving their base models’ performances on other perturbations.

Language Models for German Text Simplification: Overcoming Parallel Data Scarcity through Style-specific Pre-training
Miriam Anschütz | Joshua Oehms | Thomas Wimmer | Bartłomiej Jezierski | Georg Groh
Findings of the Association for Computational Linguistics: ACL 2023

Automatic text simplification systems help to reduce textual information barriers on the internet. However, for languages other than English, only few parallel data to train these systems exists. We propose a two-step approach to overcome this data scarcity issue. First, we fine-tuned language models on a corpus of German Easy Language, a specific style of German. Then, we used these models as decoders in a sequence-to-sequence simplification task. We show that the language models adapt to the style characteristics of Easy Language and output more accessible texts. Moreover, with the style-specific pre-training, we reduced the number of trainable parameters in text simplification models. Hence, less parallel data is sufficient for training. Our results indicate that pre-training on unaligned data can reduce the required parallel data while improving the performance on downstream tasks.

Adam-Smith at SemEval-2023 Task 4: Discovering Human Values in Arguments with Ensembles of Transformer-based Models
Daniel Schroter | Daryna Dementieva | Georg Groh
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

This paper presents the best-performing approach alias “Adam Smith” for the SemEval-2023 Task 4: “Identification of Human Values behind Arguments”. The goal of the task was to create systems that automatically identify the values within textual arguments. We train transformer-based models until they reach their loss minimum or f1-score maximum. Ensembling the models by selecting one global decision threshold that maximizes the f1-score leads to the best-performing system in the competition. Ensembling based on stacking with logistic regressions shows the best performance on an additional dataset provided to evaluate the robustness (“Nahj al-Balagha”). Apart from outlining the submitted system, we demonstrate that the use of the large ensemble model is not necessary and that the system size can be significantly reduced.

2022

Detecting Word-Level Adversarial Text Attacks via SHapley Additive exPlanations
Lukas Huber | Marc Alexander Kühn | Edoardo Mosca | Georg Groh
Proceedings of the 7th Workshop on Representation Learning for NLP

State-of-the-art machine learning models are prone to adversarial attacks”:” Maliciously crafted inputs to fool the model into making a wrong prediction, often with high confidence. While defense strategies have been extensively explored in the computer vision domain, research in natural language processing still lacks techniques to make models resilient to adversarial text inputs. We adapt a technique from computer vision to detect word-level attacks targeting text classifiers. This method relies on training an adversarial detector leveraging Shapley additive explanations and outperforms the current state-of-the-art on two benchmarks. Furthermore, we prove the detector requires only a low amount of training samples and, in some cases, generalizes to different datasets without needing to retrain.

SHAP-Based Explanation Methods: A Review for NLP Interpretability
Edoardo Mosca | Ferenc Szigeti | Stella Tragianni | Daniel Gallagher | Georg Groh
Proceedings of the 29th International Conference on Computational Linguistics

Model explanations are crucial for the transparent, safe, and trustworthy deployment of machine learning models. The SHapley Additive exPlanations (SHAP) framework is considered by many to be a gold standard for local explanations thanks to its solid theoretical background and general applicability. In the years following its publication, several variants appeared in the literature—presenting adaptations in the core assumptions and target applications. In this work, we review all relevant SHAP-based interpretability approaches available to date and provide instructive examples as well as recommendations regarding their applicability to NLP use cases.

GrammarSHAP: An Efficient Model-Agnostic and Structure-Aware NLP Explainer
Edoardo Mosca | Defne Demirtürk | Luca Mülln | Fabio Raffagnato | Georg Groh
Proceedings of the First Workshop on Learning with Natural Language Supervision

Interpreting NLP models is fundamental for their development as it can shed light on hidden properties and unexpected behaviors. However, while transformer architectures exploit contextual information to enhance their predictive capabilities, most of the available methods to explain such predictions only provide importance scores at the word level. This work addresses the lack of feature attribution approaches that also take into account the sentence structure. We extend the SHAP framework by proposing GrammarSHAP—a model-agnostic explainer leveraging the sentence’s constituency parsing to generate hierarchical importance scores.

TUM Social Computing at GermEval 2022: Towards the Significance of Text Statistics and Neural Embeddings in Text Complexity Prediction
Miriam Anschütz | Georg Groh
Proceedings of the GermEval 2022 Workshop on Text Complexity Assessment of German Text

In this paper, we describe our submission to the GermEval 2022 Shared Task on Text Complexity Assessment of German Text. It addresses the problem of predicting the complexity of German sentences on a continuous scale. While many related works still rely on handcrafted statistical features, neural networks have emerged as state-of-the-art in other natural language processing tasks. Therefore, we investigate how both can complement each other and which features are most relevant for text complexity prediction in German. We propose a fine-tuned German DistilBERT model enriched with statistical text features that achieved fourth place in the shared task with a RMSE of 0.481 on the competition’s test data.

Long Input Dialogue Summarization with Sketch Supervision for Summarization of Primetime Television Transcripts
Nataliia Kees | Thien Nguyen | Tobias Eder | Georg Groh
Proceedings of the Workshop on Automatic Summarization for Creative Writing

This paper presents our entry to the CreativeSumm 2022 shared task. Specifically tackling the problem of prime-time television screenplay summarization based on the SummScreen Forever Dreaming dataset. Our approach utilizes extended Longformers combined with sketch supervision including categories specifically for scene descriptions. Our system was able to produce the shortest summaries out of all submissions. While some problems with factual consistency still remain, the system was scoring highest among competitors in the ROUGE and BERTScore evaluation categories.

“That Is a Suspicious Reaction!”: Interpreting Logits Variation to Detect NLP Adversarial Attacks
Edoardo Mosca | Shreyash Agarwal | Javier Rando Ramírez | Georg Groh
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.

User Satisfaction Modeling with Domain Adaptation in Task-oriented Dialogue Systems
Yan Pan | Mingyang Ma | Bernhard Pflugfelder | Georg Groh
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

User Satisfaction Estimation (USE) is crucial in helping measure the quality of a task-oriented dialogue system. However, the complex nature of implicit responses poses challenges in detecting user satisfaction, and most datasets are limited in size or not available to the public due to user privacy policies. Unlike task-oriented dialogue, large-scale annotated chitchat with emotion labels is publicly available. Therefore, we present a novel user satisfaction model with domain adaptation (USMDA) to utilize this chitchat. We adopt a dialogue Transformer encoder to capture contextual features from the dialogue. And we reduce domain discrepancy to learn dialogue-related invariant features. Moreover, USMDA jointly learns satisfaction signals in the chitchat context with user satisfaction estimation, and user actions in task-oriented dialogue with dialogue action recognition. Experimental results on two benchmarks show that our proposed framework for the USE task outperforms existing unsupervised domain adaptation methods. To the best of our knowledge, this is the first work to study user satisfaction estimation with unsupervised domain adaptation from chitchat to task-oriented dialogue.

Explaining Neural NLP Models for the Joint Analysis of Open-and-Closed-Ended Survey Answers
Edoardo Mosca | Katharina Harmann | Tobias Eder | Georg Groh
Proceedings of the 2nd Workshop on Trustworthy Natural Language Processing (TrustNLP 2022)

Large-scale surveys are a widely used instrument to collect data from a target audience. Beyond the single individual, an appropriate analysis of the answers can reveal trends and patterns and thus generate new insights and knowledge for researchers. Current analysis practices employ shallow machine learning methods or rely on (biased) human judgment. This work investigates the usage of state-of-the-art NLP models such as BERT to automatically extract information from both open- and closed-ended questions. We also leverage explainability methods at different levels of granularity to further derive knowledge from the analysis model. Experiments on EMS—a survey-based study researching influencing factors affecting a student’s career goals—show that the proposed approach can identify such factors both at the input- and higher concept-level.

2021

Investigating Annotator Bias in Abusive Language Datasets
Maximilian Wich | Christian Widmer | Gerhard Hagerer | Georg Groh
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Nowadays, social media platforms use classification models to cope with hate speech and abusive language. The problem of these models is their vulnerability to bias. A prevalent form of bias in hate speech and abusive language datasets is annotator bias caused by the annotator’s subjective perception and the complexity of the annotation task. In our paper, we develop a set of methods to measure annotator bias in abusive language datasets and to identify different perspectives on abusive language. We apply these methods to four different abusive language datasets. Our proposed approach supports annotation processes of such datasets and future research addressing different perspectives on the perception of abusive language.

End-to-End Annotator Bias Approximation on Crowdsourced Single-Label Sentiment Analysis
Gerhard Hagerer | David Szabo | Andreas Koch | Maria Luisa Ripoll Dominguez | Christian Widmer | Maximilian Wich | Hannah Danner | Georg Groh
Proceedings of the 4th International Conference on Natural Language and Speech Processing (ICNLSP 2021)

German Abusive Language Dataset with Focus on COVID-19
Maximilian Wich | Svenja Räther | Georg Groh
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

Understanding and Interpreting the Impact of User Context in Hate Speech Detection
Edoardo Mosca | Maximilian Wich | Georg Groh
Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media

As hate speech spreads on social media and online communities, research continues to work on its automatic detection. Recently, recognition performance has been increasing thanks to advances in deep learning and the integration of user features. This work investigates the effects that such features can have on a detection model. Unlike previous research, we show that simple performance comparison does not expose the full impact of including contextual- and user information. By leveraging explainability techniques, we show (1) that user features play a role in the model’s decision and (2) how they affect the feature space learned by the model. Besides revealing that—and also illustrating why—user features are the reason for performance gains, we show how such techniques can be combined to better understand the model and to detect unintended bias.

SocialVisTUM: An Interactive Visualization Toolkit for Correlated Neural Topic Models on Social Media Opinion Mining
Gerhard Hagerer | Martin Kirchhoff | Hannah Danner | Robert Pesch | Mainak Ghosh | Archishman Roy | Jiaxi Zhao | Georg Groh
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Recent research in opinion mining proposed word embedding-based topic modeling methods that provide superior coherence compared to traditional topic modeling. In this paper, we demonstrate how these methods can be used to display correlated topic models on social media texts using SocialVisTUM, our proposed interactive visualization toolkit. It displays a graph with topics as nodes and their correlations as edges. Further details are displayed interactively to support the exploration of large text collections, e.g., representative words and sentences of topics, topic and sentiment distributions, hierarchical topic clustering, and customizable, predefined topic labels. The toolkit optimizes automatically on custom data for optimal coherence. We show a working instance of the toolkit on data crawled from English social media discussions about organic food consumption. The visualization confirms findings of a qualitative consumer research study. SocialVisTUM and its training procedures are accessible online.

2020

An Evaluation of Progressive Neural Networksfor Transfer Learning in Natural Language Processing
Abdul Moeed | Gerhard Hagerer | Sumit Dugar | Sarthak Gupta | Mainak Ghosh | Hannah Danner | Oliver Mitevski | Andreas Nawroth | Georg Groh
Proceedings of the Twelfth Language Resources and Evaluation Conference

A major challenge in modern neural networks is the utilization of previous knowledge for new tasks in an effective manner, otherwise known as transfer learning. Fine-tuning, the most widely used method for achieving this, suffers from catastrophic forgetting. The problem is often exacerbated in natural language processing (NLP). In this work, we assess progressive neural networks (PNNs) as an alternative to fine-tuning. The evaluation is based on common NLP tasks such as sequence labeling and text classification. By gauging PNNs across a range of architectures, datasets, and tasks, we observe improvements over the baselines throughout all experiments.

Impact of Politically Biased Data on Hate Speech Classification
Maximilian Wich | Jan Bauer | Georg Groh
Proceedings of the Fourth Workshop on Online Abuse and Harms

One challenge that social media platforms are facing nowadays is hate speech. Hence, automatic hate speech detection has been increasingly researched in recent years - in particular with the rise of deep learning. A problem of these models is their vulnerability to undesirable bias in training data. We investigate the impact of political bias on hate speech classification by constructing three politically-biased data sets (left-wing, right-wing, politically neutral) and compare the performance of classifiers trained on them. We show that (1) political bias negatively impairs the performance of hate speech classifiers and (2) an explainable machine learning model can help to visualize such bias within the training data. The results show that political bias in training data has an impact on hate speech classification and can become a serious issue.

Evaluation Metrics for Headline Generation Using Deep Pre-Trained Embeddings
Abdul Moeed | Yang An | Gerhard Hagerer | Georg Groh
Proceedings of the Twelfth Language Resources and Evaluation Conference

With the explosive growth in textual data, it is becoming increasingly important to summarize text automatically. Recently, generative language models have shown promise in abstractive text summarization tasks. Since these models rephrase text and thus use similar but different words as found in the summarized text, existing metrics such as ROUGE that use n-gram overlap may not be optimal. Therefore we evaluate two embedding-based evaluation metrics that are applicable to abstractive summarization: Fr ́echet embedding distance, which has been introduced recently, and angular embedding similarity, which is our proposed metric. To demonstrate the utility of both metrics, we analyze the headline generation capacity of two state-of-the-art language models: GPT-2 and ULMFiT. In particular, our proposed metric shows close relation with human judgments in our experiments and has overall better correlations with them. To provide reproducibility, the source code plus human assessments of our experiments is available on GitHub.

Investigating Annotator Bias with a Graph-Based Approach
Maximilian Wich | Hala Al Kuwatly | Georg Groh
Proceedings of the Fourth Workshop on Online Abuse and Harms

A challenge that many online platforms face is hate speech or any other form of online abuse. To cope with this, hate speech detection systems are developed based on machine learning to reduce manual work for monitoring these platforms. Unfortunately, machine learning is vulnerable to unintended bias in training data, which could have severe consequences, such as a decrease in classification performance or unfair behavior (e.g., discriminating minorities). In the scope of this study, we want to investigate annotator bias — a form of bias that annotators cause due to different knowledge in regards to the task and their subjective perception. Our goal is to identify annotation bias based on similarities in the annotation behavior from annotators. To do so, we build a graph based on the annotations from the different annotators, apply a community detection algorithm to group the annotators, and train for each group classifiers whose performances we compare. By doing so, we are able to identify annotator bias within a data set. The proposed method and collected insights can contribute to developing fairer and more reliable hate speech classification models.

Identifying and Measuring Annotator Bias Based on Annotators’ Demographic Characteristics
Hala Al Kuwatly | Maximilian Wich | Georg Groh
Proceedings of the Fourth Workshop on Online Abuse and Harms

Machine learning is recently used to detect hate speech and other forms of abusive language in online platforms. However, a notable weakness of machine learning models is their vulnerability to bias, which can impair their performance and fairness. One type is annotator bias caused by the subjective perception of the annotators. In this work, we investigate annotator bias using classification models trained on data from demographically distinct annotator groups. To do so, we sample balanced subsets of data that are labeled by demographically distinct annotators. We then train classifiers on these subsets, analyze their performances on similarly grouped test sets, and compare them statistically. Our findings show that the proposed approach successfully identifies bias and that demographic features, such as first language, age, and education, correlate with significant performance differences.

2014

Estimating Grammar Correctness for a Priori Estimation of Machine Translation Post-Editing Effort
Nicholas H. Kirk | Guchun Zhang | Georg Groh
Proceedings of the EACL 2014 Workshop on Humans and Computer-assisted Translation

Co-authors

Carolin M. Schuster 4

Hannah Danner 3

Simon Malberg 3

Hala Al Kuwatly 2

Davide Cadamuro 2

Lukas Ellinger 2

Valeriia Khylenko 2

Christian Widmer 2

Mohamed Hesham Ibrahim Abdalla 1

Shreyash Agarwal 1

Nikolay Babakov 1

Rahul Babu Shrestha 1

Alexander H. Berger 1

Jeremias Bohn 1

Cristian-George Craciun 1

Anastasiya Damaratskaya 1

Defne Demirtürk 1

Maria Luisa Ripoll Dominguez 1

Tohid Ebrahim Ajdari 1

Florian Eichin 1

Janek Falkenstein 1

Joyce Ann Clarize Galang 1

Daniel Gallagher 1

Shashwat Ghatiwala 1

Ekaterina Gikalo 1

Kirill Gringauz 1

Sarthak Gupta 1

Katharina Harmann 1

Michael A. Hedderich 1

Niklas Herbster 1

Bartłomiej Jezierski 1

Nataliia Kees 1

Rasul Khanbayov 1

Martin Kirchhoff 1

Nicholas H. Kirk 1

Maximilian Kummeth 1

Marc Alexander Kühn 1

Chaeeun Joy Lee 1

Tobias Lindenbauer 1

Diego Miguel Lozano 1

Oliver Mitevski 1

Frederic Mrozinski 1

Margherita Musumeci 1

Maximilian Müller 1

Eslam Nasrallah 1

Andreas Nawroth 1

Bernhard Pflugfelder 1

Thanh Mai Pham 1

Roman Poletukhin 1

Fabio Raffagnato 1

Javier Rando Ramírez 1

Andre Retterath 1

Maria-Alexandra Roman 1

Archishman Roy 1

Svenja Räther 1

Adithi Satish 1

Arthur Schmalz 1

Thorben Schomacker 1

Daniel Schroter 1

Carolin Schuster 1

Hinrich Schütze 1

Shaghayeghkolli Shaghayeghkolli 1

Regina Stodden 1

Ferenc Szigeti 1

Stella Tragianni 1

Marina Tropmann-Frick 1

Thomas Wimmer 1

Venues