Georg Groh


2024

pdf bib
Simpler Becomes Harder: Do LLMs Exhibit a Coherent Behavior on Simplified Corpora?
Miriam Anschütz | Edoardo Mosca | Georg Groh
Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024

Text simplification seeks to improve readability while retaining the original content and meaning. Our study investigates whether pre-trained classifiers also maintain such coherence by comparing their predictions on both original and simplified inputs. We conduct experiments using 11 pre-trained models, including BERT and OpenAI’s GPT 3.5, across six datasets spanning three languages. Additionally, we conduct a detailed analysis of the correlation between prediction change rates and simplification types/strengths. Our findings reveal alarming inconsistencies across all languages and models. If not promptly addressed, simplified inputs can be easily exploited to craft zero-iteration model-agnostic adversarial attacks with success rates of up to 50%.

pdf bib
From Language to Pixels: Task Recognition and Task Learning in LLMs
Janek Falkenstein | Carolin M. Schuster | Alexander H. Berger | Georg Groh
Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP

LLMs can perform unseen tasks by learning from a few in-context examples. How in-context learning works is still uncertain. We investigate the mechanisms of in-context learning on a challenging non-language task. The task requires the LLM to generate pixel matrices representing images of basic shapes. We introduce a framework to analyze if this task is solved by recognizing similar formats from the training data (task recognition) or by understanding the instructions and learning the skill de novo during inference (task learning). Our experiments demonstrate that LLMs generate meaningful pixel matrices with task recognition and fail to learn such tasks when encountering unfamiliar formats. Our findings offer insights into LLMs’ learning mechanisms and their generalization ability to guide future research on their seemingly human-like behavior.

pdf bib
Overview of the GermEval 2024 Shared Task on Statement Segmentation in German Easy Language (StaGE)
Thorben Schomacker | Miriam Anschütz | Regina Stodden | Georg Groh | Marina Tropmann-Frick
Proceedings of GermEval 2024 Shared Task on Statement Segmentation in German Easy Language (StaGE)

pdf bib
Crafting Tomorrow’s Headlines: Neural News Generation and Detection in English, Turkish, Hungarian, and Persian
Cem Üyük | Danica Rovó | Shaghayeghkolli Shaghayeghkolli | Rabia Varol | Georg Groh | Daryna Dementieva
Proceedings of the Third Workshop on NLP for Positive Impact

In the era dominated by information overload and its facilitation with Large Language Models (LLMs), the prevalence of misinformation poses a significant threat to public discourse and societal well-being. A critical concern at present involves the identification of machine-generated news. In this work, we take a significant step by introducing a benchmark dataset designed for neural news detection in four languages: English, Turkish, Hungarian, and Persian. The dataset incorporates outputs from multiple multilingual generators (in both, zero-shot and fine-tuned setups) such as BloomZ, LLaMa-2, Mistral, Mixtral, and GPT-4. Next, we experiment with a variety of classifiers, ranging from those based on linguistic features to advanced Transformer-based models and LLMs prompting. We present the detection results aiming to delve into the interpretablity and robustness of machine-generated texts detectors across all target languages.

pdf bib
Images Speak Volumes: User-Centric Assessment of Image Generation for Accessible Communication
Miriam Anschütz | Tringa Sylaj | Georg Groh
Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)

Explanatory images play a pivotal role in accessible and easy-to-read (E2R) texts. However, the images available in online databases are not tailored toward the respective texts, and the creation of customized images is expensive. In this large-scale study, we investigated whether text-to-image generation models can close this gap by providing customizable images quickly and easily. We benchmarked seven, four open- and three closed-source, image generation models and provide an extensive evaluation of the resulting images. In addition, we performed a user study with people from the E2R target group to examine whether the images met their requirements. We find that some of the models show remarkable performance, but none of the models are ready to be used at a larger scale without human supervision. Our research is an important step toward facilitating the creation of accessible information for E2R creators and tailoring accessible images to the target group’s needs.

pdf bib
Toxicity Classification in Ukrainian
Daryna Dementieva | Valeriia Khylenko | Nikolay Babakov | Georg Groh
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)

The task of toxicity detection is still a relevant task, especially in the context of safe and fair LMs development. Nevertheless, labeled binary toxicity classification corpora are not available for all languages, which is understandable given the resource-intensive nature of the annotation process. Ukrainian, in particular, is among the languages lacking such resources. To our knowledge, there has been no existing toxicity classification corpus in Ukrainian. In this study, we aim to fill this gap by investigating cross-lingual knowledge transfer techniques and creating labeled corpora by: (i)~translating from an English corpus, (ii)~filtering toxic samples using keywords, and (iii)~annotating with crowdsourcing. We compare LLMs prompting and other cross-lingual transfer approaches with and without fine-tuning offering insights into the most robust and efficient baselines.

2023

pdf bib
Language Models for German Text Simplification: Overcoming Parallel Data Scarcity through Style-specific Pre-training
Miriam Anschütz | Joshua Oehms | Thomas Wimmer | Bartłomiej Jezierski | Georg Groh
Findings of the Association for Computational Linguistics: ACL 2023

Automatic text simplification systems help to reduce textual information barriers on the internet. However, for languages other than English, only few parallel data to train these systems exists. We propose a two-step approach to overcome this data scarcity issue. First, we fine-tuned language models on a corpus of German Easy Language, a specific style of German. Then, we used these models as decoders in a sequence-to-sequence simplification task. We show that the language models adapt to the style characteristics of Easy Language and output more accessible texts. Moreover, with the style-specific pre-training, we reduced the number of trainable parameters in text simplification models. Hence, less parallel data is sufficient for training. Our results indicate that pre-training on unaligned data can reduce the required parallel data while improving the performance on downstream tasks.

pdf bib
IFAN: An Explainability-Focused Interaction Framework for Humans and NLP Models
Edoardo Mosca | Daryna Dementieva | Tohid Ebrahim Ajdari | Maximilian Kummeth | Kirill Gringauz | Yutong Zhou | Georg Groh
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations

pdf bib
This is not correct! Negation-aware Evaluation of Language Generation Systems
Miriam Anschütz | Diego Miguel Lozano | Georg Groh
Proceedings of the 16th International Natural Language Generation Conference

Large language models underestimate the impact of negations on how much they change the meaning of a sentence. Therefore, learned evaluation metrics based on these models are insensitive to negations. In this paper, we propose NegBLEURT, a negation-aware version of the BLEURT evaluation metric. For that, we designed a rule-based sentence negation tool and used it to create the CANNOT negation evaluation dataset. Based on this dataset, we fine-tuned a sentence transformer and an evaluation metric to improve their negation sensitivity. Evaluating these models on existing benchmarks shows that our fine-tuned models outperform existing metrics on the negated sentences by far while preserving their base models’ performances on other perturbations.

pdf bib
Data-Augmented Task-Oriented Dialogue Response Generation with Domain Adaptation
Yan Pan | Davide Cadamuro | Georg Groh
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf bib
Adam-Smith at SemEval-2023 Task 4: Discovering Human Values in Arguments with Ensembles of Transformer-based Models
Daniel Schroter | Daryna Dementieva | Georg Groh
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

This paper presents the best-performing approach alias “Adam Smith” for the SemEval-2023 Task 4: “Identification of Human Values behind Arguments”. The goal of the task was to create systems that automatically identify the values within textual arguments. We train transformer-based models until they reach their loss minimum or f1-score maximum. Ensembling the models by selecting one global decision threshold that maximizes the f1-score leads to the best-performing system in the competition. Ensembling based on stacking with logistic regressions shows the best performance on an additional dataset provided to evaluate the robustness (“Nahj al-Balagha”). Apart from outlining the submitted system, we demonstrate that the use of the large ensemble model is not necessary and that the system size can be significantly reduced.

pdf bib
AdamR at SemEval-2023 Task 10: Solving the Class Imbalance Problem in Sexism Detection with Ensemble Learning
Adam Rydelek | Daryna Dementieva | Georg Groh
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

The Explainable Detection of Online Sexism task presents the problem of explainable sexism detection through fine-grained categorisation of sexist cases with three subtasks. Our team experimented with different ways to combat class imbalance throughout the tasks using data augmentation and loss alteration techniques. We tackled the challenge by utilising ensembles of Transformer models trained on different datasets, which are tested to find the balance between performance and interpretability. This solution ranked us in the top 40% of teams for each of the tracks.

pdf bib
Distinguishing Fact from Fiction: A Benchmark Dataset for Identifying Machine-Generated Scientific Papers in the LLM Era.
Edoardo Mosca | Mohamed Hesham Ibrahim Abdalla | Paolo Basso | Margherita Musumeci | Georg Groh
Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)

As generative NLP can now produce content nearly indistinguishable from human writing, it becomes difficult to identify genuine research contributions in academic writing and scientific publications. Moreover, information in NLP-generated text can potentially be factually wrong or even entirely fabricated. This study introduces a novel benchmark dataset, containing human-written and machine-generated scientific papers from SCIgen, GPT-2, GPT-3, ChatGPT, and Galactica. After describing the generation and extraction pipelines, we also experiment with four distinct classifiers as a baseline for detecting the authorship of scientific text. A strong focus is put on generalization capabilities and explainability to highlight the strengths and weaknesses of detectors. We believe our work serves as an important step towards creating more robust methods for distinguishing between human-written and machine-generated scientific papers, ultimately ensuring the integrity of scientific literature.

2022

pdf bib
“That Is a Suspicious Reaction!”: Interpreting Logits Variation to Detect NLP Adversarial Attacks
Edoardo Mosca | Shreyash Agarwal | Javier Rando Ramírez | Georg Groh
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.

pdf bib
SHAP-Based Explanation Methods: A Review for NLP Interpretability
Edoardo Mosca | Ferenc Szigeti | Stella Tragianni | Daniel Gallagher | Georg Groh
Proceedings of the 29th International Conference on Computational Linguistics

Model explanations are crucial for the transparent, safe, and trustworthy deployment of machine learning models. The SHapley Additive exPlanations (SHAP) framework is considered by many to be a gold standard for local explanations thanks to its solid theoretical background and general applicability. In the years following its publication, several variants appeared in the literature—presenting adaptations in the core assumptions and target applications. In this work, we review all relevant SHAP-based interpretability approaches available to date and provide instructive examples as well as recommendations regarding their applicability to NLP use cases.

pdf bib
Long Input Dialogue Summarization with Sketch Supervision for Summarization of Primetime Television Transcripts
Nataliia Kees | Thien Nguyen | Tobias Eder | Georg Groh
Proceedings of The Workshop on Automatic Summarization for Creative Writing

This paper presents our entry to the CreativeSumm 2022 shared task. Specifically tackling the problem of prime-time television screenplay summarization based on the SummScreen Forever Dreaming dataset. Our approach utilizes extended Longformers combined with sketch supervision including categories specifically for scene descriptions. Our system was able to produce the shortest summaries out of all submissions. While some problems with factual consistency still remain, the system was scoring highest among competitors in the ROUGE and BERTScore evaluation categories.

pdf bib
TUM Social Computing at GermEval 2022: Towards the Significance of Text Statistics and Neural Embeddings in Text Complexity Prediction
Miriam Anschütz | Georg Groh
Proceedings of the GermEval 2022 Workshop on Text Complexity Assessment of German Text

In this paper, we describe our submission to the GermEval 2022 Shared Task on Text Complexity Assessment of German Text. It addresses the problem of predicting the complexity of German sentences on a continuous scale. While many related works still rely on handcrafted statistical features, neural networks have emerged as state-of-the-art in other natural language processing tasks. Therefore, we investigate how both can complement each other and which features are most relevant for text complexity prediction in German. We propose a fine-tuned German DistilBERT model enriched with statistical text features that achieved fourth place in the shared task with a RMSE of 0.481 on the competition’s test data.

pdf bib
GrammarSHAP: An Efficient Model-Agnostic and Structure-Aware NLP Explainer
Edoardo Mosca | Defne Demirtürk | Luca Mülln | Fabio Raffagnato | Georg Groh
Proceedings of the First Workshop on Learning with Natural Language Supervision

Interpreting NLP models is fundamental for their development as it can shed light on hidden properties and unexpected behaviors. However, while transformer architectures exploit contextual information to enhance their predictive capabilities, most of the available methods to explain such predictions only provide importance scores at the word level. This work addresses the lack of feature attribution approaches that also take into account the sentence structure. We extend the SHAP framework by proposing GrammarSHAP—a model-agnostic explainer leveraging the sentence’s constituency parsing to generate hierarchical importance scores.

pdf bib
Detecting Word-Level Adversarial Text Attacks via SHapley Additive exPlanations
Lukas Huber | Marc Alexander Kühn | Edoardo Mosca | Georg Groh
Proceedings of the 7th Workshop on Representation Learning for NLP

State-of-the-art machine learning models are prone to adversarial attacks”:” Maliciously crafted inputs to fool the model into making a wrong prediction, often with high confidence. While defense strategies have been extensively explored in the computer vision domain, research in natural language processing still lacks techniques to make models resilient to adversarial text inputs. We adapt a technique from computer vision to detect word-level attacks targeting text classifiers. This method relies on training an adversarial detector leveraging Shapley additive explanations and outperforms the current state-of-the-art on two benchmarks. Furthermore, we prove the detector requires only a low amount of training samples and, in some cases, generalizes to different datasets without needing to retrain.

pdf bib
User Satisfaction Modeling with Domain Adaptation in Task-oriented Dialogue Systems
Yan Pan | Mingyang Ma | Bernhard Pflugfelder | Georg Groh
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

User Satisfaction Estimation (USE) is crucial in helping measure the quality of a task-oriented dialogue system. However, the complex nature of implicit responses poses challenges in detecting user satisfaction, and most datasets are limited in size or not available to the public due to user privacy policies. Unlike task-oriented dialogue, large-scale annotated chitchat with emotion labels is publicly available. Therefore, we present a novel user satisfaction model with domain adaptation (USMDA) to utilize this chitchat. We adopt a dialogue Transformer encoder to capture contextual features from the dialogue. And we reduce domain discrepancy to learn dialogue-related invariant features. Moreover, USMDA jointly learns satisfaction signals in the chitchat context with user satisfaction estimation, and user actions in task-oriented dialogue with dialogue action recognition. Experimental results on two benchmarks show that our proposed framework for the USE task outperforms existing unsupervised domain adaptation methods. To the best of our knowledge, this is the first work to study user satisfaction estimation with unsupervised domain adaptation from chitchat to task-oriented dialogue.

pdf bib
Explaining Neural NLP Models for the Joint Analysis of Open-and-Closed-Ended Survey Answers
Edoardo Mosca | Katharina Harmann | Tobias Eder | Georg Groh
Proceedings of the 2nd Workshop on Trustworthy Natural Language Processing (TrustNLP 2022)

Large-scale surveys are a widely used instrument to collect data from a target audience. Beyond the single individual, an appropriate analysis of the answers can reveal trends and patterns and thus generate new insights and knowledge for researchers. Current analysis practices employ shallow machine learning methods or rely on (biased) human judgment. This work investigates the usage of state-of-the-art NLP models such as BERT to automatically extract information from both open- and closed-ended questions. We also leverage explainability methods at different levels of granularity to further derive knowledge from the analysis model. Experiments on EMS—a survey-based study researching influencing factors affecting a student’s career goals—show that the proposed approach can identify such factors both at the input- and higher concept-level.

2021

pdf bib
End-to-End Annotator Bias Approximation on Crowdsourced Single-Label Sentiment Analysis
Gerhard Hagerer | David Szabo | Andreas Koch | Maria Luisa Ripoll Dominguez | Christian Widmer | Maximilian Wich | Hannah Danner | Georg Groh
Proceedings of the 4th International Conference on Natural Language and Speech Processing (ICNLSP 2021)

pdf bib
German Abusive Language Dataset with Focus on COVID-19
Maximilian Wich | Svenja Räther | Georg Groh
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

pdf bib
SocialVisTUM: An Interactive Visualization Toolkit for Correlated Neural Topic Models on Social Media Opinion Mining
Gerhard Hagerer | Martin Kirchhoff | Hannah Danner | Robert Pesch | Mainak Ghosh | Archishman Roy | Jiaxi Zhao | Georg Groh
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Recent research in opinion mining proposed word embedding-based topic modeling methods that provide superior coherence compared to traditional topic modeling. In this paper, we demonstrate how these methods can be used to display correlated topic models on social media texts using SocialVisTUM, our proposed interactive visualization toolkit. It displays a graph with topics as nodes and their correlations as edges. Further details are displayed interactively to support the exploration of large text collections, e.g., representative words and sentences of topics, topic and sentiment distributions, hierarchical topic clustering, and customizable, predefined topic labels. The toolkit optimizes automatically on custom data for optimal coherence. We show a working instance of the toolkit on data crawled from English social media discussions about organic food consumption. The visualization confirms findings of a qualitative consumer research study. SocialVisTUM and its training procedures are accessible online.

pdf bib
Investigating Annotator Bias in Abusive Language Datasets
Maximilian Wich | Christian Widmer | Gerhard Hagerer | Georg Groh
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Nowadays, social media platforms use classification models to cope with hate speech and abusive language. The problem of these models is their vulnerability to bias. A prevalent form of bias in hate speech and abusive language datasets is annotator bias caused by the annotator’s subjective perception and the complexity of the annotation task. In our paper, we develop a set of methods to measure annotator bias in abusive language datasets and to identify different perspectives on abusive language. We apply these methods to four different abusive language datasets. Our proposed approach supports annotation processes of such datasets and future research addressing different perspectives on the perception of abusive language.

pdf bib
Understanding and Interpreting the Impact of User Context in Hate Speech Detection
Edoardo Mosca | Maximilian Wich | Georg Groh
Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media

As hate speech spreads on social media and online communities, research continues to work on its automatic detection. Recently, recognition performance has been increasing thanks to advances in deep learning and the integration of user features. This work investigates the effects that such features can have on a detection model. Unlike previous research, we show that simple performance comparison does not expose the full impact of including contextual- and user information. By leveraging explainability techniques, we show (1) that user features play a role in the model’s decision and (2) how they affect the feature space learned by the model. Besides revealing that—and also illustrating why—user features are the reason for performance gains, we show how such techniques can be combined to better understand the model and to detect unintended bias.

2020

pdf bib
Impact of Politically Biased Data on Hate Speech Classification
Maximilian Wich | Jan Bauer | Georg Groh
Proceedings of the Fourth Workshop on Online Abuse and Harms

One challenge that social media platforms are facing nowadays is hate speech. Hence, automatic hate speech detection has been increasingly researched in recent years - in particular with the rise of deep learning. A problem of these models is their vulnerability to undesirable bias in training data. We investigate the impact of political bias on hate speech classification by constructing three politically-biased data sets (left-wing, right-wing, politically neutral) and compare the performance of classifiers trained on them. We show that (1) political bias negatively impairs the performance of hate speech classifiers and (2) an explainable machine learning model can help to visualize such bias within the training data. The results show that political bias in training data has an impact on hate speech classification and can become a serious issue.

pdf bib
Identifying and Measuring Annotator Bias Based on Annotators’ Demographic Characteristics
Hala Al Kuwatly | Maximilian Wich | Georg Groh
Proceedings of the Fourth Workshop on Online Abuse and Harms

Machine learning is recently used to detect hate speech and other forms of abusive language in online platforms. However, a notable weakness of machine learning models is their vulnerability to bias, which can impair their performance and fairness. One type is annotator bias caused by the subjective perception of the annotators. In this work, we investigate annotator bias using classification models trained on data from demographically distinct annotator groups. To do so, we sample balanced subsets of data that are labeled by demographically distinct annotators. We then train classifiers on these subsets, analyze their performances on similarly grouped test sets, and compare them statistically. Our findings show that the proposed approach successfully identifies bias and that demographic features, such as first language, age, and education, correlate with significant performance differences.

pdf bib
Investigating Annotator Bias with a Graph-Based Approach
Maximilian Wich | Hala Al Kuwatly | Georg Groh
Proceedings of the Fourth Workshop on Online Abuse and Harms

A challenge that many online platforms face is hate speech or any other form of online abuse. To cope with this, hate speech detection systems are developed based on machine learning to reduce manual work for monitoring these platforms. Unfortunately, machine learning is vulnerable to unintended bias in training data, which could have severe consequences, such as a decrease in classification performance or unfair behavior (e.g., discriminating minorities). In the scope of this study, we want to investigate annotator bias — a form of bias that annotators cause due to different knowledge in regards to the task and their subjective perception. Our goal is to identify annotation bias based on similarities in the annotation behavior from annotators. To do so, we build a graph based on the annotations from the different annotators, apply a community detection algorithm to group the annotators, and train for each group classifiers whose performances we compare. By doing so, we are able to identify annotator bias within a data set. The proposed method and collected insights can contribute to developing fairer and more reliable hate speech classification models.

pdf bib
An Evaluation of Progressive Neural Networksfor Transfer Learning in Natural Language Processing
Abdul Moeed | Gerhard Hagerer | Sumit Dugar | Sarthak Gupta | Mainak Ghosh | Hannah Danner | Oliver Mitevski | Andreas Nawroth | Georg Groh
Proceedings of the Twelfth Language Resources and Evaluation Conference

A major challenge in modern neural networks is the utilization of previous knowledge for new tasks in an effective manner, otherwise known as transfer learning. Fine-tuning, the most widely used method for achieving this, suffers from catastrophic forgetting. The problem is often exacerbated in natural language processing (NLP). In this work, we assess progressive neural networks (PNNs) as an alternative to fine-tuning. The evaluation is based on common NLP tasks such as sequence labeling and text classification. By gauging PNNs across a range of architectures, datasets, and tasks, we observe improvements over the baselines throughout all experiments.

pdf bib
Evaluation Metrics for Headline Generation Using Deep Pre-Trained Embeddings
Abdul Moeed | Yang An | Gerhard Hagerer | Georg Groh
Proceedings of the Twelfth Language Resources and Evaluation Conference

With the explosive growth in textual data, it is becoming increasingly important to summarize text automatically. Recently, generative language models have shown promise in abstractive text summarization tasks. Since these models rephrase text and thus use similar but different words as found in the summarized text, existing metrics such as ROUGE that use n-gram overlap may not be optimal. Therefore we evaluate two embedding-based evaluation metrics that are applicable to abstractive summarization: Fr ́echet embedding distance, which has been introduced recently, and angular embedding similarity, which is our proposed metric. To demonstrate the utility of both metrics, we analyze the headline generation capacity of two state-of-the-art language models: GPT-2 and ULMFiT. In particular, our proposed metric shows close relation with human judgments in our experiments and has overall better correlations with them. To provide reproducibility, the source code plus human assessments of our experiments is available on GitHub.

2014

pdf bib
Estimating Grammar Correctness for a Priori Estimation of Machine Translation Post-Editing Effort
Nicholas H. Kirk | Guchun Zhang | Georg Groh
Proceedings of the EACL 2014 Workshop on Humans and Computer-assisted Translation