Anne Lauscher


2022

pdf bib
Fair and Argumentative Language Modeling for Computational Argumentation
Carolin Holtermann | Anne Lauscher | Simone Ponzetto
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Although much work in NLP has focused on measuring and mitigating stereotypical bias in semantic spaces, research addressing bias in computational argumentation is still in its infancy. In this paper, we address this research gap and conduct a thorough investigation of bias in argumentative language models. To this end, we introduce ABBA, a novel resource for bias measurement specifically tailored to argumentation. We employ our resource to assess the effect of argumentative fine-tuning and debiasing on the intrinsic bias found in transformer-based language models using a lightweight adapter-based approach that is more sustainable and parameter-efficient than full fine-tuning. Finally, we analyze the potential impact of language model debiasing on the performance in argument quality prediction, a downstream task of computational argumentation. Our results show that we are able to successfully and sustainably remove bias in general and argumentative language models while preserving (and sometimes improving) model performance in downstream tasks. We make all experimental code and data available at https://github.com/umanlp/FairArgumentativeLM.

pdf bib
DS-TOD: Efficient Domain Specialization for Task-Oriented Dialog
Chia-Chien Hung | Anne Lauscher | Simone Ponzetto | Goran Glavaš
Findings of the Association for Computational Linguistics: ACL 2022

Recent work has shown that self-supervised dialog-specific pretraining on large conversational datasets yields substantial gains over traditional language modeling (LM) pretraining in downstream task-oriented dialog (TOD). These approaches, however, exploit general dialogic corpora (e.g., Reddit) and thus presumably fail to reliably embed domain-specific knowledge useful for concrete downstream TOD domains. In this work, we investigate the effects of domain specialization of pretrained language models (PLMs) for TOD. Within our DS-TOD framework, we first automatically extract salient domain-specific terms, and then use them to construct DomainCC and DomainReddit – resources that we leverage for domain-specific pretraining, based on (i) masked language modeling (MLM) and (ii) response selection (RS) objectives, respectively. We further propose a resource-efficient and modular domain specialization by means of domain adapters – additional parameter-light layers in which we encode the domain knowledge. Our experiments with prominent TOD tasks – dialog state tracking (DST) and response retrieval (RR) – encompassing five domains from the MultiWOZ benchmark demonstrate the effectiveness of DS-TOD. Moreover, we show that the light-weight adapter-based specialization (1) performs comparably to full fine-tuning in single domain setups and (2) is particularly suitable for multi-domain specialization, where besides advantageous computational footprint, it can offer better TOD performance.

pdf bib
Measuring Harmful Sentence Completion in Language Models for LGBTQIA+ Individuals
Debora Nozza | Federico Bianchi | Anne Lauscher | Dirk Hovy
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Current language technology is ubiquitous and directly influences individuals’ lives worldwide. Given the recent trend in AI on training and constantly releasing new and powerful large language models (LLMs), there is a need to assess their biases and potential concrete consequences. While some studies have highlighted the shortcomings of these models, there is only little on the negative impact of LLMs on LGBTQIA+ individuals. In this paper, we investigated a state-of-the-art template-based approach for measuring the harmfulness of English LLMs sentence completion when the subjects belong to the LGBTQIA+ community. Our findings show that, on average, the most likely LLM-generated completion is an identity attack 13% of the time. Our results raise serious concerns about the applicability of these models in production environments.

2021

pdf bib
Sustainable Modular Debiasing of Language Models
Anne Lauscher | Tobias Lueken | Goran Glavaš
Findings of the Association for Computational Linguistics: EMNLP 2021

Unfair stereotypical biases (e.g., gender, racial, or religious biases) encoded in modern pretrained language models (PLMs) have negative ethical implications for widespread adoption of state-of-the-art language technology. To remedy for this, a wide range of debiasing techniques have recently been introduced to remove such stereotypical biases from PLMs. Existing debiasing methods, however, directly modify all of the PLMs parameters, which – besides being computationally expensive – comes with the inherent risk of (catastrophic) forgetting of useful language knowledge acquired in pretraining. In this work, we propose a more sustainable modular debiasing approach based on dedicated debiasing adapters, dubbed ADELE. Concretely, we (1) inject adapter modules into the original PLM layers and (2) update only the adapters (i.e., we keep the original PLM parameters frozen) via language modeling training on a counterfactually augmented corpus. We showcase ADELE, in gender debiasing of BERT: our extensive evaluation, encompassing three intrinsic and two extrinsic bias measures, renders ADELE, very effective in bias mitigation. We further show that – due to its modular nature – ADELE, coupled with task adapters, retains fairness even after large-scale downstream training. Finally, by means of multilingual BERT, we successfully transfer ADELE, to six target languages.

pdf bib
DebIE: A Platform for Implicit and Explicit Debiasing of Word Embedding Spaces
Niklas Friedrich | Anne Lauscher | Simone Paolo Ponzetto | Goran Glavaš
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

Recent research efforts in NLP have demonstrated that distributional word vector spaces often encode stereotypical human biases, such as racism and sexism. With word representations ubiquitously used in NLP models and pipelines, this raises ethical issues and jeopardizes the fairness of language technologies. While there exists a large body of work on bias measures and debiasing methods, to date, there is no platform that would unify these research efforts and make bias measuring and debiasing of representation spaces widely accessible. In this work, we present DebIE, the first integrated platform for (1) measuring and (2) mitigating bias in word embeddings. Given an (i) embedding space (users can choose between the predefined spaces or upload their own) and (ii) a bias specification (users can choose between existing bias specifications or create their own), DebIE can (1) compute several measures of implicit and explicit bias and modify the embedding space by executing two (mutually composable) debiasing models. DebIE’s functionality can be accessed through four different interfaces: (a) a web application, (b) a desktop application, (c) a REST-ful API, and (d) as a command-line application. DebIE is available at: debie.informatik.uni-mannheim.de.

pdf bib
RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models
Soumya Barikeri | Anne Lauscher | Ivan Vulić | Goran Glavaš
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Text representation models are prone to exhibit a range of societal biases, reflecting the non-controlled and biased nature of the underlying pretraining data, which consequently leads to severe ethical issues and even bias amplification. Recent work has predominantly focused on measuring and mitigating bias in pretrained language models. Surprisingly, the landscape of bias measurements and mitigation resources and methods for conversational language models is still very scarce: it is limited to only a few types of bias, artificially constructed resources, and completely ignores the impact that debiasing methods may have on the final perfor mance in dialog tasks, e.g., conversational response generation. In this work, we present REDDITBIAS, the first conversational data set grounded in the actual human conversations from Reddit, allowing for bias measurement and mitigation across four important bias dimensions: gender,race,religion, and queerness. Further, we develop an evaluation framework which simultaneously 1)measures bias on the developed REDDITBIAS resource, and 2)evaluates model capability in dialog tasks after model debiasing. We use the evaluation framework to benchmark the widely used conversational DialoGPT model along with the adaptations of four debiasing methods. Our results indicate that DialoGPT is biased with respect to religious groups and that some debiasing techniques can remove this bias while preserving downstream task performance.

2020

pdf bib
Creating a Domain-diverse Corpus for Theory-based Argument Quality Assessment
Lily Ng | Anne Lauscher | Joel Tetreault | Courtney Napoles
Proceedings of the 7th Workshop on Argument Mining

Computational models of argument quality (AQ) have focused primarily on assessing the overall quality or just one specific characteristic of an argument, such as its convincingness or its clarity. However, previous work has claimed that assessment based on theoretical dimensions of argumentation could benefit writers, but developing such models has been limited by the lack of annotated data. In this work, we describe GAQCorpus, the first large, domain-diverse annotated corpus of theory-based AQ. We discuss how we designed the annotation task to reliably collect a large number of judgments with crowdsourcing, formulating theory-based guidelines that helped make subjective judgments of AQ more objective. We demonstrate how to identify arguments and adapt the annotation task for three diverse domains. Our work will inform research on theory-based argumentation annotation and enable the creation of more diverse corpora to support computational AQ assessment.

pdf bib
Specializing Unsupervised Pretraining Models for Word-Level Semantic Similarity
Anne Lauscher | Ivan Vulić | Edoardo Maria Ponti | Anna Korhonen | Goran Glavaš
Proceedings of the 28th International Conference on Computational Linguistics

Unsupervised pretraining models have been shown to facilitate a wide range of downstream NLP applications. These models, however, retain some of the limitations of traditional static word embeddings. In particular, they encode only the distributional knowledge available in raw text corpora, incorporated through language modeling objectives. In this work, we complement such distributional knowledge with external lexical knowledge, that is, we integrate the discrete knowledge on word-level semantic similarity into pretraining. To this end, we generalize the standard BERT model to a multi-task learning setting where we couple BERT’s masked language modeling and next sentence prediction objectives with an auxiliary task of binary word relation classification. Our experiments suggest that our “Lexically Informed” BERT (LIBERT), specialized for the word-level semantic similarity, yields better performance than the lexically blind “vanilla” BERT on several language understanding tasks. Concretely, LIBERT outperforms BERT in 9 out of 10 tasks of the GLUE benchmark and is on a par with BERT in the remaining one. Moreover, we show consistent gains on 3 benchmarks for lexical simplification, a task where knowledge about word-level semantic similarity is paramount, as well as large gains on lexical reasoning probes.

pdf bib
Rhetoric, Logic, and Dialectic: Advancing Theory-based Argument Quality Assessment in Natural Language Processing
Anne Lauscher | Lily Ng | Courtney Napoles | Joel Tetreault
Proceedings of the 28th International Conference on Computational Linguistics

Though preceding work in computational argument quality (AQ) mostly focuses on assessing overall AQ, researchers agree that writers would benefit from feedback targeting individual dimensions of argumentation theory. However, a large-scale theory-based corpus and corresponding computational models are missing. We fill this gap by conducting an extensive analysis covering three diverse domains of online argumentative writing and presenting GAQCorpus: the first large-scale English multi-domain (community Q&A forums, debate forums, review forums) corpus annotated with theory-based AQ scores. We then propose the first computational approaches to theory-based assessment, which can serve as strong baselines for future work. We demonstrate the feasibility of large-scale AQ annotation, show that exploiting relations between dimensions yields performance improvements, and explore the synergies between theory-based prediction and practical AQ assessment.

pdf bib
From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers
Anne Lauscher | Vinit Ravishankar | Ivan Vulić | Goran Glavaš
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Massively multilingual transformers (MMTs) pretrained via language modeling (e.g., mBERT, XLM-R) have become a default paradigm for zero-shot language transfer in NLP, offering unmatched transfer performance. Current evaluations, however, verify their efficacy in transfers (a) to languages with sufficiently large pretraining corpora, and (b) between close languages. In this work, we analyze the limitations of downstream language transfer with MMTs, showing that, much like cross-lingual word embeddings, they are substantially less effective in resource-lean scenarios and for distant languages. Our experiments, encompassing three lower-level tasks (POS tagging, dependency parsing, NER) and two high-level tasks (NLI, QA), empirically correlate transfer performance with linguistic proximity between source and target languages, but also with the size of target language corpora used in MMT pretraining. Most importantly, we demonstrate that the inexpensive few-shot transfer (i.e., additional fine-tuning on a few target-language instances) is surprisingly effective across the board, warranting more research efforts reaching beyond the limiting zero-shot conditions.

pdf bib
AraWEAT: Multidimensional Analysis of Biases in Arabic Word Embeddings
Anne Lauscher | Rafik Takieddin | Simone Paolo Ponzetto | Goran Glavaš
Proceedings of the Fifth Arabic Natural Language Processing Workshop

Recent work has shown that distributional word vector spaces often encode human biases like sexism or racism. In this work, we conduct an extensive analysis of biases in Arabic word embeddings by applying a range of recently introduced bias tests on a variety of embedding spaces induced from corpora in Arabic. We measure the presence of biases across several dimensions, namely: embedding models (Skip-Gram, CBOW, and FastText) and vector sizes, types of text (encyclopedic text, and news vs. user-generated content), dialects (Egyptian Arabic vs. Modern Standard Arabic), and time (diachronic analyses over corpora from different time periods). Our analysis yields several interesting findings, e.g., that implicit gender bias in embeddings trained on Arabic news corpora steadily increases over time (between 2007 and 2017). We make the Arabic bias specifications (AraWEAT) publicly available.

pdf bib
Common Sense or World Knowledge? Investigating Adapter-Based Knowledge Injection into Pretrained Transformers
Anne Lauscher | Olga Majewska | Leonardo F. R. Ribeiro | Iryna Gurevych | Nikolai Rozanov | Goran Glavaš
Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

Following the major success of neural language models (LMs) such as BERT or GPT-2 on a variety of language understanding tasks, recent work focused on injecting (structured) knowledge from external resources into these models. While on the one hand, joint pre-training (i.e., training from scratch, adding objectives based on external knowledge to the primary LM objective) may be prohibitively computationally expensive, post-hoc fine-tuning on external knowledge, on the other hand, may lead to the catastrophic forgetting of distributional knowledge. In this work, we investigate models for complementing the distributional knowledge of BERT with conceptual knowledge from ConceptNet and its corresponding Open Mind Common Sense (OMCS) corpus, respectively, using adapter training. While overall results on the GLUE benchmark paint an inconclusive picture, a deeper analysis reveals that our adapter-based models substantially outperform BERT (up to 15-20 performance points) on inference tasks that require the type of conceptual knowledge explicitly present in ConceptNet and OMCS. We also open source all our experiments and relevant code under: https://github.com/wluper/retrograph.

2019

pdf bib
Are We Consistently Biased? Multidimensional Analysis of Biases in Distributional Word Vectors
Anne Lauscher | Goran Glavaš
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

Word embeddings have recently been shown to reflect many of the pronounced societal biases (e.g., gender bias or racial bias). Existing studies are, however, limited in scope and do not investigate the consistency of biases across relevant dimensions like embedding models, types of texts, and different languages. In this work, we present a systematic study of biases encoded in distributional word vector spaces: we analyze how consistent the bias effects are across languages, corpora, and embedding models. Furthermore, we analyze the cross-lingual biases encoded in bilingual embedding spaces, indicative of the effects of bias transfer encompassed in cross-lingual transfer of NLP models. Our study yields some unexpected findings, e.g., that biases can be emphasized or downplayed by different embedding models or that user-generated content may be less biased than encyclopedic text. We hope our work catalyzes bias research in NLP and informs the development of bias reduction techniques.

2018

pdf bib
UniMa at SemEval-2018 Task 7: Semantic Relation Extraction and Classification from Scientific Publications
Thorsten Keiper | Zhonghao Lyu | Sara Pooladzadeh | Yuan Xu | Jingyi Zhang | Anne Lauscher | Simone Paolo Ponzetto
Proceedings of The 12th International Workshop on Semantic Evaluation

Large repositories of scientific literature call for the development of robust methods to extract information from scholarly papers. This problem is addressed by the SemEval 2018 Task 7 on extracting and classifying relations found within scientific publications. In this paper, we present a feature-based and a deep learning-based approach to the task and discuss the results of the system runs that we submitted for evaluation.

pdf bib
ArguminSci: A Tool for Analyzing Argumentation and Rhetorical Aspects in Scientific Writing
Anne Lauscher | Goran Glavaš | Kai Eckert
Proceedings of the 5th Workshop on Argument Mining

Argumentation is arguably one of the central features of scientific language. We present ArguminSci, an easy-to-use tool that analyzes argumentation and other rhetorical aspects of scientific writing, which we collectively dub scitorics. The main aspect we focus on is the fine-grained argumentative analysis of scientific text through identification of argument components. The functionality of ArguminSci is accessible via three interfaces: as a command line tool, via a RESTful application programming interface, and as a web application.

pdf bib
An Argument-Annotated Corpus of Scientific Publications
Anne Lauscher | Goran Glavaš | Simone Paolo Ponzetto
Proceedings of the 5th Workshop on Argument Mining

Argumentation is an essential feature of scientific language. We present an annotation study resulting in a corpus of scientific publications annotated with argumentative components and relations. The argumentative annotations have been added to the existing Dr. Inventor Corpus, already annotated for four other rhetorical aspects. We analyze the annotated argumentative structures and investigate the relations between argumentation and other rhetorical aspects of scientific writing, such as discourse roles and citation contexts.

pdf bib
Investigating the Role of Argumentation in the Rhetorical Analysis of Scientific Publications with Neural Multi-Task Learning Models
Anne Lauscher | Goran Glavaš | Simone Paolo Ponzetto | Kai Eckert
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Exponential growth in the number of scientific publications yields the need for effective automatic analysis of rhetorical aspects of scientific writing. Acknowledging the argumentative nature of scientific text, in this work we investigate the link between the argumentative structure of scientific publications and rhetorical aspects such as discourse categories or citation contexts. To this end, we (1) augment a corpus of scientific publications annotated with four layers of rhetoric annotations with argumentation annotations and (2) investigate neural multi-task learning architectures combining argument extraction with a set of rhetorical classification tasks. By coupling rhetorical classifiers with the extraction of argumentative components in a joint multi-task learning setting, we obtain significant performance gains for different rhetorical analysis tasks.