Gabriella Lapesa

2025

pdf bib abs
Mining the uncertainty patterns of humans and models in the annotation of moral foundations and human values
Neele Falk | Gabriella Lapesa
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The NLP community has converged on considering disagreement in annotation (or human label variation, HLV) as a constitutive feature of subjective tasks. This paper makes a further step by investigating the relationship between HLV and model uncertainty, and the impact of linguistic features of the items on both. We focus on the identification of moral foundations (e.g., care, fairness, loyalty) and human values (e.g., be polite, be honest) in text. We select three standard datasets and proceed into two steps. First, we focus on HLV and analyze the linguistic features (complexity, polarity, pragmatic phenomena, lexical choices) that correlate with HLV. Next, we proceed to uncertainty and its relationship to HLV. We experiment with RoBERTa and Flan-T5 in a number of training setups and evaluation metrics that test the calibration of uncertainty to HLV and its relationship to performance beyond majority vote; next, we analyze the impact of linguistic features on uncertainty. We find that RoBERTa with soft loss is better calibrated to HLV, and we find alignment between calibrated models and humans in the features (textual complexity and polarity) triggering variation.

pdf bib
Proceedings of the 12th Argument mining Workshop
Elena Chistova | Philipp Cimiano | Shohreh Haddadan | Gabriella Lapesa | Ramon Ruiz-Dolz
Proceedings of the 12th Argument mining Workshop

pdf bib abs
Investigating Subjective Factors of Argument Strength: Storytelling, Emotions, and Hedging
Carlotta Quensel | Neele Falk | Gabriella Lapesa
Proceedings of the 12th Argument mining Workshop

In assessing argument strength, the notions of what makes a good argument are manifold. With the broader trend towards treating subjectivity as an asset and not a problem in NLP, new dimensions of argument quality are studied. Although studies on individual subjective features like personal stories exist, there is a lack of large-scale analyses of the relation between these features and argument strength. To address this gap, we conduct regression analysis to quantify the impact of subjective factors – emotions, storytelling, and hedging - on two standard datasets annotated for objective argument quality and subjective persuasion. As such, our contribution is twofold: at the level of contributed resources, as there are no datasets annotated with all studied dimensions, this work compares and evaluates automated annotation methods for each subjective feature. At the level of novel insights, our regression analysis uncovers different patterns of impact of subjective features on the two facets of argument strength encoded in the datasets. Our results show that storytelling and hedging have contrasting effects on objective and subjective argument quality, while the influence of emotions depends on their rhetoric utilization rather than the domain.

pdf bib abs
PerspectiveMod: A Perspectivist Resource for Deliberative Moderation
Eva Maria Vecchi | Neele Falk | Carlotta Quensel | Iman Jundi | Gabriella Lapesa
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Human moderators in online discussions face a heterogeneous range of tasks, which go beyond content moderation, or policing. They also support and improve discussion quality, which is challenging to model (and evaluate) in NLP due to its inherent subjectivity and the scarcity of annotated resources. We address this gap by introducing PerspectiveMod, a dataset of online comments annotated for the question: *“Does this comment require moderation, and why?”* Annotations were collected from both expert moderators and trained non-experts. **PerspectiveMod** is unique in its intentional variation across (a) the level of moderation experience embedded in the source data (professional vs. non-professional moderation environments), (b) the annotator profiles (experts vs. trained crowdworkers), and (c) the richness of each moderation judgment, both in terms on fine-grained comment properties (drawn from argumentation and deliberative theory) and in the representation of the individuality of the annotator (socio-demographics and attitudes towards the task). We advance understanding of the task’s complexity by providing interpretation layers that account for its subjectivity. Our statistical analysis highlights the value of collecting annotator perspectives, including their experiences, attitudes, and views on AI, as a foundation for developing more context-aware and interpretively robust moderation tools.

pdf bib abs
AI Argues Differently: Distinct Argumentative and Linguistic Patterns of LLMs in Persuasive Contexts
Esra Dönmez | Maximilian Maurer | Gabriella Lapesa | Agnieszka Falenska
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Distinguishing LLM-generated text from human-written is a key challenge for safe and ethical NLP, particularly in high-stake settings such as persuasive online discourse. While recent work focuses on detection, real-world use cases also demand interpretable tools to help humans understand and distinguish LLM-generated texts. To this end, we present an analysis framework comparing human- and LLM-authored arguments using two easily-interpretable feature sets: general-purpose linguistic features (e.g., lexical richness, syntactic complexity) and domain-specific features related to argument quality (e.g., logical soundness, engagement strategies). Applied to */r/ChangeMyView* arguments by humans and three LLMs, our method reveals clear patterns: LLM-generated counter-arguments show lower type-token and lemma-token ratios but higher emotional intensity — particularly in anticipation and trust. They more closely resemble textbook-quality arguments — cogent, justified, explicitly respectful toward others, and positive in tone. Moreover, counter-arguments generated by LLMs converge more closely with the original post’s style and quality than those written by humans. Finally, we demonstrate that these differences enable a lightweight, interpretable, and highly effective classifier for detecting LLM-generated comments in CMV.

pdf bib abs
Tell Me What You Know About Sexism: Expert-LLM Interaction Strategies and Co-Created Definitions for Zero-Shot Sexism Detection
Myrthe Reuver | Indira Sen | Matteo Melis | Gabriella Lapesa
Findings of the Association for Computational Linguistics: NAACL 2025

This paper investigates hybrid intelligence and collaboration between researchers of sexism and Large Language Models (LLMs), with afour-component pipeline. First, nine sexism researchers answer questions about their knowledge of sexism and of LLMs. They then participate in two interactive experiments involving an LLM (GPT3.5). The first experiment has experts assessing the model’s knowledgeabout sexism and suitability for use in research. The second experiment tasks them with creating three different definitions of sexism: anexpert-written definition, an LLM-written one, and a co-created definition. Lastly, zero-shot classification experiments use the three definitions from each expert in a prompt template for sexism detection, evaluating GPT4o on 2.500 texts sampled from five sexism benchmarks. We then analyze the resulting 67.500 classification decisions. The LLM interactions lead to longer and more complex definitions of sexism. Expert-written definitions on average perform poorly compared to LLM-generated definitions. However, some experts do improve classification performance with their co-created definitions of sexism, also experts who are inexperienced in using LLMs.

pdf bib abs
Towards a Perspectivist Turn in Argument Quality Assessment
Julia Romberg | Maximilian Maurer | Henning Wachsmuth | Gabriella Lapesa
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The assessment of argument quality depends on well-established logical, rhetorical, and dialectical properties that are unavoidably subjective: multiple valid assessments may exist, there is no unequivocal ground truth. This aligns with recent paths in machine learning, which embrace the co-existence of different perspectives. However, this potential remains largely unexplored in NLP research on argument quality. One crucial reason seems to be the yet unexplored availability of suitable datasets. We fill this gap by conducting a systematic review of argument quality datasets. We assign them to a multi-layered categorization targeting two aspects: (a) What has been annotated: we collect the quality dimensions covered in datasets and consolidate them in an overarching taxonomy, increasing dataset comparability and interoperability. (b) Who annotated: we survey what information is given about annotators, enabling perspectivist research and grounding our recommendations for future actions. To this end, we discuss datasets suitable for developing perspectivist models (i.e., those containing individual, non-aggregated annotations), and we showcase the importance of a controlled selection of annotators in a pilot study.

pdf bib abs
It Is Not Only the Negative that Deserves Attention! Understanding, Generation & Evaluation of (Positive) Moderation
Iman Jundi | Eva Maria Vecchi | Carlotta Quensel | Neele Falk | Gabriella Lapesa
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Moderation is essential for maintaining and improving the quality of online discussions. This involves: (1) countering negativity, e.g. hate speech and toxicity, and (2) promoting positive discourse, e.g. broadening the discussion to involve other users and perspectives. While significant efforts have focused on addressing negativity, driven by an urgency to address such issues, this left moderation promoting positive discourse (henceforth PositiveModeration) under-studied. With the recent advancements in LLMs, Positive Moderation can potentially be scaled to vast conversations, fostering more thoughtful discussions and bridging the increasing divide in online interactions.We advance the understanding of Positive Moderation by annotating a dataset on 13 moderation properties, e.g. neutrality, clarity and curiosity. We extract instructions from professional moderation guidelines and use them to prompt LLaMA to generate such moderation. This is followed by extensive evaluation showing that (1) annotators rate generated higher than professional moderation, but still slightly prefer professional moderation in pairwise comparison, and (2) LLMs can be used to estimate human evaluation as an efficient alternative.

pdf bib abs
A Modular Taxonomy for Hate Speech Definitions and Its Impact on Zero-Shot LLM Classification Performance
Matteo Melis | Gabriella Lapesa | Dennis Assenmacher
Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH)

Detecting harmful content is a crucial task in the landscape of NLP applications for Social Good, with hate speech being one of its most dangerous forms. But what do we mean by hate speech, how can we define it and how does prompting different definitions of hate speech affect model performance? The contribution of this work is twofold. At the theoretical level, we address the ambiguity surrounding hate speech by collecting and analyzing existing definitions from the literature. We organize these definitions into a taxonomy of 14 conceptual elements—building blocks that capture different aspects of hate speech definitions, such as references to the target of hate. At the experimental level, we employ the collection of definitions in a systematic zero-shot evaluation of three LLMs, on three hate speech datasets representing different types of data (synthetic, human-in-the-loop, and real-world). We find that choosing different definitions, i.e., definitions with a different degree of specificity in terms of encoded elements, impacts model performance, but this effect is not consistent across all architectures.

2024

pdf bib abs
GESIS-DSM at PerpectiveArg2024: A Matter of Style? Socio-Cultural Differences in Argumentation
Maximilian Maurer | Julia Romberg | Myrthe Reuver | Negash Weldekiros | Gabriella Lapesa
Proceedings of the 11th Workshop on Argument Mining (ArgMining 2024)

This paper describes the contribution of team GESIS-DSM to the Perspective Argument Retrieval Task, a task on retrieving socio-culturally relevant and diverse arguments for different user queries. Our experiments and analyses aim to explore the nature of the socio-cultural specialization in argument retrieval: (how) do the arguments written by different socio-cultural groups differ? We investigate the impact of content and style for the task of identifying arguments relevant to a query and a certain demographic attribute. In its different configurations, our system employs sentence embedding representations, arguments generated with Large Language Model, as well as stylistic features. final method places third overall in the shared task, and, in comparison, does particularly well in the most difficult evaluation scenario, where the socio-cultural background of the argument author is implicit (i.e. has to be inferred from the text). This result indicates that socio-cultural differences in argument production may indeed be a matter of style.

pdf bib
Proceedings of the 4th Workshop on Computational Linguistics for the Political and Social Sciences: Long and short papers
Christopher Klamm | Gabriella Lapesa | Simone Paolo Ponzetto | Ines Rehbein | Indira Sen
Proceedings of the 4th Workshop on Computational Linguistics for the Political and Social Sciences: Long and short papers

pdf bib
Proceedings of the First Workshop on Language-driven Deliberation Technology (DELITE) @ LREC-COLING 2024
Annette Hautli-Janisz | Gabriella Lapesa | Lucas Anastasiou | Valentin Gold | Anna De Liddo | Chris Reed
Proceedings of the First Workshop on Language-driven Deliberation Technology (DELITE) @ LREC-COLING 2024

pdf bib abs
Moderation in the Wild: Investigating User-Driven Moderation in Online Discussions
Neele Falk | Eva Vecchi | Iman Jundi | Gabriella Lapesa
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Effective content moderation is imperative for fostering healthy and productive discussions in online domains. Despite the substantial efforts of moderators, the overwhelming nature of discussion flow can limit their effectiveness. However, it is not only trained moderators who intervene in online discussions to improve their quality. “Ordinary” users also act as moderators, actively intervening to correct information of other users’ posts, enhance arguments, and steer discussions back on course.This paper introduces the phenomenon of user moderation, documenting and releasing UMOD, the first dataset of comments in whichusers act as moderators. UMOD contains 1000 comment-reply pairs from the subreddit r/changemyview with crowdsourced annotations from a large annotator pool and with a fine-grained annotation schema targeting the functions of moderation, stylistic properties(aggressiveness, subjectivity, sentiment), constructiveness, as well as the individual perspectives of the annotators on the task. The releaseof UMOD is complemented by two analyses which focus on the constitutive features of constructiveness in user moderation and on thesources of annotator disagreements, given the high subjectivity of the task.

pdf bib abs
Toeing the Party Line: Election Manifestos as a Key to Understand Political Discourse on Twitter
Maximilian Maurer | Tanise Ceron | Sebastian Padó | Gabriella Lapesa
Findings of the Association for Computational Linguistics: EMNLP 2024

Political discourse on Twitter is a moving target: politicians continuously make statements about their positions. It is therefore crucial to track their discourse on social media to understand their ideological positions and goals. However, Twitter data is also challenging to work with since it is ambiguous and often dependent on social context, and consequently, recent work on political positioning has tended to focus strongly on manifestos (parties’ electoral programs) rather than social media.In this paper, we extend recently proposed methods to predict pairwise positional similarities between parties from the manifesto case to the Twitter case, using hashtags as a signal to fine-tune text representations, without the need for manual annotation. We verify the efficacy of fine-tuning and conduct a series of experiments that assess the robustness of our method for low-resource scenarios. We find that our method yields stable positionings reflective of manifesto positionings, both in scenarios with all tweets of candidates across years available and when only smaller subsets from shorter time periods are available. This indicates that it is possible to reliably analyze the relative positioning of actors without the need for manual annotation, even in the noisier context of social media.

pdf bib abs
Argument Quality Assessment in the Age of Instruction-Following Large Language Models
Henning Wachsmuth | Gabriella Lapesa | Elena Cabrio | Anne Lauscher | Joonsuk Park | Eva Maria Vecchi | Serena Villata | Timon Ziegenbein
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The computational treatment of arguments on controversial issues has been subject to extensive NLP research, due to its envisioned impact on opinion formation, decision making, writing education, and the like. A critical task in any such application is the assessment of an argument’s quality - but it is also particularly challenging. In this position paper, we start from a brief survey of argument quality research, where we identify the diversity of quality notions and the subjectiveness of their perception as the main hurdles towards substantial progress on argument quality assessment. We argue that the capabilities of instruction-following large language models (LLMs) to leverage knowledge across contexts enable a much more reliable assessment. Rather than just fine-tuning LLMs towards leaderboard chasing on assessment tasks, they need to be instructed systematically with argumentation theories and scenarios as well as with ways to solve argument-related problems. We discuss the real-world opportunities and ethical issues emerging thereby.

pdf bib abs
Self-reported Demographics and Discourse Dynamics in a Persuasive Online Forum
Agnieszka Falenska | Eva Maria Vecchi | Gabriella Lapesa
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Research on language as interactive discourse underscores the deliberate use of demographic parameters such as gender, ethnicity, and class to shape social identities. For example, by explicitly disclosing one’s information and enforcing one’s social identity to an online community, the reception by and interaction with the said community is impacted, e.g., strengthening one’s opinions by depicting the speaker as credible through their experience in the subject. Here, we present a first thorough study of the role and effects of self-disclosures on online discourse dynamics, focusing on a pervasive type of self-disclosure: author gender. Concretely, we investigate the contexts and properties of gender self-disclosures and their impact on interaction dynamics in an online persuasive forum, ChangeMyView. Our contribution is twofold. At the level of the target phenomenon, we fill a research gap in the understanding of the impact of these self-disclosures on the discourse by bringing together features related to forum activity (votes, number of comments), linguistic/stylistic features from the literature, and discourse topics. At the level of the contributed resource, we enrich and release a comprehensive dataset that will provide a further impulse for research on the interplay between gender disclosures, community interaction, and persuasion in online discourse.

pdf bib abs
Stories and Personal Experiences in the COVID-19 Discourse
Neele Falk | Gabriella Lapesa
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Storytelling, i.e., the use of of anecdotes and personal experiences, plays a crucial role in everyday argumentation. This is particularly true for the highly controversial debates that spark in times of crisis - where the focus of the discussion is on heterogeneous aspects of everyday life. For individuals, stories can have a strong persuasive power; for a larger collective, stories can help decision-makers to develop strategies for addressing the challenges people are facing, especially in times of crisis. In this paper, we analyse the use of storytelling in the COVID-19 discourse. We carry out our analysis on three publicly available Reddit datasets, for a total of 367K comments. We automatically annotate the Reddit datasets by detecting spans containing storytelling and classifying them into: a) personal vs. general – is the story experienced by the speaker? b) argumentative function (Does the story clarify a problem, potentially consisting in harm to a specific group? Does it exemplify a solution to a problem, or does it establish the credibility of the speaker?), and c) topic. We then carry out an analysis which establishes the relevance of storytelling in the COVID discourse and further uncovers interactions between topics and types of stories associated to them.

pdf bib abs
Mining, Assessing, and Improving Arguments in NLP and the Social Sciences
Gabriella Lapesa | Eva Maria Vecchi | Serena Villata | Henning Wachsmuth
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries

Computational argumentation is an interdisciplinary research field, connecting Natural Language Processing (NLP) to other disciplines such as the social sciences. The focus of recent research has concentrated on argument quality assessment: what makes an argument good or bad? We present a tutorial which is an updated edition of the EACL 2023 tutorial presented by the same authors. As in the previous version, the tutorial will have a strong interdisciplinary and interactive nature, and will be structured along three main coordinates: (1) the notions of argument quality (AQ) across disciplines (how do we recognize good and bad arguments?), with a particular focus on the interface between Argument Mining (AM) and Deliberation Theory; (2) the modeling of subjectivity (who argues to whom; what are their beliefs?); and (3) the generation of improved arguments (what makes an argument better?). The tutorial will also touch upon a series of topics that are particularly relevant for the LREC-COLING audience (the issue of resource quality for the assessment of AQ; the interdisciplinary application of AM and AQ in a text-as-data approach to Political Science), in line with the developments in NLP (LLMs for AQ assessment), and relevant for the societal applications of AQ assessment (bias and debiasing). We will involve the participants in two annotation studies on the assessment and the improvement of quality.

2023

pdf bib abs
StoryARG: a corpus of narratives and personal experiences in argumentative texts
Neele Falk | Gabriella Lapesa
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Humans are storytellers, even in communication scenarios which are assumed to be more rationality-oriented, such as argumentation. Indeed, supporting arguments with narratives or personal experiences (henceforth, stories) is a very natural thing to do – and yet, this phenomenon is largely unexplored in computational argumentation. Which role do stories play in an argument? Do they make the argument more effective? What are their narrative properties? To address these questions, we collected and annotated StoryARG, a dataset sampled from well-established corpora in computational argumentation (ChangeMyView and RegulationRoom), and the Social Sciences (Europolis), as well as comments to New York Times articles. StoryARG contains 2451 textual spans annotated at two levels. At the argumentative level, we annotate the function of the story (e.g., clarification, disclosure of harm, search for a solution, establishing speaker’s authority), as well as its impact on the effectiveness of the argument and its emotional load. At the level of narrative properties, we annotate whether the story has a plot-like development, is factual or hypothetical, and who the protagonist is. What makes a story effective in an argument? Our analysis of the annotations in StoryARG uncover a positive impact on effectiveness for stories which illustrate a solution to a problem, and in general, annotator-specific preferences that we investigate with regression analysis.

pdf bib abs
Node Placement in Argument Maps: Modeling Unidirectional Relations in High & Low-Resource Scenarios
Iman Jundi | Neele Falk | Eva Maria Vecchi | Gabriella Lapesa
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Argument maps structure discourse into nodes in a tree with each node being an argument that supports or opposes its parent argument. This format is more comprehensible and less redundant compared to an unstructured one. Exploring those maps and maintaining their structure by placing new arguments under suitable parents is more challenging for users with huge maps that are typical in online discussions. To support those users, we introduce the task of node placement: suggesting candidate nodes as parents for a new contribution. We establish an upper-bound of human performance, and conduct experiments with models of various sizes and training strategies. We experiment with a selection of maps from Kialo, drawn from a heterogeneous set of domains. Based on an annotation study, we highlight the ambiguity of the task that makes it challenging for both humans and models. We examine the unidirectional relation between tree nodes and show that encoding a node into different embeddings for each of the parent and child cases improves performance. We further show the few-shot effectiveness of our approach.

pdf bib
Proceedings of the 3rd Workshop on Computational Linguistics for the Political and Social Sciences
Christopher Klamm | Gabriella Lapesa | Valentin Gold | Theresa Gessler | Simone Paolo Ponzetto
Proceedings of the 3rd Workshop on Computational Linguistics for the Political and Social Sciences

pdf bib abs
Mining, Assessing, and Improving Arguments in NLP and the Social Sciences
Gabriella Lapesa | Eva Maria Vecchi | Serena Villata | Henning Wachsmuth
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts

Computational argumentation is an interdisciplinary research field, connecting Natural Language Processing (NLP) to other disciplines such as the social sciences. This tutorial will focus on a task that recently got into the center of attention in the community: argument quality assessment, that is, what makes an argument good or bad? We structure the tutorial along three main coordinates: (1) the notions of argument quality across disciplines (how do we recognize good and bad arguments?), (2) the modeling of subjectivity (who argues to whom; what are their beliefs?), and (3) the generation of improved arguments (what makes an argument better?). The tutorial highlights interdisciplinary aspects of the field, ranging from the collaboration of theory and practice (e.g., in NLP and social sciences), to approaching different types of linguistic structures (e.g., social media versus parliamentary texts), and facing the ethical issues involved (e.g., how to build applications for the social good). A key feature of this tutorial is its interactive nature: We will involve the participants in two annotation studies on the assessment and the improvement of quality, and we will encourage them to reflect on the challenges and potential of these tasks.

pdf bib abs
Bridging Argument Quality and Deliberative Quality Annotations with Adapters
Neele Falk | Gabriella Lapesa
Findings of the Association for Computational Linguistics: EACL 2023

Assessing the quality of an argument is a complex, highly subjective task, influenced by heterogeneous factors (e.g., prior beliefs of the annotators, topic, domain, and application), and crucial for its impact in downstream tasks (e.g., argument retrieval or generation). Both the Argument Mining and the Social Science community have devoted plenty of attention to it, resulting in a wide variety of argument quality dimensions and a large number of annotated resources. This work aims at a better understanding of how the different aspects of argument quality relate to each other from a practical point of view. We employ adapter-fusion (Pfeiffer et al., 2021) as a multi-task learning framework which a) can improve the prediction of individual quality dimensions by injecting knowledge about related dimensions b) is efficient and modular and c) can serve as an analysis tool to investigate relations between different dimensions. We conduct experiments on 6 datasets and 20 quality dimensions. We find that the majority of the dimensions can be learned as a weighted combination of other quality aspects, and that for 8 dimensions adapter fusion improves quality prediction. Last, we show the benefits of this approach by improving the performance in an extrinsic, out-of-domain task: prediction of moderator interventions in a deliberative forum.

pdf bib
Political claim identification and categorization in a multilingual setting: First experiments
Urs Zaberer | Sebastian Pado | Gabriella Lapesa
Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)

2022

pdf bib abs
Reports of personal experiences and stories in argumentation: datasets and analysis
Neele Falk | Gabriella Lapesa
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Reports of personal experiences or stories can play a crucial role in argumentation, as they represent an immediate and (often) relatable way to back up one’s position with respect to a given topic. They are easy to understand and increase empathy: this makes them powerful in argumentation. The impact of personal reports and stories in argumentation has been studied in the Social Sciences, but it is still largely underexplored in NLP. Our work is the first step towards filling this gap: our goal is to develop robust classifiers to identify documents containing personal experiences and reports. The main challenge is the scarcity of annotated data: our solution is to leverage existing annotations to be able to scale-up the analysis. Our contribution is two-fold. First, we conduct a set of in-domain and cross-domain experiments involving three datasets (two from Argument Mining, one from the Social Sciences), modeling architectures, training setups and fine-tuning options tailored to the involved domains. We show that despite the differences among datasets and annotations, robust cross-domain classification is possible. Second, we employ linear regression for performance mining, identifying performance trends both for overall classification performance and individual classifier predictions.

pdf bib
Proceedings of the 9th Workshop on Argument Mining
Gabriella Lapesa | Jodi Schneider | Yohan Jo | Sougata Saha
Proceedings of the 9th Workshop on Argument Mining

pdf bib abs
How to Translate Your Samples and Choose Your Shots? Analyzing Translate-train & Few-shot Cross-lingual Transfer
Iman Jundi | Gabriella Lapesa
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

The lack of resources for languages in the Americas has proven to be a problem for the creation of digital systems such as machine translation, search engines, chat bots, and more. The scarceness of digital resources for a language causes a higher impact on populations where the language is spoken by millions of people. We introduce the first official large combined corpus for deep learning of an indigenous South American low-resource language spoken by millions called Quechua. Specifically, our curated corpus is created from text gathered from the southern region of Peru where a dialect of Quechua is spoken that has not traditionally been used for digital systems as a target dialect in the past. In order to make our work repeatable by others, we also offer a public, pre-trained, BERT model called QuBERT which is the largest linguistic model ever trained for any Quechua type, not just the southern region dialect. We furthermore test our corpus and its corresponding BERT model on two major tasks: (1) named-entity recognition (NER) and (2) part-of-speech (POS) tagging by using state-of-the-art techniques where we achieve results comparable to other work on higher-resource languages. In this article, we describe the methodology, challenges, and results from the creation of QuBERT which is on on par with other state-of-the-art multilingual models for natural language processing achieving between 71 and 74% F1 score on NER and 84–87% on POS tasks.

Many tasks in text-based computational social science (CSS) involve the classification of political statements into categories based on a domain-specific codebook. In order to be useful for CSS analysis, these categories must be fine-grained. The typically skewed distribution of fine-grained categories, however, results in a challenging classification problem on the NLP side. This paper proposes to make use of the hierarchical relations among categories typically present in such codebooks:e.g., markets and taxation are both subcategories of economy, while borders is a subcategory of security. We use these ontological relations as prior knowledge to establish additional constraints on the learned model, thusimproving performance overall and in particular for infrequent categories. We evaluate several lightweight variants of this intuition by extending state-of-the-art transformer-based textclassifiers on two datasets and multiple languages. We find the most consistent improvement for an approach based on regularization.

pdf bib abs
How to Translate Your Samples and Choose Your Shots? Analyzing Translate-train & Few-shot Cross-lingual Transfer
Iman Jundi | Gabriella Lapesa
Findings of the Association for Computational Linguistics: NAACL 2022

Translate-train or few-shot cross-lingual transfer can be used to improve the zero-shot performance of multilingual pretrained language models. Few-shot utilizes high-quality low-quantity samples (often manually translated from the English corpus ). Translate-train employs a machine translation of the English corpus, resulting in samples with lower quality that could be scaled to high quantity. Given the lower cost and higher availability of machine translation compared to manual professional translation, it is important to systematically compare few-shot and translate-train, understand when each has an advantage, and investigate how to choose the shots to translate in order to increase the few-shot gain. This work aims to fill this gap: we compare and quantify the performance gain of few-shot vs. translate-train using three different base models and a varying number of samples for three tasks/datasets (XNLI, PAWS-X, XQuAD) spanning 17 languages. We show that scaling up the training data using machine translation gives a larger gain compared to using the small-scale (higher-quality) few-shot data. When few-shot is beneficial, we show that there are random sets of samples that perform better across languages and that the performance on English and on the machine-translation of the samples can both be used to choose the shots to manually translate for an increased few-shot gain.

pdf bib
Journal for Language Technology and Computational Linguistics, Vol. 35 No. 2
Ines Rehbein | Gabriella Lapesa | Goran Glavaš | Simone Paolo Ponzetto
Journal for Language Technology and Computational Linguistics, Vol. 35 No. 2

pdf bib abs
Scaling up Discourse Quality Annotation for Political Science
Neele Falk | Gabriella Lapesa
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The empirical quantification of the quality of a contribution to a political discussion is at the heart of deliberative theory, the subdiscipline of political science which investigates decision-making in deliberative democracy. Existing annotation on deliberative quality is time-consuming and carried out by experts, typically resulting in small datasets which also suffer from strong class imbalance. Scaling up such annotations with automatic tools is desirable, but very challenging. We take up this challenge and explore different strategies to improve the prediction of deliberative quality dimensions (justification, common good, interactivity, respect) in a standard dataset. Our results show that simple data augmentation techniques successfully alleviate data imbalance. Classifiers based on linguistic features (textual complexity and sentiment/polarity) and classifiers integrating argument quality annotations (from the argument mining community in NLP) were consistently outperformed by transformer-based models, with or without data augmentation.

pdf bib abs
Investigating Independence vs. Control: Agenda-Setting in Russian News Coverage on Social Media
Annerose Eichel | Gabriella Lapesa | Sabine Schulte im Walde
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Agenda-setting is a widely explored phenomenon in political science: powerful stakeholders (governments or their financial supporters) have control over the media and set their agenda: political and economical powers determine which news should be salient. This is a clear case of targeted manipulation to divert the public attention from serious issues affecting internal politics (such as economic downturns and scandals) by flooding the media with potentially distracting information. We investigate agenda-setting in the Russian social media landscape, exploring the relation between economic indicators and mentions of foreign geopolitical entities, as well as of Russia itself. Our contributions are at three levels: at the level of the domain of the investigation, our study is the first to substructure the Russian media landscape in state-controlled vs. independent outlets in the context of strategic distraction from negative economic trends; at the level of the scope of the investigation, we involve a large set of geopolitical entities (while previous work has focused on the U.S.); at the qualitative level, our analysis of posts on Ukraine, whose relationship with Russia is of high geopolitical relevance, provides further insights into the contrast between state-controlled and independent outlets.

2021

pdf bib abs
Towards Argument Mining for Social Good: A Survey
Eva Maria Vecchi | Neele Falk | Iman Jundi | Gabriella Lapesa
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This survey builds an interdisciplinary picture of Argument Mining (AM), with a strong focus on its potential to address issues related to Social and Political Science. More specifically, we focus on AM challenges related to its applications to social media and in the multilingual domain, and then proceed to the widely debated notion of argument quality. We propose a novel definition of argument quality which is integrated with that of deliberative quality from the Social Science literature. Under our definition, the quality of a contribution needs to be assessed at multiple levels: the contribution itself, its preceding context, and the consequential effect on the development of the upcoming discourse. The latter has not received the deserved attention within the community. We finally define an application of AM for Social Good: (semi-)automatic moderation, a highly integrative application which (a) represents a challenging testbed for the integrated notion of quality we advocate, (b) allows the empirical quantification of argument/deliberative quality to benefit from the developments in other NLP fields (i.e. hate speech detection, fact checking, debiasing), and (c) has a clearly beneficial potential at the level of its societal thanks to its real-world application (even if extremely ambitious).

pdf bib abs
Predicting Moderation of Deliberative Arguments: Is Argument Quality the Key?
Neele Falk | Iman Jundi | Eva Maria Vecchi | Gabriella Lapesa
Proceedings of the 8th Workshop on Argument Mining

Human moderation is commonly employed in deliberative contexts (argumentation and discussion targeting a shared decision on an issue relevant to a group, e.g., citizens arguing on how to employ a shared budget). As the scale of discussion enlarges in online settings, the overall discussion quality risks to drop and moderation becomes more important to assist participants in having a cooperative and productive interaction. The scale also makes it more important to employ NLP methods for(semi-)automatic moderation, e.g. to prioritize when moderation is most needed. In this work, we make the first steps towards (semi-)automatic moderation by using state-of-the-art classification models to predict which posts require moderation, showing that while the task is undoubtedly difficult, performance is significantly above baseline. We further investigate whether argument quality is a key indicator of the need for moderation, showing that surprisingly, high quality arguments also trigger moderation. We make our code and data publicly available.

pdf bib abs
FAST: A carefully sampled and cognitively motivated dataset for distributional semantic evaluation
Stefan Evert | Gabriella Lapesa
Proceedings of the 25th Conference on Computational Natural Language Learning

What is the first word that comes to your mind when you hear giraffe, or damsel, or freedom? Such free associations contain a huge amount of information on the mental representations of the corresponding concepts, and are thus an extremely valuable testbed for the evaluation of semantic representations extracted from corpora. In this paper, we present FAST (Free ASsociation Tasks), a free association dataset for English rigorously sampled from two standard free association norms collections (the Edinburgh Associative Thesaurus and the University of South Florida Free Association Norms), discuss two evaluation tasks, and provide baseline results. In parallel, we discuss methodological considerations concerning the desiderata for a proper evaluation of semantic representations.

The analysis of public debates crucially requires the classification of political demands according to hierarchical claim ontologies (e.g. for immigration, a supercategory “Controlling Migration” might have subcategories “Asylum limit” or “Border installations”). A major challenge for automatic claim classification is the large number and low frequency of such subclasses. We address it by jointly predicting pairs of matching super- and subcategories. We operationalize this idea by (a) encoding soft constraints in the claim classifier and (b) imposing hard constraints via Integer Linear Programming. Our experiments with different claim classifiers on a German immigration newspaper corpus show consistent performance increases for joint prediction, in particular for infrequent categories and discuss the complementarity of the two approaches.

pdf bib abs
Regression Analysis of Lexical and Morpho-Syntactic Properties of Kiezdeutsch
Diego Frassinelli | Gabriella Lapesa | Reem Alatrash | Dominik Schlechtweg | Sabine Schulte im Walde
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

Kiezdeutsch is a variety of German predominantly spoken by teenagers from multi-ethnic urban neighborhoods in casual conversations with their peers. In recent years, the popularity of Kiezdeutsch has increased among young people, independently of their socio-economic origin, and has spread in social media, too. While previous studies have extensively investigated this language variety from a linguistic and qualitative perspective, not much has been done from a quantitative point of view. We perform the first large-scale data-driven analysis of the lexical and morpho-syntactic properties of Kiezdeutsch in comparison with standard German. At the level of results, we confirm predictions of previous qualitative analyses and integrate them with further observations on specific linguistic phenomena such as slang and self-centered speaker attitude. At the methodological level, we provide logistic regression as a framework to perform bottom-up feature selection in order to quantify differences across language varieties.

2020

DEbateNet-migr15 is a manually annotated dataset for German which covers the public debate on immigration in 2015. The building block of our annotation is the political science notion of a claim, i.e., a statement made by a political actor (a politician, a party, or a group of citizens) that a specific action should be taken (e.g., vacant flats should be assigned to refugees). We identify claims in newspaper articles, assign them to actors and fine-grained categories and annotate their polarity and date. The aim of this paper is two-fold: first, we release the full DEbateNet-mig15 corpus and document it by means of a quantitative and qualitative analysis; second, we demonstrate its application in a discourse network analysis framework, which enables us to capture the temporal dynamics of the political debate

pdf bib abs
Swimming with the Tide? Positional Claim Detection across Political Text Types
Nico Blokker | Erenay Dayanik | Gabriella Lapesa | Sebastian Padó
Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science

Manifestos are official documents of political parties, providing a comprehensive topical overview of the electoral programs. Voters, however, seldom read them and often prefer other channels, such as newspaper articles, to understand the party positions on various policy issues. The natural question to ask is how compatible these two formats (manifesto and newspaper reports) are in their representation of party positioning. We address this question with an approach that combines political science (manual annotation and analysis) and natural language processing (supervised claim identification) in a cross-text type setting: we train a classifier on annotated newspaper data and test its performance on manifestos. Our findings show a) strong performance for supervised classification even across text types and b) a substantive overlap between the two formats in terms of party positioning, with differences regarding the salience of specific issues.

2019

pdf bib abs
An Environment for Relational Annotation of Political Debates
Andre Blessing | Nico Blokker | Sebastian Haunss | Jonas Kuhn | Gabriella Lapesa | Sebastian Padó
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

This paper describes the MARDY corpus annotation environment developed for a collaboration between political science and computational linguistics. The tool realizes the complete workflow necessary for annotating a large newspaper text collection with rich information about claims (demands) raised by politicians and other actors, including claim and actor spans, relations, and polarities. In addition to the annotation GUI, the tool supports the identification of relevant documents, text pre-processing, user management, integration of external knowledge bases, annotation comparison and merging, statistical analysis, and the incorporation of machine learning models as “pseudo-annotators”.

2017

pdf bib abs
Large-scale evaluation of dependency-based DSMs: Are they worth the effort?
Gabriella Lapesa | Stefan Evert
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

This paper presents a large-scale evaluation study of dependency-based distributional semantic models. We evaluate dependency-filtered and dependency-structured DSMs in a number of standard semantic similarity tasks, systematically exploring their parameter space in order to give them a “fair shot” against window-based models. Our results show that properly tuned window-based DSMs still outperform the dependency-based models in most tasks. There appears to be little need for the language-dependent resources and computational cost associated with syntactic analysis.

pdf bib
Are doggies really nicer than dogs? The impact of morphological derivation on emotional valence in German
Gabriella Lapesa | Sebastian Padó | Tillmann Pross | Antje Rossdeutscher
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Short papers

pdf bib
Modeling Derivational Morphology in Ukrainian
Mariia Melymuka | Gabriella Lapesa | Max Kisselew | Sebastian Padó
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Short papers

2015

pdf bib
SemantiKLUE: Semantic Textual Similarity with Maximum Weight Matching
Nataliia Plotnikova | Gabriella Lapesa | Thomas Proisl | Stefan Evert
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf bib abs
A Large Scale Evaluation of Distributional Semantic Models: Parameters, Interactions and Model Selection
Gabriella Lapesa | Stefan Evert
Transactions of the Association for Computational Linguistics, Volume 2

This paper presents the results of a large-scale evaluation study of window-based Distributional Semantic Models on a wide variety of tasks. Our study combines a broad coverage of model parameters with a model selection methodology that is robust to overfitting and able to capture parameter interactions. We show that our strategy allows us to identify parameter configurations that achieve good performance across different datasets and tasks.

pdf bib
Contrasting Syntagmatic and Paradigmatic Relations: Insights from Distributional Semantic Models
Gabriella Lapesa | Stefan Evert | Sabine Schulte im Walde
Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014)

pdf bib
NaDiR: Naive Distributional Response Generation
Gabriella Lapesa | Stefan Evert
Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)

2013

pdf bib
Evaluating Neighbor Rank and Distance Measures as Predictors of Semantic Priming
Gabriella Lapesa | Stefan Evert
Proceedings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics (CMCL)

2012

pdf bib abs
LexIt: A Computational Resource on Italian Argument Structure
Alessandro Lenci | Gabriella Lapesa | Giulia Bonansinga
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The aim of this paper is to introduce LexIt, a computational framework for the automatic acquisition and exploration of distributional information about Italian verbs, nouns and adjectives, freely available through a web interface at the address http://sesia.humnet.unipi.it/lexit. LexIt is the first large-scale resource for Italian in which subcategorization and semantic selection properties are characterized fully on distributional ground: in the paper we describe both the process of data extraction and the evaluation of the subcategorization frames extracted with LexIt.

2010

pdf bib abs
Building an Italian FrameNet through Semi-automatic Corpus Analysis
Alessandro Lenci | Martina Johnson | Gabriella Lapesa
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

n this paper, we outline the methodology we adopted to develop a FrameNet for Italian. The main element of novelty with respect to the original FrameNet is represented by the fact that the creation and annotation of Lexical Units is strictly grounded in distributional information (statistical distribution of verbal subcategorization frames, lexical and semantic preferences of each frame) automatically acquired from a large, dependency-parsed corpus. We claim that this approach allows us to overcome some of the shortcomings of the classical lexicographic method used to create FrameNet, by complementing the accuracy of manual annotation with the robustness of data on the global distributional patterns of a verb. In the paper, we describe our method for extracting distributional data from the corpus and the way we used it for the encoding and annotation of LUs. The long-term goal of our project is to create an electronic lexicon for Italian similar to the original English FrameNet. For the moment, we have developed a database of syntactic valences that will be made freely accessible via a web interface. This represents an autonomous resource besides the FrameNet lexicon, of which we have a beginning nucleus consisting of 791 annotated sentences.