Elisa Bassignana - ACL Anthology

Elisa Bassignana

2026

Do Large Language Models Adapt to Language Variation across Socioeconomic Status?
Elisa Bassignana | Mike Zhang | Dirk Hovy | Amanda Cercas Curry
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects

Humans adjust their linguistic style to the audience they are addressing. However, the extent to which LLMs adapt to different social contexts is largely unknown. As these models increasingly mediate human-to-human communication, their failure to adapt to diverse styles can perpetuate stereotypes and marginalize communities whose linguistic norms are less closely mirrored by the models, thereby reinforcing social stratification. We study the extent to which LLMs integrate into social media communication across different socioeconomic status (SES) communities. We collect a novel dataset from Reddit and YouTube, stratified by SES. We prompt four LLMs with incomplete text from that corpus and compare the LLM-generated completions to the originals along 94 sociolinguistic metrics, including syntactic, rhetorical, and lexical features. LLMs modulate their style with respect to SES to only a minor extent, often resulting in approximation or caricature, and tend to emulate the style of upper SES more effectively. Our findings (1) show how LLMs risk amplifying linguistic hierarchies and (2) call into question their validity for agent-based social simulation, survey experiments, and any research relying on language style as a social signal.

2025

DaKultur: Evaluating the Cultural Awareness of Language Models for Danish with Native Speakers
Max Müller-Eberstein | Mike Zhang | Elisa Bassignana | Peter Brunsgaard Trolle | Rob Van Der Goot
Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3NLP 2025)

Large Language Models (LLMs) have seen widespread societal adoption. However, while they are able to interact with users in languages beyond English, they have been shown to lack cultural awareness, providing anglocentric or inappropriate responses for underrepresented language communities. To investigate this gap and disentangle linguistic versus cultural proficiency, we conduct the first cultural evaluation study for the mid-resource language of Danish, in which native speakers prompt different models to solve tasks requiring cultural awareness. Our analysis of the resulting 1,038 interactions from 63 demographically diverse participants highlights open challenges to cultural adaptation: Particularly, how currently employed automatically translated data are insufficient to train or measure cultural adaptation, and how training on native-speaker data can more than double response acceptance rates. We release our study data as DaKultur - the first native Danish cultural awareness dataset.

Probing Feminist Representations: A Study of Bias in LLMs and Word Embeddings
Arianna Muti | Elisa Bassignana | Emanuele Moscato | Debora Nozza
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

SnakModel: Lessons Learned from Training an Open Danish Large Language Model
Mike Zhang | Max Müller-Eberstein | Elisa Bassignana | Rob van der Goot
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

We present SnakModel, a Danish large language model (LLM) based on Llama2-7B, which we continuously pre-train on 13.6B Danish words, and further tune on 3.7M Danish instructions. As best practices for creating LLMs for smaller language communities have yet to be established, we examine the effects of early modeling and training decisions on downstream performance throughout the entire training pipeline, including (1) the creation of a strictly curated corpus of Danish text from diverse sources; (2) the language modeling and instruction-tuning training process itself, including the analysis of intermediate training dynamics, and ablations across different hyperparameters; (3) an evaluation on eight language and culturally-specific tasks. Across these experiments SnakModel achieves the highest overall performance, outperforming multiple contemporary Llama2-7B-based models. By making SnakModel, the majority of our pre-training corpus, and the associated code available under open licenses, we hope to foster further research and development in Danish Natural Language Processing, and establish training guidelines for languages with similar resource constraints.

MorSeD: Morphological Segmentation of Danish and its Effect on Language Modeling
Rob van der Goot | Anette Jensen | Emil Allerslev Schledermann | Mikkel Wildner Kildeberg | Nicolaj Larsen | Mike Zhang | Elisa Bassignana
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Current language models (LMs) mostly exploit subwords as input units based on statistical co-occurrences of characters. Adjacently, previous work has shown that modeling morphemes can aid performance for Natural Language Processing (NLP) models. However, morphemes are challenging to obtain as there is no annotated data in most languages. In this work, we release a wide-coverage Danish morphological segmentation evaluation set. We evaluate a range of unsupervised token segmenters and evaluate the downstream effect of using morphemes as input units for transformer-based LMs. Our results show that popular subword algorithms perform poorly on this task, scoring at most an F1 of 57.6 compared to 68.0 for an unsupervised morphological segmenter (Morfessor). Furthermore, evaluate a range of segmenters on the task of language modeling.

The AI Gap: How Socioeconomic Status Affects Language Technology Interactions
Elisa Bassignana | Amanda Cercas Curry | Dirk Hovy
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Socioeconomic status (SES) fundamentally influences how people interact with each other and, more recently, with digital technologies like large language models (LLMs). While previous research has highlighted the interaction between SES and language technology, it was limited by reliance on proxy metrics and synthetic data. We survey 1,000 individuals from ‘diverse socioeconomic backgrounds’ about their use of language technologies and generative AI, and collect 6,482 prompts from their previous interactions with LLMs. We find systematic differences across SES groups in language technology usage (i.e., frequency, performed tasks), interaction styles, and topics. Higher SES entail a higher level of abstraction, convey requests more concisely, and topics like ‘inclusivity’ and ‘travel’. Lower SES correlates with higher anthropomorphization of LLMs (using ”hello” and ”thank you”) and more concrete language. Our findings suggest that while generative language technologies are becoming more accessible to everyone, socioeconomic linguistic differences still stratify their use to create a digital divide. These differences underscore the importance of considering SES in developing language technologies to accommodate varying linguistic needs rooted in socioeconomic factors and limit the AI Gap across SES groups.

2024

Can Humans Identify Domains?
Maria Barrett | Max Müller-Eberstein | Elisa Bassignana | Amalie Brogaard Pauli | Mike Zhang | Rob van der Goot
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Textual domain is a crucial property within the Natural Language Processing (NLP) community due to its effects on downstream model performance. The concept itself is, however, loosely defined and, in practice, refers to any non-typological property, such as genre, topic, medium or style of a document. We investigate the core notion of domains via human proficiency in identifying related intrinsic textual properties, specifically the concepts of genre (communicative purpose) and topic (subject matter). We publish our annotations in TGeGUM: A collection of 9.1k sentences from the GUM dataset (Zeldes, 2017) with single sentence and larger context (i.e., prose) annotations for one of 11 genres (source type), and its topic/subtopic as per the Dewey Decimal library classification system (Dewey, 1979), consisting of 10/100 hierarchical topics of increased granularity. Each instance is annotated by three annotators, for a total of 32.7k annotations, allowing us to examine the level of human disagreement and the relative difficulty of each annotation task. With a Fleiss’ kappa of at most 0.53 on the sentence level and 0.66 at the prose level, it is evident that despite the ubiquity of domains in NLP, there is little human consensus on how to define them. By training classifiers to perform the same task, we find that this uncertainty also extends to NLP models.

How to Encode Domain Information in Relation Classification
Elisa Bassignana | Viggo Unmack Gascou | Frida Nøhr Laustsen | Gustav Kristensen | Marie Haahr Petersen | Rob van der Goot | Barbara Plank
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Current language models require a lot of training data to obtain high performance. For Relation Classification (RC), many datasets are domain-specific, so combining datasets to obtain better performance is non-trivial. We explore a multi-domain training setup for RC, and attempt to improve performance by encoding domain information. Our proposed models improve > 2 Macro-F1 against the baseline setup, and our analysis reveals that not all the labels benefit the same: The classes which occupy a similar space across domains (i.e., their interpretation is close across them, for example “physical”) benefit the least, while domain-dependent relations (e.g., “part-of”) improve the most when encoding domain information.

Dissecting Biases in Relation Extraction: A Cross-Dataset Analysis on People’s Gender and Origin
Marco Stranisci | Pere-Lluís Huguet Cabot | Elisa Bassignana | Roberto Navigli
Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

Relation Extraction (RE) is at the core of many Natural Language Understanding tasks, including knowledge-base population and Question Answering. However, any Natural Language Processing system is exposed to biases, and the analysis of these has not received much attention in RE. We propose a new method for inspecting bias in the RE pipeline, which is completely transparent in terms of interpretability. Specifically, in this work we analyze biases related to gender and place of birth. Our methodology includes (i) obtaining semantic triplets (subject, object, semantic relation) involving ‘person’ entities from RE resources, (ii) collecting meta-information (‘gender’ and ‘place of birth’) using Entity Linking technologies, and then (iii) analyze the distribution of triplets across different groups (e.g., men versus women). We investigate bias at two levels: In the training data of three commonly used RE datasets (SREDFM, CrossRE, NYT), and in the predictions of a state-of-the-art RE approach (ReLiK). To enable cross-dataset analysis, we introduce a taxonomy of relation types mapping the label sets of different RE datasets to a unified label space. Our findings reveal that bias is a compounded issue affecting underrepresented groups within data and predictions for RE.

What’s wrong with your model? A Quantitative Analysis of Relation Classification
Elisa Bassignana | Rob van der Goot | Barbara Plank
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

With the aim of improving the state-of-the-art (SOTA) on a target task, a standard strategy in Natural Language Processing (NLP) research is to design a new model, or modify the existing SOTA, and then benchmark its performance on the target task. We argue in favor of enriching this chain of actions by a preliminary error-guided analysis: First, explore weaknesses by analyzing the hard cases where the existing model fails, and then target the improvement based on those. Interpretable evaluation has received little attention for structured prediction tasks. Therefore we propose the first in-depth analysis suite for Relation Classification (RC), and show its effectiveness through a case study. We propose a set of potentially influential attributes to focus on (e.g., entity distance, sentence length). Then, we bucket our datasets based on these attributes, and weight the importance of them through correlations. This allows us to identify highly challenging scenarios for the RC model. By exploiting the findings of our analysis, with a carefully targeted adjustment to our architecture, we effectively improve the performance over the baseline by >3 Micro-F1.

2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
Elisa Bassignana | Matthias Lindemann | Alban Petit
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

Silver Syntax Pre-training for Cross-Domain Relation Extraction
Elisa Bassignana | Filip Ginter | Sampo Pyysalo | Rob van der Goot | Barbara Plank
Findings of the Association for Computational Linguistics: ACL 2023

Relation Extraction (RE) remains a challenging task, especially when considering realistic out-of-domain evaluations. One of the main reasons for this is the limited training size of current RE datasets: obtaining high-quality (manually annotated) data is extremely expensive and cannot realistically be repeated for each new domain. An intermediate training step on data from related tasks has shown to be beneficial across many NLP tasks. However, this setup still requires supplementary annotated data, which is often not available. In this paper, we investigate intermediate pre-training specifically for RE. We exploit the affinity between syntactic structure and semantic RE, and identify the syntactic relations which are closely related to RE by being on the shortest dependency path between two entities. We then take advantage of the high accuracy of current syntactic parsers in order to automatically obtain large amounts of low-cost pre-training data. By pre-training our RE model on the relevant syntactic relations, we are able to outperform the baseline in five out of six cross-domain setups, without any additional annotated data.

Multi-CrossRE A Multi-Lingual Multi-Domain Dataset for Relation Extraction
Elisa Bassignana | Filip Ginter | Sampo Pyysalo | Rob van der Goot | Barbara Plank
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Most research in Relation Extraction (RE) involves the English language, mainly due to the lack of multi-lingual resources. We propose Multi-CrossRE, the broadest multi-lingual dataset for RE, including 26 languages in addition to English, and covering six text domains. Multi-CrossRE is a machine translated version of CrossRE (Bassignana and Plank, 2022), with a sub-portion including more than 200 sentences in seven diverse languages checked by native speakers. We run a baseline model over the 26 new datasets and–as sanity check–over the 26 back-translations to English. Results on the back-translated data are consistent with the ones on the original English CrossRE, indicating high quality of the translation and the resulting dataset.

2022

Experimental Standards for Deep Learning in Natural Language Processing Research
Dennis Ulmer | Elisa Bassignana | Max Müller-Eberstein | Daniel Varab | Mike Zhang | Rob van der Goot | Christian Hardmeier | Barbara Plank
Findings of the Association for Computational Linguistics: EMNLP 2022

The field of Deep Learning (DL) has undergone explosive growth during the last decade, with a substantial impact on Natural Language Processing (NLP) as well. Yet, compared to more established disciplines, a lack of common experimental standards remains an open challenge to the field at large. Starting from fundamental scientific principles, we distill ongoing discussions on experimental standards in NLP into a single, widely-applicable methodology. Following these best practices is crucial to strengthen experimental evidence, improve reproducibility and enable scientific progress. These standards are further collected in a public repository to help them transparently adapt to future needs.

CrossRE: A Cross-Domain Dataset for Relation Extraction
Elisa Bassignana | Barbara Plank
Findings of the Association for Computational Linguistics: EMNLP 2022

Relation Extraction (RE) has attracted increasing attention, but current RE evaluation is limited to in-domain evaluation setups. Little is known on how well a RE system fares in challenging, but realistic out-of-distribution evaluation setups. To address this gap, we propose CrossRE, a new, freely-available cross-domain benchmark for RE, which comprises six distinct text domains and includes multi-label annotations. An additional innovation is that we release meta-data collected during annotation, to include explanations and flags of difficult instances. We provide an empirical evaluation with a state-of-the-art model for relation classification. As the meta-data enables us to shed new light on the state-of-the-art model, we provide a comprehensive analysis on the impact of difficult cases and find correlations between model and human annotations. Overall, our empirical investigation highlights the difficulty of cross-domain RE. We release our dataset, to spur more research in this direction.

Evidence > Intuition: Transferability Estimation for Encoder Selection
Elisa Bassignana | Max Müller-Eberstein | Mike Zhang | Barbara Plank
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

With the increase in availability of large pre-trained language models (LMs) in Natural Language Processing (NLP), it becomes critical to assess their fit for a specific target task a priori—as fine-tuning the entire space of available LMs is computationally prohibitive and unsustainable. However, encoder transferability estimation has received little to no attention in NLP. In this paper, we propose to generate quantitative evidence to predict which LM, out of a pool of models, will perform best on a target task without having to fine-tune all candidates. We provide a comprehensive study on LM ranking for 10 NLP tasks spanning the two fundamental problem types of classification and structured prediction. We adopt the state-of-the-art Logarithm of Maximum Evidence (LogME) measure from Computer Vision (CV) and find that it positively correlates with final LM performance in 94% of the setups.In the first study of its kind, we further compare transferability measures with the de facto standard of human practitioner ranking, finding that evidence from quantitative metrics is more robust than pure intuition and can help identify unexpected LM candidates.

What Do You Mean by Relation Extraction? A Survey on Datasets and Study on Scientific Relation Classification
Elisa Bassignana | Barbara Plank
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Over the last five years, research on Relation Extraction (RE) witnessed extensive progress with many new dataset releases. At the same time, setup clarity has decreased, contributing to increased difficulty of reliable empirical evaluation (Taillé et al., 2020). In this paper, we provide a comprehensive survey of RE datasets, and revisit the task definition and its adoption by the community. We find that cross-dataset and cross-domain setups are particularly lacking. We present an empirical study on scientific Relation Classification across two datasets. Despite large data overlap, our analysis reveals substantial discrepancies in annotation. Annotation discrepancies strongly impact Relation Classification performance, explaining large drops in cross-dataset evaluations. Variation within further sub-domains exists but impacts Relation Classification only to limited degrees. Overall, our study calls for more rigour in reporting setups in RE and evaluation across multiple test sets.

2020

Personal-ITY: A Novel YouTube-based Corpus for Personality Prediction in Italian
Elisa Bassignana | Malvina Nissim | Viviana Patti
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

Matching Theory and Data with Personal-ITY: What a Corpus of Italian YouTube Comments Reveals About Personality
Elisa Bassignana | Malvina Nissim | Viviana Patti
Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media

As a contribution to personality detection in languages other than English, we rely on distant supervision to create Personal-ITY, a novel corpus of YouTube comments in Italian, where authors are labelled with personality traits. The traits are derived from one of the mainstream personality theories in psychology research, named MBTI. Using personality prediction experiments, we (i) study the task of personality prediction in itself on our corpus as well as on TWISTY, a Twitter dataset also annotated with MBTI labels; (ii) carry out an extensive, in-depth analysis of the features used by the classifier, and view them specifically under the light of the original theory that we used to create the corpus in the first place. We observe that no single model is best at personality detection, and that while some traits are easier than others to detect, and also to match back to theory, for other, less frequent traits the picture is much more blurred.

2018

Hurtlex: A Multilingual Lexicon of Words to Hurt
Elisa Bassignana | Valerio Basile | Viviana Patti
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)