Luciana Benotti - ACL Anthology

Luciana Benotti

2026

Adaptive Data Collection for Latin-American Community-sourced Evaluation of Stereotypes (LACES)
Guido Ivetta | Pietro Palombini | Sofía Martinelli | Marcos J Gomez | M Emilia Echeveste | Sunipa Dev | Vinodkumar Prabhakaran | Luciana Benotti
Findings of the Association for Computational Linguistics: ACL 2026

The evaluation of societal biases in NLP models is critically hindered by a geo-cultural gap. This leaves regions such as Latin America severely underserved, making it impossible to adequately assess or mitigate the perpetuation of harmful regional stereotypes in language technologies. This paper presents LACES, a stereotype association dataset, for 15 Latin American countries. This dataset includes 4,789 stereotype associations[The de-identified dataset can be accessed via GitHub], manually created and annotated by 83 participants. The dataset was developed through targeted community partnerships across Latin America. Additionally, in this paper, we propose a novel adaptive data collection methodology that uniquely integrates the sourcing of new stereotype entries and the validation of existing data within a single, unified workflow. This approach results in a resource with more unique stereotypes than previous static collection methods, enabling a more efficient stereotype collection. The paper further supports the quality of LACES by demonstrating reduced efficacy of debiasing methods on this dataset in comparison to existing popular stereotype benchmarks.Content Warning: This research involves the study of social biases. Consequently, the paper contains examples of discriminatory language and stereotypes that may be sensitive or upsetting to readers. These examples are included for the purpose of scientific analysis and do not reflect the views of the authors.

Proceedings of the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP 2026)
Vinodkumar Prabhakaran | Sunipa Dev | Luciana Benotti | Daniel Hershcovich | Yong Cao | Li Zhou | Bolei Ma | Ife Adebara
Proceedings of the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP 2026)

2025

Insights from a Disaggregated Analysis of Kinds of Biases in a Multicultural Dataset
Guido Ivetta | Hernán Maina | Luciana Benotti
Proceedings of the 9th Widening NLP Workshop

Warning: This paper contains explicit statements of offensive stereotypes which may be upsetting.Stereotypes vary across cultural contexts, making it essential to understand how language models encode social biases. MultiLingualCrowsPairs is a dataset of culturally adapted stereotypical and anti-stereotypical sentence pairs across nine languages. While prior work has primarily reported average fairness metrics on masked language models, this paper analyzes social biases in generative models by disaggregating results across specific bias types.We find that although most languages show an overall preference for stereotypical sentences, this masks substantial variation across different types of bias, such as gender, religion, and socioeconomic status. Our findings underscore that relying solely on aggregated metrics can obscure important patterns, and that fine-grained, bias-specific analysis is critical for meaningful fairness evaluation.

HESEIA: A community-based dataset for evaluating social biases in large language models, co-designed in real school settings in Latin America
Guido Ivetta | Marcos J Gomez | Sofía Martinelli | Pietro Palombini | M Emilia Echeveste | Nair Carolina Mazzeo | Beatriz Busaniche | Luciana Benotti
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Most resources for evaluating social biases in Large Language Models are developed without co-design from the communities affected by these biases, and rarely involve participatory approaches. We introduce HESEIA, a dataset of 46,499 sentences created in a professional development course. The course involved 370 high-school teachers and 5,370 students from 189 Latin-American schools. Unlike existing benchmarks, HESEIA captures intersectional biases across multiple demographic axes and school subjects. It reflects local contexts through the lived experience and pedagogical expertise of educators. Teachers used minimal pairs to create sentences that express stereotypes relevant to their school subjects and communities. We show the dataset diversity in term of demographic axes represented and also in terms of the knowledge areas included. We demonstrate that the dataset contains more stereotypes unrecognized by current LLMs than previous datasets. HESEIA is available to support bias assessments grounded in educational communities.

Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3NLP 2025)
Vinodkumar Prabhakaran | Sunipa Dev | Luciana Benotti | Daniel Hershcovich | Yong Cao | Li Zhou | Laura Cabello | Ife Adebara
Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3NLP 2025)

Navigating Ethical Challenges in NLP: Hands-on strategies for students and researchers
Luciana Benotti | Fanny Ducel | Karën Fort | Guido Ivetta | Zhijing Jin | Min-Yen Kan | Seunghun J. Lee | Minzhi Li | Margot Mieskes | Adriana Pagano
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts)

With NLP research being rapidly productionized into real-world applications, it is important to be aware of and think through the consequences of our work. Such ethical considerations are important in both authoring and reviewing (e.g. privacy, consent, fairness, among others). This tutorial will equip participants with basic guidelines for thinking deeply about ethical issues and review common considerations that recur in NLP research. The methodology is interactive and participatory, including discussion of case studies and group work. Participants will gain practical experience on when to flag a paper for ethics review and how to write an ethical consideration section to be shared with the broader community. Most importantly, the participants will be co-creating the tutorial outcomes and extending tutorial materials to share as public outcomes.

2024

Warning: This paper contains explicit statements of offensive stereotypes which may be upsetting The study of bias, fairness and social impact in Natural Language Processing (NLP) lacks resources in languages other than English. Our objective is to support the evaluation of bias in language models in a multilingual setting. We use stereotypes across nine types of biases to build a corpus containing contrasting sentence pairs, one sentence that presents a stereotype concerning an underadvantaged group and another minimally changed sentence, concerning a matching advantaged group. We build on the French CrowS-Pairs corpus and guidelines to provide translations of the existing material into seven additional languages. In total, we produce 11,139 new sentence pairs that cover stereotypes dealing with nine types of biases in seven cultural contexts. We use the final resource for the evaluation of relevant monolingual and multilingual masked language models. We find that language models in all languages favor sentences that express stereotypes in most bias categories. The process of creating a resource that covers a wide range of language types and cultural settings highlights the difficulty of bias evaluation, in particular comparability across languages and contexts.

Selectively Answering Visual Questions
Julian Eisenschlos | Hernán Maina | Guido Ivetta | Luciana Benotti
Findings of the Association for Computational Linguistics: ACL 2024

Recently, large multi-modal models (LMMs) have emerged with the capacity to perform vision tasks such as captioning and visual question answering (VQA) with unprecedented accuracy. Applications such as helping the blind or visually impaired have a critical need for precise answers. It is specially important for models to be well calibrated and be able to quantify their uncertainty in order to selectively decide when to answer and when to abstain or ask for clarifications. We perform the first in-depth analysis of calibration methods and metrics for VQA with in-context learning LMMs. Studying VQA on two answerability benchmarks, we show that the likelihood score of visually grounded models is better calibrated than in their text-only counterparts for in-context learning, where sampling based methods are generally superior, but no clear winner arises. We propose Avg BLEU, a calibration score combining the benefits of both sampling and likelihood methods across modalities.

Proceedings of the 2nd Workshop on Cross-Cultural Considerations in NLP
Vinodkumar Prabhakaran | Sunipa Dev | Luciana Benotti | Daniel Hershcovich | Laura Cabello | Yong Cao | Ife Adebara | Li Zhou
Proceedings of the 2nd Workshop on Cross-Cultural Considerations in NLP

2023

Understanding Ethics in NLP Authoring and Reviewing
Luciana Benotti | Karën Fort | Min-Yen Kan | Yulia Tsvetkov
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts

With NLP research now quickly being transferred into real-world applications, it is important to be aware of and think through the consequences of our scientific investigation. Such ethical considerations are important in both authoring and reviewing. This tutorial will equip participants with basic guidelines for thinking deeply about ethical issues and review common considerations that recur in NLP research. The methodology is interactive and participatory, including case studies and working in groups. Importantly, the participants will be co-building the tutorial outcomes and will be working to create further tutorial materials to share as public outcomes.

Bias assessment for experts in discrimination, not in computer science
Laura Alonso Alemany | Luciana Benotti | Hernán Maina | Lucía Gonzalez | Lautaro Martínez | Beatriz Busaniche | Alexia Halvorsen | Amanda Rojo | Mariela Rajngewerc
Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP)

Approaches to bias assessment usually require such technical skills that, by design, they leave discrimination experts out. In this paper we present EDIA, a tool that facilitates that experts in discrimination explore social biases in word embeddings and masked language models. Experts can then characterize those biases so that their presence can be assessed more systematically, and actions can be planned to address them. They can work interactively to assess the effects of different characterizations of bias in a given word embedding or language model, which helps to specify informal intuitions in concrete resources for systematic testing.

Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP)
Sunipa Dev | Vinodkumar Prabhakaran | David Ifeoluwa Adelani | Dirk Hovy | Luciana Benotti
Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP)

2022

What kinds of errors do reference resolution models make and what can we learn from them?
Jorge Sánchez | Mauricio Mazuecos | Hernán Maina | Luciana Benotti
Findings of the Association for Computational Linguistics: NAACL 2022

Referring resolution is the task of identifying the referent of a natural language expression, for example “the woman behind the other woman getting a massage”. In this paper we investigate which are the kinds of referring expressions on which current transformer based models fail. Motivated by this analysis we identify the weakening of the spatial natural constraints as one of its causes and propose a model that aims to restore it. We evaluate our proposed model on different datasets for the task showing improved performance on the most challenging kinds of referring expressions. Finally we present a thorough analysis of the kinds errors that are improved by the new model and those that are not and remain future challenges for the task.

Ethics consideration sections in natural language processing papers
Luciana Benotti | Patrick Blackburn
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

In this paper, we present the results of a manual classification of all ethical consideration sections for ACL 2021. We also compare how many papers had an ethics consideration section per track and per world region in ACL 2021. We classified papers according to the ethical issues covered (research benefits, potential harms, and vulnerable groups affected) and whether the paper was marked as requiring ethics review by at least one reviewer. Moreover, we discuss recurring obstacles we have observed (highlighting some interesting texts we found along the way) and conclude with three suggestions. We think that this paper may be useful for anyone who needs to write — or review — an ethics section and would like to get an overview of what others have done.

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts
Luciana Benotti | Naoaki Okazaki | Yves Scherrer | Marcos Zampieri
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

2021

Visually Grounded Follow-up Questions: a Dataset of Spatial Questions Which Require Dialogue History
Tianai Dong | Alberto Testoni | Luciana Benotti | Raffaella Bernardi
Proceedings of Second International Combined Workshop on Spatial Language Understanding and Grounded Communication for Robotics

In this paper, we define and evaluate a methodology for extracting history-dependent spatial questions from visual dialogues. We say that a question is history-dependent if it requires (parts of) its dialogue history to be interpreted. We argue that some kinds of visual questions define a context upon which a follow-up spatial question relies. We call the question that restricts the context: trigger, and we call the spatial question that requires the trigger question to be answered: zoomer. We automatically extract different trigger and zoomer pairs based on the visual property that the questions rely on (e.g. color, number). We manually annotate the automatically extracted trigger and zoomer pairs to verify which zoomers require their trigger. We implement a simple baseline architecture based on a SOTA multimodal encoder. Our results reveal that there is much room for improvement for answering history-dependent questions.

The Impact of Answers in Referential Visual Dialog
Mauricio Mazuecos | Patrick Blackburn | Luciana Benotti
Proceedings of the Reasoning and Interaction Conference (ReInAct 2021)

In the visual dialog task GuessWhat?! two players maintain a dialog in order to identify a secret object in an image. Computationally, this is modeled using a question generation module and a guesser module for the questioner role and an answering model, the Oracle, to answer the generated questions. This raises a question: what’s the risk of having an imperfect oracle model?. Here we present work in progress in the study of the impact of different answering models in human generated questions in GuessWhat?!. We show that having access to better quality answers has a direct impact on the guessing task for human dialog and argue that better answers could help train better question generation models.

A recipe for annotating grounded clarifications
Luciana Benotti | Patrick Blackburn
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

In order to interpret the communicative intents of an utterance, it needs to be grounded in something that is outside of language; that is, grounded in world modalities. In this paper, we argue that dialogue clarification mechanisms make explicit the process of interpreting the communicative intents of the speaker’s utterances by grounding them in the various modalities in which the dialogue is situated. This paper frames dialogue clarification mechanisms as an understudied research problem and a key missing piece in the giant jigsaw puzzle of natural language understanding. We discuss both the theoretical background and practical challenges posed by this problem and propose a recipe for obtaining grounding annotations. We conclude by highlighting ethical issues that need to be addressed in future work.

Region under Discussion for visual dialog
Mauricio Mazuecos | Franco M. Luque | Jorge Sánchez | Hernán Maina | Thomas Vadora | Luciana Benotti
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Visual Dialog is assumed to require the dialog history to generate correct responses during a dialog. However, it is not clear from previous work how dialog history is needed for visual dialog. In this paper we define what it means for a visual question to require dialog history and we release a subset of the Guesswhat?! questions for which their dialog history completely changes their responses. We propose a novel interpretable representation that visually grounds dialog history: the Region under Discussion. It constrains the image’s spatial features according to a semantic representation of the history inspired by the information structure notion of Question under Discussion.We evaluate the architecture on task-specific multimodal models and the visual transformer model LXMERT.

Grounding as a Collaborative Process
Luciana Benotti | Patrick Blackburn
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Collaborative grounding is a fundamental aspect of human-human dialog which allows people to negotiate meaning. In this paper we argue that it is missing from current deep learning approaches to dialog. Our central point is that making mistakes and being able to recover from them collaboratively is a key ingredient in grounding meaning. We illustrate the pitfalls of being unable to ground collaboratively, discuss what can be learned from the language acquisition and dialog systems literature, and reflect on how to move forward.

2020

Effective questions in referential visual dialogue
Mauricio Mazuecos | Alberto Testoni | Raffaella Bernardi | Luciana Benotti
Proceedings of the Fourth Widening Natural Language Processing Workshop

An interesting challenge for situated dialogue systems is referential visual dialog: by asking questions, the system has to identify the referent to which the user refers to. Task success is the standard metric used to evaluate these systems. However, it does not consider how effective each question is, that is how much each question contributes to the goal. We propose a new metric, that measures question effectiveness. As a preliminary study, we report the new metric for state of the art publicly available models on GuessWhat?!. Surprisingly, successful dialogues do not have a higher percentage of effective questions than failed dialogues. This suggests that a system with high task success is not necessarily one that generates good questions.

They Are Not All Alike: Answering Different Spatial Questions Requires Different Grounding Strategies
Alberto Testoni | Claudio Greco | Tobias Bianchi | Mauricio Mazuecos | Agata Marcante | Luciana Benotti | Raffaella Bernardi
Proceedings of the Third International Workshop on Spatial Language Understanding

In this paper, we study the grounding skills required to answer spatial questions asked by humans while playing the GuessWhat?! game. We propose a classification for spatial questions dividing them into absolute, relational, and group questions. We build a new answerer model based on the LXMERT multimodal transformer and we compare a baseline with and without visual features of the scene. We are interested in studying how the attention mechanisms of LXMERT are used to answer spatial questions since they require putting attention on more than one region simultaneously and spotting the relation holding among them. We show that our proposed model outperforms the baseline by a large extent (9.70% on spatial questions and 6.27% overall). By analyzing LXMERT errors and its attention mechanisms, we find that our classification helps to gain a better understanding of the skills required to answer different spatial questions.

On the role of effective and referring questions in GuessWhat?!
Mauricio Mazuecos | Alberto Testoni | Raffaella Bernardi | Luciana Benotti
Proceedings of the First Workshop on Advances in Language and Vision Research

Task success is the standard metric used to evaluate referential visual dialogue systems. In this paper we propose two new metrics that evaluate how each question contributes to the goal. First, we measure how effective each question is by evaluating whether the question discards objects that are not the referent. Second, we define referring questions as those that univocally identify one object in the image. We report the new metrics for human dialogues and for state of the art publicly available models on GuessWhat?!. Regarding our first metric, we find that successful dialogues do not have a higher percentage of effective questions for most models. With respect to the second metric, humans make questions at the end of the dialogue that are referring, confirming their guess before guessing. Human dialogues that use this strategy have a higher task success but models do not seem to learn it.

2018

Modeling Student Response Times: Towards Efficient One-on-one Tutoring Dialogues
Luciana Benotti | Jayadev Bhaskaran | Sigtryggur Kjartansson | David Lang
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

In this paper we investigate the task of modeling how long it would take a student to respond to a tutor question during a tutoring dialogue. Solving such a task has applications in educational settings such as intelligent tutoring systems, as well as in platforms that help busy human tutors to keep students engaged. Knowing how long it would normally take a student to respond to different types of questions could help tutors optimize their own time while answering multiple dialogues concurrently, as well as deciding when to prompt a student again. We study this problem using data from a service that offers tutor support for math, chemistry and physics through an instant messaging platform. We create a dataset of 240K questions. We explore several strong baselines for this task and compare them with human performance.

2015

Zoom: a corpus of natural language descriptions of map locations
Romina Altamirano | Thiago Ferreira | Ivandré Paraboni | Luciana Benotti
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

Mining human interactions to construct a virtual guide for a virtual fair
Andrés Luna | Luciana Benotti
Proceedings of the EACL 2014 Workshop on Dialogue in Motion

A Natural Language Instructor for pedestrian navigation based in generation by selection
Santiago Avalos | Luciana Benotti
Proceedings of the EACL 2014 Workshop on Dialogue in Motion

2012

Corpus-based Interpretation of Instructions in Virtual Environments
Luciana Benotti | Martín Villalba | Tessa Lau | Julián Cerruti
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Probabilistic Refinement Algorithms for the Generation of Referring Expressions
Romina Altamirano | Carlos Areces | Luciana Benotti
Proceedings of COLING 2012: Posters

2011

CL system: Giving instructions by corpus based selection
Luciana Benotti | Alexandre Denis
Proceedings of the 13th European Workshop on Natural Language Generation

The GIVE-2.5 C Generation System
David Nicolás Racca | Luciana Benotti | Pablo Duboue
Proceedings of the 13th European Workshop on Natural Language Generation

Giving instructions in virtual environments by corpus based selection
Luciana Benotti | Alexandre Denis
Proceedings of the SIGDIAL 2011 Conference

Prototyping virtual instructors from human-human corpora
Luciana Benotti | Alexandre Denis
Proceedings of the ACL-HLT 2011 System Demonstrations

2010

Negotiating causal implicatures
Luciana Benotti | Patrick Blackburn
Proceedings of the SIGDIAL 2010 Conference

Dialogue Systems for Virtual Environments
Luciana Benotti | Paula Estrella | Carlos Areces
Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas

2009

Clarification Potential of Instructions
Luciana Benotti
Proceedings of the SIGDIAL 2009 Conference

A computational account of comparative implicatures for a spoken dialogue agent
Luciana Benotti | David Traum
Proceedings of the Eight International Conference on Computational Semantics

Frolog: an Accommodating Text-Adventure Game
Luciana Benotti
Proceedings of the Demonstrations Session at EACL 2009

Co-authors

Vinodkumar Prabhakaran 5

Raffaella Bernardi 4

Alberto Testoni 4

Alexandre Denis 3

Daniel Hershcovich 3

Laura Alonso Alemany 2

Romina Altamirano 2

Carlos Areces 2

Beatriz Busaniche 2

Laura Cabello Piqueras 2

M Emilia Echeveste 2

Marcos J Gomez 2

Sofía Martinelli 2

Margot Mieskes 2

Pietro Palombini 2

Jorge Sánchez 2

David Ifeoluwa Adelani 1

Santiago Avalos 1

Julien Bezançon 1

Jayadev Bhaskaran 1

Tobias Bianchi 1

Marthese Borg 1

Thiago Castro Ferreira 1

Julián Cerruti 1

Yongjian Chen 1

Julian Eisenschlos 1

Paula Estrella 1

Lucía Gonzalez 1

Claudio Greco 1

Alexia Halvorsen 1

Sigtryggur Kjartansson 1

Seunghun J. Lee 1

Franco M. Luque 1

Agata Marcante 1

Lautaro Martínez 1

Nair Carolina Mazzeo 1

Aurelie Neveol 1

Naoaki Okazaki 1

Adriana Pagano 1

Ivandré Paraboni 1

David Nicolas Racca 1

Matteo Radaelli 1

Emma Raimundo Schulz 1

Mariela Rajngewerc 1

Yves Scherrer 1

Wolfgang S. Schmeisser-Nieto 1

Javier Torroba Marchante 1

Yulia Tsvetkov 1

Thomas Vadora 1

Martín Villalba 1

Marcos Zampieri 1

Sergio E. Zanotto 1

Venues