Raffaella Bernardi - ACL Anthology

Raffaella Bernardi

Also published as: R. Bernardi

2026

The Price of Thought: A Multilingual Analysis of Reasoning, Performance, and Cost of Negotiation in Large Language Models
Sherzod Hakimov | Roland Bernard | Tim Leiber | Karl Osswald | Kristina Richert | Ruilin Yang | Raffaella Bernardi | David Schlangen
Findings of the Association for Computational Linguistics: EACL 2026

Negotiation is a fundamental challenge for AI agents, as it requires an ability to reason strategically, model opponents, and balance cooperation with competition. We present the first comprehensive study that systematically evaluates how explicit reasoning training affects the negotiation abilities of both commercial and open-weight large language models, comparing these models to their vanilla counterparts across three languages. Using a self-play setup across three diverse dialogue games, we analyse trade-offs between performance and cost, the language consistency of reasoning processes, and the nature of strategic adaptation exhibited by models.Our findings show that enabling reasoning—that is, scaling test time compute—significantly improves negotiation outcomes by enhancing collaboration and helping models overcome task complexities, but comes at a substantial computational cost: reasoning improves GPT-5’s performance by 31.4 % while increasing its cost by nearly 400 %. Most critically, we uncover a significant multilingual reasoning distinction: open-weight models consistently switch to English for their internal reasoning steps, even when negotiating in German or Italian (and thus possibly impacting potential explainability gains through the disclosure of reasoning traces), while a leading commercial model maintains language consistency between reasoning and final output.

Teaching Small Language Models to Learn Logic through Meta-Learning
Leonardo Bertolazzi | Manuel Vargas Guzmán | Raffaella Bernardi | Maciej Malicki | Jakub Szymanik
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) are increasingly evaluated on reasoning tasks, yet their logical abilities remain contested. To address this, we study LLMs’ reasoning in a well-defined fragment of logic: syllogistic reasoning. We cast the problem as premise selection and construct controlled datasets to isolate logical competence. Beyond evaluation, an open challenge is enabling LLMs to acquire abstract inference patterns that generalize to novel structures. We propose to apply few-shot meta-learning to this domain, thereby encouraging models to extract rules across tasks rather than memorize patterns within tasks. Although meta-learning has been little explored in the context of logic learnability, our experiments show that it is effective: small models (1.5B–7B) fine-tuned with meta-learning demonstrate strong gains in generalization, with especially pronounced benefits in low-data regimes. These meta-learned models outperform GPT-4o and o3-mini on our syllogistic reasoning task.

2025

MLLMs Construction Company: Investigating Multimodal LLMs’ Communicative Skills in a Collaborative Building Task
Marika Sarzotti | Giovanni Duca | Chris Madge | Raffaella Bernardi | Massimo Poesio
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

Interaction between learner and feedback-giver has come into focus recently for post-training of Large Language Models (LLMs), through the use of reward models that judge the appropriateness of a model’s response. In this paper, we investigate whether Dialogue Games—goal-directed and rule-governed activities driven predominantly by verbal actions—can also serve as a source of feedback signals for learning.We introduce Playpen, an environment for off- and online learning through Dialogue Game self-play, and investigate a representative set of post-training methods: supervised fine-tuning; direct alignment (DPO); and reinforcement learning with Group Relative Policy Optimization (GRPO). We experiment with post-training a small LLM (Llama-3.1-8B-Instruct), evaluating performance on unseen instances of training games as well as unseen games, and on standard benchmarks. We find that imitation learning through SFT improves performance on unseen instances, but negatively impacts other skills, while interactive learning with GRPO shows balanced improvements without loss of skills. We release the framework and the baseline training setups to foster research in this promising new direction of “learning in (synthetic) interaction”.

There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments, raising questions about the validity of these evaluations, as well as their reproducibility in the case of proprietary models. We provide JUDGE-BENCH, an extensible collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show substantial variance across models and datasets. Models are reliable evaluators on some tasks, but overall display substantial variability depending on the property being evaluated, the expertise level of the human judges, and whether the language is human or model-generated. We conclude that LLMs should be carefully validated against human judgments before being used as evaluators.

All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark
Davide Testa | Giovanni Bonetta | Raffaella Bernardi | Alessandro Bondielli | Alessandro Lenci | Alessio Miaschi | Lucia Passaro | Bernardo Magnini
Findings of the Association for Computational Linguistics: EMNLP 2025

We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an open-ended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs’ consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models’ fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers.Data available at *[GitHub](https://github.com/Caput97/MAIA-Multimodal_AI_Assessment.git).*

The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It
Leonardo Bertolazzi | Philipp Mondorf | Barbara Plank | Raffaella Bernardi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

The ability of large language models (LLMs) to validate their output and identify potential errors is crucial for ensuring robustness and reliability. However, current research indicates that LLMs struggle with self-correction, encountering significant challenges in detecting errors. While studies have explored methods to enhance self-correction in LLMs, relatively little attention has been given to understanding the models’ internal mechanisms underlying error detection. In this paper, we present a mechanistic analysis of error detection in LLMs, focusing on simple arithmetic problems. Through circuit analysis, we identify the computational subgraphs responsible for detecting arithmetic errors across four smaller-sized LLMs. Our findings reveal that all models heavily rely on consistency heads—attention heads that assess surface-level alignment of numerical values in arithmetic solutions. Moreover, we observe that the models’ internal arithmetic computation primarily occurs in higher layers, whereas validation takes place in middle layers, before the final arithmetic results are fully encoded. This structural dissociation between arithmetic computation and validation seems to explain why smaller-sized LLMs struggle to detect even simple arithmetic errors.

MAIA: A Benchmark for Multimodal AI Assessment
Davide Testa | Giovanni Bonetta | Raffaella Bernardi | Alessandro Bondielli | Alessandro Lenci | Alessio Miaschi | Lucia C. Passaro | Bernardo Magnini
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests
Filippo Momentè | Alessandro Suglia | Mario Giulianelli | Ambra Ferrari | Alexander Koller | Oliver Lemon | David Schlangen | Raquel Fernández | Raffaella Bernardi
Findings of the Association for Computational Linguistics: EMNLP 2025

We examine three evaluation paradigms: standard benchmarks (e.g., MMLU and BBH), interactive games (e.g., Signalling Games or Taboo), and cognitive tests (e.g., for working memory or theory of mind). First, we investigate which of the former two—benchmarks or games—is most effective at discriminating LLMs of varying quality. Then, inspired by human cognitive assessments, we compile a suite of targeted tests that measure cognitive abilities deemed essential for effective language use, and we investigate their correlation with model performance in benchmarks and games. Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models. Causal and logical reasoning correlate with both static and interactive tests, while differences emerge regarding core executive functions and social/emotional skills, which correlate more with games. We advocate for the development of new interactive benchmarks and targeted cognitive tasks inspired by assessing human abilities but designed specifically for LLMs.

Argumentative Analysis of Legal Rulings: A Structured Framework Using Bobbitt’s Typology
Carlotta Giacchetta | Raffaella Bernardi | Barbara Montini | Jacopo Staiano | Serena Tomasi
Proceedings of the 12th Argument mining Workshop

Legal reasoning remains one of the most complex and nuanced domains for AI, with current tools often lacking transparency and domain adaptability. While recent advances in large language models (LLMs) offer new opportunities for legal analysis, their ability to structure and interpret judicial argumentation remains unexplored. address this gap by proposing a structured framework for AI-assisted legal reasoning, centered on argumentative analysis. this work, we use GPT-4o for discourse-level and semantic analysis to identify argumentative units and classify them according to Philippe Bobbitt’s six constitutional modalities of legal reasoning.apply this framework to legal rulings from the Italian Court of Cassation.experimental findings indicate that LLM-based tools can effectively augment and streamline legal practice, by e.g. preprocessing the legal texts under scrutiny; still, the limited performance of the state-of-the-art generative model tested indicates significant room for progress in human-AI collaboration in the legal domain.

2024

A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences
Leonardo Bertolazzi | Albert Gatt | Raffaella Bernardi
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The reasoning abilities of Large Language Models (LLMs) are becoming a central focus of study in NLP. In this paper, we consider the case of syllogistic reasoning, an area of deductive reasoning studied extensively in logic and cognitive psychology. Previous research has shown that pre-trained LLMs exhibit reasoning biases, such as content effects, avoid answering that no conclusion follows, align with human difficulties, and struggle with multi-step reasoning. We contribute to this research line by systematically investigating the effects of chain-of-thought reasoning, in-context learning (ICL), and supervised fine-tuning (SFT) on syllogistic reasoning, considering syllogisms with conclusions that support or violate world knowledge and with multiple premises. Crucially, we go beyond the standard focus on accuracy, with an in-depth analysis of the conclusions generated by the models. Our results suggest that the behavior of pre-trained LLMs can be explained by heuristics studied in cognitive science and that both ICL and SFT improve model performance on valid inferences, although only the latter can mitigate most reasoning biases while being consistent.

Learning to Ask Informative Questions: Enhancing LLMs with Preference Optimization and Expected Information Gain
Davide Mazzaccara | Alberto Testoni | Raffaella Bernardi
Findings of the Association for Computational Linguistics: EMNLP 2024

Questions are essential tools for acquiring the necessary information to complete information-seeking tasks. However, large language models (LLMs), especially open-source models, often perform poorly in generating informative questions, as measured by expected information gain (EIG). In this paper, we propose a method to enhance the informativeness of LLM-generated questions in 20-question game dialogues. We sample multiple questions from the same model (LLaMA 2-Chat 7B) for each game and create pairs of low-EIG and high-EIG questions to apply a Direct Preference Optimization (DPO) algorithm. Our results show that this method produces more effective questions (in terms of EIG), even in domains different from those used to train the DPO model.

2023

The Inherence of Telicity: Unveiling Temporal Reasoning in Video Question Answering
Olga Loginova | Raffaella Bernardi
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)

ChatGPT’s Information Seeking Strategy: Insights from the 20-Questions Game
Leonardo Bertolazzi | Davide Mazzaccara | Filippo Merlo | Raffaella Bernardi
Proceedings of the 16th International Natural Language Generation Conference

Large Language Models, and ChatGPT in particular, have recently grabbed the attention of the community and the media. Having reached high language proficiency, attention has been shifting toward its reasoning capabilities. In this paper, our main aim is to evaluate ChatGPT’s question generation in a task where language production should be driven by an implicit reasoning process. To this end, we employ the 20-Questions game, traditionally used within the Cognitive Science community to inspect the information seeking-strategy’s development. This task requires a series of interconnected skills: asking informative questions, stepwise updating the hypothesis space, and stopping asking questions when enough information has been collected. We build hierarchical hypothesis spaces, exploiting feature norms collected from humans vs. ChatGPT itself, and we inspect the efficiency and informativeness of ChatGPT’s strategy. Our results show that ChatGPT’s performance gets closer to an optimal agent only when prompted to explicitly list the updated space stepwise.

2022

ACT-Thor: A Controlled Benchmark for Embodied Action Understanding in Simulated Environments
Michael Hanna | Federico Pedeni | Alessandro Suglia | Alberto Testoni | Raffaella Bernardi
Proceedings of the 29th International Conference on Computational Linguistics

Artificial agents are nowadays challenged to perform embodied AI tasks. To succeed, agents must understand the meaning of verbs and how their corresponding actions transform the surrounding world. In this work, we propose ACT-Thor, a novel controlled benchmark for embodied action understanding. We use the AI2-THOR simulated environment to produce a controlled setup in which an agent, given a before-image and an associated action command, has to determine what the correct after-image is among a set of possible candidates. First, we assess the feasibility of the task via a human evaluation that resulted in 81.4% accuracy, and very high inter-annotator agreement (84.9%). Second, we design both unimodal and multimodal baselines, using state-of-the-art visual feature extractors. Our evaluation and error analysis suggest that only models that have a very structured representation of the actions together with powerful visual features can perform well on the task. However, they still fall behind human performance in a zero-shot scenario where the model is exposed to unseen (action, object) pairs. This paves the way for a systematic way of evaluating embodied AI agents that understand grounded actions.

A Small but Informed and Diverse Model: The Case of the Multimodal GuessWhat!? Guessing Game
Claudio Greco | Alberto Testoni | Raffaella Bernardi | Stella Frank
Proceedings of the 2022 CLASP Conference on (Dis)embodiment

Pre-trained Vision and Language Transformers achieve high performance on downstream tasks due to their ability to transfer representational knowledge accumulated during pretraining on substantial amounts of data. In this paper, we ask whether it is possible to compete with such models using features based on transferred (pre-trained, frozen) representations combined with a lightweight architecture. We take a multimodal guessing task as our testbed, GuessWhat?!. An ensemble of our lightweight model matches the performance of the finetuned pre-trained transformer (LXMERT). An uncertainty analysis of our ensemble shows that the lightweight transferred representations close the data uncertainty gap with LXMERT, while retaining model diversity leading to ensemble boost. We further demonstrate that LXMERT’s performance gain is due solely to its extra V&L pretraining rather than because of architectural improvements. These results argue for flexible integration of multiple features and lightweight models as a viable alternative to large, cumbersome, pre-trained models.

2021

The Interplay of Task Success and Dialogue Quality: An in-depth Evaluation in Task-Oriented Visual Dialogues
Alberto Testoni | Raffaella Bernardi
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

When training a model on referential dialogue guessing games, the best model is usually chosen based on its task success. We show that in the popular end-to-end approach, this choice prevents the model from learning to generate linguistically richer dialogues, since the acquisition of language proficiency takes longer than learning the guessing task. By comparing models playing different games (GuessWhat, GuessWhich, and Mutual Friends), we show that this discrepancy is model- and task-agnostic. We investigate whether and when better language quality could lead to higher task success. We show that in GuessWhat, models could increase their accuracy if they learn to ground, encode, and decode also words that do not occur frequently in the training set.

Looking for Confirmations: An Effective and Human-Like Visual Dialogue Strategy
Alberto Testoni | Raffaella Bernardi
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Generating goal-oriented questions in Visual Dialogue tasks is a challenging and longstanding problem. State-Of-The-Art systems are shown to generate questions that, although grammatically correct, often lack an effective strategy and sound unnatural to humans. Inspired by the cognitive literature on information search and cross-situational word learning, we design Confirm-it, a model based on a beam search re-ranking algorithm that guides an effective goal-oriented strategy by asking questions that confirm the model’s conjecture about the referent. We take the GuessWhat?! game as a case-study. We show that dialogues generated by Confirm-it are more natural and effective than beam search decoding without re-ranking.

“I’ve Seen Things You People Wouldn’t Believe”: Hallucinating Entities in GuessWhat?!
Alberto Testoni | Raffaella Bernardi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

Natural language generation systems have witnessed important progress in the last years, but they are shown to generate tokens that are unrelated to the source input. This problem affects computational models in many NLP tasks, and it is particularly unpleasant in multimodal systems. In this work, we assess the rate of object hallucination in multimodal conversational agents playing the GuessWhat?! referential game. Better visual processing has been shown to mitigate this issue in image captioning; hence, we adapt to the GuessWhat?! task the best visual processing models at disposal, and propose two new models to play the Questioner agent. We show that the new models generate few hallucinations compared to other renowned models available in the literature. Moreover, their hallucinations are less severe (affect task-accuracy less) and are more human-like. We also analyse where hallucinations tend to occur more often through the dialogue: hallucinations are less frequent in earlier turns, cause a cascade hallucination effect, and are often preceded by negative answers, which have been shown to be harder to ground.

Visually Grounded Follow-up Questions: a Dataset of Spatial Questions Which Require Dialogue History
Tianai Dong | Alberto Testoni | Luciana Benotti | Raffaella Bernardi
Proceedings of Second International Combined Workshop on Spatial Language Understanding and Grounded Communication for Robotics

In this paper, we define and evaluate a methodology for extracting history-dependent spatial questions from visual dialogues. We say that a question is history-dependent if it requires (parts of) its dialogue history to be interpreted. We argue that some kinds of visual questions define a context upon which a follow-up spatial question relies. We call the question that restricts the context: trigger, and we call the spatial question that requires the trigger question to be answered: zoomer. We automatically extract different trigger and zoomer pairs based on the visual property that the questions rely on (e.g. color, number). We manually annotate the automatically extracted trigger and zoomer pairs to verify which zoomers require their trigger. We implement a simple baseline architecture based on a SOTA multimodal encoder. Our results reveal that there is much room for improvement for answering history-dependent questions.

2020

On the role of effective and referring questions in GuessWhat?!
Mauricio Mazuecos | Alberto Testoni | Raffaella Bernardi | Luciana Benotti
Proceedings of the First Workshop on Advances in Language and Vision Research

Task success is the standard metric used to evaluate referential visual dialogue systems. In this paper we propose two new metrics that evaluate how each question contributes to the goal. First, we measure how effective each question is by evaluating whether the question discards objects that are not the referent. Second, we define referring questions as those that univocally identify one object in the image. We report the new metrics for human dialogues and for state of the art publicly available models on GuessWhat?!. Regarding our first metric, we find that successful dialogues do not have a higher percentage of effective questions for most models. With respect to the second metric, humans make questions at the end of the dialogue that are referring, confirming their guess before guessing. Human dialogues that use this strategy have a higher task success but models do not seem to learn it.

Be Different to Be Better! A Benchmark to Leverage the Complementarity of Language and Vision
Sandro Pezzelle | Claudio Greco | Greta Gandolfi | Eleonora Gualdoni | Raffaella Bernardi
Findings of the Association for Computational Linguistics: EMNLP 2020

This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models combine complementary information from the two modalities. Recently, impressive progress has been made to develop universal multimodal encoders suitable for virtually any language and vision tasks. However, current approaches often require them to combine redundant information provided by language and vision. Inspired by real-life communicative contexts, we propose a novel task where either modality is necessary but not sufficient to make a correct prediction. To do so, we first build a dataset of images and corresponding sentences provided by human participants. Second, we evaluate state-of-the-art models and compare their performance against human speakers. We show that, while the task is relatively easy for humans, best-performing models struggle to achieve similar results.

Overprotective Training Environments Fall Short at Testing Time: Let Models Contribute to Their Own Training
Alberto Testoni | Raffaella Bernardi
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

They Are Not All Alike: Answering Different Spatial Questions Requires Different Grounding Strategies
Alberto Testoni | Claudio Greco | Tobias Bianchi | Mauricio Mazuecos | Agata Marcante | Luciana Benotti | Raffaella Bernardi
Proceedings of the Third International Workshop on Spatial Language Understanding

In this paper, we study the grounding skills required to answer spatial questions asked by humans while playing the GuessWhat?! game. We propose a classification for spatial questions dividing them into absolute, relational, and group questions. We build a new answerer model based on the LXMERT multimodal transformer and we compare a baseline with and without visual features of the scene. We are interested in studying how the attention mechanisms of LXMERT are used to answer spatial questions since they require putting attention on more than one region simultaneously and spotting the relation holding among them. We show that our proposed model outperforms the baseline by a large extent (9.70% on spatial questions and 6.27% overall). By analyzing LXMERT errors and its attention mechanisms, we find that our classification helps to gain a better understanding of the skills required to answer different spatial questions.

Effective questions in referential visual dialogue
Mauricio Mazuecos | Alberto Testoni | Raffaella Bernardi | Luciana Benotti
Proceedings of the Fourth Widening Natural Language Processing Workshop

An interesting challenge for situated dialogue systems is referential visual dialog: by asking questions, the system has to identify the referent to which the user refers to. Task success is the standard metric used to evaluate these systems. However, it does not consider how effective each question is, that is how much each question contributes to the goal. We propose a new metric, that measures question effectiveness. As a preliminary study, we report the new metric for state of the art publicly available models on GuessWhat?!. Surprisingly, successful dialogues do not have a higher percentage of effective questions than failed dialogues. This suggests that a system with high task success is not necessarily one that generates good questions.

Grounded and Ungrounded Referring Expressions in Human Dialogues: Language Mirrors Different Grounding Conditions
Eleonora Gualdoni | Raffaella Bernardi | Raquel Fernández | Sandro Pezzelle
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

2019

Preface
Raffaella Bernardi | Roberto Navigli | Giovanni Semeraro
Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019)

Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat
Ravi Shekhar | Aashish Venkatesh | Tim Baumgärtner | Elia Bruni | Barbara Plank | Raffaella Bernardi | Raquel Fernández
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We propose a grounded dialogue state encoder which addresses a foundational issue on how to integrate visual grounding with dialogue system components. As a test-bed, we focus on the GuessWhat?! game, a two-player game where the goal is to identify an object in a complex visual scene by asking a sequence of yes/no questions. Our visually-grounded encoder leverages synergies between guessing and asking questions, as it is trained jointly using multi-task learning. We further enrich our model via a cooperative learning regime. We show that the introduction of both the joint architecture and cooperative learning lead to accuracy improvements over the baseline system. We compare our approach to an alternative system which extends the baseline with reinforcement learning. Our in-depth analysis shows that the linguistic skills of the two models differ dramatically, despite approaching comparable performance levels. This points at the importance of analyzing the linguistic output of competing systems beyond numeric comparison solely based on task success.

Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019)
Raffaella Bernardi | Roberto Navigli | Giovanni Semeraro
Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019)

Quantifiers in a Multimodal World: Hallucinating Vision with Language and Sound
Alberto Testoni | Sandro Pezzelle | Raffaella Bernardi
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Inspired by the literature on multisensory integration, we develop a computational model to ground quantifiers in perception. The model learns to pick, out of nine quantifiers (‘few’, ‘many’, ‘all’, etc.), the one that is more likely to describe the percent of animals in a visual-auditory input containing both animals and artifacts. We show that relying on concurrent sensory inputs increases model performance on the quantification task. Moreover, we evaluate the model in a situation in which only the auditory modality is given, while the visual one is ‘hallucinanted’ either from the auditory input itself or from a linguistic caption describing the quantity of entities in the auditory input. This way, the model exploits prior associations between modalities. We show that the model profits from the prior knowledge and outperforms the auditory-only setting.

Psycholinguistics Meets Continual Learning: Measuring Catastrophic Forgetting in Visual Question Answering
Claudio Greco | Barbara Plank | Raquel Fernández | Raffaella Bernardi
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We study the issue of catastrophic forgetting in the context of neural multimodal approaches to Visual Question Answering (VQA). Motivated by evidence from psycholinguistics, we devise a set of linguistically-informed VQA tasks, which differ by the types of questions involved (Wh-questions and polar questions). We test what impact task difficulty has on continual learning, and whether the order in which a child acquires question types facilitates computational models. Our results show that dramatic forgetting is at play and that task difficulty and order matter. Two well-known current continual learning methods mitigate the problem only to a limiting degree.

Proceedings of the Second Workshop on Shortcomings in Vision and Language
Raffaella Bernardi | Raquel Fernandez | Spandana Gella | Kushal Kafle | Christopher Kanan | Stefan Lee | Moin Nabi
Proceedings of the Second Workshop on Shortcomings in Vision and Language

Jointly Learning to See, Ask, Decide when to Stop, and then GuessWhat
Ravi Shekhar | Alberto Testoni | Raquel Fernández | Raffaella Bernardi
Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019)

Evaluating the Representational Hub of Language and Vision Models
Ravi Shekhar | Ece Takmaz | Raquel Fernández | Raffaella Bernardi
Proceedings of the 13th International Conference on Computational Semantics - Long Papers

The multimodal models used in the emerging field at the intersection of computational linguistics and computer vision implement the bottom-up processing of the “Hub and Spoke” architecture proposed in cognitive science to represent how the brain processes and combines multi-sensory inputs. In particular, the Hub is implemented as a neural network encoder. We investigate the effect on this encoder of various vision-and-language tasks proposed in the literature: visual question answering, visual reference resolution, and visually grounded dialogue. To measure the quality of the representations learned by the encoder, we use two kinds of analyses. First, we evaluate the encoder pre-trained on the different vision-and-language tasks on an existing “diagnostic task” designed to assess multimodal semantic understanding. Second, we carry out a battery of analyses aimed at studying how the encoder merges and exploits the two modalities.

2018

A Distributional Study of Negated Adjectives and Antonyms
Laura Aina | Raffaella Bernardi | Raquel Fernández
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

Grounded Textual Entailment
Hoa Trong Vu | Claudio Greco | Aliia Erofeeva | Somayeh Jafaritazehjan | Guido Linders | Marc Tanti | Alberto Testoni | Raffaella Bernardi | Albert Gatt
Proceedings of the 27th International Conference on Computational Linguistics

Capturing semantic relations between sentences, such as entailment, is a long-standing challenge for computational semantics. Logic-based models analyse entailment in terms of possible worlds (interpretations, or situations) where a premise P entails a hypothesis H iff in all worlds where P is true, H is also true. Statistical models view this relationship probabilistically, addressing it in terms of whether a human would likely infer H from P. In this paper, we wish to bridge these two perspectives, by arguing for a visually-grounded version of the Textual Entailment task. Specifically, we ask whether models can perform better if, in addition to P and H, there is also an image (corresponding to the relevant “world” or “situation”). We use a multimodal version of the SNLI dataset (Bowman et al., 2015) and we compare “blind” and visually-augmented models of textual entailment. We show that visual information is beneficial, but we also conduct an in-depth error analysis that reveals that current multimodal models are not performing “grounding” in an optimal fashion.

Ask No More: Deciding when to guess in referential visual dialogue
Ravi Shekhar | Tim Baumgärtner | Aashish Venkatesh | Elia Bruni | Raffaella Bernardi | Raquel Fernandez
Proceedings of the 27th International Conference on Computational Linguistics

Our goal is to explore how the abilities brought in by a dialogue manager can be included in end-to-end visually grounded conversational agents. We make initial steps towards this general goal by augmenting a task-oriented visual dialogue model with a decision-making component that decides whether to ask a follow-up question to identify a target referent in an image, or to stop the conversation to make a guess. Our analyses show that adding a decision making component produces dialogues that are less repetitive and that include fewer unnecessary questions, thus potentially leading to more efficient and less unnatural interactions.

Comparatives, Quantifiers, Proportions: a Multi-Task Model for the Learning of Quantities from Vision
Sandro Pezzelle | Ionut-Teodor Sorodoc | Raffaella Bernardi
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

The present work investigates whether different quantification mechanisms (set comparison, vague quantification, and proportional estimation) can be jointly learned from visual scenes by a multi-task computational model. The motivation is that, in humans, these processes underlie the same cognitive, non-symbolic ability, which allows an automatic estimation and comparison of set magnitudes. We show that when information about lower-complexity tasks is available, the higher-level proportional task becomes more accurate than when performed in isolation. Moreover, the multi-task model is able to generalize to unseen combinations of target/non-target objects. Consistently with behavioral evidence showing the interference of absolute number in the proportional task, the multi-task model no longer works when asked to provide the number of target objects in the scene.

Some of Them Can be Guessed! Exploring the Effect of Linguistic Context in Predicting Quantifiers
Sandro Pezzelle | Shane Steinert-Threlkeld | Raffaella Bernardi | Jakub Szymanik
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We study the role of linguistic context in predicting quantifiers (‘few’, ‘all’). We collect crowdsourced data from human participants and test various models in a local (single-sentence) and a global context (multi-sentence) condition. Models significantly out-perform humans in the former setting and are only slightly better in the latter. While human performance improves with more linguistic context (especially on proportional quantifiers), model performance suffers. Models are very effective in exploiting lexical and morpho-syntactic patterns; humans are better at genuinely understanding the meaning of the (global) context.

2017

Vision and Language Integration: Moving beyond Objects
Ravi Shekhar | Sandro Pezzelle | Aurélie Herbelot | Moin Nabi | Enver Sangineto | Raffaella Bernardi
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Short papers

Be Precise or Fuzzy: Learning the Meaning of Cardinals and Quantifiers from Vision
Sandro Pezzelle | Marco Marelli | Raffaella Bernardi
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

People can refer to quantities in a visual scene by using either exact cardinals (e.g. one, two, three) or natural language quantifiers (e.g. few, most, all). In humans, these two processes underlie fairly different cognitive and neural mechanisms. Inspired by this evidence, the present study proposes two models for learning the objective meaning of cardinals and quantifiers from visual scenes containing multiple objects. We show that a model capitalizing on a ‘fuzzy’ measure of similarity is effective for learning quantifiers, whereas the learning of exact cardinals is better accomplished when information about number is provided.

FOIL it! Find One mismatch between Image and Language caption
Ravi Shekhar | Sandro Pezzelle | Yauhen Klimovich | Aurélie Herbelot | Moin Nabi | Enver Sangineto | Raffaella Bernardi
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we aim to understand whether current language and vision (LaVi) models truly grasp the interaction between the two modalities. To this end, we propose an extension of the MS-COCO dataset, FOIL-COCO, which associates images with both correct and ‘foil’ captions, that is, descriptions of the image that are highly similar to the original ones, but contain one single mistake (‘foil word’). We show that current LaVi models fall into the traps of this data and perform badly on three tasks: a) caption classification (correct vs. foil); b) foil word detection; c) foil word correction. Humans, in contrast, have near-perfect performance on those tasks. We demonstrate that merely utilising language cues is not enough to model FOIL-COCO and that it challenges the state-of-the-art by requiring a fine-grained understanding of the relation between text and image.

Can You See the (Linguistic) Difference? Exploring Mass/Count Distinction in Vision
David Addison Smith | Sandro Pezzelle | Francesca Franzon | Chiara Zanini | Raffaella Bernardi
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Short papers

2016

There Is No Logical Negation Here, But There Are Alternatives: Modeling Conversational Negation with Distributional Semantics
Germán Kruszewski | Denis Paperno | Raffaella Bernardi | Marco Baroni
Computational Linguistics, Volume 42, Issue 4 - December 2016

Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics
Claire Gardent | Raffaella Bernardi | Ivan Titov
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

“Look, some Green Circles!”: Learning to Quantify from Images
Ionut Sorodoc | Angeliki Lazaridou | Gemma Boleda | Aurélie Herbelot | Sandro Pezzelle | Raffaella Bernardi
Proceedings of the 5th Workshop on Vision and Language

Building a Bagpipe with a Bag and a Pipe: Exploring Conceptual Combination in Vision
Sandro Pezzelle | Ravi Shekhar | Raffaella Bernardi
Proceedings of the 5th Workshop on Vision and Language

The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno | Germán Kruszewski | Angeliki Lazaridou | Ngoc Quan Pham | Raffaella Bernardi | Sandro Pezzelle | Marco Baroni | Gemma Boleda | Raquel Fernández
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

Distributional Semantics in Use
Raffaella Bernardi | Gemma Boleda | Raquel Fernández | Denis Paperno
Proceedings of the First Workshop on Linking Computational Models of Lexical, Sentential and Discourse-level Semantics

2014

TUHOI: Trento Universal Human Object Interaction Dataset
Dieu-Thu Le | Jasper Uijlings | Raffaella Bernardi
Proceedings of the Third Workshop on Vision and Language

SemEval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment
Marco Marelli | Luisa Bentivogli | Marco Baroni | Raffaella Bernardi | Stefano Menini | Roberto Zamparelli
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

Frege in Space: A Program for Composition Distributional Semantics
Marco Baroni | Raffaella Bernardi | Roberto Zamparelli
Linguistic Issues in Language Technology, Volume 9, 2014 - Perspectives on Semantic Representations for Textual Inference

The lexicon of any natural language encodes a huge number of distinct word meanings. Just to understand this article, you will need to know what thousands of words mean. The space of possible sentential meanings is infinite: In this article alone, you will encounter many sentences that express ideas you have never heard before, we hope. Statistical semantics has addressed the issue of the vastness of word meaning by proposing methods to harvest meaning automatically from large collections of text (corpora). Formal semantics in the Fregean tradition has developed methods to account for the infinity of sentential meaning based on the crucial insight of compositionality, the idea that meaning of sentences is built incrementally by combining the meanings of their constituents. This article sketches a new approach to semantics that brings together ideas from statistical and formal semantics to account, in parallel, for the richness of lexical meaning and the combinatorial power of sentential semantics. We adopt, in particular, the idea that word meaning can be approximated by the patterns of co-occurrence of words in corpora from statistical semantics, and the idea that compositionality can be captured in terms of a syntax-driven calculus of function application from formal semantics.

Coloring Objects: Adjective-Noun Visual Semantic Compositionality
Dat Tien Nguyen | Angeliki Lazaridou | Raffaella Bernardi
Proceedings of the Third Workshop on Vision and Language

A SICK cure for the evaluation of compositional distributional semantic models
Marco Marelli | Stefano Menini | Marco Baroni | Luisa Bentivogli | Raffaella Bernardi | Roberto Zamparelli
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Shared and internationally recognized benchmarks are fundamental for the development of any computational system. We aim to help the research community working on compositional distributional semantic models (CDSMs) by providing SICK (Sentences Involving Compositional Knowldedge), a large size English benchmark tailored for them. SICK consists of about 10,000 English sentence pairs that include many examples of the lexical, syntactic and semantic phenomena that CDSMs are expected to account for, but do not require dealing with other aspects of existing sentential data sets (idiomatic multiword expressions, named entities, telegraphic language) that are not within the scope of CDSMs. By means of crowdsourcing techniques, each pair was annotated for two crucial semantic tasks: relatedness in meaning (with a 5-point rating scale as gold score) and entailment relation between the two elements (with three possible gold labels: entailment, contradiction, and neutral). The SICK data set was used in SemEval-2014 Task 1, and it freely available for research purposes.

Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC)
Alexandre Allauzen | Raffaella Bernardi | Edward Grefenstette | Hugo Larochelle | Christopher Manning | Scott Wen-tau Yih
Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC)

2013

Sentence paraphrase detection: When determiners and word order make the difference
Nghia Pham | Raffaella Bernardi | Yao Zhong Zhang | Marco Baroni
Proceedings of the IWCS 2013 Workshop Towards a Formal Distributional Semantics

A relatedness benchmark to test the role of determiners in compositional distributional semantics
Raffaella Bernardi | Georgiana Dinu | Marco Marelli | Marco Baroni
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Exploiting Language Models for Visual Recognition
Dieu-Thu Le | Jasper Uijlings | Raffaella Bernardi
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

CCG Categories for Distributional Semantic Models
Paramita Mirza | Raffaella Bernardi
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

Query classification using topic models and support vector machine
Dieu-Thu Le | Raffaella Bernardi
Proceedings of ACL 2012 Student Research Workshop

Entailment above the word level in distributional semantics
Marco Baroni | Raffaella Bernardi | Ngoc-Quynh Do | Chung-chieh Shan
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

2011

Query classification via Topic Models for an art image archive
Dieu-Thu Le | Raffaella Bernardi | Ed Vald
Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage

2010

Towards an Empirically Motivated Typology of Follow-Up Questions: The Role of Dialogue Context
Manuel Kirschner | Raffaella Bernardi
Proceedings of the SIGDIAL 2010 Conference

Context Fusion: The Role of Discourse Structure and Centering Theory
Raffaella Bernardi | Manuel Kirschner | Zorana Ratkovic
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Questions are not asked in isolation. Their context, viz. the preceding interactions, might be of help to understand them and retrieve the correct answer. Previous research in Interactive Question Answering showed that context fusion has a big potential to improve the performance of answer retrieval. In this paper, we study how much context, and what elements of it, should be considered to answer Follow-Up Questions (FU Qs). Following previous research, we exploit Logistic Regression Models to learn aspects of dialogue structure relevant to answering FU Qs. We enrich existing models based on shallow features with deep features, relying on the theory of discourse structure of (Chai and Jin, 2004), and on Centering Theory, respectively. Using models trained on realistic IQA data, we show which of the various theoretically motivated features hold up against empirical evidence. We also show that, while these deep features do not outperform the shallow ones on their own, an IQA system's answer correctness increases if the shallow and deep features are combined.

2009

Exploring Topic Continuation Follow-up Questions using Machine Learning
Manuel Kirschner | Raffaella Bernardi
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium

2008

Context Modelling for IQA: the Role of Tasks and Entities
Raffaella Bernardi | Manuel Kirschner
Coling 2008: Proceedings of the workshop on Knowledge and Reasoning for Answering Questions

2007

An Empirical View on IQA Follow-up Questions
Manuel Kirschner | Raffaella Bernardi
Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue

2006

Multilingual Search in Libraries. The case-study of the Free University of Bozen-Bolzano
R. Bernardi | D. Calvanese | L. Dini | V. Di Tomaso | E. Frasnelli | U. Kugler | B. Plank
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper presents an on-going project aiming at enhancing the OPAC (Online Public Access Catalog) search system of the Library of the Free University of Bozen-Bolzano with multilingual access. The Multilingual search system (MUSIL), we have developed, integrates advanced linguistic technologies in a user friendly interface and bridges the gap between the world of free text search and the world of conceptual librarian search. In this paper we present the architecture of the system, its interface and preliminary evaluations of the precision of the search results.

POS tagset design for Italian
Raffaella Bernardi | Andrea Bolognesi | Corrado Seidenari | Fabio Tamburini
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We aim to automatically induce a PoS tagset for Italian by analysing the distributional behaviour of Italian words. To this end, we propose an algorithm that (a) extracts information from loosely labelled dependency structures that encode only basic and broadly accepted syntactic relations, namely Head/Dependent and the distinction of dependents into Argument vs. Adjunct, and (b) derives a possible set of word classes. The paper reports on some preliminary experiments carried out using the induced tagset in conjunction with state-of-the-art PoS taggers. The method proposed to design a proper tagset exploits little, if any, language-specific knowledge: hence it is in principle applicable to any language.

2005

Automatic Induction of a POS Tagset for Italian
Raffaella Bernardi | Andrea Bolognesi | Corrado Seidenari | Fabio Tamburini
Proceedings of the Australasian Language Technology Workshop 2005

2004

Categorial Type Logic meets Dependency Grammar to annotate an Italian corpus
R. Bernardi | A. Bolognesi | F. Tamburini | M. Moortgat
Proceedings of the Workshop on Recent Advances in Dependency Grammar

2000

Deriving polarity effects
Raffaella Bernardi
Proceedings of the Fifth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+5)

Co-authors

Leonardo Bertolazzi 5

Claudio Greco 5

Manuel Kirschner 5

Barbara Plank 5

Luciana Benotti 4

Marco Marelli 4

David Schlangen 4

Alessandro Suglia 4

Andrea Bolognesi 3

Mario Giulianelli 3

Aurélie Herbelot 3

Alexander Koller 3

Angeliki Lazaridou 3

Mauricio Mazuecos 3

Davide Mazzaccara 3

Denis Paperno 3

Fabio Tamburini 3

Roberto Zamparelli 3

Tim Baumgärtner 2

Luisa Bentivogli 2

Alessandro Bondielli 2

Giovanni Bonetta 2

Eleonora Gualdoni 2

Sherzod Hakimov 2

Michael Hanna 2

Germán Kruszewski 2

Alessandro Lenci 2

Bernardo Magnini 2

Stefano Menini 2

Alessio Miaschi 2

Filippo Momentè 2

Philipp Mondorf 2

Roberto Navigli 2

Enver Sangineto 2

Corrado Seidenari 2

Giovanni Semeraro 2

Ionut Sorodoc 2

Jakub Szymanik 2

Jasper Uijlings 2

Aashish Venkatesh 2

Alexandre Allauzen 1

Anna Bavaresco 1

Frederic Bechet 1

Roland Bernard 1

Tobias Bianchi 1

Zoraida Callejas 1

Diego Calvanese 1

Yun-Nung Chen 1

Shammur Absar Chowdhury 1

Géraldine Damnati 1

Giuseppe "Pino" Di Fabbrizio 1

Vittorio Di Tomaso 1

Georgiana Dinu 1

Ngoc-Quynh Do 1

Giovanni Duca 1

Luis Fernando D’Haro 1

Desmond Elliott 1

Aliia Erofeeva 1

Ambra Ferrari 1

Luca Franceschetti 1

Francesca Franzon 1

Elisabeth Frasnelli 1

Greta Gandolfi 1

Claire Gardent 1

Spandana Gella 1

Carlotta Giacchetta 1

Edward Grefenstette 1

Joakim Gustafson 1

Dilek Hakkani-Tur 1

Somayeh Jafaritazehjani 1

Michael Johnston 1

Christopher Kanan 1

Tatsuya Kawahara 1

Yauhen Klimovich 1

Ulrike Kugler 1

Hugo Larochelle 1

Guido Linders 1

Olga Loginova 1

Maciej Malicki 1

Christopher D. Manning 1

Agata Marcante 1

André F. T. Martins 1

John Mendonça 1

Filippo Merlo 1

Paramita Mirza 1

Barbara Montini 1

Michael Moortgat 1

Seyed Mahed Mousavi 1

Vera Neplenbroek 1

Alexandros Papangelis 1

Lucia Passaro 1

Lucia C. Passaro 1

Federico Pedeni 1

Nghia The Pham 1

Ngoc-Quan Pham 1

Massimo Poesio 1

Zorana Ratkovic 1

Giuseppe Riccardi 1

Kristina Richert 1

Philipp Sadler 1

Marika Sarzotti 1

Antonia Schmidt 1

Chung-chieh Shan 1

David A. Smith 1

Jacopo Staiano 1

Shane Steinert-Threlkeld 1

Michael Sullivan 1

Aditya K Surikuchi 1

Dat Tien Nguyen 1

Serena Tomasi 1

M. Inés Torres 1

Manuel Vargas Guzmán 1

Koichiro Yoshino 1

Chiara Zanini 1

Yao-Zhong Zhang 1

Venues