Sandro Pezzelle


2021

pdf bib
Proceedings of the Third Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)
Marius Mosbach | Michael A. Hedderich | Sandro Pezzelle | Aditya Mogadala | Dietrich Klakow | Marie-Francine Moens | Zeynep Akata
Proceedings of the Third Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)

pdf bib
EaSe: A Diagnostic Tool for VQA based on Answer Diversity
Shailza Jolly | Sandro Pezzelle | Moin Nabi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We propose EASE, a simple diagnostic tool for Visual Question Answering (VQA) which quantifies the difficulty of an image, question sample. EASE is based on the pattern of answers provided by multiple annotators to a given question. In particular, it considers two aspects of the answers: (i) their Entropy; (ii) their Semantic content. First, we prove the validity of our diagnostic to identify samples that are easy/hard for state-of-art VQA models. Second, we show that EASE can be successfully used to select the most-informative samples for training/fine-tuning. Crucially, only information that is readily available in any VQA dataset is used to compute its scores.

pdf bib
Probing Cross-Modal Representations in Multi-Step Relational Reasoning
Iuliia Parfenova | Desmond Elliott | Raquel Fernández | Sandro Pezzelle
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)

We investigate the representations learned by vision and language models in tasks that require relational reasoning. Focusing on the problem of assessing the relative size of objects in abstract visual contexts, we analyse both one-step and two-step reasoning. For the latter, we construct a new dataset of three-image scenes and define a task that requires reasoning at the level of the individual images and across images in a scene. We probe the learned model representations using diagnostic classifiers. Our experiments show that pretrained multimodal transformer-based architectures can perform higher-level relational reasoning, and are able to learn representations for novel tasks and data that are very different from what was seen in pretraining.

2020

pdf bib
Proceedings of the Second Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)
Aditya Mogadala | Sandro Pezzelle | Dietrich Klakow | Marie-Francine Moens | Zeynep Akata
Proceedings of the Second Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)

pdf bib
Refer, Reuse, Reduce: Generating Subsequent References in Visual and Conversational Contexts
Ece Takmaz | Mario Giulianelli | Sandro Pezzelle | Arabella Sinclair | Raquel Fernández
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Dialogue participants often refer to entities or situations repeatedly within a conversation, which contributes to its cohesiveness. Subsequent references exploit the common ground accumulated by the interlocutors and hence have several interesting properties, namely, they tend to be shorter and reuse expressions that were effective in previous mentions. In this paper, we tackle the generation of first and subsequent references in visually grounded dialogue. We propose a generation model that produces referring utterances grounded in both the visual and the conversational context. To assess the referring effectiveness of its output, we also implement a reference resolution system. Our experiments and analyses show that the model produces better, more effective referring utterances than a model not grounded in the dialogue context, and generates subsequent references that exhibit linguistic patterns akin to humans.

pdf bib
Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze
Ece Takmaz | Sandro Pezzelle | Lisa Beinborn | Raquel Fernández
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

When speakers describe an image, they tend to look at objects before mentioning them. In this paper, we investigate such sequential cross-modal alignment by modelling the image description generation process computationally. We take as our starting point a state-of-the-art image captioning system and develop several model variants that exploit information from human gaze patterns recorded during language production. In particular, we propose the first approach to image description generation where visual processing is modelled sequentially. Our experiments and analyses confirm that better descriptions can be obtained by exploiting gaze-driven attention and shed light on human cognitive processes by comparing different ways of aligning the gaze modality with language production. We find that processing gaze data sequentially leads to descriptions that are better aligned to those produced by speakers, more diverse, and more natural—particularly when gaze is encoded with a dedicated recurrent component.

pdf bib
Be Different to Be Better! A Benchmark to Leverage the Complementarity of Language and Vision
Sandro Pezzelle | Claudio Greco | Greta Gandolfi | Eleonora Gualdoni | Raffaella Bernardi
Findings of the Association for Computational Linguistics: EMNLP 2020

This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models combine complementary information from the two modalities. Recently, impressive progress has been made to develop universal multimodal encoders suitable for virtually any language and vision tasks. However, current approaches often require them to combine redundant information provided by language and vision. Inspired by real-life communicative contexts, we propose a novel task where either modality is necessary but not sufficient to make a correct prediction. To do so, we first build a dataset of images and corresponding sentences provided by human participants. Second, we evaluate state-of-the-art models and compare their performance against human speakers. We show that, while the task is relatively easy for humans, best-performing models struggle to achieve similar results.

2019

pdf bib
Is the Red Square Big? MALeViC: Modeling Adjectives Leveraging Visual Contexts
Sandro Pezzelle | Raquel Fernández
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

This work aims at modeling how the meaning of gradable adjectives of size (‘big’, ‘small’) can be learned from visually-grounded contexts. Inspired by cognitive and linguistic evidence showing that the use of these expressions relies on setting a threshold that is dependent on a specific context, we investigate the ability of multi-modal models in assessing whether an object is ‘big’ or ‘small’ in a given visual scene. In contrast with the standard computational approach that simplistically treats gradable adjectives as ‘fixed’ attributes, we pose the problem as relational: to be successful, a model has to consider the full visual context. By means of four main tasks, we show that state-of-the-art models (but not a relatively strong baseline) can learn the function subtending the meaning of size adjectives, though their performance is found to decrease while moving from simple to more complex tasks. Crucially, models fail in developing abstract representations of gradable adjectives that can be used compositionally.

pdf bib
Proceedings of the Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)
Aditya Mogadala | Dietrich Klakow | Sandro Pezzelle | Marie-Francine Moens
Proceedings of the Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)

pdf bib
Big Generalizations with Small Data: Exploring the Role of Training Samples in Learning Adjectives of Size
Sandro Pezzelle | Raquel Fernández
Proceedings of the Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)

In this paper, we experiment with a recently proposed visual reasoning task dealing with quantities – modeling the multimodal, contextually-dependent meaning of size adjectives (‘big’, ‘small’) – and explore the impact of varying the training data on the learning behavior of a state-of-art system. In previous work, models have been shown to fail in generalizing to unseen adjective-noun combinations. Here, we investigate whether, and to what extent, seeing some of these cases during training helps a model understand the rule subtending the task, i.e., that being big implies being not small, and vice versa. We show that relatively few examples are enough to understand this relationship, and that developing a specific, mutually exclusive representation of size adjectives is beneficial to the task.

pdf bib
Quantifiers in a Multimodal World: Hallucinating Vision with Language and Sound
Alberto Testoni | Sandro Pezzelle | Raffaella Bernardi
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Inspired by the literature on multisensory integration, we develop a computational model to ground quantifiers in perception. The model learns to pick, out of nine quantifiers (‘few’, ‘many’, ‘all’, etc.), the one that is more likely to describe the percent of animals in a visual-auditory input containing both animals and artifacts. We show that relying on concurrent sensory inputs increases model performance on the quantification task. Moreover, we evaluate the model in a situation in which only the auditory modality is given, while the visual one is ‘hallucinanted’ either from the auditory input itself or from a linguistic caption describing the quantity of entities in the auditory input. This way, the model exploits prior associations between modalities. We show that the model profits from the prior knowledge and outperforms the auditory-only setting.

2018

pdf bib
Comparatives, Quantifiers, Proportions: a Multi-Task Model for the Learning of Quantities from Vision
Sandro Pezzelle | Ionut-Teodor Sorodoc | Raffaella Bernardi
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

The present work investigates whether different quantification mechanisms (set comparison, vague quantification, and proportional estimation) can be jointly learned from visual scenes by a multi-task computational model. The motivation is that, in humans, these processes underlie the same cognitive, non-symbolic ability, which allows an automatic estimation and comparison of set magnitudes. We show that when information about lower-complexity tasks is available, the higher-level proportional task becomes more accurate than when performed in isolation. Moreover, the multi-task model is able to generalize to unseen combinations of target/non-target objects. Consistently with behavioral evidence showing the interference of absolute number in the proportional task, the multi-task model no longer works when asked to provide the number of target objects in the scene.

pdf bib
Some of Them Can be Guessed! Exploring the Effect of Linguistic Context in Predicting Quantifiers
Sandro Pezzelle | Shane Steinert-Threlkeld | Raffaella Bernardi | Jakub Szymanik
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We study the role of linguistic context in predicting quantifiers (‘few’, ‘all’). We collect crowdsourced data from human participants and test various models in a local (single-sentence) and a global context (multi-sentence) condition. Models significantly out-perform humans in the former setting and are only slightly better in the latter. While human performance improves with more linguistic context (especially on proportional quantifiers), model performance suffers. Models are very effective in exploiting lexical and morpho-syntactic patterns; humans are better at genuinely understanding the meaning of the (global) context.

2017

pdf bib
Be Precise or Fuzzy: Learning the Meaning of Cardinals and Quantifiers from Vision
Sandro Pezzelle | Marco Marelli | Raffaella Bernardi
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

People can refer to quantities in a visual scene by using either exact cardinals (e.g. one, two, three) or natural language quantifiers (e.g. few, most, all). In humans, these two processes underlie fairly different cognitive and neural mechanisms. Inspired by this evidence, the present study proposes two models for learning the objective meaning of cardinals and quantifiers from visual scenes containing multiple objects. We show that a model capitalizing on a ‘fuzzy’ measure of similarity is effective for learning quantifiers, whereas the learning of exact cardinals is better accomplished when information about number is provided.

pdf bib
FOIL it! Find One mismatch between Image and Language caption
Ravi Shekhar | Sandro Pezzelle | Yauhen Klimovich | Aurélie Herbelot | Moin Nabi | Enver Sangineto | Raffaella Bernardi
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we aim to understand whether current language and vision (LaVi) models truly grasp the interaction between the two modalities. To this end, we propose an extension of the MS-COCO dataset, FOIL-COCO, which associates images with both correct and ‘foil’ captions, that is, descriptions of the image that are highly similar to the original ones, but contain one single mistake (‘foil word’). We show that current LaVi models fall into the traps of this data and perform badly on three tasks: a) caption classification (correct vs. foil); b) foil word detection; c) foil word correction. Humans, in contrast, have near-perfect performance on those tasks. We demonstrate that merely utilising language cues is not enough to model FOIL-COCO and that it challenges the state-of-the-art by requiring a fine-grained understanding of the relation between text and image.

pdf bib
Vision and Language Integration: Moving beyond Objects
Ravi Shekhar | Sandro Pezzelle | Aurélie Herbelot | Moin Nabi | Enver Sangineto | Raffaella Bernardi
IWCS 2017 — 12th International Conference on Computational Semantics — Short papers

pdf bib
Can You See the (Linguistic) Difference? Exploring Mass/Count Distinction in Vision
David Addison Smith | Sandro Pezzelle | Francesca Franzon | Chiara Zanini | Raffaella Bernardi
IWCS 2017 — 12th International Conference on Computational Semantics — Short papers

2016

pdf bib
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno | Germán Kruszewski | Angeliki Lazaridou | Ngoc Quan Pham | Raffaella Bernardi | Sandro Pezzelle | Marco Baroni | Gemma Boleda | Raquel Fernández
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Building a Bagpipe with a Bag and a Pipe: Exploring Conceptual Combination in Vision
Sandro Pezzelle | Ravi Shekhar | Raffaella Bernardi
Proceedings of the 5th Workshop on Vision and Language

pdf bib
“Look, some Green Circles!”: Learning to Quantify from Images
Ionut Sorodoc | Angeliki Lazaridou | Gemma Boleda | Aurélie Herbelot | Sandro Pezzelle | Raffaella Bernardi
Proceedings of the 5th Workshop on Vision and Language