2022
pdf
bib
abs
What kinds of errors do reference resolution models make and what can we learn from them?
Jorge Sánchez
|
Mauricio Mazuecos
|
Hernán Maina
|
Luciana Benotti
Findings of the Association for Computational Linguistics: NAACL 2022
Referring resolution is the task of identifying the referent of a natural language expression, for example “the woman behind the other woman getting a massage”. In this paper we investigate which are the kinds of referring expressions on which current transformer based models fail. Motivated by this analysis we identify the weakening of the spatial natural constraints as one of its causes and propose a model that aims to restore it. We evaluate our proposed model on different datasets for the task showing improved performance on the most challenging kinds of referring expressions. Finally we present a thorough analysis of the kinds errors that are improved by the new model and those that are not and remain future challenges for the task.
2021
pdf
bib
abs
The Impact of Answers in Referential Visual Dialog
Mauricio Mazuecos
|
Patrick Blackburn
|
Luciana Benotti
Proceedings of the Reasoning and Interaction Conference (ReInAct 2021)
In the visual dialog task GuessWhat?! two players maintain a dialog in order to identify a secret object in an image. Computationally, this is modeled using a question generation module and a guesser module for the questioner role and an answering model, the Oracle, to answer the generated questions. This raises a question: what’s the risk of having an imperfect oracle model?. Here we present work in progress in the study of the impact of different answering models in human generated questions in GuessWhat?!. We show that having access to better quality answers has a direct impact on the guessing task for human dialog and argue that better answers could help train better question generation models.
pdf
bib
abs
Region under Discussion for visual dialog
Mauricio Mazuecos
|
Franco M. Luque
|
Jorge Sánchez
|
Hernán Maina
|
Thomas Vadora
|
Luciana Benotti
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Visual Dialog is assumed to require the dialog history to generate correct responses during a dialog. However, it is not clear from previous work how dialog history is needed for visual dialog. In this paper we define what it means for a visual question to require dialog history and we release a subset of the Guesswhat?! questions for which their dialog history completely changes their responses. We propose a novel interpretable representation that visually grounds dialog history: the Region under Discussion. It constrains the image’s spatial features according to a semantic representation of the history inspired by the information structure notion of Question under Discussion.We evaluate the architecture on task-specific multimodal models and the visual transformer model LXMERT.
2020
bib
abs
Effective questions in referential visual dialogue
Mauricio Mazuecos
|
Alberto Testoni
|
Raffaella Bernardi
|
Luciana Benotti
Proceedings of the Fourth Widening Natural Language Processing Workshop
An interesting challenge for situated dialogue systems is referential visual dialog: by asking questions, the system has to identify the referent to which the user refers to. Task success is the standard metric used to evaluate these systems. However, it does not consider how effective each question is, that is how much each question contributes to the goal. We propose a new metric, that measures question effectiveness. As a preliminary study, we report the new metric for state of the art publicly available models on GuessWhat?!. Surprisingly, successful dialogues do not have a higher percentage of effective questions than failed dialogues. This suggests that a system with high task success is not necessarily one that generates good questions.
pdf
bib
abs
On the role of effective and referring questions in GuessWhat?!
Mauricio Mazuecos
|
Alberto Testoni
|
Raffaella Bernardi
|
Luciana Benotti
Proceedings of the First Workshop on Advances in Language and Vision Research
Task success is the standard metric used to evaluate referential visual dialogue systems. In this paper we propose two new metrics that evaluate how each question contributes to the goal. First, we measure how effective each question is by evaluating whether the question discards objects that are not the referent. Second, we define referring questions as those that univocally identify one object in the image. We report the new metrics for human dialogues and for state of the art publicly available models on GuessWhat?!. Regarding our first metric, we find that successful dialogues do not have a higher percentage of effective questions for most models. With respect to the second metric, humans make questions at the end of the dialogue that are referring, confirming their guess before guessing. Human dialogues that use this strategy have a higher task success but models do not seem to learn it.
pdf
bib
abs
They Are Not All Alike: Answering Different Spatial Questions Requires Different Grounding Strategies
Alberto Testoni
|
Claudio Greco
|
Tobias Bianchi
|
Mauricio Mazuecos
|
Agata Marcante
|
Luciana Benotti
|
Raffaella Bernardi
Proceedings of the Third International Workshop on Spatial Language Understanding
In this paper, we study the grounding skills required to answer spatial questions asked by humans while playing the GuessWhat?! game. We propose a classification for spatial questions dividing them into absolute, relational, and group questions. We build a new answerer model based on the LXMERT multimodal transformer and we compare a baseline with and without visual features of the scene. We are interested in studying how the attention mechanisms of LXMERT are used to answer spatial questions since they require putting attention on more than one region simultaneously and spotting the relation holding among them. We show that our proposed model outperforms the baseline by a large extent (9.70% on spatial questions and 6.27% overall). By analyzing LXMERT errors and its attention mechanisms, we find that our classification helps to gain a better understanding of the skills required to answer different spatial questions.