Probing Cross-Modal Representations in Multi-Step Relational Reasoning

Iuliia Parfenova, Desmond Elliott, Raquel Fernández, Sandro Pezzelle


Abstract
We investigate the representations learned by vision and language models in tasks that require relational reasoning. Focusing on the problem of assessing the relative size of objects in abstract visual contexts, we analyse both one-step and two-step reasoning. For the latter, we construct a new dataset of three-image scenes and define a task that requires reasoning at the level of the individual images and across images in a scene. We probe the learned model representations using diagnostic classifiers. Our experiments show that pretrained multimodal transformer-based architectures can perform higher-level relational reasoning, and are able to learn representations for novel tasks and data that are very different from what was seen in pretraining.
Anthology ID:
2021.repl4nlp-1.16
Volume:
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)
Month:
August
Year:
2021
Address:
Online
Venue:
RepL4NLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
152–162
Language:
URL:
https://aclanthology.org/2021.repl4nlp-1.16
DOI:
10.18653/v1/2021.repl4nlp-1.16
Bibkey:
Cite (ACL):
Iuliia Parfenova, Desmond Elliott, Raquel Fernández, and Sandro Pezzelle. 2021. Probing Cross-Modal Representations in Multi-Step Relational Reasoning. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pages 152–162, Online. Association for Computational Linguistics.
Cite (Informal):
Probing Cross-Modal Representations in Multi-Step Relational Reasoning (Parfenova et al., RepL4NLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.repl4nlp-1.16.pdf
Code
 jig-san/multi-step-size-reasoning
Data
CLEVRNLVR