A Corpus for Reasoning about Natural Language Grounded in Photographs

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, Yoav Artzi


Abstract
We introduce a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges. The data contains 107,292 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a pair of photographs. We crowdsource the data using sets of visually rich images and a compare-and-contrast task to elicit linguistically diverse language. Qualitative analysis shows the data requires compositional joint reasoning, including about quantities, comparisons, and relations. Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge.
Anthology ID:
P19-1644
Original:
P19-1644v1
Version 2:
P19-1644v2
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6418–6428
Language:
URL:
https://aclanthology.org/P19-1644
DOI:
10.18653/v1/P19-1644
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/P19-1644.pdf
Supplementary:
 P19-1644.Supplementary.pdf
Code
 lil-lab/nlvr +  additional community code
Data
CLEVRCLEVR-HumansCOCONLVRVisual Question Answering