Visual Goal-Step Inference using wikiHow

Yue Yang, Artemis Panagopoulou, Qing Lyu, Li Zhang, Mark Yatskar, Chris Callison-Burch


Abstract
Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images representing human actions, we show that our task is challenging for state-of-the-art multimodal models. Moreover, the multimodal representation learned from our data can be effectively transferred to other datasets like HowTo100m, increasing the VGSI accuracy by 15 - 20%. Our task will facilitate multimodal reasoning about procedural events.
Anthology ID:
2021.emnlp-main.165
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2167–2179
Language:
URL:
https://aclanthology.org/2021.emnlp-main.165
DOI:
10.18653/v1/2021.emnlp-main.165
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.165.pdf
Code
 yueyang1996/wikihow-vgsi
Data
wikiHow-imageCOINHowTo100M