Connecting Language and Vision to Actions

Peter Anderson, Abhishek Das, Qi Wu


Abstract
A long-term goal of AI research is to build intelligent agents that can see the rich visual environment around us, communicate this understanding in natural language to humans and other agents, and act in a physical or embodied environment. To this end, recent advances at the intersection of language and vision have made incredible progress – from being able to generate natural language descriptions of images/videos, to answering questions about them, to even holding free-form conversations about visual content! However, while these agents can passively describe images or answer (a sequence of) questions about them, they cannot act in the world (what if I cannot answer a question from my current view, or I am asked to move or manipulate something?). Thus, the challenge now is to extend this progress in language and vision to embodied agents that take actions and actively interact with their visual environments. To reduce the entry barrier for new researchers, this tutorial will provide an overview of the growing number of multimodal tasks and datasets that combine textual and visual understanding. We will comprehensively review existing state-of-the-art approaches to selected tasks such as image captioning, visual question answering (VQA) and visual dialog, presenting the key architectural building blocks (such as co-attention) and novel algorithms (such as cooperative/adversarial games) used to train models for these tasks. We will then discuss some of the current and upcoming challenges of combining language, vision and actions, and introduce some recently-released interactive 3D simulation environments designed for this purpose.
Anthology ID:
P18-5004
Volume:
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts
Month:
July
Year:
2018
Address:
Melbourne, Australia
Editors:
Yoav Artzi, Jacob Eisenstein
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10–14
Language:
URL:
https://aclanthology.org/P18-5004
DOI:
10.18653/v1/P18-5004
Bibkey:
Cite (ACL):
Peter Anderson, Abhishek Das, and Qi Wu. 2018. Connecting Language and Vision to Actions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 10–14, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
Connecting Language and Vision to Actions (Anderson et al., ACL 2018)
Copy Citation:
PDF:
https://aclanthology.org/P18-5004.pdf
Data
Visual Question Answering