@inproceedings{anderson-etal-2018-connecting,
title = "Connecting Language and Vision to Actions",
author = "Anderson, Peter and
Das, Abhishek and
Wu, Qi",
editor = "Artzi, Yoav and
Eisenstein, Jacob",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts",
month = jul,
year = "2018",
address = "Melbourne, Australia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/P18-5004",
doi = "10.18653/v1/P18-5004",
pages = "10--14",
abstract = "A long-term goal of AI research is to build intelligent agents that can see the rich visual environment around us, communicate this understanding in natural language to humans and other agents, and act in a physical or embodied environment. To this end, recent advances at the intersection of language and vision have made incredible progress {--} from being able to generate natural language descriptions of images/videos, to answering questions about them, to even holding free-form conversations about visual content! However, while these agents can passively describe images or answer (a sequence of) questions about them, they cannot act in the world (what if I cannot answer a question from my current view, or I am asked to move or manipulate something?). Thus, the challenge now is to extend this progress in language and vision to embodied agents that take actions and actively interact with their visual environments. To reduce the entry barrier for new researchers, this tutorial will provide an overview of the growing number of multimodal tasks and datasets that combine textual and visual understanding. We will comprehensively review existing state-of-the-art approaches to selected tasks such as image captioning, visual question answering (VQA) and visual dialog, presenting the key architectural building blocks (such as co-attention) and novel algorithms (such as cooperative/adversarial games) used to train models for these tasks. We will then discuss some of the current and upcoming challenges of combining language, vision and actions, and introduce some recently-released interactive 3D simulation environments designed for this purpose.",
}
<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="anderson-etal-2018-connecting">
<titleInfo>
<title>Connecting Language and Vision to Actions</title>
</titleInfo>
<name type="personal">
<namePart type="given">Peter</namePart>
<namePart type="family">Anderson</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Abhishek</namePart>
<namePart type="family">Das</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Qi</namePart>
<namePart type="family">Wu</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<originInfo>
<dateIssued>2018-07</dateIssued>
</originInfo>
<typeOfResource>text</typeOfResource>
<relatedItem type="host">
<titleInfo>
<title>Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts</title>
</titleInfo>
<name type="personal">
<namePart type="given">Yoav</namePart>
<namePart type="family">Artzi</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Jacob</namePart>
<namePart type="family">Eisenstein</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<originInfo>
<publisher>Association for Computational Linguistics</publisher>
<place>
<placeTerm type="text">Melbourne, Australia</placeTerm>
</place>
</originInfo>
<genre authority="marcgt">conference publication</genre>
</relatedItem>
<abstract>A long-term goal of AI research is to build intelligent agents that can see the rich visual environment around us, communicate this understanding in natural language to humans and other agents, and act in a physical or embodied environment. To this end, recent advances at the intersection of language and vision have made incredible progress – from being able to generate natural language descriptions of images/videos, to answering questions about them, to even holding free-form conversations about visual content! However, while these agents can passively describe images or answer (a sequence of) questions about them, they cannot act in the world (what if I cannot answer a question from my current view, or I am asked to move or manipulate something?). Thus, the challenge now is to extend this progress in language and vision to embodied agents that take actions and actively interact with their visual environments. To reduce the entry barrier for new researchers, this tutorial will provide an overview of the growing number of multimodal tasks and datasets that combine textual and visual understanding. We will comprehensively review existing state-of-the-art approaches to selected tasks such as image captioning, visual question answering (VQA) and visual dialog, presenting the key architectural building blocks (such as co-attention) and novel algorithms (such as cooperative/adversarial games) used to train models for these tasks. We will then discuss some of the current and upcoming challenges of combining language, vision and actions, and introduce some recently-released interactive 3D simulation environments designed for this purpose.</abstract>
<identifier type="citekey">anderson-etal-2018-connecting</identifier>
<identifier type="doi">10.18653/v1/P18-5004</identifier>
<location>
<url>https://aclanthology.org/P18-5004</url>
</location>
<part>
<date>2018-07</date>
<extent unit="page">
<start>10</start>
<end>14</end>
</extent>
</part>
</mods>
</modsCollection>
%0 Conference Proceedings
%T Connecting Language and Vision to Actions
%A Anderson, Peter
%A Das, Abhishek
%A Wu, Qi
%Y Artzi, Yoav
%Y Eisenstein, Jacob
%S Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts
%D 2018
%8 July
%I Association for Computational Linguistics
%C Melbourne, Australia
%F anderson-etal-2018-connecting
%X A long-term goal of AI research is to build intelligent agents that can see the rich visual environment around us, communicate this understanding in natural language to humans and other agents, and act in a physical or embodied environment. To this end, recent advances at the intersection of language and vision have made incredible progress – from being able to generate natural language descriptions of images/videos, to answering questions about them, to even holding free-form conversations about visual content! However, while these agents can passively describe images or answer (a sequence of) questions about them, they cannot act in the world (what if I cannot answer a question from my current view, or I am asked to move or manipulate something?). Thus, the challenge now is to extend this progress in language and vision to embodied agents that take actions and actively interact with their visual environments. To reduce the entry barrier for new researchers, this tutorial will provide an overview of the growing number of multimodal tasks and datasets that combine textual and visual understanding. We will comprehensively review existing state-of-the-art approaches to selected tasks such as image captioning, visual question answering (VQA) and visual dialog, presenting the key architectural building blocks (such as co-attention) and novel algorithms (such as cooperative/adversarial games) used to train models for these tasks. We will then discuss some of the current and upcoming challenges of combining language, vision and actions, and introduce some recently-released interactive 3D simulation environments designed for this purpose.
%R 10.18653/v1/P18-5004
%U https://aclanthology.org/P18-5004
%U https://doi.org/10.18653/v1/P18-5004
%P 10-14
Markdown (Informal)
[Connecting Language and Vision to Actions](https://aclanthology.org/P18-5004) (Anderson et al., ACL 2018)
ACL
- Peter Anderson, Abhishek Das, and Qi Wu. 2018. Connecting Language and Vision to Actions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 10–14, Melbourne, Australia. Association for Computational Linguistics.