Learning to Embed Multi-Modal Contexts for Situated Conversational Agents

Haeju Lee; Oh Joon Kwon; Yunseon Choi; Minho Park; Ran Han; Yoonhyung Kim; Jinhyeon Kim; Youngjune Lee; Haebin Shin; Kangwook Lee; Kee-Eung Kim

doi:10.18653/v1/2022.findings-naacl.61

Learning to Embed Multi-Modal Contexts for Situated Conversational Agents

Haeju Lee, Oh Joon Kwon, Yunseon Choi, Minho Park, Ran Han, Yoonhyung Kim, Jinhyeon Kim, Youngjune Lee, Haebin Shin, Kangwook Lee, Kee-Eung Kim

Abstract

The Situated Interactive Multi-Modal Conversations (SIMMC) 2.0 aims to create virtual shopping assistants that can accept complex multi-modal inputs, i.e. visual appearances of objects and user utterances. It consists of four subtasks, multi-modal disambiguation (MM-Disamb), multi-modal coreference resolution (MM-Coref), multi-modal dialog state tracking (MM-DST), and response retrieval and generation. While many task-oriented dialog systems usually tackle each subtask separately, we propose a jointly learned multi-modal encoder-decoder that incorporates visual inputs and performs all four subtasks at once for efficiency. This approach won the MM-Coref and response retrieval subtasks and nominated runner-up for the remaining subtasks using a single unified model at the 10th Dialog Systems Technology Challenge (DSTC10), setting a high bar for the novel task of multi-modal task-oriented dialog systems.

Anthology ID:: 2022.findings-naacl.61
Volume:: Findings of the Association for Computational Linguistics: NAACL 2022
Month:: July
Year:: 2022
Address:: Seattle, United States
Editors:: Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 813–830
Language:
URL:: https://aclanthology.org/2022.findings-naacl.61/
DOI:: 10.18653/v1/2022.findings-naacl.61
Bibkey:
Cite (ACL):: Haeju Lee, Oh Joon Kwon, Yunseon Choi, Minho Park, Ran Han, Yoonhyung Kim, Jinhyeon Kim, Youngjune Lee, Haebin Shin, Kangwook Lee, and Kee-Eung Kim. 2022. Learning to Embed Multi-Modal Contexts for Situated Conversational Agents. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 813–830, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):: Learning to Embed Multi-Modal Contexts for Situated Conversational Agents (Lee et al., Findings 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.findings-naacl.61.pdf
Video:: https://aclanthology.org/2022.findings-naacl.61.mp4

PDF Cite Search Video Fix data