A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses

Hisashi Kamezawa; Noriki Nishida; Nobuyuki Shimizu; Takashi Miyazaki; Hideki Nakayama

doi:10.18653/v1/2020.emnlp-main.267

A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses

Hisashi Kamezawa, Noriki Nishida, Nobuyuki Shimizu, Takashi Miyazaki, Hideki Nakayama

Abstract

In real-world dialogue, first-person visual information about where the other speakers are and what they are paying attention to is crucial to understand their intentions. Non-verbal responses also play an important role in social interactions. In this paper, we propose a visually-grounded first-person dialogue (VFD) dataset with verbal and non-verbal responses. The VFD dataset provides manually annotated (1) first-person images of agents, (2) utterances of human speakers, (3) eye-gaze locations of the speakers, and (4) the agents’ verbal and non-verbal responses. We present experimental results obtained using the proposed VFD dataset and recent neural network models (e.g., BERT, ResNet). The results demonstrate that first-person vision helps neural network models correctly understand human intentions, and the production of non-verbal responses is a challenging task like that of verbal responses. Our dataset is publicly available.

Anthology ID:: 2020.emnlp-main.267
Volume:: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Month:: November
Year:: 2020
Address:: Online
Editors:: Bonnie Webber, Trevor Cohn, Yulan He, Yang Liu
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3299–3310
Language:
URL:: https://aclanthology.org/2020.emnlp-main.267/
DOI:: 10.18653/v1/2020.emnlp-main.267
Bibkey:
Cite (ACL):: Hisashi Kamezawa, Noriki Nishida, Nobuyuki Shimizu, Takashi Miyazaki, and Hideki Nakayama. 2020. A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3299–3310, Online. Association for Computational Linguistics.
Cite (Informal):: A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses (Kamezawa et al., EMNLP 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.emnlp-main.267.pdf
Video:: https://slideslive.com/38939228

PDF Cite Search Video Fix data