MIMOQA: Multimodal Input Multimodal Output Question Answering

Hrituraj Singh, Anshul Nasery, Denil Mehta, Aishwarya Agarwal, Jatin Lamba, Balaji Vasan Srinivasan


Abstract
Multimodal research has picked up significantly in the space of question answering with the task being extended to visual question answering, charts question answering as well as multimodal input question answering. However, all these explorations produce a unimodal textual output as the answer. In this paper, we propose a novel task - MIMOQA - Multimodal Input Multimodal Output Question Answering in which the output is also multimodal. Through human experiments, we empirically show that such multimodal outputs provide better cognitive understanding of the answers. We also propose a novel multimodal question-answering framework, MExBERT, that incorporates a joint textual and visual attention towards producing such a multimodal output. Our method relies on a novel multimodal dataset curated for this problem from publicly available unimodal datasets. We show the superior performance of MExBERT against strong baselines on both the automatic as well as human metrics.
Anthology ID:
2021.naacl-main.418
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
June
Year:
2021
Address:
Online
Editors:
Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5317–5332
Language:
URL:
https://aclanthology.org/2021.naacl-main.418
DOI:
10.18653/v1/2021.naacl-main.418
Bibkey:
Cite (ACL):
Hrituraj Singh, Anshul Nasery, Denil Mehta, Aishwarya Agarwal, Jatin Lamba, and Balaji Vasan Srinivasan. 2021. MIMOQA: Multimodal Input Multimodal Output Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5317–5332, Online. Association for Computational Linguistics.
Cite (Informal):
MIMOQA: Multimodal Input Multimodal Output Question Answering (Singh et al., NAACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-main.418.pdf
Video:
 https://aclanthology.org/2021.naacl-main.418.mp4
Data
Conceptual CaptionsMS MARCOManyModalQANatural QuestionsSQuADTVQAVisual Question Answering