Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Ahjeong Seo, Gi-Cheon Kang, Joonhan Park, Byoung-Tak Zhang


Abstract
Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose Motion-Appearance Synergistic Networks (MASN), which embed two cross-modal features grounded on motion and appearance information and selectively utilize them depending on the question’s intentions. MASN consists of a motion module, an appearance module, and a motion-appearance fusion module. The motion module computes the action-oriented cross-modal joint representations, while the appearance module focuses on the appearance aspect of the input video. Finally, the motion-appearance fusion module takes each output of the motion module and the appearance module as input, and performs question-guided fusion. As a result, MASN achieves new state-of-the-art performance on the TGIF-QA and MSVD-QA datasets. We also conduct qualitative analysis by visualizing the inference results of MASN.
Anthology ID:
2021.acl-long.481
Volume:
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Month:
August
Year:
2021
Address:
Online
Editors:
Chengqing Zong, Fei Xia, Wenjie Li, Roberto Navigli
Venues:
ACL | IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6167–6177
Language:
URL:
https://aclanthology.org/2021.acl-long.481
DOI:
10.18653/v1/2021.acl-long.481
Bibkey:
Cite (ACL):
Ahjeong Seo, Gi-Cheon Kang, Joonhan Park, and Byoung-Tak Zhang. 2021. Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6167–6177, Online. Association for Computational Linguistics.
Cite (Informal):
Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering (Seo et al., ACL-IJCNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.acl-long.481.pdf
Video:
 https://aclanthology.org/2021.acl-long.481.mp4
Code
 ahjeongseo/MASN-pytorch
Data
Visual Question Answering