Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Hyounghun Kim, Zineng Tang, Mohit Bansal


Abstract
Videos convey rich information. Dynamic spatio-temporal relationships between people/objects, and diverse multimodal events are present in a video clip. Hence, it is important to develop automated models that can accurately extract such information from videos. Answering questions on videos is one of the tasks which can evaluate such AI abilities. In this paper, we propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Specifically, we first employ dense image captions to help identify objects and their detailed salient regions and actions, and hence give the model useful extra information (in explicit textual format to allow easier matching) for answering questions. Moreover, our model is also comprised of dual-level attention (word/object and frame level), multi-head self/cross-integration for different sources (video and dense captions), and gates which pass more relevant information to the classifier. Finally, we also cast the frame selection problem as a multi-label classification task and introduce two loss functions, In-andOut Frame Score Margin (IOFSM) and Balanced Binary Cross-Entropy (BBCE), to better supervise the model with human importance annotations. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin (74.09% versus 70.52%). We also present several word, object, and frame level visualization studies.
Anthology ID:
2020.acl-main.435
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Editors:
Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4812–4822
Language:
URL:
https://aclanthology.org/2020.acl-main.435
DOI:
10.18653/v1/2020.acl-main.435
Bibkey:
Cite (ACL):
Hyounghun Kim, Zineng Tang, and Mohit Bansal. 2020. Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4812–4822, Online. Association for Computational Linguistics.
Cite (Informal):
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA (Kim et al., ACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.acl-main.435.pdf
Video:
 http://slideslive.com/38929110
Code
 hyounghk/VideoQADenseCapFrameGate-ACL2020
Data
TVQATVQA+Visual Question Answering