Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering

Jianguo Mao, Wenbin Jiang, Xiangdong Wang, Zhifan Feng, Yajuan Lyu, Hong Liu, Yong Zhu


Abstract
Existing video question answering (video QA) models lack the capacity for deep video understanding and flexible multistep reasoning. We propose for video QA a novel model which performs dynamic multistep reasoning between questions and videos. It creates video semantic representation based on the video scene graph composed of semantic elements of the video and semantic relations among these elements. Then, it performs multistep reasoning for better answer decision between the representations of the question and the video, and dynamically integrate the reasoning results. Experiments show the significant advantage of the proposed model against previous methods in accuracy and interpretability. Against the existing state-of-the-art model, the proposed model dramatically improves more than 4%/3.1%/2% on the three widely used video QA datasets, MSRVTT-QA, MSRVTT multi-choice, and TGIF-QA, and displays better interpretability by backtracing along with the attention mechanisms to the video scene graphs.
Anthology ID:
2022.naacl-main.286
Volume:
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3894–3904
Language:
URL:
https://aclanthology.org/2022.naacl-main.286
DOI:
10.18653/v1/2022.naacl-main.286
Bibkey:
Cite (ACL):
Jianguo Mao, Wenbin Jiang, Xiangdong Wang, Zhifan Feng, Yajuan Lyu, Hong Liu, and Yong Zhu. 2022. Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3894–3904, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering (Mao et al., NAACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.naacl-main.286.pdf
Video:
 https://aclanthology.org/2022.naacl-main.286.mp4