Exploring Question Guidance and Answer Calibration for Visually Grounded Video Question Answering

Yuanxing Xu, Yuting Wei, Shuai Zhong, Xinming Chen, Jinsheng Qi, Bin Wu


Abstract
Video Question Answering (VideoQA) tasks require not only correct answers but also visual evidence. The “localize-then-answer” strategy, while enhancing accuracy and interpretability, faces challenges due to the lack of temporal localization labels in VideoQA datasets. Existing methods often train the models’ localization capabilities indirectly using QA labels, leading to inaccurate localization. Moreover, our experiments show that despite high accuracy, current models depend too heavily on language shortcuts or spurious correlations with irrelevant visual context. To address these issues, we propose a Question-Guided and Answer-Calibrated TRansformer (QGAC-TR), which guides and calibrates localization using question and option texts without localization labels. Furthermore, we design two self-supervised learning tasks to further enhance the model’s refined localization capabilities. Extensive experiments on three public datasets focused on temporal and causal reasoning show that our model not only achieves accuracy comparable to large-scale pretrained models but also leads in localization aspects. Code will be available on GitHub.
Anthology ID:
2024.findings-emnlp.176
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3121–3133
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.176
DOI:
Bibkey:
Cite (ACL):
Yuanxing Xu, Yuting Wei, Shuai Zhong, Xinming Chen, Jinsheng Qi, and Bin Wu. 2024. Exploring Question Guidance and Answer Calibration for Visually Grounded Video Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 3121–3133, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Exploring Question Guidance and Answer Calibration for Visually Grounded Video Question Answering (Xu et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.176.pdf