Self-Adaptive Sampling for Accurate Video Question Answering on Image Text Models

Wei Han, Hui Chen, Min-Yen Kan, Soujanya Poria


Abstract
Image–text models (ITMs) is the prevalent architecture to solve video question–answering tasks, which requires only a few input frames to save huge computational cost compared to video–language models.However, we find existent ITM video question–answering solutions either 1) adopt simplistic and unintentional sampling strategies, which may miss key frames to offer the answer clues; or 2) sample a large number of frames into divided groups, which the computational sources can not accommodate. In this work, we aim at an efficient sampling method towards the few-frame situations.We first summarize a family of prior sampling methods based on question–frame correlation into a unified one, dubbed *Most Implied Frames* (MIF). Through some primary results and analysis, Through analysis, we form a hypothesis that question-aware sampling is not necessary, from which we further propose the other method *Most Dominant Frames* (MDF).Experimental results on four public datasets and three advanced ITMs demonstrate that our proposed strategies can boost the performance for image–text pretrained models, and have a wide application scenario in terms of model architectures and dataset types. Our code is available at https://github.com/declare-lab/Sealinghttps://github.com/declare-lab/Sealing.
Anthology ID:
2024.findings-naacl.162
Volume:
Findings of the Association for Computational Linguistics: NAACL 2024
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2522–2534
Language:
URL:
https://aclanthology.org/2024.findings-naacl.162
DOI:
Bibkey:
Cite (ACL):
Wei Han, Hui Chen, Min-Yen Kan, and Soujanya Poria. 2024. Self-Adaptive Sampling for Accurate Video Question Answering on Image Text Models. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2522–2534, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Self-Adaptive Sampling for Accurate Video Question Answering on Image Text Models (Han et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-naacl.162.pdf
Copyright:
 2024.findings-naacl.162.copyright.pdf