In-the-Wild Video Question Answering

Santiago Castro, Naihao Deng, Pingxuan Huang, Mihai Burzo, Rada Mihalcea


Abstract
Existing video understanding datasets mostly focus on human interactions, with little attention being paid to the “in the wild” settings, where the videos are recorded outdoors. We propose WILDQA, a video understanding dataset of videos recorded in outside settings. In addition to video question answering (Video QA), we also introduce the new task of identifying visual support for a given question and answer (Video Evidence Selection). Through evaluations using a wide range of baseline models, we show that WILDQA poses new challenges to the vision and language research communities. The dataset is available at https: //lit.eecs.umich.edu/wildqa/.
Anthology ID:
2022.coling-1.496
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
5613–5635
Language:
URL:
https://aclanthology.org/2022.coling-1.496
DOI:
Bibkey:
Cite (ACL):
Santiago Castro, Naihao Deng, Pingxuan Huang, Mihai Burzo, and Rada Mihalcea. 2022. In-the-Wild Video Question Answering. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5613–5635, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
In-the-Wild Video Question Answering (Castro et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.496.pdf
Data
MovieQATVQATVQA+