Causal and Temporal Inference in Visual Question Generation by Utilizing Pre-trained Models

Zhanghao Hu, Frank Keller


Abstract
Visual Question Generation is a task at the crossroads of visual and language learning, impacting broad domains like education, medicine, and social media. While existing pre-trained models excel in fact-based queries with image pairs, they fall short of capturing human-like inference, particularly in understanding causal and temporal relationships within videos. Additionally, the computational demands of prevalent pre-training methods pose challenges. In response, our study introduces a framework that leverages vision-text matching pre-trained models to guide language models in recognizing event-entity relationships within videos and generating inferential questions. Demonstrating efficacy on the NExT-QA dataset, which is designed for causal and temporal inference in visual question answering, our method successfully guides pre-trained language models in recognizing video content. We present methodologies for abstracting causal and temporal relationships between events and entities, pointing out the importance of consistent relationships among input frames during training and inference phases and suggesting an avenue for future exploration.
Anthology ID:
2024.alvr-1.12
Volume:
Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Jing Gu, Tsu-Jui (Ray) Fu, Drew Hudson, Asli Celikyilmaz, William Wang
Venues:
ALVR | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
138–154
Language:
URL:
https://aclanthology.org/2024.alvr-1.12
DOI:
10.18653/v1/2024.alvr-1.12
Bibkey:
Cite (ACL):
Zhanghao Hu and Frank Keller. 2024. Causal and Temporal Inference in Visual Question Generation by Utilizing Pre-trained Models. In Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 138–154, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Causal and Temporal Inference in Visual Question Generation by Utilizing Pre-trained Models (Hu & Keller, ALVR-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.alvr-1.12.pdf