Xinming Chen

2024

Video Question Answering (VideoQA) tasks require not only correct answers but also visual evidence. The “localize-then-answer” strategy, while enhancing accuracy and interpretability, faces challenges due to the lack of temporal localization labels in VideoQA datasets. Existing methods often train the models’ localization capabilities indirectly using QA labels, leading to inaccurate localization. Moreover, our experiments show that despite high accuracy, current models depend too heavily on language shortcuts or spurious correlations with irrelevant visual context. To address these issues, we propose a Question-Guided and Answer-Calibrated TRansformer (QGAC-TR), which guides and calibrates localization using question and option texts without localization labels. Furthermore, we design two self-supervised learning tasks to further enhance the model’s refined localization capabilities. Extensive experiments on three public datasets focused on temporal and causal reasoning show that our model not only achieves accuracy comparable to large-scale pretrained models but also leads in localization aspects. Code will be available on GitHub.

Co-authors

Venues

findings1

Fix author