VISREAS: Complex Visual Reasoning with Unanswerable Questions

Syeda Nahida Akter, Sangwu Lee, Yingshan Chang, Yonatan Bisk, Eric Nyberg


Abstract
Verifying a question’s validity before answering is crucial in real-world applications, where users may provide imperfect instructions. In this scenario, an ideal model should address the discrepancies in the query and convey them to the users rather than generating the best possible answer. Addressing this requirement, we introduce a new compositional visual question-answering dataset, VisReas, that consists of answerable and unanswerable visual queries formulated by traversing and perturbing commonalities and differences among objects, attributes, and relations. VisReas contains 2.07M semantically diverse queries generated automatically using Visual Genome scene graphs. The unique feature of this task, validating question answerability with respect to an image before answering, and the poor performance of state-of-the-art models inspired the design of a new modular baseline, Logic2Vision that reasons by producing and executing pseudocode without any external modules to generate the answer. Logic2Vision outperforms generative models in VisReas (+4.82% over LLaVA-1.5; +12.23% over InstructBLIP) and achieves a significant gain in performance against the classification models.
Anthology ID:
2024.findings-acl.402
Volume:
Findings of the Association for Computational Linguistics: ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6735–6752
Language:
URL:
https://aclanthology.org/2024.findings-acl.402
DOI:
10.18653/v1/2024.findings-acl.402
Bibkey:
Cite (ACL):
Syeda Nahida Akter, Sangwu Lee, Yingshan Chang, Yonatan Bisk, and Eric Nyberg. 2024. VISREAS: Complex Visual Reasoning with Unanswerable Questions. In Findings of the Association for Computational Linguistics: ACL 2024, pages 6735–6752, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
VISREAS: Complex Visual Reasoning with Unanswerable Questions (Akter et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.402.pdf