Selectively Answering Visual Questions

Julian Eisenschlos, Hernán Maina, Guido Ivetta, Luciana Benotti


Abstract
Recently, large multi-modal models (LMMs) have emerged with the capacity to perform vision tasks such as captioning and visual question answering (VQA) with unprecedented accuracy. Applications such as helping the blind or visually impaired have a critical need for precise answers. It is specially important for models to be well calibrated and be able to quantify their uncertainty in order to selectively decide when to answer and when to abstain or ask for clarifications. We perform the first in-depth analysis of calibration methods and metrics for VQA with in-context learning LMMs. Studying VQA on two answerability benchmarks, we show that the likelihood score of visually grounded models is better calibrated than in their text-only counterparts for in-context learning, where sampling based methods are generally superior, but no clear winner arises. We propose Avg BLEU, a calibration score combining the benefits of both sampling and likelihood methods across modalities.
Anthology ID:
2024.findings-acl.250
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4219–4229
Language:
URL:
https://aclanthology.org/2024.findings-acl.250
DOI:
Bibkey:
Cite (ACL):
Julian Eisenschlos, Hernán Maina, Guido Ivetta, and Luciana Benotti. 2024. Selectively Answering Visual Questions. In Findings of the Association for Computational Linguistics ACL 2024, pages 4219–4229, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
Selectively Answering Visual Questions (Eisenschlos et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.250.pdf