Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Neelabh Sinha, Vinija Jain, Aman Chadha


Abstract
Visual Question-Answering (VQA) has become key to user experience, particularly after improved generalization capabilities of Vision-Language Models (VLMs). But evaluating VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper aims to solve that using an end-to-end framework. We present VQA360 - a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, for a comprehensive evaluation. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with state-of-the-art VLMs reveal that no single model excels universally, thus, making a right choice a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, but open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B also demonstrate competitive strengths, while providing additional advantages. Our framework can also be extended to other tasks.
Anthology ID:
2025.evalmg-1.7
Volume:
Proceedings of the First Workshop of Evaluation of Multi-Modal Generation
Month:
Jan
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Wei Emma Zhang, Xiang Dai, Desmond Elliot, Byron Fang, Mongyuan Sim, Haojie Zhuang, Weitong Chen
Venues:
EvalMG | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
76–94
Language:
URL:
https://aclanthology.org/2025.evalmg-1.7/
DOI:
Bibkey:
Cite (ACL):
Neelabh Sinha, Vinija Jain, and Aman Chadha. 2025. Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types. In Proceedings of the First Workshop of Evaluation of Multi-Modal Generation, pages 76–94, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types (Sinha et al., EvalMG 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.evalmg-1.7.pdf