Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Neelabh Sinha; Vinija Jain; Aman Chadha

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Abstract

Visual Question-Answering (VQA) has become key to user experience, particularly after improved generalization capabilities of Vision-Language Models (VLMs). But evaluating VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper aims to solve that using an end-to-end framework. We present VQA360 - a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, for a comprehensive evaluation. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with state-of-the-art VLMs reveal that no single model excels universally, thus, making a right choice a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, but open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B also demonstrate competitive strengths, while providing additional advantages. Our framework can also be extended to other tasks.

Anthology ID:: 2025.evalmg-1.7
Volume:: Proceedings of the First Workshop of Evaluation of Multi-Modal Generation
Month:: Jan
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Wei Emma Zhang, Xiang Dai, Desmond Elliot, Byron Fang, Mongyuan Sim, Haojie Zhuang, Weitong Chen
Venues:: EvalMG | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 76–94
Language:
URL:: https://aclanthology.org/2025.evalmg-1.7/
DOI:
Bibkey:
Cite (ACL):: Neelabh Sinha, Vinija Jain, and Aman Chadha. 2025. Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types. In Proceedings of the First Workshop of Evaluation of Multi-Modal Generation, pages 76–94, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types (Sinha et al., EvalMG 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.evalmg-1.7.pdf

PDF Cite Search Fix data