Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained Evaluation

Seongyun Lee, Seungone Kim, Sue Park, Geewook Kim, Minjoon Seo


Abstract
Assessing long-form responses generated by Vision-Language Models (VLMs) is challenging. It not only requires checking whether the VLM follows the given instruction but also verifying whether the text output is properly grounded on the given image. Inspired by the recent approach of evaluating LMs with LMs, in this work, we propose to evaluate VLMs with VLMs. For this purpose, we present a new feedback dataset called the Perception Collection, encompassing 15K customized score rubrics that users might care about during assessment. Using the Perception Collection, we train Prometheus-Vision, the first open-source VLM evaluator model that can understand the user-defined score criteria during evaluation. Prometheus-Vision shows the highest Pearson correlation with human evaluators and GPT-4V among open-source models, showing its effectiveness for transparent and accessible evaluation of VLMs. We open-source our code, dataset, and model.
Anthology ID:
2024.findings-acl.672
Volume:
Findings of the Association for Computational Linguistics: ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11286–11315
Language:
URL:
https://aclanthology.org/2024.findings-acl.672
DOI:
10.18653/v1/2024.findings-acl.672
Bibkey:
Cite (ACL):
Seongyun Lee, Seungone Kim, Sue Park, Geewook Kim, and Minjoon Seo. 2024. Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained Evaluation. In Findings of the Association for Computational Linguistics: ACL 2024, pages 11286–11315, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained Evaluation (Lee et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.672.pdf