Seiji Gobara
2025
IRR: Image Review Ranking Framework for Evaluating Vision-Language Models
Kazuki Hayashi
|
Kazuma Onishi
|
Toma Suzuki
|
Yusuke Ide
|
Seiji Gobara
|
Shigeki Saito
|
Yusuke Sakai
|
Hidetaka Kamigaito
|
Katsuhiko Hayashi
|
Taro Watanabe
Proceedings of the 31st International Conference on Computational Linguistics
Large-scale Vision-Language Models (LVLMs) process both images and text, excelling in multimodal tasks such as image captioning and description generation. However, while these models excel at generating factual content, their ability to generate and evaluate texts reflecting perspectives on the same image, depending on the context, has not been sufficiently explored. To address this, we propose IRR: Image Review Rank, a novel evaluation framework designed to assess critic review texts from multiple perspectives. IRR evaluates LVLMs by measuring how closely their judgments align with human interpretations. We validate it using a dataset of images from 15 categories, each with five critic review texts and annotated rankings in both English and Japanese, totaling over 2,000 data instances. Our results indicate that, although LVLMs exhibited consistent performance across languages, their correlation with human annotations was insufficient, highlighting the need for further advancements. These findings highlight the limitations of current evaluation methods and the need for approaches that better capture human reasoning in Vision & Language tasks.
Search
Fix data
Co-authors
- Kazuki Hayashi 1
- Katsuhiko Hayashi 1
- Yusuke Ide 1
- Hidetaka Kamigaito 1
- Kazuma Onishi 1
- show all...