LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model

Tao Sun; Oliver Liu; JinJin Li; Lan Ma

LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model

Abstract

Multimodal generative AI usually involves generating image or text responses given inputs in another modality. The evaluation of image-text relevancy is essential for measuring the response quality or ranking candidate responses. In particular, binary relevancy evaluation, i.e., “Relevant” vs. “Not Relevant”, is a fundamental problem. However, this is a challenging task considering that texts have diverse formats and the definition of relevancy varies in different scenarios. We find that Multimodal Large Language Models (MLLMs) are an ideal choice to build such evaluators, as they can flexibly handle complex text formats and take in additional task information. In this paper, we present LLaVA-RE, a first attempt for binary image-text relevancy evaluation with MLLM. It follows the LLaVA architecture and adopts detailed task instructions and multimodal in-context samples. Further, we propose a novel binary relevancy dataset covering diverse tasks. Experimental results validate the effectiveness of our framework.

Anthology ID:: 2025.evalmg-1.4
Volume:: Proceedings of the First Workshop of Evaluation of Multi-Modal Generation
Month:: Jan
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Wei Emma Zhang, Xiang Dai, Desmond Elliot, Byron Fang, Mongyuan Sim, Haojie Zhuang, Weitong Chen
Venues:: EvalMG | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 40–51
Language:
URL:: https://aclanthology.org/2025.evalmg-1.4/
DOI:
Bibkey:
Cite (ACL):: Tao Sun, Oliver Liu, JinJin Li, and Lan Ma. 2025. LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model. In Proceedings of the First Workshop of Evaluation of Multi-Modal Generation, pages 40–51, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model (Sun et al., EvalMG 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.evalmg-1.4.pdf

PDF Cite Search Fix data