VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models

Jingtao Cao, Zhang Zheng, Hongru Wang, Kam-Fai Wong


Abstract
Progress in Text-to-Image (T2I) models has significantly advanced the generation of images from textual descriptions. Existing metrics, such as CLIP, effectively measure the semantic alignment between single prompts and their corresponding images. However, they fall short in evaluating a model’s ability to generalize across a broad spectrum of textual inputs. To address this gap, we propose the VLEU (Visual Language Evaluation Understudy) metric. VLEU leverages the power of Large Language Models (LLMs) to sample from the visual text domain, encompassing the entire range of potential inputs for the T2I task, to generate a wide variety of visual text. The images generated by T2I models from these prompts are then assessed for their alignment with the input text using the CLIP model. VLEU quantitatively measures a model’s generalizability by computing the Kullback-Leibler (KL) divergence between the visual text marginal distribution and the conditional distribution over the images generated by the model. This provides a comprehensive metric for comparing the overall generalizability of T2I models, beyond single-prompt evaluations, and offers valuable insights during the finetuning process. Our experimental results demonstrate VLEU’s effectiveness in evaluating the generalizability of various T2I models, positioning it as an essential metric for future research and development in image synthesis from text prompts. Our code and data will be publicly available at https://github.com/mio7690/VLEU.
Anthology ID:
2024.emnlp-main.618
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11034–11049
Language:
URL:
https://aclanthology.org/2024.emnlp-main.618
DOI:
Bibkey:
Cite (ACL):
Jingtao Cao, Zhang Zheng, Hongru Wang, and Kam-Fai Wong. 2024. VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11034–11049, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models (Cao et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.618.pdf
Software:
 2024.emnlp-main.618.software.zip
Data:
 2024.emnlp-main.618.data.zip