Generating Contextual Images for Long-Form Text

Avijit Mitra, Nalin Gupta, Chetan Naik, Abhinav Sethy, Kinsey Bice, Zeynab Raeesy


Abstract
We investigate the problem of synthesizing relevant visual imagery from generic long-form text, leveraging Large Language Models (LLMs) and Text-to-Image Models (TIMs). Current Text-to-Image models require short prompts that describe the image content and style explicitly. Unlike image prompts, generation of images from general long-form text requires the image synthesis system to derive the visual content and style elements from the text. In this paper, we study zero-shot prompting and supervised fine-tuning approaches that use LLMs and TIMs jointly for synthesizing images. We present an empirical study on generating images for Wikipedia articles covering a broad spectrum of topic and image styles. We compare these systems using a suite of metrics, including a novel metric specifically designed to evaluate the semantic correctness of generated images. Our study offers a preliminary understanding of existing models’ strengths and limitation for the task of image generation from long-form text, and sets up an evaluation framework and establishes baselines for future research.
Anthology ID:
2024.lrec-main.673
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
7623–7633
Language:
URL:
https://aclanthology.org/2024.lrec-main.673
DOI:
Bibkey:
Cite (ACL):
Avijit Mitra, Nalin Gupta, Chetan Naik, Abhinav Sethy, Kinsey Bice, and Zeynab Raeesy. 2024. Generating Contextual Images for Long-Form Text. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 7623–7633, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Generating Contextual Images for Long-Form Text (Mitra et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.673.pdf