Generating Contextual Images for Long-Form Text

Avijit Mitra; Nalin Gupta; Chetan Naik; Abhinav Sethy; Kinsey Bice; Zeynab Raeesy

Generating Contextual Images for Long-Form Text

Avijit Mitra, Nalin Gupta, Chetan Naik, Abhinav Sethy, Kinsey Bice, Zeynab Raeesy

Abstract

We investigate the problem of synthesizing relevant visual imagery from generic long-form text, leveraging Large Language Models (LLMs) and Text-to-Image Models (TIMs). Current Text-to-Image models require short prompts that describe the image content and style explicitly. Unlike image prompts, generation of images from general long-form text requires the image synthesis system to derive the visual content and style elements from the text. In this paper, we study zero-shot prompting and supervised fine-tuning approaches that use LLMs and TIMs jointly for synthesizing images. We present an empirical study on generating images for Wikipedia articles covering a broad spectrum of topic and image styles. We compare these systems using a suite of metrics, including a novel metric specifically designed to evaluate the semantic correctness of generated images. Our study offers a preliminary understanding of existing models’ strengths and limitation for the task of image generation from long-form text, and sets up an evaluation framework and establishes baselines for future research.

Anthology ID:: 2024.lrec-main.673
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 7623–7633
Language:
URL:: https://aclanthology.org/2024.lrec-main.673
DOI:
Bibkey:
Cite (ACL):: Avijit Mitra, Nalin Gupta, Chetan Naik, Abhinav Sethy, Kinsey Bice, and Zeynab Raeesy. 2024. Generating Contextual Images for Long-Form Text. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 7623–7633, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Generating Contextual Images for Long-Form Text (Mitra et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.lrec-main.673.pdf

PDF Cite Search