ITERATE: Image-Text Enhancement, Retrieval, and Alignment for Transmodal Evolution with LLMs

Chenhan Fu, Guoming Wang, Juncheng Li, Wenqiao Zhang, Rongxing Lu, Siliang Tang


Abstract
Inspired by human cognitive behavior, we introduce visual modality to enhance the performance of pure text-based question-answering tasks with the development of multimodal models. However, obtaining corresponding images through manual annotation often entails high costs. Faced with this challenge, an intuitive strategy is to use search engines or use web scraping techniques to automatically obtain relevant image information. However, the images obtained by this strategy may be of low quality and may not match the context of the original task, which could fail to improve or even decrease performance on downstream tasks. In this paper, we propose a novel framework named “ITERATE”, aimed at retrieving and optimizing the quality of images to improve the alignment between text and images. Inspired by evolutionary algorithms in reinforcement learning and driven by the synergy of large language models (LLMs) and multimodal models, ITERATE employs a series of strategic actions such as filtering, optimizing, and retrieving to acquire higher quality images, and repeats this process over multiple generations to enhance the quality of the entire image cluster. Our experimental results on the ScienceQA, ARC-Easy, and OpenDataEval datasets also verify the effectiveness of our method, showing improvements of 3.5%, 5%, and 7%, respectively.
Anthology ID:
2025.coling-main.91
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1365–1376
Language:
URL:
https://aclanthology.org/2025.coling-main.91/
DOI:
Bibkey:
Cite (ACL):
Chenhan Fu, Guoming Wang, Juncheng Li, Wenqiao Zhang, Rongxing Lu, and Siliang Tang. 2025. ITERATE: Image-Text Enhancement, Retrieval, and Alignment for Transmodal Evolution with LLMs. In Proceedings of the 31st International Conference on Computational Linguistics, pages 1365–1376, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
ITERATE: Image-Text Enhancement, Retrieval, and Alignment for Transmodal Evolution with LLMs (Fu et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.91.pdf