Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach

Saehyung Lee, Sangwon Yu, Junsung Park, Jihun Yi, Sungroh Yoon


Abstract
In this paper, we primarily address the issue of dialogue-form context query within the interactive text-to-image retrieval task. Our methodology, PlugIR, actively utilizes the general instruction-following capability of LLMs in two ways. First, by reformulating the dialogue-form context, we eliminate the necessity of fine-tuning a retrieval model on existing visual dialogue data, thereby enabling the use of any arbitrary black-box model. Second, we construct the LLM questioner to generate non-redundant questions about the attributes of the target image, based on the information of retrieval candidate images in the current context. This approach mitigates the issues of noisiness and redundancy in the generated questions. Beyond our methodology, we propose a novel evaluation metric, Best log Rank Integral (BRI), for a comprehensive assessment of the interactive retrieval system. PlugIR demonstrates superior performance compared to both zero-shot and fine-tuned baselines in various benchmarks. Additionally, the two methodologies comprising PlugIR can be flexibly applied together or separately in various situations.
Anthology ID:
2024.luhme-long.46
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
791–809
Language:
URL:
https://aclanthology.org/2024.luhme-long.46/
DOI:
10.18653/v1/2024.acl-long.46
Bibkey:
Cite (ACL):
Saehyung Lee, Sangwon Yu, Junsung Park, Jihun Yi, and Sungroh Yoon. 2024. Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 791–809, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach (Lee et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.46.pdf