DRUM: Learning Demonstration Retriever for Large MUlti-modal Models

Ellen Yi-Ge; Jiechao Gao; Wei Han; Wei Zhu

doi:10.18653/v1/2025.acl-srw.83

DRUM: Learning Demonstration Retriever for Large MUlti-modal Models

Ellen Yi-Ge, Jiechao Gao, Wei Han, Wei Zhu

Abstract

Recently, large language models (LLMs) have demonstrated impressive capabilities in dealing with new tasks with the help of in-context learning (ICL). In the study of Large Vision-Language Models (LVLMs), when implementing ICL, researchers usually adopt the naive strategies like fixed demonstrations across different samples, or selecting demonstrations directly via a visual-language embedding model. These methods do not guarantee the configured demonstrations fit the need of the LVLMs. To address this issue, we propose a novel framework, demonstration retriever for large multi-modal model (DRUM), which fine-tunes the CLIP embedding model to better meet the LVLM’s needs. First, we discuss the retrieval strategies for a visual-language task, assuming an embedding model is given. And we propose to concate the image and text embeddings to enhance the retrieval performance. Second, we propose to re-rank the the embedding model’s retrieved demonstrations via the LVLM’s feedbacks, and calculate a list-wise ranking loss for training the embedding model. Third, we propose an iterative demonstration mining strategy to improve the training of the embedding model. Through extensive experiments on 3 types of visual-language tasks, 7 benchmark datasets, our DRUM framework is proven to be effective in boosting the LVLM’s in-context learning performance via retrieving more proper demonstrations.

Anthology ID:: 2025.acl-srw.83
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Jin Zhao, Mingyang Wang, Zhu Liu
Venues:: ACL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1051–1063
Language:
URL:: https://aclanthology.org/2025.acl-srw.83/
DOI:: 10.18653/v1/2025.acl-srw.83
Bibkey:
Cite (ACL):: Ellen Yi-Ge, Jiechao Gao, Wei Han, and Wei Zhu. 2025. DRUM: Learning Demonstration Retriever for Large MUlti-modal Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 1051–1063, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: DRUM: Learning Demonstration Retriever for Large MUlti-modal Models (Yi-Ge et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-srw.83.pdf

PDF Cite Search Fix data