Zero-Shot Multimodal Retrieval with Multi-Scale Contextual Representations

Sourajit Saha; Tejas Gokhale

doi:10.18653/v1/2026.acl-long.930

Zero-Shot Multimodal Retrieval with Multi-Scale Contextual Representations

Abstract

In multimodal information retrieval (MMIR), candidates relevant to an input query need to be retrieved from a database, where the query and database items span different modalities. As real-world databases evolve, repeatedly annotating and indexing data and re-optimizing domain-specific models across modalities is impractical. We present MULTI-SCORE, a fine-tuning-free, two-stage MMIR approach that couples efficient candidate filtering with fine-grained multimodal re-ranking. Stage-1 adopts Matryoshka representations to efficiently filter out low-relevance candidates without expensive similarity computations on full-scale representations for the entire database. Stage-2 re-ranks the filtered candidates by computing their fine-grained multimodal contextual representations with two scoring functions for semantic alignment using chain-of-thought prompting and question-answering. Experiments demonstrate state-of-the-art zero-shot retrieval on 12 MMIR tasks across 32 datasets while outperforming supervised methods on 23 datasets.

Anthology ID:: 2026.acl-long.930
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 20304–20324
Language:
URL:: https://aclanthology.org/2026.acl-long.930/
DOI:: 10.18653/v1/2026.acl-long.930
Bibkey:
Cite (ACL):: Sourajit Saha and Tejas Gokhale. 2026. Zero-Shot Multimodal Retrieval with Multi-Scale Contextual Representations. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 20304–20324, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Zero-Shot Multimodal Retrieval with Multi-Scale Contextual Representations (Saha & Gokhale, ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.930.pdf
Checklist:: 2026.acl-long.930.checklist.pdf

PDF Cite Search Checklist Fix data