PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers

Weizhe Lin, Jingbiao Mei, Jinghong Chen, Bill Byrne


Abstract
Large Multimodal Models (LMMs) excel in natural language and visual understanding but are challenged by exacting tasks such as Knowledge-based Visual Question Answering (KB-VQA) which involve the retrieval of relevant information from document collections to use in shaping answers to questions. We present an extensive training and evaluation framework, M2KR, for KB-VQA. M2KR contains a collection of vision and language tasks which we have incorporated into a single suite of benchmark tasks for training and evaluating general-purpose multi-modal retrievers. We use M2KR to develop PreFLMR, a pre-trained version of the recently developed Fine-grained Late-interaction Multi-modal Retriever (FLMR) approach to KB-VQA, and we report new state-of-the-art results across a range of tasks. We also present investigations into the scaling behaviors of PreFLMR intended to be useful in future developments in general-purpose multi-modal retrievers.
Anthology ID:
2024.acl-long.289
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5294–5316
Language:
URL:
https://aclanthology.org/2024.acl-long.289
DOI:
Bibkey:
Cite (ACL):
Weizhe Lin, Jingbiao Mei, Jinghong Chen, and Bill Byrne. 2024. PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5294–5316, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers (Lin et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.289.pdf