DocMMIR: A Framework for Document Multi-modal Information Retrieval

Zirui Li, Siwei Wu, Yizhi Li, Xingyu Wang, Yi Zhou, Chenghua Lin


Abstract
The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-language models has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack a comprehensive exploration of document-level retrieval and suffer from the absence of cross-domain datasets at this granularity. To address this limitation, we introduce DocMMIR, a novel multi-modal document retrieval framework designed explicitly to unify diverse document formats and domains—including Wikipedia articles, scientific papers (arXiv), and presentation slides—within a comprehensive retrieval scenario. We construct a large-scale cross-domain multimodal dataset, comprising 450K training, 19.2K validation, and 19.2K test documents, serving as both a benchmark to reveal the shortcomings of existing MMIR models and a training set for further improvement. The dataset systematically integrates textual and visual information. Our comprehensive experimental analysis reveals substantial limitations in current state-of-the-art MLLMs (CLIP, BLIP2, SigLIP-2, ALIGN) when applied to our tasks, with only CLIP (ViT-L/14) demonstrating reasonable zero-shot performance. Through systematic investigation of cross-modal fusion strategies and loss function selection on the CLIP (ViT-L/14) model, we develop an optimised approach that achieves a +31% improvement in MRR@10 metrics from zero-shot baseline to fine-tuned model. Our findings offer crucial insights and practical guidance for future development in unified multimodal document retrieval tasks.
Anthology ID:
2025.findings-emnlp.705
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13117–13130
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.705/
DOI:
Bibkey:
Cite (ACL):
Zirui Li, Siwei Wu, Yizhi Li, Xingyu Wang, Yi Zhou, and Chenghua Lin. 2025. DocMMIR: A Framework for Document Multi-modal Information Retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 13117–13130, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
DocMMIR: A Framework for Document Multi-modal Information Retrieval (Li et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.705.pdf
Checklist:
 2025.findings-emnlp.705.checklist.pdf