SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

Siwei Wu; Yizhi Li; Kang Zhu; Ge Zhang; Yiming Liang; Kaijing Ma; Chenghao Xiao; Haoran Zhang; Bohao Yang; Wenhu Chen; Wenhao Huang; Noura Al Moubayed; Jie Fu; Chenghua Lin

doi:10.18653/v1/2024.findings-acl.746

SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

Siwei Wu, Yizhi Li, Kang Zhu, Ge Zhang, Yiming Liang, Kaijing Ma, Chenghao Xiao, Haoran Zhang, Bohao Yang, Wenhu Chen, Wenhao Huang, Noura Al Moubayed, Jie Fu, Chenghua Lin

Abstract

Multi-modal information retrieval (MMIR) is a rapidly evolving field where significant progress has been made through advanced representation learning and cross-modality alignment research, particularly in image-text pairing.However, current benchmarks for evaluating MMIR performance on image-text pairings overlook the scientific domain, which has a notable gap with the generic data since the caption of scientific charts and tables usually describes the analysis of experimental results or scientific principles in contrast to human activity or scenery depicted in generic images.To bridge this gap, we develop a scientific domain-specific MMIR benchmark (SciMMIR) by leveraging open-access research paper corpora to extract data relevant to the scientific domain. This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions from scientific documents.We further annotate the image-text pairs with a two-level subset-subcategory hierarchy to facilitate a more comprehensive evaluation of the baselines. We conduct zero-shot and fine-tuned evaluations on prominent multi-modal image-captioning and visual language models, such as CLIP, BLIP, and BLIP-2.Our findings offer critical insights for MMIR in the scientific domain, including the impact of pre-training and fine-tuning settings and the effects of different visual and textual encoders.

Anthology ID:: 2024.findings-acl.746
Volume:: Findings of the Association for Computational Linguistics: ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12560–12574
Language:
URL:: https://aclanthology.org/2024.findings-acl.746/
DOI:: 10.18653/v1/2024.findings-acl.746
Bibkey:
Cite (ACL):: Siwei Wu, Yizhi Li, Kang Zhu, Ge Zhang, Yiming Liang, Kaijing Ma, Chenghao Xiao, Haoran Zhang, Bohao Yang, Wenhu Chen, Wenhao Huang, Noura Al Moubayed, Jie Fu, and Chenghua Lin. 2024. SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval. In Findings of the Association for Computational Linguistics: ACL 2024, pages 12560–12574, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval (Wu et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-acl.746.pdf

PDF Cite Search Fix data