Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

You Li (李铀); Heyu Huang; Chi Chen; Kaiyu Huang (黄锴宇); Chao Huang; Zonghao Guo; Zhiyuan Liu; Jinan Xu (徐金安); Yuhua Li; Ruixuan Li; Maosong Sun

doi:10.18653/v1/2025.findings-acl.512

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

You Li, Heyu Huang, Chi Chen, Kaiyu Huang, Chao Huang, Zonghao Guo, Zhiyuan Liu, Jinan Xu, Yuhua Li, Ruixuan Li, Maosong Sun

Abstract

The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 24.94% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced at https://migician-vg.github.io/.

Anthology ID:: 2025.findings-acl.512
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9845–9867
Language:
URL:: https://aclanthology.org/2025.findings-acl.512/
DOI:: 10.18653/v1/2025.findings-acl.512
Bibkey:
Cite (ACL):: You Li, Heyu Huang, Chi Chen, Kaiyu Huang, Chao Huang, Zonghao Guo, Zhiyuan Liu, Jinan Xu, Yuhua Li, Ruixuan Li, and Maosong Sun. 2025. Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 9845–9867, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models (Li et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.512.pdf

PDF Cite Search Fix data