Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM

Dingjie Song, Sicheng Lai, Mingxuan Wang, Shunian Chen, Lichao Sun, Benyou Wang


Abstract
The rapid advancement of multimodal large language models (MLLMs) has significantly enhanced performance across benchmarks. However, data contamination — partial/entire benchmark data is included in the model’s training set — poses critical challenges for fair evaluation. Existing detection methods for unimodal large language models (LLMs) are inadequate for MLLMs due to multimodal data complexity and multi-phase training. We systematically analyze multimodal data contamination using our analytical framework, MM-DETECT, which defines two contamination categories — unimodal and cross-modal — and effectively quantifies contamination severity across multiple-choice and caption-based Visual Question Answering tasks. Evaluations on twelve MLLMs and five benchmarks reveal significant contamination, particularly in proprietary models and older benchmarks. Crucially, contamination sometimes originates during unimodal pre-training rather than solely from multimodal fine-tuning. Our insights refine contamination understanding, guiding evaluation practices and improving multimodal model reliability.
Anthology ID:
2025.findings-emnlp.556
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10527–10542
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.556/
DOI:
Bibkey:
Cite (ACL):
Dingjie Song, Sicheng Lai, Mingxuan Wang, Shunian Chen, Lichao Sun, and Benyou Wang. 2025. Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 10527–10542, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM (Song et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.556.pdf
Checklist:
 2025.findings-emnlp.556.checklist.pdf