When Image and Text Disagree: Cross-Modal Evidence Conflict in Multimodal Retrieval-Augmented Generation

Jasper Kyle Catapang

When Image and Text Disagree: Cross-Modal Evidence Conflict in Multimodal Retrieval-Augmented Generation

Abstract

This paper introduces the Cross-Modal Conflict Benchmark (CMC-Bench) to evaluate how multimodal retrieval-augmented generation (RAG) systems handle contradicting evidence between retrieved text and images. Using 3,768 instances from ChartQA and MMMU evaluation splits, the study benchmarks four open vision-language models (VLMs) across four conflict types (factual, temporal, entity, and granularity) and four evidence conditions: aligned (both modalities support the gold answer), image-correct (image supports the gold and text contradicts it), text-correct (text supports the gold and the image is wrong or swapped), and both-wrong(neither modality supports the gold). Key findings reveal that cross-modal disagreement severely degrades performance, with change in accuracy between 0.17 and 0.46 relative to aligned evidence. Results show models often exhibit a modality lean rather than reliable arbitration, with text-leaning systems particularly vulnerable when only the image is correct. Furthermore, merging abstention and fabrication into a single hallucination score obscures critical behavioral differences; for instance, Qwen3-VL-4B abstains on 31.7% of conflicts, while Gemma-3n-E2B fabricates unsupported answers in 51.9% of conflicts. Multimodal RAG evaluation should explicitly distinguish abstention from fabrication to assess reliability accurately.

Anthology ID:: 2026.magmar-main.3
Volume:: Proceedings of the 2nd Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2026)
Month:: July
Year:: 2026
Address:: San Diego, USA
Editors:: Kenton Murray, Reno Kriz
Venues:: MAGMaR | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1–10
Language:
URL:: https://aclanthology.org/2026.magmar-main.3/
DOI:
Bibkey:
Cite (ACL):: Jasper Kyle Catapang. 2026. When Image and Text Disagree: Cross-Modal Evidence Conflict in Multimodal Retrieval-Augmented Generation. In Proceedings of the 2nd Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2026), pages 1–10, San Diego, USA. Association for Computational Linguistics.
Cite (Informal):: When Image and Text Disagree: Cross-Modal Evidence Conflict in Multimodal Retrieval-Augmented Generation (Catapang, MAGMaR 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.magmar-main.3.pdf

PDF Cite Search Fix data