BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models

Bryan Chen Zhengyu Tan; Weihua Zheng; Zhengyuan Liu; Nancy Chen; Hwaran Lee; Kenny Tsu Wei Choo; Roy Ka-Wei Lee

BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models

Bryan Chen Zhengyu Tan, Weihua Zheng, Zhengyuan Liu, Nancy F. Chen, Hwaran Lee, Kenny Tsu Wei Choo, Roy Ka-Wei Lee

Abstract

As vision-language models (VLMs) are deployed globally, their ability to understand culturally situated knowledge becomes essential. Yet, existing evaluations largely assess static recall or isolated visual grounding, leaving unanswered whether VLMs possess robust and transferable cultural understanding. We introduce ‘BLEnD-Vis‘, a multimodal, multicultural benchmark designed to evaluate the robustness of everyday cultural knowledge in VLMs across linguistic rephrasings and visual modalities. Building on the BLEnD dataset, ‘BLEnD-Vis‘ constructs 313 culturally grounded question templates spanning 16 regions and generates three aligned multiple-choice formats: (i) a text-only baseline querying from Region → Entity, (ii) an inverted text-only variant (Entity → Region), and (iii) a VQA-style version of (ii) with generated images. The resulting benchmark comprises 4,916 images and over 21,000 multiple-choice questions (MCQ) instances, validated through human annotation. ‘BLEnD-Vis‘ reveals significant fragility in current VLM cultural knowledge; models exhibit performance drops under linguistic rephrasing. While visual cues often aid performance, low cross-modal consistency highlights the challenges of robustly integrating textual and visual understanding, particularly in lower-resource regions. ‘BLEnD-Vis‘ thus provides a crucial testbed for systematically analysing cultural robustness and multimodal grounding, exposing limitations and guiding the development of more culturally competent VLMs. Code is available at https://github.com/Social-AI-Studio/BLEnD-Vis.

Anthology ID:: 2026.eacl-long.215
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4647–4669
Language:
URL:: https://aclanthology.org/2026.eacl-long.215/
DOI:
Bibkey:
Cite (ACL):: Bryan Chen Zhengyu Tan, Weihua Zheng, Zhengyuan Liu, Nancy F. Chen, Hwaran Lee, Kenny Tsu Wei Choo, and Roy Ka-Wei Lee. 2026. BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4647–4669, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models (Tan et al., EACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.eacl-long.215.pdf
Checklist:: 2026.eacl-long.215.checklist.pdf

PDF Cite Search Checklist Fix data