TaiwanVQA: A Benchmark for Visual Question Answering for Taiwanese Daily Life

Hsin-Yi Hsieh, Shang Wei Liu, Chang Chih Meng, Shuo-Yueh Lin, Chen Chien-Hua, Hung-Ju Lin, Hen-Hsen Huang, I-Chen Wu


Abstract
We introduce TaiwanVQA, a novel visual question answering benchmark designed to evaluate vision language models’ (VLMs) ability to recognize and reason about Taiwan-specific multimodal content.TaiwanVQA comprises 2,000 image-question pairs covering diverse topics relevant to Taiwanese culture and daily life. We categorize the questions into recognition and reasoning tasks, further sub-classifying reasoning questions based on the level of external knowledge required. We conduct extensive experiments on state-of-the-art VLMs, including GPT-4o, Llama-3.2, LLaVA, Qwen2-VL, and InternVL2 models. Our findings reveal significant limitations in current VLMs when handling culturally specific content. The performance gap widens between recognition tasks (top score 73.60%) and reasoning tasks (top score 49.80%), indicating challenges in cultural inference and contextual understanding.These results highlight the need for more culturally diverse training data and improved model architectures that can better integrate visual and textual information within specific cultural contexts. By providing TaiwanVQA, we aim to contribute to the development of more inclusive and culturally aware AI models, facilitating their deployment in diverse real-world settings. TaiwanVQA can be accessed on our GitHub page.
Anthology ID:
2025.evalmg-1.6
Volume:
Proceedings of the First Workshop of Evaluation of Multi-Modal Generation
Month:
Jan
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Wei Emma Zhang, Xiang Dai, Desmond Elliot, Byron Fang, Mongyuan Sim, Haojie Zhuang, Weitong Chen
Venues:
EvalMG | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
57–75
Language:
URL:
https://aclanthology.org/2025.evalmg-1.6/
DOI:
Bibkey:
Cite (ACL):
Hsin-Yi Hsieh, Shang Wei Liu, Chang Chih Meng, Shuo-Yueh Lin, Chen Chien-Hua, Hung-Ju Lin, Hen-Hsen Huang, and I-Chen Wu. 2025. TaiwanVQA: A Benchmark for Visual Question Answering for Taiwanese Daily Life. In Proceedings of the First Workshop of Evaluation of Multi-Modal Generation, pages 57–75, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
TaiwanVQA: A Benchmark for Visual Question Answering for Taiwanese Daily Life (Hsieh et al., EvalMG 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.evalmg-1.6.pdf