SEA-VQA: Southeast Asian Cultural Context Dataset For Visual Question Answering

Norawit Urailertprasert; Peerat Limkonchotiwat; Supasorn Suwajanakorn; Sarana Nutanong

doi:10.18653/v1/2024.alvr-1.15

SEA-VQA: Southeast Asian Cultural Context Dataset For Visual Question Answering

Norawit Urailertprasert, Peerat Limkonchotiwat, Supasorn Suwajanakorn, Sarana Nutanong

Abstract

Visual Question Answering (VQA) is a critical task that requires the simultaneous understanding of visual and textual information. While significant advancements have been made with multilingual datasets, these often lack cultural specificity, especially in the context of Southeast Asia (SEA). In this paper, we introduce SEA-VQA aiming to highlight the challenges and gaps in existing VQA models when confronted with culturally specific content. Our dataset includes images from eight SEA countries, curated from the UNESCO Cultural Heritage collection. Our evaluation, comparing GPT-4 and GEMINI models, demonstrates substantial performance drops on culture-centric questions compared to the A-OKVQA dataset, a commonsense and world-knowledge VQA benchmark comprising approximately 25,000 questions. Our findings underscore the importance of cultural diversity in VQA datasets and reveal substantial gaps in the ability of current VQA models to handle culturally rich contexts. SEA-VQA serves as a crucial benchmark for identifying these gaps and guiding future improvements in VQA systems.

Anthology ID:: 2024.alvr-1.15
Volume:: Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Jing Gu, Tsu-Jui (Ray) Fu, Drew Hudson, Asli Celikyilmaz, William Wang
Venues:: ALVR | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 173–185
Language:
URL:: https://aclanthology.org/2024.alvr-1.15/
DOI:: 10.18653/v1/2024.alvr-1.15
Bibkey:
Cite (ACL):: Norawit Urailertprasert, Peerat Limkonchotiwat, Supasorn Suwajanakorn, and Sarana Nutanong. 2024. SEA-VQA: Southeast Asian Cultural Context Dataset For Visual Question Answering. In Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 173–185, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: SEA-VQA: Southeast Asian Cultural Context Dataset For Visual Question Answering (Urailertprasert et al., ALVR 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.alvr-1.15.pdf

PDF Cite Search Fix data