Kanishk Jain
2024
Benchmarking Vision Language Models for Cultural Understanding
Shravan Nayak
|
Kanishk Jain
|
Rabiul Awal
|
Siva Reddy
|
Sjoerd Van Steenkiste
|
Lisa Anne Hendricks
|
Karolina Stanczak
|
Aishwarya Agrawal
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Foundation models and vision-language pre-training have notably advanced Vision Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their performance has been typically assessed on general scene understanding - recognizing objects, attributes, and actions - rather than cultural comprehension. This study introduces CulturalVQA, a visual question-answering benchmark aimed at assessing VLM’s geo-diverse cultural understanding. We curate a diverse collection of 2,378 image-question pairs with 1-5 answers per question representing cultures from 11 countries across 5 continents. The questions probe understanding of various facets of culture such as clothing, food, drinks, rituals, and traditions. Benchmarking VLMs on CulturalVQA, including GPT-4V and Gemini, reveals disparity in their level of cultural understanding across regions, with strong cultural understanding capabilities for North America while significantly weaker capabilities for Africa. We observe disparity in their performance across cultural facets too, with clothing, rituals, and traditions seeing higher performances than food and drink. These disparities help us identify areas where VLMs lack cultural understanding and demonstrate the potential of CulturalVQA as a comprehensive evaluation set for gauging VLM progress in understanding diverse cultures.
2022
Comprehensive Multi-Modal Interactions for Referring Image Segmentation
Kanishk Jain
|
Vineet Gandhi
Findings of the Association for Computational Linguistics: ACL 2022
We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the natural language description. Addressing RIS efficiently requires considering the interactions happening across visual and linguistic modalities and the interactions within each modality. Existing methods are limited because they either compute different forms of interactions sequentially (leading to error propagation) or ignore intra-modal interactions. We address this limitation by performing all three interactions simultaneously through a Synchronous Multi-Modal Fusion Module (SFM). Moreover, to produce refined segmentation masks, we propose a novel Hierarchical Cross-Modal Aggregation Module (HCAM), where linguistic features facilitate the exchange of contextual information across the visual hierarchy. We present thorough ablation studies and validate our approach’s performance on four benchmark datasets, showing considerable performance gains over the existing state-of-the-art (SOTA) methods.
Search
Co-authors
- Shravan Nayak 1
- Rabiul Awal 1
- Siva Reddy 1
- Sjoerd Van Steenkiste 1
- Lisa Anne Hendricks 1
- show all...