Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

Zhecan Wang, Haoxuan You, Yicheng He, Wenhao Li, Kai-Wei Chang, Shih-Fu Chang


Abstract
Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described. Recently, various approaches have been developed and have achieved high performance on visual commonsense benchmarks. However, it is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources. To provide an in-depth analysis, we present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models’ understanding of the visual scene, text, and related knowledge. We then take a step further to show that training with the ME data boosts the model’s performance in standard VCR evaluation. Lastly, our in-depth analysis and comparison reveal interesting findings: (1) semantically low-level information can assist the learning of high-level information but not the opposite; (2) visual information is generally under utilization compared with text.
Anthology ID:
2022.emnlp-main.626
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9212–9224
Language:
URL:
https://aclanthology.org/2022.emnlp-main.626
DOI:
10.18653/v1/2022.emnlp-main.626
Bibkey:
Cite (ACL):
Zhecan Wang, Haoxuan You, Yicheng He, Wenhao Li, Kai-Wei Chang, and Shih-Fu Chang. 2022. Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9212–9224, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense (Wang et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.626.pdf