Can Large Multimodal Models Uncover Deep Semantics Behind Images?

Yixin Yang; Zheng Li; Qingxiu Dong; Heming Xia; Zhifang Sui

Can Large Multimodal Models Uncover Deep Semantics Behind Images?

Yixin Yang, Zheng Li, Qingxiu Dong, Heming Xia, Zhifang Sui

Abstract

Understanding the deep semantics of images is essential in the era dominated by social media. However, current research works primarily on the superficial description of images, revealing a notable deficiency in the systematic investigation of the inherent deep semantics. In this work, we introduce DEEPEVAL, a comprehensive benchmark to assess Large Multimodal Models’ (LMMs) capacities of visual deep semantics. DEEPEVAL includes human-annotated dataset and three progressive subtasks: fine-grained description selection, in-depth title matching, and deep semantics understanding. Utilizing DEEPEVAL, we evaluate 9 open-source LMMs and GPT-4V(ision). Our evaluation demonstrates a substantial gap between the deep semantic comprehension capabilities of existing LMMs and humans. For example, GPT-4V is 30% behind humans in understanding deep semantics, even though it achieves human-comparable performance in image description. Further analysis reveals that LMM performance on DEEPEVAL varies according to the specific facets of deep semantics explored, indicating the fundamental challenges remaining in developing LMMs.

Anthology ID:: 2024.findings-acl.113
Volume:: Findings of the Association for Computational Linguistics ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand and virtual meeting
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1898–1912
Language:
URL:: https://aclanthology.org/2024.findings-acl.113
DOI:
Bibkey:
Cite (ACL):: Yixin Yang, Zheng Li, Qingxiu Dong, Heming Xia, and Zhifang Sui. 2024. Can Large Multimodal Models Uncover Deep Semantics Behind Images?. In Findings of the Association for Computational Linguistics ACL 2024, pages 1898–1912, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: Can Large Multimodal Models Uncover Deep Semantics Behind Images? (Yang et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-acl.113.pdf

PDF Cite Search