Toward Interactive Regional Understanding in Vision-Large Language Models

Jungbeom Lee, Sanghyuk Chun, Sangdoo Yun


Abstract
Recent Vision-Language Pre-training (VLP) models have demonstrated significant advancements. Nevertheless, these models heavily rely on image-text pairs that capture only coarse and global information of an image, leading to a limitation in their regional understanding ability. In this work, we introduce RegionVLM, equipped with explicit regional modeling capabilities, allowing them to understand user-indicated image regions. To achieve this, we design a simple yet innovative architecture, requiring no modifications to the model architecture or objective function. Additionally, we leverage a dataset that contains a novel source of information, namely Localized Narratives, which has been overlooked in previous VLP research. Our experiments demonstrate that our single generalist model not only achieves an interactive dialogue system but also exhibits superior performance on various zero-shot region understanding tasks, without compromising its ability for global image understanding.
Anthology ID:
2024.naacl-long.356
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6416–6429
Language:
URL:
https://aclanthology.org/2024.naacl-long.356
DOI:
Bibkey:
Cite (ACL):
Jungbeom Lee, Sanghyuk Chun, and Sangdoo Yun. 2024. Toward Interactive Regional Understanding in Vision-Large Language Models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6416–6429, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Toward Interactive Regional Understanding in Vision-Large Language Models (Lee et al., NAACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.naacl-long.356.pdf
Copyright:
 2024.naacl-long.356.copyright.pdf