Read Before Grounding: Scene Knowledge Visual Grounding via Multi-step Parsing

HaiXiang Zhu, Lixian Su, ShuangMing Mao, Jing Ye


Abstract
Visual grounding (VG) is an important task in vision and language that involves understanding the mutual relationship between query terms and images. However, existing VG datasets typically use simple and intuitive textual descriptions, with limited attribute and spatial information between images and text. Recently, the Scene Knowledge Visual Grounding (SK-VG) task has been introduced, which constructs VG datasets using visual knowledge and relational referential expressions. Due to the length of textual visual knowledge and the complexity of the referential relationships between entities, previous models have struggled with this task. Therefore, we propose ReadVG, a zero-shot, plug-and-play method that leverages the robust language understanding capabilities of Large Language Models (LLMs) to transform long visual knowledge texts into concise, information-dense visual descriptions. To improve the accuracy of target localisation, we employ a multi-step parsing algorithm that can progressively extract the query targets and their features from the visual knowledge and relational referencing expressions, thereby assisting multimodal models to more accurately localise the target for grounding purposes. Extensive experiments and case studies show that our approach can significantly improve the performance of multimodal grounding models.
Anthology ID:
2025.coling-main.76
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1136–1149
Language:
URL:
https://aclanthology.org/2025.coling-main.76/
DOI:
Bibkey:
Cite (ACL):
HaiXiang Zhu, Lixian Su, ShuangMing Mao, and Jing Ye. 2025. Read Before Grounding: Scene Knowledge Visual Grounding via Multi-step Parsing. In Proceedings of the 31st International Conference on Computational Linguistics, pages 1136–1149, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Read Before Grounding: Scene Knowledge Visual Grounding via Multi-step Parsing (Zhu et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.76.pdf