DivScene: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes

Zhaowei Wang; Hongming Zhang; Tianqing Fang; Ye Tian; Yue Yang; Kaixin Ma; Xiaoman Pan; Yangqiu Song; Dong Yu (于东)

doi:10.18653/v1/2025.findings-emnlp.513

DivScene: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes

Zhaowei Wang, Hongming Zhang, Tianqing Fang, Ye Tian, Yue Yang, Kaixin Ma, Xiaoman Pan, Yangqiu Song, Dong Yu

Abstract

Large Vision-Language Models (LVLMs) have achieved significant progress in tasks like visual question answering and document understanding. However, their potential to comprehend embodied environments and navigate within them remains underexplored. In this work, we first study the challenge of open-vocabulary object navigation by introducing DivScene, a large-scale dataset with 4,614 houses across 81 scene types and 5,707 kinds of target objects. Our dataset provides a much greater diversity of target objects and scene types than existing datasets, enabling a comprehensive task evaluation. We evaluated various methods with LVLMs and LLMs on our dataset and found that current models still fall short of open-vocab object navigation ability. Then, we fine-tuned LVLMs to predict the next action with CoT explanations. We observe that LVLM’s navigation ability can be improved substantially with only BFS-generated shortest paths without any human supervision, surpassing GPT-4o by over 20% in success rates.

Anthology ID:: 2025.findings-emnlp.513
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9666–9686
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.513/
DOI:: 10.18653/v1/2025.findings-emnlp.513
Bibkey:
Cite (ACL):: Zhaowei Wang, Hongming Zhang, Tianqing Fang, Ye Tian, Yue Yang, Kaixin Ma, Xiaoman Pan, Yangqiu Song, and Dong Yu. 2025. DivScene: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 9666–9686, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: DivScene: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes (Wang et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.513.pdf
Checklist:: 2025.findings-emnlp.513.checklist.pdf

PDF Cite Search Checklist Fix data