NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM

Zihan Wang; Yaohui Zhu; Gim Hee Lee; Yachun Fan

doi:10.18653/v1/2025.findings-acl.442

NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM

Zihan Wang, Yaohui Zhu, Gim Hee Lee, Yachun Fan

Abstract

Vision-and-Language Navigation (VLN) is an essential skill for embodied agents, allowing them to navigate in 3D environments following natural language instructions. High-performance navigation models require a large amount of training data, the high cost of manually annotating data has seriously hindered this field. Therefore, some previous methods translate trajectory videos into step-by-step instructions for expanding data, but such instructions do not match well with users’ communication styles that briefly describe destinations or state specific needs. Moreover, local navigation trajectories overlook global context and high-level task planning. To address these issues, we propose NavRAG, a retrieval-augmented generation (RAG) framework that generates user demand instructions for VLN. NavRAG leverages LLM to build a hierarchical scene description tree for 3D scene understanding from global layout to local details, then simulates various user roles with specific demands to retrieve from the scene tree, generating diverse instructions with LLM. We annotate over 2 million navigation instructions across 861 scenes and evaluate the data quality and navigation performance of trained models. The model trained on our NavRAG dataset achieves SOTA performance on the REVERIE benchmark.

Anthology ID:: 2025.findings-acl.442
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8430–8440
Language:
URL:: https://aclanthology.org/2025.findings-acl.442/
DOI:: 10.18653/v1/2025.findings-acl.442
Bibkey:
Cite (ACL):: Zihan Wang, Yaohui Zhu, Gim Hee Lee, and Yachun Fan. 2025. NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM. In Findings of the Association for Computational Linguistics: ACL 2025, pages 8430–8440, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM (Wang et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.442.pdf

PDF Cite Search Fix data