OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models

Yuxuan Kuang; Hai Lin; Meng Jiang

OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models

Abstract

Object navigation (ObjectNav) requires an agent to navigate through unseen environments to find queried objects. Many previous methods attempted to solve this task by relying on supervised or reinforcement learning, where they are trained on limited household datasets with close-set objects. However, two key challenges are unsolved: understanding free-form natural language instructions that demand open-set objects, and generalizing to new environments in a zero-shot manner. Aiming to solve the two challenges, in this paper, we propose **OpenFMNav**, an **Open**-set **F**oundation **M**odel based framework for zero-shot object **Nav**igation. We first unleash the reasoning abilities of large language models (LLMs) to extract proposed objects from natural language instructions that meet the user’s demand. We then leverage the generalizability of large vision language models (VLMs) to actively discover and detect candidate objects from the scene, building a *Versatile Semantic Score Map (VSSM)*. Then, by conducting common sense reasoning on *VSSM*, our method can perform effective language-guided exploration and exploitation of the scene and finally reach the goal. By leveraging the reasoning and generalizing abilities of foundation models, our method can understand free-form human instructions and perform effective open-set zero-shot navigation in diverse environments. Extensive experiments on the HM3D ObjectNav benchmark show that our method surpasses all the strong baselines on all metrics, proving our method’s effectiveness. Furthermore, we perform real robot demonstrations to validate our method’s open-set-ness and generalizability to real-world environments.

Anthology ID:: 2024.findings-naacl.24
Volume:: Findings of the Association for Computational Linguistics: NAACL 2024
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 338–351
Language:
URL:: https://aclanthology.org/2024.findings-naacl.24
DOI:
Bibkey:
Cite (ACL):: Yuxuan Kuang, Hai Lin, and Meng Jiang. 2024. OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 338–351, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models (Kuang et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-naacl.24.pdf

PDF Cite Search