Masked Path Modeling for Vision-and-Language Navigation

Zi-Yi Dou, Feng Gao, Nanyun Peng


Abstract
Vision-and-language navigation (VLN) agents are trained to navigate in real-world environments based on natural language instructions. A major challenge in VLN is the limited available training data, which hinders the models’ ability to generalize effectively. Previous approaches have attempted to alleviate this issue by using external tools to generate pseudo-labeled data or integrating web-scaled image-text pairs during training. However, these methods often rely on automatically-generated or out-of-domain data, leading to challenges such as suboptimal data quality and domain mismatch. In this paper, we introduce a masked path modeling (MPM) objective. MPM pretrains an agent using self-collected data for subsequent navigation tasks, eliminating the need for external tools. Specifically, our method allows the agent to explore navigation environments and record the paths it traverses alongside the corresponding agent actions. Subsequently, we train the agent on this collected data to reconstruct the original action sequence when given a randomly masked subsequence of the original path. This approach enables the agent to accumulate a diverse and substantial dataset, facilitating the connection between visual observations of paths and the agent’s actions, which is the foundation of the VLN task. Importantly, the collected data are in-domain, and the training process avoids synthetic data with uncertain quality, addressing previous issues. We conduct experiments on various VLN datasets and demonstrate the applications of MPM across different levels of instruction complexity. Our results exhibit significant improvements in success rates, with enhancements of 1.3%, 1.1%, and 1.2% on the val-unseen split of the Room-to-Room, Room-for-Room, and Room-across-Room datasets, respectively. Additionally, we underscore the adaptability of MPM as well as the potential for additional improvements when the agent is allowed to explore unseen environments prior to testing.
Anthology ID:
2023.findings-emnlp.1019
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15255–15269
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.1019
DOI:
10.18653/v1/2023.findings-emnlp.1019
Bibkey:
Cite (ACL):
Zi-Yi Dou, Feng Gao, and Nanyun Peng. 2023. Masked Path Modeling for Vision-and-Language Navigation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15255–15269, Singapore. Association for Computational Linguistics.
Cite (Informal):
Masked Path Modeling for Vision-and-Language Navigation (Dou et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.1019.pdf