Explicit Object Relation Alignment for Vision and Language Navigation

Yue Zhang, Parisa Kordjamshidi


Abstract
In this paper, we investigate the problem of vision and language navigation. To solve this problem, grounding the landmarks and spatial relations in the textual instructions into visual modality is important. We propose a neural agent named Explicit Object Relation Alignment Agent (EXOR),to explicitly align the spatial information in both instruction and the visual environment, including landmarks and spatial relationships between the agent and landmarks. Empirically, our proposed method surpasses the baseline by a large margin on the R2R dataset. We provide a comprehensive analysis to show our model’s spatial reasoning ability and explainability.
Anthology ID:
2022.acl-srw.24
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Samuel Louvan, Andrea Madotto, Brielen Madureira
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
322–331
Language:
URL:
https://aclanthology.org/2022.acl-srw.24
DOI:
10.18653/v1/2022.acl-srw.24
Bibkey:
Cite (ACL):
Yue Zhang and Parisa Kordjamshidi. 2022. Explicit Object Relation Alignment for Vision and Language Navigation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 322–331, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Explicit Object Relation Alignment for Vision and Language Navigation (Zhang & Kordjamshidi, ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-srw.24.pdf
Code
 hlr/object-grounding-for-vln