HOLM: Hallucinating Objects with Language Models for Referring Expression Recognition in Partially-Observed Scenes

Volkan Cirik, Louis-Philippe Morency, Taylor Berg-Kirkpatrick


Abstract
AI systems embodied in the physical world face a fundamental challenge of partial observability; operating with only a limited view and knowledge of the environment. This creates challenges when AI systems try to reason about language and its relationship with the environment: objects referred to through language (e.g. giving many instructions) are not immediately visible. Actions by the AI system may be required to bring these objects in view. A good benchmark to study this challenge is Dynamic Referring Expression Recognition (dRER) task, where the goal is to find a target location by dynamically adjusting the field of view (FoV) in a partially observed 360 scenes. In this paper, we introduce HOLM, Hallucinating Objects with Language Models, to address the challenge of partial observability. HOLM uses large pre-trained language models (LMs) to infer object hallucinations for the unobserved part of the environment. Our core intuition is that if a pair of objects co-appear in an environment frequently, our usage of language should reflect this fact about the world. Based on this intuition, we prompt language models to extract knowledge about object affinities which gives us a proxy for spatial relationships of objects. Our experiments show that HOLM performs better than the state-of-the-art approaches on two datasets for dRER; allowing to study generalization for both indoor and outdoor settings.
Anthology ID:
2022.acl-long.373
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5440–5453
Language:
URL:
https://aclanthology.org/2022.acl-long.373
DOI:
10.18653/v1/2022.acl-long.373
Bibkey:
Cite (ACL):
Volkan Cirik, Louis-Philippe Morency, and Taylor Berg-Kirkpatrick. 2022. HOLM: Hallucinating Objects with Language Models for Referring Expression Recognition in Partially-Observed Scenes. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5440–5453, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
HOLM: Hallucinating Objects with Language Models for Referring Expression Recognition in Partially-Observed Scenes (Cirik et al., ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-long.373.pdf
Data
Visual Genome