ARKitSceneRefer: Text-based Localization of Small Objects in Diverse Real-World 3D Indoor Scenes

Shunya Kato, Shuhei Kurita, Chenhui Chu, Sadao Kurohashi


Abstract
3D referring expression comprehension is a task to ground text representations onto objects in 3D scenes. It is a crucial task for indoor household robots or augmented reality devices to localize objects referred to in user instructions. However, existing indoor 3D referring expression comprehension datasets typically cover larger object classes that are easy to localize, such as chairs, tables, or doors, and often overlook small objects, such as cooking tools or office supplies. Based on the recently proposed diverse and high-resolution 3D scene dataset of ARKitScenes, we construct the ARKitSceneRefer dataset focusing on small daily-use objects that frequently appear in real-world indoor scenes. ARKitSceneRefer contains 15k objects of 1,605 indoor scenes, which are significantly larger than those of the existing 3D referring datasets, and covers diverse object classes of 583 from the LVIS dataset. In empirical experiments with both 2D and 3D state-of-the-art referring expression comprehension models, we observed the task difficulty of the localization in the diverse small object classes.
Anthology ID:
2023.findings-emnlp.56
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
784–799
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.56
DOI:
10.18653/v1/2023.findings-emnlp.56
Bibkey:
Cite (ACL):
Shunya Kato, Shuhei Kurita, Chenhui Chu, and Sadao Kurohashi. 2023. ARKitSceneRefer: Text-based Localization of Small Objects in Diverse Real-World 3D Indoor Scenes. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 784–799, Singapore. Association for Computational Linguistics.
Cite (Informal):
ARKitSceneRefer: Text-based Localization of Small Objects in Diverse Real-World 3D Indoor Scenes (Kato et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.56.pdf