Jose L. Part
2024
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding
Alessandro Suglia
|
Claudio Greco
|
Katie Baker
|
Jose L. Part
|
Ioannis Papaioannou
|
Arash Eshghi
|
Ioannis Konstas
|
Oliver Lemon
Findings of the Association for Computational Linguistics: EMNLP 2024
AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. Second, we present , a 7B parameter VLM trained using parameter-efficient methods on EVUD. Finally, we evaluate ‘s capabilities on OpenEQA, a challenging benchmark for embodied video question answering. Our model achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%.Additionally, we outperform Claude 3 and Gemini Pro Vision 1.0 and showcase competitive results compared to Gemini Pro 1.5 and GPT-4V, even surpassing the latter in spatial reasoning. This research paves the way for building efficient VLMs that can be deployed in robots or wearables, leveraging embodied video understanding to collaborate seamlessly with humans in everyday tasks, contributing to the advancement of next-generation Embodied AI.
2022
A Visually-Aware Conversational Robot Receptionist
Nancie Gunson
|
Daniel Hernandez Garcia
|
Weronika Sieińska
|
Angus Addlesee
|
Christian Dondrup
|
Oliver Lemon
|
Jose L. Part
|
Yanchao Yu
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue
Socially Assistive Robots (SARs) have the potential to play an increasingly important role in a variety of contexts including healthcare, but most existing systems have very limited interactive capabilities. We will demonstrate a robot receptionist that not only supports task-based and social dialogue via natural spoken conversation but is also capable of visually grounded dialogue; able to perceive and discuss the shared physical environment (e.g. helping users to locate personal belongings or objects of interest). Task-based dialogues include check-in, navigation and FAQs about facilities, alongside social features such as chit-chat, access to the latest news and a quiz game to play while waiting. We also show how visual context (objects and their spatial relations) can be combined with linguistic representations of dialogue context, to support visual dialogue and question answering. We will demonstrate the system on a humanoid ARI robot, which is being deployed in a hospital reception area.