Katie Baker


2024

pdf bib
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding
Alessandro Suglia | Claudio Greco | Katie Baker | Jose L. Part | Ioannis Papaioannou | Arash Eshghi | Ioannis Konstas | Oliver Lemon
Findings of the Association for Computational Linguistics: EMNLP 2024

AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. Second, we present , a 7B parameter VLM trained using parameter-efficient methods on EVUD. Finally, we evaluate ‘s capabilities on OpenEQA, a challenging benchmark for embodied video question answering. Our model achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%.Additionally, we outperform Claude 3 and Gemini Pro Vision 1.0 and showcase competitive results compared to Gemini Pro 1.5 and GPT-4V, even surpassing the latter in spatial reasoning. This research paves the way for building efficient VLMs that can be deployed in robots or wearables, leveraging embodied video understanding to collaborate seamlessly with humans in everyday tasks, contributing to the advancement of next-generation Embodied AI.

2021

pdf bib
The Spoon Is in the Sink: Assisting Visually Impaired People in the Kitchen
Katie Baker | Amit Parekh | Adrien Fabre | Angus Addlesee | Ruben Kruiper | Oliver Lemon
Proceedings of the Reasoning and Interaction Conference (ReInAct 2021)

Visual Question Answering (VQA) systems are increasingly adept at a variety of tasks, and this technology can be used to assist blind and partially sighted people. To do this, the system’s responses must not only be accurate, but usable. It is also vital for assistive technologies to be designed with a focus on: (1) privacy, as the camera may capture a user’s mail, medication bottles, or other sensitive information; (2) transparency, so that the system’s behaviour can be explained and trusted by users; and (3) controllability, to tailor the system for a particular domain or user group. We have therefore extended a conversational VQA framework, called Aye-saac, with these objectives in mind. Specifically, we gave Aye-saac the ability to answer visual questions in the kitchen, a particularly challenging area for visually impaired people. Our system can now answer questions about quantity, positioning, and system confidence in regards to 299 kitchen objects. Questions about the spatial relations between these objects are particularly helpful to visually impaired people, and our system output more usable answers than other state of the art end-to-end VQA systems.