Show and Guide: Instructional-Plan Grounded Vision and Language Model

Diogo Glória-Silva, David Semedo, Joao Magalhaes


Abstract
Guiding users through complex procedural plans is an inherently multimodal task in which having visually illustrated plan steps is crucial to deliver an effective plan guidance. However, existing works on plan-following language models (LMs) often are not capable of multimodal input and output. In this work, we present MM-PlanLLM, the first multimodal LLM designed to assist users in executing instructional tasks by leveraging both textual plans and visual information. Specifically, we bring cross-modality through two key tasks: Conversational Video Moment Retrieval, where the model retrieves relevant step-video segments based on user queries, and Visually-Informed Step Generation, where the model generates the next step in a plan, conditioned on an image of the user’s current progress. MM-PlanLLM is trained using a novel multitask-multistage approach, designed to gradually expose the model to multimodal instructional-plans semantic layers, achieving strong performance on both multimodal and textual dialogue in a plan-grounded setting. Furthermore, we show that the model delivers cross-modal temporal and plan-structure representations aligned between textual plan steps and instructional video moments.
Anthology ID:
2024.emnlp-main.1191
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
21371–21389
Language:
URL:
https://aclanthology.org/2024.emnlp-main.1191
DOI:
Bibkey:
Cite (ACL):
Diogo Glória-Silva, David Semedo, and Joao Magalhaes. 2024. Show and Guide: Instructional-Plan Grounded Vision and Language Model. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21371–21389, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Show and Guide: Instructional-Plan Grounded Vision and Language Model (Glória-Silva et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.1191.pdf