AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant

Weixian Lei, Difei Gao, Yuxuan Wang, Dongxing Mao, Zihan Liang, Lingmin Ran, Mike Zheng Shou


Abstract
It is still a pipe dream that personal AI assistants on the phone and AR glasses can assist our daily life in addressing our questions like “how to adjust the date for this watch?” and “how to set its heating duration? (while pointing at an oven)”. The queries used in conventional tasks (i.e. Video Question Answering, Video Retrieval, Moment Localization) are often factoid and based on pure text. In contrast, we present a new task called Task-oriented Question-driven Video Segment Retrieval (TQVSR). Each of our questions is an image-box-text query that focuses on affordance of items in our daily life and expects relevant answer segments to be retrieved from a corpus of instructional video-transcript segments. To support the study of this TQVSR task, we construct a new dataset called AssistSR. We design novel guidelines to create high-quality samples. This dataset contains 3.2k multimodal questions on 1.6k video segments from instructional videos on diverse daily-used items. To address TQVSR, we develop a simple yet effective model called Dual Multimodal Encoders (DME) that significantly outperforms several baseline methods while still having large room for improvement in the future. Moreover, we present detailed ablation analyses. Code and data are available at https://github.com/StanLei52/TQVSR.
Anthology ID:
2022.findings-emnlp.24
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2022
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
319–338
Language:
URL:
https://aclanthology.org/2022.findings-emnlp.24
DOI:
10.18653/v1/2022.findings-emnlp.24
Bibkey:
Cite (ACL):
Weixian Lei, Difei Gao, Yuxuan Wang, Dongxing Mao, Zihan Liang, Lingmin Ran, and Mike Zheng Shou. 2022. AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 319–338, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant (Lei et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-emnlp.24.pdf
Video:
 https://aclanthology.org/2022.findings-emnlp.24.mp4