Multimodal Intent Discovery from Livestream Videos

Adyasha Maharana, Quan Tran, Franck Dernoncourt, Seunghyun Yoon, Trung Bui, Walter Chang, Mohit Bansal


Abstract
Individuals, educational institutions, and businesses are prolific at generating instructional video content such as “how-to” and tutorial guides. While significant progress has been made in basic video understanding tasks, identifying procedural intent within these instructional videos is a challenging and important task that remains unexplored but essential to video summarization, search, and recommendations. This paper introduces the problem of instructional intent identification and extraction from software instructional livestreams. We construct and present a new multimodal dataset consisting of software instructional livestreams and containing manual annotations for both detailed and abstract procedural intent that enable training and evaluation of joint video and text understanding models. We then introduce a multimodal cascaded cross-attention model to efficiently combine the weaker and noisier video signal with the more discriminative text signal. Our experiments show that our proposed model brings significant gains compared to strong baselines, including large-scale pretrained multimodal models. Our analysis further identifies that the task benefits from spatial as well as motion features extracted from videos, and provides insight on how the video signal is preferentially used for intent discovery. We also show that current models struggle to comprehend the nature of abstract intents, revealing important gaps in multimodal understanding and paving the way for future work.
Anthology ID:
2022.findings-naacl.36
Volume:
Findings of the Association for Computational Linguistics: NAACL 2022
Month:
July
Year:
2022
Address:
Seattle, United States
Venues:
Findings | NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
476–489
Language:
URL:
https://aclanthology.org/2022.findings-naacl.36
DOI:
10.18653/v1/2022.findings-naacl.36
Bibkey:
Cite (ACL):
Adyasha Maharana, Quan Tran, Franck Dernoncourt, Seunghyun Yoon, Trung Bui, Walter Chang, and Mohit Bansal. 2022. Multimodal Intent Discovery from Livestream Videos. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 476–489, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
Multimodal Intent Discovery from Livestream Videos (Maharana et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-naacl.36.pdf
Software:
 2022.findings-naacl.36.software.zip