Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals

Te-Lin Wu, Alex Spangher, Pegah Alipoormolabashi, Marjorie Freedman, Ralph Weischedel, Nanyun Peng


Abstract
The ability to sequence unordered events is evidence of comprehension and reasoning about real world tasks/procedures. It is essential for applications such as task planning and multi-source instruction summarization. It often requires thorough understanding of temporal common sense and multimodal information, since these procedures are often conveyed by a combination of texts and images. While humans are capable of reasoning about and sequencing unordered procedural instructions, the extent to which the current machine learning methods possess such capability is still an open question. In this work, we benchmark models’ capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from online instructional manuals and collecting comprehensive human annotations. We find current state-of-the-art models not only perform significantly worse than humans but also seem incapable of efficiently utilizing multimodal information. To improve machines’ performance on multimodal event sequencing, we propose sequence-aware pretraining techniques exploiting the sequential alignment properties of both texts and images, resulting in > 5% improvements on perfect match ratio.
Anthology ID:
2022.acl-long.310
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4525–4542
Language:
URL:
https://aclanthology.org/2022.acl-long.310
DOI:
10.18653/v1/2022.acl-long.310
Bibkey:
Cite (ACL):
Te-Lin Wu, Alex Spangher, Pegah Alipoormolabashi, Marjorie Freedman, Ralph Weischedel, and Nanyun Peng. 2022. Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4525–4542, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals (Wu et al., ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-long.310.pdf
Software:
 2022.acl-long.310.software.zip
Video:
 https://aclanthology.org/2022.acl-long.310.mp4
Data
RecipeQAWikiHow