%0 Conference Proceedings
%T MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences
%A Yang, Jianing
%A Wang, Yongxin
%A Yi, Ruitao
%A Zhu, Yuying
%A Rehman, Azaan
%A Zadeh, Amir
%A Poria, Soujanya
%A Morency, Louis-Philippe
%Y Toutanova, Kristina
%Y Rumshisky, Anna
%Y Zettlemoyer, Luke
%Y Hakkani-Tur, Dilek
%Y Beltagy, Iz
%Y Bethard, Steven
%Y Cotterell, Ryan
%Y Chakraborty, Tanmoy
%Y Zhou, Yichao
%S Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
%D 2021
%8 June
%I Association for Computational Linguistics
%C Online
%F yang-etal-2021-mtag
%X Human communication is multimodal in nature; it is through multiple modalities such as language, voice, and facial expressions, that opinions and emotions are expressed. Data in this domain exhibits complex multi-relational and temporal interactions. Learning from this data is a fundamentally challenging research problem. In this paper, we propose Modal-Temporal Attention Graph (MTAG). MTAG is an interpretable graph-based neural model that provides a suitable framework for analyzing multimodal sequential data. We first introduce a procedure to convert unaligned multimodal sequence data into a graph with heterogeneous nodes and edges that captures the rich interactions across modalities and through time. Then, a novel graph fusion operation, called MTAG fusion, along with a dynamic pruning and read-out technique, is designed to efficiently process this modal-temporal graph and capture various interactions. By learning to focus only on the important interactions within the graph, MTAG achieves state-of-the-art performance on multimodal sentiment analysis and emotion recognition benchmarks, while utilizing significantly fewer model parameters.
%R 10.18653/v1/2021.naacl-main.79
%U https://aclanthology.org/2021.naacl-main.79
%U https://doi.org/10.18653/v1/2021.naacl-main.79
%P 1009-1021