Beyond Additive Fusion: Learning Non-Additive Multimodal Interactions

Torsten Wörtwein, Lisa Sheeber, Nicholas Allen, Jeffrey Cohn, Louis-Philippe Morency


Abstract
Multimodal fusion addresses the problem of analyzing spoken words in the multimodal context, including visual expressions and prosodic cues. Even when multimodal models lead to performance improvements, it is often unclear whether bimodal and trimodal interactions are learned or whether modalities are processed independently of each other. We propose Multimodal Residual Optimization (MRO) to separate unimodal, bimodal, and trimodal interactions in a multimodal model. This improves interpretability as the multimodal interaction can be quantified. Inspired by Occam’s razor, the main intuition of MRO is that (simpler) unimodal contributions should be learned before learning (more complex) bimodal and trimodal interactions. For example, bimodal predictions should learn to correct the mistakes (residuals) of unimodal predictions, thereby letting the bimodal predictions focus on the remaining bimodal interactions. Empirically, we observe that MRO successfully separates unimodal, bimodal, and trimodal interactions while not degrading predictive performance. We complement our empirical results with a human perception study and observe that MRO learns multimodal interactions that align with human judgments.
Anthology ID:
2022.findings-emnlp.344
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2022
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4681–4696
Language:
URL:
https://aclanthology.org/2022.findings-emnlp.344
DOI:
10.18653/v1/2022.findings-emnlp.344
Bibkey:
Cite (ACL):
Torsten Wörtwein, Lisa Sheeber, Nicholas Allen, Jeffrey Cohn, and Louis-Philippe Morency. 2022. Beyond Additive Fusion: Learning Non-Additive Multimodal Interactions. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4681–4696, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Beyond Additive Fusion: Learning Non-Additive Multimodal Interactions (Wörtwein et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-emnlp.344.pdf