MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts

Haofei Yu, Zhengyang Qi, Lawrence Jang, Russ Salakhutdinov, Louis-Philippe Morency, Paul Pu Liang


Abstract
Advances in multimodal models have greatly improved how interactions relevant to various tasks are modeled. Today’s multimodal models mainly focus on the correspondence between images and text, using this for tasks like image-text matching. However, this covers only a subset of real-world interactions. Novel interactions, such as sarcasm expressed through opposing spoken words and gestures or humor expressed through utterances and tone of voice, remain challenging. In this paper, we introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE). The key idea in MMoE is to train separate expert models for each type of multimodal interaction, such as redundancy present in both modalities, uniqueness in one modality, or synergy that emerges when both modalities are fused. On a sarcasm detection task (MUStARD) and a humor detection task (URFUNNY), we obtain new state-of-the-art results. MMoE is also able to be applied to various types of models to gain improvement.
Anthology ID:
2024.emnlp-main.558
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10006–10030
Language:
URL:
https://aclanthology.org/2024.emnlp-main.558
DOI:
Bibkey:
Cite (ACL):
Haofei Yu, Zhengyang Qi, Lawrence Jang, Russ Salakhutdinov, Louis-Philippe Morency, and Paul Pu Liang. 2024. MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10006–10030, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts (Yu et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.558.pdf