Word-Aware Modality Stimulation for Multimodal Fusion

Shuhei Tateishi, Makoto Nakatsuji, Yasuhito Osugi


Abstract
Multimodal learning is generally expected to make more accurate predictions than text-only analysis. Here, although various methods for fusing multimodal inputs have been proposed for sentiment analysis tasks, we found that they may be inhibiting their fusion methods, which are based on attention-based language models, from learning non-verbal modalities, because non-verbal ones are isolated from the linguistic semantics and contexts and do not include them, meaning that they are unsuitable for applying attention to text modalities during the fusion phase. To address this issue, we propose Word-aware Modality Stimulation Fusion (WA-MSF) for facilitating integration of non-verbal modalities with the text modality. The Modality Stimulation Unit layer (MSU-layer) is the core concept of WA-MSF; it integrates language contexts and semantics into non-verbal modalities, thereby instilling linguistic essence into these modalities. Moreover, WA-MSF uses aMLP in the fusion phase in order to utilize spatial and temporal representations of non-verbal modalities more effectively than transformer fusion. In our experiments, WA-MSF set a new state-of-the-art level of performance on sentiment prediction tasks.
Anthology ID:
2024.lrec-main.1536
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
17664–17674
Language:
URL:
https://aclanthology.org/2024.lrec-main.1536
DOI:
Bibkey:
Cite (ACL):
Shuhei Tateishi, Makoto Nakatsuji, and Yasuhito Osugi. 2024. Word-Aware Modality Stimulation for Multimodal Fusion. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 17664–17674, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Word-Aware Modality Stimulation for Multimodal Fusion (Tateishi et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.1536.pdf
Optional supplementary material:
 2024.lrec-main.1536.OptionalSupplementaryMaterial.zip