Mi Tian


2025

pdf bib
Sequential Fusion of Text-close and Text-far Representations for Multimodal Sentiment Analysis
Kaiwei Sun | Mi Tian
Proceedings of the 31st International Conference on Computational Linguistics

Multimodal Sentiment Analysis (MSA) aims to identify human attitudes from diverse modalities such as visual, audio and text modalities. Recent studies suggest that the text modality tends to be the most effective, which has encouraged models to consider text as its core modality. However, previous methods primarily concentrate on projecting modalities other than text into a space close to the text modality and learning an identical representation, which does not fully make use of the auxiliary information provided by audio and visual modalities. In this paper, we propose a framework, Sequential Fusion of Text-close and Text-far Representations (SFTTR), aiming to refine multimodal representations from multimodal data which should contain both representations close to and far from the text modality. Specifically, we employ contrastive learning to sufficiently explore the information similarities and differences between text and audio/visual modalities. Moreover, to fuse the extracted representations more effectively, we design a sequential cross-modal encoder to sequentially fuse representations that are close to and far from the text modality.