Sequential Fusion of Text-close and Text-far Representations for Multimodal Sentiment Analysis

Kaiwei Sun; Mi Tian

Sequential Fusion of Text-close and Text-far Representations for Multimodal Sentiment Analysis

Abstract

Multimodal Sentiment Analysis (MSA) aims to identify human attitudes from diverse modalities such as visual, audio and text modalities. Recent studies suggest that the text modality tends to be the most effective, which has encouraged models to consider text as its core modality. However, previous methods primarily concentrate on projecting modalities other than text into a space close to the text modality and learning an identical representation, which does not fully make use of the auxiliary information provided by audio and visual modalities. In this paper, we propose a framework, Sequential Fusion of Text-close and Text-far Representations (SFTTR), aiming to refine multimodal representations from multimodal data which should contain both representations close to and far from the text modality. Specifically, we employ contrastive learning to sufficiently explore the information similarities and differences between text and audio/visual modalities. Moreover, to fuse the extracted representations more effectively, we design a sequential cross-modal encoder to sequentially fuse representations that are close to and far from the text modality.

Anthology ID:: 2025.coling-main.4
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 40–49
Language:
URL:: https://aclanthology.org/2025.coling-main.4/
DOI:
Bibkey:
Cite (ACL):: Kaiwei Sun and Mi Tian. 2025. Sequential Fusion of Text-close and Text-far Representations for Multimodal Sentiment Analysis. In Proceedings of the 31st International Conference on Computational Linguistics, pages 40–49, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Sequential Fusion of Text-close and Text-far Representations for Multimodal Sentiment Analysis (Sun & Tian, COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.4.pdf

PDF Cite Search Fix data