Self-supervised Cross-modal Pretraining for Speech Emotion Recognition and Sentiment Analysis

Iek-Heng Chu; Ziyi Chen; Xinlu Yu; Mei Han; Jing Xiao; Peng Chang

doi:10.18653/v1/2022.findings-emnlp.375

Self-supervised Cross-modal Pretraining for Speech Emotion Recognition and Sentiment Analysis

Iek-Heng Chu, Ziyi Chen, Xinlu Yu, Mei Han, Jing Xiao, Peng Chang

Abstract

Multimodal speech emotion recognition (SER) and sentiment analysis (SA) are important techniques for human-computer interaction. Most existing multimodal approaches utilize either shallow cross-modal fusion of pretrained features, or deep cross-modal fusion with raw features. Recently, attempts have been made to fuse pretrained feature representations in a deep fusion manner during fine-tuning stage. However those approaches have not led to improved results, partially due to their relatively simple fusion mechanisms and lack of proper cross-modal pretraining. In this work, leveraging single-modal pretrained models (RoBERTa and HuBERT), we propose a novel deeply-fused audio-text bi-modal transformer with carefully designed cross-modal fusion mechanism and a stage-wise cross-modal pretraining scheme to fully facilitate the cross-modal learning. Our experiment results show that the proposed method achieves state-of-the-art results on the public IEMOCAP emotion and CMU-MOSEI sentiment datasets, exceeding the previous benchmarks by a large margin.

Anthology ID:: 2022.findings-emnlp.375
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2022
Month:: December
Year:: 2022
Address:: Abu Dhabi, United Arab Emirates
Editors:: Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5105–5114
Language:
URL:: https://aclanthology.org/2022.findings-emnlp.375
DOI:: 10.18653/v1/2022.findings-emnlp.375
Bibkey:
Cite (ACL):: Iek-Heng Chu, Ziyi Chen, Xinlu Yu, Mei Han, Jing Xiao, and Peng Chang. 2022. Self-supervised Cross-modal Pretraining for Speech Emotion Recognition and Sentiment Analysis. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5105–5114, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):: Self-supervised Cross-modal Pretraining for Speech Emotion Recognition and Sentiment Analysis (Chu et al., Findings 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.findings-emnlp.375.pdf

PDF Cite Search