Taihao Li
2024
Amanda: Adaptively Modality-Balanced Domain Adaptation for Multimodal Emotion Recognition
Xinxin Zhang
|
Jun Sun
|
Simin Hong
|
Taihao Li
Findings of the Association for Computational Linguistics: ACL 2024
This paper investigates unsupervised multimodal domain adaptation for multimodal emotion recognition, which is a solution for data scarcity yet remains under studied. Due to the varying distribution discrepancies of different modalities between source and target domains, the primary challenge lies in how to balance the domain alignment across modalities to guarantee they are all well aligned. To achieve this, we first develop our model based on the information bottleneck theory to learn optimal representation for each modality independently. Then, we align the domains via matching the label distributions and the representations. In order to balance the representation alignment, we propose to minimize a surrogate of the alignment losses, which is equivalent to adaptively adjusting the weights of the modalities throughout training, thus achieving balanced domain alignment across modalities. Overall, the proposed approach features Adaptively modality-balanced domain adaptation, dubbed Amanda, for multimodal emotion recognition. Extensive empirical results on commonly used benchmark datasets demonstrate that Amanda significantly outperforms competing approaches. The code is available at https://github.com/sunjunaimer/Amanda.
DetectiveNN: Imitating Human Emotional Reasoning with a Recall-Detect-Predict Framework for Emotion Recognition in Conversations
Simin Hong
|
Jun Sun
|
Taihao Li
Findings of the Association for Computational Linguistics: EMNLP 2024
Emotion Recognition in conversations (ERC) involves an internal cognitive process that interprets emotional cues by using a collection of past emotional experiences. However, many existing methods struggle to decipher emotional cues in dialogues since they are insufficient in understanding the rich historical emotional context. In this work, we introduce an innovative Detective Network (DetectiveNN), a novel model that is grounded in the cognitive theory of emotion and utilizes a “recall-detect-predict” framework to imitate human emotional reasoning. This process begins by ‘recalling’ past interactions of a specific speaker to collect emotional cues. It then ‘detects’ relevant emotional patterns by interpreting these cues in the context of the ongoing conversation. Finally, it ‘predicts’ the speaker’s current emotional state. Tested on three benchmark datasets, our approach significantly outperforms existing methods. This highlights the advantages of incorporating cognitive factors into deep learning for ERC, enhancing task efficacy and prediction accuracy.
2023
Layer-wise Fusion with Modality Independence Modeling for Multi-modal Emotion Recognition
Jun Sun
|
Shoukang Han
|
Yu-Ping Ruan
|
Xiaoning Zhang
|
Shu-Kai Zheng
|
Yulong Liu
|
Yuxin Huang
|
Taihao Li
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-modal emotion recognition has gained increasing attention in recent years due to its widespread applications and the advances in multi-modal learning approaches. However, previous studies primarily focus on developing models that exploit the unification of multiple modalities. In this paper, we propose that maintaining modality independence is beneficial for the model performance. According to this principle, we construct a dataset, and devise a multi-modal transformer model. The new dataset, CHinese Emotion Recognition dataset with Modality-wise Annotions, abbreviated as CHERMA, provides uni-modal labels for each individual modality, and multi-modal labels for all modalities jointly observed. The model consists of uni-modal transformer modules that learn representations for each modality, and a multi-modal transformer module that fuses all modalities. All the modules are supervised by their corresponding labels separately, and the forward information flow is uni-directionally from the uni-modal modules to the multi-modal module. The supervision strategy and the model architecture guarantee each individual modality learns its representation independently, and meanwhile the multi-modal module aggregates all information. Extensive empirical results demonstrate that our proposed scheme outperforms state-of-the-art alternatives, corroborating the importance of modality independence in multi-modal emotion recognition. The dataset and codes are availabel at https://github.com/sunjunaimer/LFMIM
Search
Co-authors
- Jun Sun 3
- Simin Hong 2
- Shoukang Han 1
- Yu-Ping Ruan 1
- Xiaoning Zhang 1
- show all...