Multimodal sentiment analysis has attracted increasing attention and lots of models have been proposed. However, the performance of the state-of-the-art models decreases sharply when they are deployed in the real world. We find that the main reason is that real-world applications can only access the text outputs by the automatic speech recognition (ASR) models, which may be with errors because of the limitation of model capacity. Through further analysis of the ASR outputs, we find that in some cases the sentiment words, the key sentiment elements in the textual modality, are recognized as other words, which makes the sentiment of the text change and hurts the performance of multimodal sentiment analysis models directly. To address this problem, we propose the sentiment word aware multimodal refinement model (SWRM), which can dynamically refine the erroneous sentiment words by leveraging multimodal sentiment clues. Specifically, we first use the sentiment word position detection module to obtain the most possible position of the sentiment word in the text and then utilize the multimodal sentiment word refinement module to dynamically refine the sentiment word embeddings. The refined embeddings are taken as the textual inputs of the multimodal feature fusion module to predict the sentiment labels. We conduct extensive experiments on the real-world datasets including MOSI-Speechbrain, MOSI-IBM, and MOSI-iFlytek and the results demonstrate the effectiveness of our model, which surpasses the current state-of-the-art models on three datasets. Furthermore, our approach can be adapted for other multimodal feature fusion models easily.
Neural networks are widely used in various NLP tasks for their remarkable performance. However, the complexity makes them difficult to interpret, i.e., they are not guaranteed right for the right reason. Besides the complexity, we reveal that the model pathology - the inconsistency between word saliency and model confidence, further hurts the interpretability. We show that the pathological inconsistency is caused by the representation collapse issue, which means that the representation of the sentences with tokens in different saliency reduced is somehow collapsed, and thus the important words cannot be distinguished from unimportant words in terms of model confidence changing. In this paper, to mitigate the pathology and obtain more interpretable models, we propose Pathological Contrastive Training (PCT) framework, which adopts contrastive learning and saliency-based samples augmentation to calibrate the sentences representation. Combined with qualitative analysis, we also conduct extensive quantitative experiments and measure the interpretability with eight reasonable metrics. Experiments show that our method can mitigate the model pathology and generate more interpretable models while keeping the model performance. Ablation study also shows the effectiveness.
Emotion recognition in conversations (ERC) has received much attention recently in the natural language processing community. Considering that the emotions of the utterances in conversations are interactive, previous works usually implicitly model the emotion interaction between utterances by modeling dialogue context, but the misleading emotion information from context often interferes with the emotion interaction. We noticed that the gold emotion labels of the context utterances can provide explicit and accurate emotion interaction, but it is impossible to input gold labels at inference time. To address this problem, we propose an iterative emotion interaction network, which uses iteratively predicted emotion labels instead of gold emotion labels to explicitly model the emotion interaction. This approach solves the above problem, and can effectively retain the performance advantages of explicit modeling. We conduct experiments on two datasets, and our approach achieves state-of-the-art performance.