Context or Knowledge is Not Always Necessary: A Contrastive Learning Framework for Emotion Recognition in Conversations

.


Introduction
Emotion recognition in conversations (ERC) has received an active research attention (Tu et al., 2022c;Li et al., 2022b;Xie et al., 2021;Mao et al., 2021;Lian et al., 2021;Xiao et al., 2020) because of its wide applications in many fields such as opinion mining (Cortis and Davis, 2021) and recommender systems (Zheng et al., 2020).Existing works in ERC conventionally need to model context-sensitive dependencies (Wang et al., 2020;Jiao et al., 2020) and knowledge-sensitive dependencies (Li et al., 2021a,b;Ghosal et al., 2020;Zhong et al., 2019).Especially, in knowledge-sensitive ERC models, there are mainly two kinds of external knowledge: One is concepts retrieved from external knowledge bases ConceptNet (Speer et al., 2017) or Sentic-Net (Cambria et al., 2020).The other is generated by the pre-training commonsense transformers (COMET) (Bosselut et al., 2019).Additionally, in context-sensitive ERC models, recent studies proposed various methods, including memory networks (Kumar et al., 2022b;Xing et al., 2020;Hazarika et al., 2018) and graph-based models (Nie et al., 2021;Li et al., 2020;Ghosal et al., 2019;Zhang et al., 2019).However, these works do not follow whether models need context and external knowledge for the current utterance, but rather on improving modeling methods.In Fig. 1, intuitively, the emotion of example 1 can be recognized even without leveraging context and external knowledge.Example 2 (context-independent) is the first utterance in a conversation, lacking any context.Without establishing a relationship between patio furniture and 'joyful', it becomes challenging to detect emotions.For example 3 (knowledge-independent), the literal meaning of the utterance is opposite to the conveyed emotion.So, it is difficult to correctly detect the emotion of the utterance without context.Although the above example exists, the removal of context or external knowledge will lead to a semantic gap between utterances representations.Therefore, how to differentiate context-and knowledge-independent utterances from other utterances is a challenge.
Based on the above, we proposed a framework based on contrastive learning (CL), CKCL, to distinguish context-and knowledge-independent utterances during training.
Concretely, the context-independent and knowledge-independent (or context-dependent and knowledge-dependent) utterances are labeled as '1' (or labeled '0').Then, the CKCL pulls utterances with the same or different labels together or further apart.For ERC models, this can alleviate the performance degradation of context-and knowledge-independent utterance representations during training.And the CKCL can also denoise irrelevant context and knowledge to improve the robustness ability of models.In addition, inspired by Li et al. (2022a), we introduce a weighted supervised CL (SCL) named Emotion SCL into CKCL, to further distinguish similar emotions, which takes into account the uneven distribution of classes in ERC.To summarize, our main contributions can be summarized as follows: • We are the first to explore self-supervised CL in the ERC task.
• We propose a CKCL framework to differentiate context-and knowledge-independent utterances, which promotes the robustness of ERC models against irrelevant context and knowledge during training.
• Experimental results demonstrate that our proposed method can boost various baselines and outperforms state-of-the-art ERC methods.

Related Work
Emotion Recognition in Conversations Context-sensitive Models The emotion generation theory (Gross and Barrett, 2011) indicates the importance of contextual information for emotion identification.RNN-based models (Poria et al., 2017) are often used to model context dependencies.However, they are unable to capture the distinction between historical utterances (Lian et al., 2021) when modeling context.To solve this problem, most works began to focus on the memory network (Hazarika et al., 2018;Jiao et al., 2020).
In addition, the role of participants in ERC is also important to the speaker's emotional state (Wen et al., 2023).To model the speaker-level context, researchers have a greater emphasis on speakerspecific models (Kim and Vossen, 2021), graphbased models (Nie et al., 2021), and so on.For example, Majumder et al. (2019) (Speer et al., 2017).Subsequently, Zhao et al. (2022) proposed a causal aware model using generated knowledge to capture the context information.However, these methods, both context-and knowledgesensitive models, have performance degradation, that is, their performance in some utterances is even worse than models without context and knowledge.

Contrastive Learning
e w c w R 9 Y n z / + X J P P < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 l t t 3 Proposed CKCL Framework

Task Definition
Let a conversation C consist of utterances u 1 , u 2 ..., u n , where n is the number of utterances.Each utterance u i = {w 1 , w 2 ..., w m } consists of m tokens.And, there are g participants S = {s 1 , s 2 ..., s g }, (g ≥ 2) in C. Each utterance u i is uttered by one of S.Then, the ERC task aims to predict the pre-defined emotion label set Y = {y 1 , y 2 , ..., y e } of each utterance in C.However, unlike vanilla emotion recognition, ERC models need to focus on modeling context-and knowledge-sensitive dependencies because of interaction between participants.Accordingly, the above issues can be expressed as follows: , where w is the size of context, K = {k 1 , k 2 , ..., k n } denotes the external knowledge of utterances.

Overview
CL was first proposed to augment data for improving visual representation.Afterward, due to the poor performance of BERT in semantic text similarity tasks, researchers began to introduce selfsupervised CL to capture the correlation and difference between utterances.Subsequently, more and more CL-based methods appeared in the NLP field (Kumar et al., 2022a).Unfortunately, there is no related work based on unsupervised CL in ERC.
In this section, given the defect in performance degradation of ERC models in modeling context and external knowledge, we introduce a CL-based framework CKCL to refine the utterance representations during training.Additionally, inspired by Li et al. (2022a), we incorporate a weighted supervised CL into CKCL, to distinguish uneven distribution samples with similar emotions.

Context CL
Context is the core of NLP-related tasks and significantly improves the performance of NLP systems.
In ERC, surrounding utterances (at time < t) of the current utterance (at time t) are treated as a context (Poria et al., 2017).However, it is challenging to model context, mainly because of (1) emotional dynamics: self-and inter-personal dependency modeling (Poria et al., 2019), and (2) differences between local and distant historical utterances (Ghosal et al., 2021).Although existing ERC models aid classification performance by modeling context, there is also a marked degeneration behind this because of low-quality context.Specifically, the performance of a model in certain utterances is even worse than the model that does not consider contextual information, which highlights the significance of denoising irrelevant context in ERC.Furthermore, the efficacy of denoising low-quality context has been demonstrated Algorithm 1: Calculation of CKCL for each mini-batch B Input: in other NLP tasks (Zhang et al., 2021).
Based on this, we design a context CL to capture the correlation and difference between contextindependent and context-dependent utterances.We first copy the model M and feed the input data i=1 masking context representation x i of u i with 0, into the replica model M † for each mini-batch B. Especially, the context representation of an utterance is conventionally in the hidden layer of ERC models (Ghosal et al., 2020;Majumder et al., 2019), but there are also models to use context as input (Zhong et al., 2019).Then, we conduct self-supervised pseudo labeling, represented as Line 6 -Line 12 in Algorithm 1. Finally, we can calculate contrastive loss item L c according to the pseudo labels z c = {z c i } N b i=1 , described as Line 14 -Line 24 in Algorithm 1.

Knowledge CL
In conversations, humans usually rely on commonsense knowledge to convey emotions (Zhong et al., 2019).However, in knowledge-sensitive ERC models, irrelevant knowledge for understanding the utterance might be absorbed as noise (Tu et al., 2022b).Although there are some works Jiang et al. (2022); Tu et al. (2022a); Zhu et al. (2021) striving for knowledge selection, they are still limited in knowledge-independent utterances as identifying the emotions of these utterances does not necessitate external knowledge.To distinguish between knowledge-independent and knowledge-dependent utterances and denoise irrelevant knowledge, we also design a CL-based method, Knowledge CL.The process of Knowledge CL is similar to that of Context CL, but the difference is that Knowledge CL is masking the knowledge representation, rather than the context representation.As a result, we can obtain another loss item L k , described in Algorithm 1.

Emotion SCL
Considering the ERC task characteristics, that is, the class distribution is extremely uneven and emotional labels have heightened similarity, we proposed a class-weighted SCL, Emotion SCL, to clarify the representation of utterances with similar emotions.The Emotion SCL can pull samples with different emotional labels further apart and alleviates the impact of the class imbalance problem to a certain extent.The process of Emotion SCL for each mini-batch B is as follows: where B denotes a mini-batch sample, N b is the size of B. 1 [.] ∈ {0, 1} represents an indicator function.α j is the class weight of the j-th utterance.EmbeddingLayer(.)represents the word embedding methods.ERC models usually leverage BERT (Vaswani et al., 2017), Glove (Pennington et al., 2014), or Roberta (Liu et al., 2019) to encode utterance representations.F(x i , x k , τ ) = e simi(x i ,x j )/τ , where τ is the temperature parameter, simi(x i , x j ) = x i •x j ∥x i ∥∥x j ∥ denotes the cosine similarity function.And {y i } N b i=1 is the emotional label set of utterances in B.

Model Training
We jointly train our proposed framework by minimizing the sum of the following three losses.
where γ e , γ c and γ k are tuned hyper-parameters.L ′ is the classification loss.Θ is a set of learnable parameters of the CKCL framework.λ represents the coefficient of L 2 -regularization.
IEMOCAP consists of dyadic sessions where actors perform improvisations or scripted scenarios.And each utterance is labeled with one of the emotions: happy, angry, neutral, sad, excited, or frustrated.
Dailydialog is a dyadic conversation dataset from human-written daily communications.And each utterance is annotated with one of the emotions: happiness, surprise, sadness, anger, disgust, neutral, or fear, and one of the sentiments: neutral, negative, or positive.
MELD is a multi-party conversation dataset collected from the TV show Friends, which is an extension of the EmotionLines dataset (Hsu et al., 2018).Each utterance is annotated with one of the emotions: surprise, fear, disgust, anger, sadness, neutral, or joy, and one of the sentiments: neutral, negative, or positive.
EmoryNLP consists of multi-party sessions from the TV show Friends, and each utterance is labeled with one of the emotions: surprise, fear, disgust, anger, sadness, neutral, or joy, and one of the sentiments: neutral, negative or positive.

Experimental Settings
All of the baselines have released their source codes.Thus, we hold identical settings as the original papers.For CKCL, γ e , γ c , γ k , and τ are tuned manually on each dataset with hold-out validation.Specifically, the hyperparameters setting of COSMIC ⋆ +CKCL reported in Table 2. γ e , γ c , γ k of baselines are 1: 1: 1 on each dataset and τ of baselines is always 0.07.The reported results are the average score of 5 random runs on the test set.Additionally, because the models are different in modeling context and knowledge representation, as shown in Table 3, thus, the mask objects are also different during our experiments.Figure 3: Ablation studies on CKCL.The Numbers_C and Numbers_K represent the number changes of in contextand knowledge-independent utterances.And the F1_C and F1_K denote the weighted avg F1 score of contextand knowledge-independent utterances in the validation set.Especially, the '_o', '_c ', _k, and '_all'   MIC based on the CKCL framework is significantly boosted.The COSMIC ⋆ achieves the best improvement result of F1 on the DailyDialg dataset, i.e., 2.18%.And the COSMIC * + CKCL outperforms all compared methods on different datasets except for the IEMOCAP dataset.This verifies the effectiveness ability of our CKCL framework.

Ablation Study
To investigate the impact of each component of our proposed CKCL framework, we conducted an ablation study on COSMIC ⋆ , and the results are shown in Table 4 It may be attributed to the model's inherent ability to denoise irrelevant context to some extent, but struggling to effectively handle irrelevant knowledge.In addition, considering the difference in data size between IEMOCAP and Dailydialog datasets, we analyze CKCL on IEMOCAP and EmoryNLP datasets for better visualization.The number of context-or knowledge-independent utterances is shown in Fig. 3, which converges as the training iterations and remains consistent with the model's convergence.This is primarily because the CKCL annotation heavily relies on the model's prediction results.Especially, Fig. 3.(c-d) also demonstrates the capability of CKCL to effectively enhance the performance of context-or knowledge-independent utterances.This further proves the effectiveness of the CKCL framework in denoising irrelevant contexts or knowledge.

Analysis of Emotion SCL:
To better understand differences of utterance representations with different emotions, we show the t-SNE (van der Maaten and Hinton, 2008) visualization of the intermediate representation of COSMIC ⋆ and COSMIC ⋆ + Emotion SCL on IEMOCAP and Dailydialog datasets.Overall, the differences in utterance representations derived from the latter are clearer than the former, as shown in Fig. 5. Specifically, as shown in Fig. 5.(a), Emotion SCL alleviates the difficulty of distinguishing similar emotions such as 'happy' and 'excited' to some extent.Additionally, the utterance representations of the emotion "happy" is effectively differentiated from others.Similarly, in

Analysis of Performance Degradation
Although modeling context and knowledge can enhance performance, it also leads to performance degradation in certain utterances.As shown in Fig. 4, the theoretical performance suggests that the model will not experience performance degradation, meaning that after modeling context and knowledge, the model can also correctly identify utterances that it could correctly identify previously (i.e. when the model lacked modeling context and knowledge).The difference between theoretical and actual performance implies existing ERC systems can not achieve a resultful denoising effect for irrelevant context and knowledge.This is also the primary motivation behind this paper.

Case Study
Table 6 shows three utterances sampled from the IEMOCAP and Dailydialog datasets.These utterances were initially recognized correctly by COSMIC ⋆ without modeling context or external knowledge, but upon considering the context and external knowledge, they were actually recognized incorrectly.Intuitively, the emotions of these utterances can be recognized even without context and knowledge, but the model showed disappointing performance, because blindly modeling context and knowledge may deteriorate utterance representation.Fortunately, CKCL can effectively distinguish these utterances, thus absorbing irrelevant context and knowledge as noise to improve the robustness ability of ERC models.As a result, the CKCL-based model can correctly identify these cases as expected.

Generalizability Analysis
To evaluate the generalizability of our CKCL framework, following Yang et al. (2022), we conduct the experiment on various ERC baselines as shown in Table 4.We can see that the improvement effect on the Dailydialog dataset is not in line with expectations, which shows that the influence of CKCL on different models is quite different.Additionally, the CKCL without hyperparameter adjustment can still boost the performance of all models on emotion or sentiment classification.It verifies the effectiveness and generalizability of CKCL in ERC.

Analysis of Static Pseudo Labels
Because the CKCL needs to reason three times for adaptively annotating dynamic pseudo labels, it causes the growth in time complexity.Therefore, we explored a low-time complexity method, that is, in the first epoch, using a trained model to annotate static pseudo labels that remain unchanged during subsequent training, which means there is no need for additional reasoning in the following training.As a price, the model performance has declined to some extent, but fortunately, the CKCL is still adequate for the model, as shown in Table 7.

Error Analysis
In this section, we conduct an error analysis on the reported results per dataset and found that most errors are attributed to the following three points: • Class imbalance problem: The unbalanced distribution of classes is the primary cause of errors.In the training set of the MELD dataset, the number of samples is as follows: 'fear ': 268, 'disgust': 271, 'sadness': 683, 'anger': 1109, 'surprise': 1205, 'joy': 1743, 'neutral': 4710, which causes the F1 score of emotion 'fear' is as low as 0.0806.
• Diversity of context modeling: Unlike knowledge representation, masking the context representation as demonstrated in Table 3 does not completely eliminate the influence of contextual information.This is mainly due to the fact that even RNNs or their variations can capture potential contextual information.As a result, generating pseudo labels for Context CL has become challenging.
• Limitation of pseudo labeling: The quality of labeled results in pseudo-label annotation primarily relies on the model's predictions.Thus, the model's performance directly impacts the quality of the labeled samples.Consequently, there is a possibility of leveraging incorrectly labeled samples, which can hinder the model's training.For example, in a specific epoch of the training process, Example 3 in Fig. 1 might be mistakenly labeled as a knowledgedependent utterance or other.Consequently, such situations can lead to fluctuations in the model's performance during training, potentially even lower than the original model.Nevertheless, the benefits of this approach still outweigh the drawbacks as shown in Fig. 3, because as the model converges, the annotations tend to stabilize.

Conclusion
In this paper, we propose a novel CKCL framework to enhance utterance representations in ERC.More concretely, we employ Context (or Knowledge) CL to capture the correlation and difference between context-independent and context-dependent (or knowledge-independent and knowledge-dependent) utterances representations, which also enhances the ability of models to denoising irrelevant context or knowledge.Additionally, the Emotion SCL can pull utterances with different labels further apart, and then obtain clearer differences in utterances with similar emotions.Experimental results show that our CKCL framework significantly boosted various ERC models and outperformed state-of-the-art methods.

Figure 1 :
Figure 1: Examples of utterances, reflecting the context and knowledge are not always necessary in ERC.

Figure 2 :
Figure 2: The proposed CKCL framework.The dotted line represents data flows without backpropagation.CE denotes cross-entropy, which is widely adopted in ERC.

Fig. 5 .
Fig. 5.(b), the differentiation between similar emotions like 'happiness' and 'surprise' is also alleviated and the differences among the other emotions become more pronounced.

Table 1 :
, Statistics of experimental datasets.Since the IEMOCAP dataset does not provide a predefined train/validation split, we utilize 10% of the training dialogues as the validation split.
Table4reports the experimental results on different datasets.We can observe that the performance on sentiment and emotion identification of COS- indicate the COSMIC * , COSMIC * + Context CL, COSMIC * + Knowledge CL, and COSMIC * + CKCL, respectively.

Table 3 :
The mask objects are in various models, which is consistent with the original papers.
. 'w/o L con ', 'w/o L kno ', and 'w/o L emo ' the prediction results, which enriches utterance representations and the denoising ability of the model in ERC.Reported results in Table 4 also demonstrated it, and the effectiveness of Knowledge CL is superior to that of Context CL.

Table 4 :
(Ghosal et al., 2020) different methods.The best scores are in bold.⋆ is our replication results, ♮ and ♯ represents results from the original papers and(Ghosal et al., 2020), respectively.♠ denotes knowledge-sensitive models.W-Avg F1 denotes the weighted avg F1 score.The depth of color symbolizes the declining or rising value.

Table 5 :
Experimental results of generalizability analysis on different baselines and datasets.

Table 6 :
Examples of utterances from the IEMOCAP and Dailydialog datasets for the case study.

Table 7 :
Comparison results of dynamic and static pseudo labels on different methods.⋆ denotes the CKCL framework with static pseudo labels.