Enhancing Emotion Recognition in Conversation via Multi-view Feature Alignment and Memorization

,


Introduction
Emotional intelligence is an advanced capability of conversational AI systems.A fundamental and critical task in this domain is emotion recognition in conversation (ERC), which aims to identify the emotions conveyed in each utterance within the dialogue context (Poria et al., 2019b).Unlike the basic emotion classification (EC) task (Yin and Shang, 2022), ERC is a more practical endeavor that involves predicting the emotion label of each utterance based on the surrounding context.Previous methods (Ghosal et al., 2019;Ishiwatari et al., 2020;Shen et al., 2021b) commonly follow a two-step paradigm of first extracting semanticview features via fine-tuning PLMs and then modeling context-view features based on the obtained semantic-view features by various graph neural networks, such as GCN (Kipf and Welling, 2016), RGCN (Ishiwatari et al., 2020), GAT (Veličković et al., 2017).Considering the complexity of emotions, some recent works (Song et al., 2022;Hu et al., 2023) use supervised contrast loss as an auxiliary loss to improve the feature discriminability between different classes of samples.Although these methods have achieved excellent performance on ERC task, three issues still remain: (1) As illus- trated in Figure 1, the interaction between utterances in a conversation is very complex, and it's difficult to fully model this interaction simply through a graph neural network.(2) Both semantic-view and context-view are important for ERC.The former focuses on the emotion conveyed by the independent utterance, while the latter offers clues about the emotional context.These two views of information work together in a complementary manner, contributing to ERC from distinct perspectives.However, as illustrated in Figure 2, semantic-view and context-view are not well aligned.(3) Due to the smaller number of samples in the tail class, it is difficult for the parametric model to learn the pattern of the tail class during training.How to effectively recognize samples of the tail class remains an issue to be solved.
To address these issues, we propose Multi-view Feature Alignment and Memorization (MFAM) for ERC.Firstly, we treat the pre-trained conversation model (PCM) (Gu et al., 2021) as a prior knowledge base and from which we elicit correlations between utterances by a probing procedure.The correlation between utterances will be used to form the weights of edges in the graph neural network and participate in the interactions between utterances.Secondly, we adopt supervised contrastive learning (SCL) to align the features at semanticview and context-view and distinguish semantically similar emotion categories.Unlike the regular SCL, both semantic-view and context-view features will participate in the computations of SCL.Finally, we propose a new semi-parametric paradigm of inferencing through memorization to solve the recognition problem of tail class samples.We construct two knowledge stores, one from semantic-view and the other from context-view.Semantic-view knowledge store regarding semantic-view features and corresponding emotion labels as memorized keyvalue pairs, and the context-view knowledge store is constructed in the same way.During inference, our model not only infers emotion through the weights of trained model but also assists decisionmaking by retrieving examples that are memorized in the two knowledge stores.It's worth noting that semantic-view and context-view features have been well aligned, which will facilitate the joint retrieval of the semantic-view and context-view.
The main contributions of this work are summarized as follows: (1) We propose multi-view feature alignment for ERC, which aligns semanticview and context-view features, these two views of features work together in a complementary manner, contributing to ERC from distinct perspectives.
(2) We propose a new semi-parametric paradigm of inferencing through memorization to solve the recognition problem of tail class samples.(3) We treat the PCM as a prior knowledge base and from which we elicit correlations between utterances by a probing procedure.(4) We achieve state-of-theart results on four widely used benchmarks, and extensive experiments demonstrate the effectiveness of our proposed multi-view feature alignment and memorization.

Emotion Recognition in Conversation
Most existing works (Ghosal et al., 2019;Ishiwatari et al., 2020;Shen et al., 2021b) commonly first extract semantic-view features via fine-tuning PLMs and then model context-view features based on the obtained semantic-view features by various graph neural networks, there are also some works (Zhong et al., 2019;Shen et al., 2021a;Majumder et al., 2019) that use transformer-based and recurrencebased methods to model context-view features.It's worth noting that self-attention (Vaswani et al., 2017) in transformer-based methods can be viewed as a fully-connected graph in some sense.Some recent works (Li et al., 2021;Hu et al., 2023)use supervised contrastive loss as an auxiliary loss to enhance feature discriminability between samples of different classes.In the following paragraphs, we provide a detailed description of graph-based methods and methods using supervised contrastive loss.
DialogueGCN (Hu et al., 2021) treats the dialogue as a directed graph, where each utterance is connected with the surrounding utterances.DAG-ERC (Shen et al., 2021b) uses a directed acyclic graph to model the dialogue, where each utterance only receives information from past utterances.CoG-BART (Li et al., 2021) adapts supervised contrastive learning to make different emotions mutually exclusive to identify similar emotions better.SACL (Hu et al., 2023) propose a supervised adversarial contrastive learning framework for learning generalized and robust emotional representations.

Memorization
Memorization-based methods (or non/semiparametric methods) performs well under low-resource scenarios, and have been applied to various NLP tasks such as language modeling (Khandelwal et al., 2019), question answering (Kassner and Schütze, 2020), knowledge graph embedding (Zhang et al., 2022) and relation extraction (Chen et al., 2022).

Probing PLMs
Some work has shown that PLMs such as BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019), ELECTRA(Clark et al., 2020) store rich knowledge.Based on this, PROTON (Wang et al., 2022) elicit relational structures for schema linking from PLMs through a unsupervised probing procedure.Unlike PROTON operating at the word level, our probe procedure is based on PCM and operates at the utterance level.

Methodology
We introduce the definition of ERC task in section 3.1, and from section 3.2 to section 3.5, we introduce the proposed MFAM in this paper.The overall framework of MFAM is shown in Figure 3.

Definition
Given a collection of all speakers S, an emotion label set Y and a conversation C, our goal is to identify speaker's emotion label at each conversation turn.A conversation is denoted as where s i ∈ S is the speaker and u i is the utterance of i-th turn.For utterance u i , it is comprised of n i tokens

Multi-view Feature Extraction
In this section, we will introduce how to extract the semantic-view and context-view features of utterance.

Semantic-view Feature Extraction
Semantic Feature Extraction We employ PLM to extract the semantic feature of utterance u i .Following the convention, the PLM is firstly fine-tuned on each ERC dataset, and its parameters are then frozen while training.Following Ghosal et al. (2020), we employ RoBERTa-Large (Liu et al., 2019) as our feature extractor.More specifically, for each utterance u i , we prepend a special token [CLS] to its tokens, making the input a form of Then, we average the [CLS]'s embedding in the last 4 layers as u i 's semantic feature vector x i ∈ R du .

Commensense
Feature Extraction We extract six types (xIntent, xAttr, xN eed, xW ant, xEf f ect, xReact) of commensense feature vectors related to the utterance u i from COMET (Bosselut et al., 2019): ) where r j (1 ≤ j ≤ 6) is the token of j-th relation type, c j i ∈ R dc represent the feature vectors of the j-th commensense type for u i .
Semantic-view feature is obtained by concatenate semantic feature vectors with their corresponding six commensense feature vectors as follows: where SV D is used to extract key features from commensense, W g is used to reduce feature dimension.
the goal of our probing technique is to derive a function f (•, •) that captures the correaltion between an arbitraty pair of utterances.To this end, we employ a PCM (Gu et al., 2021), which is pretrained on masked utterance generation, next utterance generation and distributed utterance order ranking task.As shown in Figure 4, we first feed the N utterances into the PCM.We use h u j to denote the contextualized representation of the j-th utterance u j , where 1 ≤ j ≤ N .Then, we replace the u i with a mask utterance [CLS, M ASK, SEP ] and feed the corrupted N utterances into the PCM again.Accordingly, we use h u j\i to denote the new representation of the j-th utterance u j when u i is masked out.Formally, we measure the distance between h u j and h u j\i to induce the correlation between u i and u j as follows: where d(•, •) is the distance metric to measure the difference between two vectors.We use Euclidean distance metric to implement d(•, •): where d Euc (•, •) denotes a distance function in Euclidean space.By repeating the above process on each pair of utterances u i , u j and calculating f (u i , u j ), we obtain a correlation matrix , where x ij denotes the correlation between utterance pair (u i , u j ).

Graph Construction
where j = i − p, . . ., i + f , b denotes the predefined weight coefficient of x ij , reflects the injection intensity of correlation-based knowledge.
Interaction between Utterance Based on the constructed graph, we utilize RGCN to implement interactions between utterances, and then obtain context-view feature f i of utterance u i .The detailed calculation process can be found in Appendix A.

Multi-view Feature Alignment
We adopt SCL to align semantic-view and contextview features.Specifically, in a batch consisting of M training samples, for all m ∈ [1, 2, ..., M ], both g m and f m are involved in the computation of SCL, we can incorporate g m and f m into the SCL computation separately, forming 2M samples, or concatenate them for SCL computation.Taking the former approach as example, the supervised contrastive loss of 2M samples in a multiview batch can be expressed by the following equation: where indicate the index of the samples in a multiview batch, τ ∈ R + denotes the temperature coefficient used to control the distance between instances, P (i) = I j=i − {i} represents samples with the same category as i while excluding itself, To further enrich the samples under each category, we incorporate the prototype vectors corresponding to each category into the SCL computation.Prototype vectors can correspond to features at different views, forming multiple prototypes, or they can correspond to the concatenation of features at different views, forming single prototype.
For instance, with multiple prototypes, the updates for P (i) and A(i) are as follows: .We use BERT-whitening (Su et al., 2021) to reduce the feature dimension, therefore, the retrieval speed is accelerated.
We construct two knowledge stores, one from semantic-view and the other from contextview.Semantic-view knowledge store regarding semantic-view features x i and corresponding emotion labels y i as memorized key-value pairs.For context-view knowledge store, we adopt the same construction method as semantic-view.
During inference, given an utterance u, parametric model computes corresponding semantic-view feature x and context-view feature f .We do the same as Eq.( 8) for x and f to get x and f , which is used to respectively retrieve the semantic-view and context-view knowledge store for the k nearest neighbors (kNN) U according to L2 distance: where T is the temperature.

Training and Inference
Training Since semantic-view and context-view features have been well aligned, we concatenate them for emotion classification and use crossentropy loss function to calculate classification loss.Meanwhile, we use logit compensation (Menon et al., 2020) to eliminate the bias in the classification layer caused by class imbalance issues: where w i = g i ⊕ f i , ⊕ represent feature concatention, φ represent classification layer, δ y is the compensation for class y and its value is related to class-frequency.Finally, we have the following batch-wise loss for training.
Inference During inference, we not only infer the utterance's emotion through the trained model but also assists decision-making by retrieving examples that are memorized in the two knowledge stores: where λ, µ, γ represent the interpolation hyperparameters between model output distribution and kNN distribution.

Datasets and Evaluation metrics
We evaluate our method on four benchmark datasets: IEMOCAP (Busso et al., 2008), MELD (Poria et al., 2019a), DailyDialog (Li et al., 2017) and EmoryNLP (Zahiri and Choi, 2018).Detailed descriptions of each dataset can be found in the appendix B. The statistics of four datasets are presented in Table 1.We utilize only the textual modality of the above datasets for the experiments.For evaluation metrics, we follow (Shen et al., 2021b)

Implementation Details
The initial weight of RoBERTa come from Huggingface's Transformers (Wolf et al., 2020).We utilize Adam (Kingma and Ba, 2014) optimizer to optimize the network parameters of our model and the learning rate is set to 0.0003 and remains constant during the training process.We adopt faiss (Johnson et al., 2019) library to conduct retrieval, and the dimensions of semantic-view and context-view features respectively become 384 and 64 after the BERT-whitening operation.We search the hyper-parameters on the validation set.All experiments are conducted on A6000 GPU with 48GB memory.

Compared Methods
We compare our model with the following baselines in our experiments.
In addition to the above-mentioned baselines, we also take into account the performance of Chat-GPT (Brown et al., 2020) on the ERC task evaluated by Yang et al. (2023) and Zhao et al. (2023).

Overall Performance
The comparison results between the proposed MFAM and other baseline methods are reported in our proposed MFAM consistently achieves start-ofthe-art results on four widely used benchmarks.
As a graph-based method, MFAM achieves an average performance improvement of +4.36% when compared to preceding graph-based methods.Moreover, when considering the four benchmarks individually, MFAM achieves a performance improvement of +2.38%, +1.82%, +4.82%, and +2.04% respectively, marking a significant enhancement over the earlier graph-based methods.
MFAM also shows a significant advantage when compared to other types of methods.It achieves an average performance gain of +2.06%, +7.60% respectively compared to transformer-based and recurrence-based methods.Moreover, we notice that the performance of ChatGPT in zero-shot and few-shot scenarios still has a significant gap compared to the performance of models currently trained on the full dataset.

Ablation Study
To study the effectiveness of the modules in MFAM, we evaluate MFAM by removing knowledge module, alignment module and memorization module.The results of ablation study are shown in Table 3.
Concerning the knowledge module, which includes commonsense knowledge and correlationbased knowledge.Its removal results in a sharp performance drop on IEMOCAP and DailyDialog, while a slight drop on MELD and EmoryNLP.Therefore, we can infer that the conversations in IEMOCAP and DailyDialog involve more commonsense knowledge and more complex utterance interactions.Concerning the alignment module, its removal leads to a similar and significant  drop in the model's performance on all datasets, demonstrating the importance of multi-view feature alignment in the ERC task.Concerning the memorization module, its removal results in a drastic decrease in model performance across all datasets, highlighting the effectiveness of inferencing through memorization in addressing class imbalance issues.Moreover, the simultaneous removal of the alignment and memorization modules results in a performance decline that is greater than the sum of the declines caused by the removal of each module individually, proving that aligning semantic-view and context-view features can facilitate the joint retrieval of the semantic-view and context-view.

Analysis on λ, µ, γ
λ, µ and γ are very important parameters, which respectively represent the weights occupied by the model, semantic-view retrieval, and context-view retrieval during inference.Determining the appropriate weight combination to reinforce their interplay is very important.Table 4 shows the test f1scores on the IEMOCAP dataset for different λ, µ and γ.
We can observe that when setting µ to 0.2, dynamically adjusting λ and γ leads to continuous changes in performance, when the value of λ rises to 0.7, corresponding to a drop in γ to 0.  the best performance is achieved.Continuing to increase the value of λ and reduce the value of γ would result in a performance decline.In addition, fluctuating the value of µ while keeping λ and γ at 0.7 and 0.1 respectively also leads to a decrease in performance.

Visualization on Multi-view Feature
To conduct a qualitative analysis of multi-view feature alignment, we utilize t-sne(Van der Maaten and Hinton, 2008) to visualize the prototype vectors corresponding to semantic-view and context-view features under each emotion category, as shown in Figure 5.It can be observed that semantic-view and context-view features under the same emotion category are well aligned.Meanwhile, positive emotions such as "happy" and "excited" are close to each other, while negative emotions like "frustrated", "angry", and "sad" are also close to each other.

Conclusion
In this paper, we propose Multi-view Feature Alignment and Memorization (MFAM) for ERC.Firstly, in order to obtain accurate context-view features, we treat the PCM as a prior knowledge base and from which we elicit correlations between utterances by a probing procedure.Then we adopt SCL to align semantic-view and context-view features.Moreover, we improve the recognition performance of tail-class samples by retrieving examples that are memorized in the two knowledge stores during inference.We achieve state-of-the-art results on four widely used benchmarks, ablation study and visualized results demonstrates the effectiveness of multi-view feature alignment and memorization.

Limitations
There are two major limitations in this study.Firstly, semantic-view and context-view retrieval based on the training set may suffer from dataset and model bias.Secondly, during inference, we need to consider three probability distributions: semantic-view retrieval, context-view retrieval, and model output.Determining the appropriate weight combination to reinforce their interplay is very important, therefore, additional computational resources are required to find these parameters.The aforementioned limitations will be left for future research.

A Interaction between Utterance
Following Ghosal et al. (2019), we use a two-step graph convolution process to obtain context-view features.In the first step, a new feature vector h (1) i is computed for vertex v i by aggregating local neighbourhood information using the relation specific transformation inspired from (Schlichtkrull et al., 2018): where W ij and W ii are the edge weights, N r i denotes the neighbouring indices of vertex i under relation r ∈ R, c i,r = |N r i |. σ is an activation function such as ReLU, W (1) r and W (1) 0 are learnable parameters of the transformation.In the second step, another local neighbourhood based transformation is applied over the output of the first step: where, W (2) and W (2) 0 are parameters of these transformation and σ is the activation function.

B Detailed Descriptions of ERC Datasets
IEMOCAP (Busso et al., 2008): Multimodal ERC dataset.Each conversation within the IEMOCAP dataset comes from the performance based on script by two actors.Models are evaluated on the samples with 6 types of emotion, namely neutral, happiness, sadness, anger, frustrated, and excited.MELD (Poria et al., 2019a): Multimodal ERC dataset gathered from the TV show Friends.7 emotion labels are included: neutral, happiness, surprise, sadness, anger, disgust, and fear.DailyDialog (Li et al., 2017): Dialogues penned by humans, collected from communications of English learners.7 emotion labels are included: neutral, happiness, surprise, sadness, anger, disgust, and fear.EmoryNLP (Zahiri and Choi, 2018): This dataset comprises TV show scripts obtained from the television series "Friends," with variations in scene selection and emotion labeling.7 emotion labels are included: neutral, sad, mad, scared, powerful, peaceful, and joyful.Based on the experimental results, applying our method to other graph networks and non-graph structured models can bring significant performance improvements, indicating that our proposed method has good generalization capabilities.

C.2 Case Study
After eliciting correlations between utterances by a probing procedure from a pre-trained conversation model, the utterance will pay attention to the coherence of the context and the overall logic flow of the dialogue.Here is a case in table 6, where u 3 nicely picks up the emotion of u 1 , and the emotions of u 17 and u 18 also correspond well.Meanwhile, by examining the relevance scores between u 17 and other utterances, it is found that u 17 has noticed the topic change at u 6 (a more sorrowful topic turning into an entertaining one).

Figure 1 :
Figure 1: Major challenges in the ERC task.

Figure 3 :
Figure 3: The overall framework of MFAM, which mainly consists of four parts: multi-view feature extraction, multi-view feature alignment, knowledge store and retrieval, training and inference.
Following Ghosal et al.  (2019), a conversation having N utterances is represented as a directed graph G = (V, E, R, W), with vertices/nodes v i ∈ V, labeled edges (relations) r ij ∈ E where r ∈ R is the relation type of the edge between v i and v j and α ij is the weight of the labeled edge r ij , with 0 ⩽ α ij ⩽ 1, whereα ij ∈ W and i, j ∈ [1, 2, ..., N ].Each utterance in the conversation is represented as a vertex v i ∈ V in G.Each vertex v i is initialized with the corresponding semantic-view feature g i , for all i ∈ [1, 2, • • • , N ], and has an edge with the immediate p utterances of the past: v i−1 , . . ., v i−p , f utterances of the future: v i+1 , . . ., v i+f and itself: v i .Considering to speaker dependency and temporal dependency, there exists multiple relation types of edge.The edge weights W ij are obtained by combining the results of two computation ways.One use a similarity based attention module to calculate edge weights α ij , weights keep changing during training, which can be regarded as adaptive training.The other takes the correlation x ij between utterance pair (u i , u j ) computed by Eq.(3) as edge weights, weights keep frozen during training, which can be regarded as correlation-based knowledge injection:
7) where c represents the emotion category corresponding to sample i, L = {l 1 , • • • , l |Y| } and O = {o 1 , • • • , o |Y| } respectively represents the index collection of semantic-view and context-view prototype vectors for all emotion categories.3.4 Knowledge Store and Retrieval Given a training set (u, y) ∈ (U, Y), parametric model computes semantic-view feature x 2 and context-view feature f of input utterance u.Then we get semantic-view features {x i } P i=1 and contextview features {f i } P i=1 of all training inputs {u i } P i=1

Figure 5 :
Figure 5: Visualize the prototype vectors corresponding to semantic-view and context-view features under each emotion category.Proto1 and proto2 correspond to semantic-view and context-view features, respectively.

Table 1 :
Statistics of four benchmark datasets.

Table 2 .
We can observe from the results that

Table 2 :
Overall results (%) against various baseline methods for ERC on the four benchmarks.† and ‡ represent different prompt templates, zs and f s respectively represent zero-shot and few-shot scenario.* represent models with RoBERTa utterance features.The results reported in the table are from the original paper or their official repository.Best results are highlighted in bold.The improvement of our model over all baselines is statistically significant with p < 0.05 under t-test.

Table 3 :
Results of ablation study on the four benchmrks.

Table 5 :
C.1 Generalization ability of the modelBased on the IEMOCAP dataset, we applied our proposed method to two graph network models, RGAT and DAG-ERC, and two non-graph network models, RoBERTa and COG-BART.The experimental results are shown in table5.Generalization ability of the proposed model.For the RoBERTa model, since it does not model the context-view features, we only apply the memorization technique.