Context-Dependent Embedding Utterance Representations for Emotion Recognition in Conversations

Emotion Recognition in Conversations (ERC) has been gaining increasing importance as conversational agents become more and more common. Recognizing emotions is key for effective communication, being a crucial component in the development of effective and empathetic conversational agents. Knowledge and understanding of the conversational context are extremely valuable for identifying the emotions of the interlocutor. We thus approach Emotion Recognition in Conversations leveraging the conversational context, i.e., taking into attention previous conversational turns. The usual approach to model the conversational context has been to produce context-independent representations of each utterance and subsequently perform contextual modeling of these. Here we propose context-dependent embedding representations of each utterance by leveraging the contextual representational power of pre-trained transformer language models. In our approach, we feed the conversational context appended to the utterance to be classified as input to the RoBERTa encoder, to which we append a simple classification module, thus discarding the need to deal with context after obtaining the embeddings since these constitute already an efficient representation of such context. We also investigate how the number of introduced conversational turns influences our model performance. The effectiveness of our approach is validated on the open-domain DailyDialog dataset and on the task-oriented EmoWOZ dataset.


I. INTRODUCTION
Emotion Recognition in Conversations (ERC) is useful in automatic opinion mining, emotion-aware conversational agents and assisting modules for therapeutic practices.  There is thus an increasing interest in endowing machines with efficient emotion recognition modules.
Knowledge and understanding of the conversational context, i.e., of the previous conversation turns, are extremely valuable in identifying the emotions of the interlocutors [20] [3] [18].
Research in automatic emotion recognition using machine learning techniques dates back to the end of the 20th century. However, the use of the conversational context as an auxiliary information for the classifiers, did not appear until publicly available conversational datasets became more common.
State-of-the-art ERC works leverage not only state-ofthe-art pre-trained-language models such as BERT [4] and RoBERTa [15], but also deep, complex architectures to model several factors that influence the emotions in the conversation [18]. Such factors usually pertain to self and inter-speaker emotional influence and the context and emotion of preceeding utterances.
In this paper we argue that the powerful representation capabilities of pre-trained language models can be leveraged to model context without the need of additional elaborate classifier architectures, allowing for much simpler and efficient architectures. Furthermore, it is our contention that the Transformer, the backbone of our chosen language model, is better at preserving the contextual information since it has a shorter path of information flow than the RNNs typically used for context modelling. In this line, we rely on the RoBERTa language model and resort to a simple classification module to preserve the contextual information.
The usual approach to model the conversational context has been to produce context independent representations of each utterance and subsequently perform contextual modeling of those representations. State-of-the art approaches start by resorting to embedding representations from language models and employ gated or graph neural network architectures to perform contextual modelling of these embedding representations at a later step. In our much simpler and efficient proposed approach, we produce context-dependent embedding representations of each utterance, by feeding not only the utterance but also its conversational context to the language model. We thus discard the need to deal with context after obtaining the embeddings since these constitute already an efficient representation of such context. Our experiments show that by leveraging context in this way, one can obtain state-of-the-art results with RoBERTa and a simple classification module, surpassing more complex stateof-the-art models.
II. RELATED WORK Amongst the first works considering contextual interdependences among utterances is the one by Poria et al. [19]. It uses LSTMs to extract contextual features from the utterances. These gated recurrent networks make it possible to share information between consecutive utterances while preserving its order. A more elaborate model also leveraging gated recurrent networks is DialogueRNN [16], which uses GRUs to model the speaker, context and emotion of preceding utterances by keeping a party state and a global state that are used to model the final emotion representation.
Gated recurrent networks have a long path of information flow which makes it difficult to capture long term dependencies. These can be better captured with the Transformer which a has shorter path of information flow. Its invention in 2017 [23] led to a new state-of-the-art in several Natural Language Processing tasks.
Amongst the first works leveraging the Transformer is the Knowledge-Enriched Transformer (KET) [25]. It uses its selfattention to model context and response. It also makes use of an external knowledge base, a graph of concepts that is retrieved for each word.
Following the invention of Transformers, pre-trained language models brought about another new state-of-the art in 2019. Since their invention, most state-of-the art ERC works resorted to encoder pre-trained language models [21] [8] [13].
COSMIC [8] leverages RoBERTa Large as feature extractor. Furthermore, it makes use of the commonsense transformer model COMET [1] in order to extract commonsense features. Five bi-directional GRUs model a context state, internal state, external state, intent state, and emotion state that influence the final emotion classification.
Psychological [13] also uses RoBERTa Large for utterance encoding and COMET. For conversation-level encoding it constructs a graph of utterances to model the actions and intentions of the speaker along with the interactions with other utterances. It uses COMET to introduce commonsense knowledge into the graph edge representations and processes this graph using a graph transformer network.

III. METHODOLOGY
We describe how we obtain a contextual embedding representation of the sentence and its context with RoBERTa, how we pool the contextual embeddings, our classification module and how we obtain the emotion labels. These processes can be observed in Figure 2.

A. Task definition
Given a conversation, a sequence of u i utterances with corresponding emotion i from a predefined set of emotions, the aim of the task of ERC is to correctly assign an emotion to each utterance of the conversation. An utterance consists in a sequence of w it tokens representing its T i words The usual approach for this task has been to produce context independent representations of each utterance and perform contextual modeling of these. In our approach we produce context-dependent representations of each utterance that represent not only the utterance but also a given number of previous utterances from the conversation.

B. Context-dependent feature extraction
For context-dependent feature extraction, we feed as input to RoBERTa the utterance we intend to classify, u i , concatenated with its conversational context corresponding to the number c of previous utterances in the conversation, . (2)

C. Pooling
The RoBERTa encoder outputs several layers of embeddings representing the utterance, and in our approach, also the preceding utterances it receives as input. Each layer comprises several tokens, being the number of tokens the same as the number of input tokens. Each token is a vector with dimension corresponding to the RoBERTa hidden size.
From these embeddings one can extract a suitable representation for the sentence. Choosing all tokens from all layers would yield an extremely memory demanding classification layer and may not yield the best model performance. Thus we choose the first embedding from the last layer L, the [CLS] which is used for classification, as in Equation 3.

D. Emotion Classification
The classification module that follows RoBERTa is a linear fully connected layer, applying a linear transformation to the pooled encoder output data. Its input size is the RoBERTa encoder hidden size and its output size is the number of emotion classes.
The final label probability distribution is yielded by applying the softmax operation to the output of the classification head and the predicted label is the one with the highest probability:

IV. EXPERIMENTAL SETUP A. Training
Our model is based on RoBERTa-base from the Transformers library by Hugging Face [24]. It is trained with the crossentropy loss with logits. The Adam [11] optimizer is used with an initial learning rate of 1e-5 and 5e-5, for the encoder and the classification head, respectively with a layer-wise decay rate of 0.95 after each training epoch for the encoder. The encoder is frozen for the first epoch. The batch size is set to 4. Gradient clipping is set to 1.0. As stopping criteria, early stopping is used to terminate training if there is no improvement after 5 consecutive epochs on the validation set over macro-F1, for a maximum of 10 epochs. The checkpoint used in testing is the one that achieves the highest macro-F1 score on the validation set.

B. Evaluation
We evaluate the performance of our model with the macro F1-score. The reported results are yielded from an average of 5 runs corresponding to 5 distinct random seeds that are kept for a meaningful comparison of all experiments. This average is motivated by the fact that results for the same experiment obtained with different random seeds can have a variability of about 3 in macro F1-score which is a large deviation given that our proposed approach yields an improvement of that magnitude and comparison between state-of-the-art models are based on improvements of less than 1 F1-score. This procedure is in line with several authors that also resort to 5 run averages [13] [25] [21] [22].
Our code is publicly available 1 .

C. Datasets
We evaluate our approach on the chit-chat DailyDialog [14] dataset and on the task-oriented EmoWOZ [7] dataset.
1) DailyDialog: DailyDialog is built from websites used to practice English dialogue in daily life. It is labelled with the six Ekman's basic emotions [5], anger, disgust, fear, happiness, sadness and surprise, or neutral. The publicly available splits of Yanran are used.
2) EmoWOZ: EmoWOZ is derived from MultiWOZ [2], one of the largest multi-domain corpora benchmark dataset for various dialogue tasks. User utterances are annotated with either fear, dissatisfaction, apologetic, abusive, excited, satisfied or neutral emotions.
The statistics and proportion of labels in the datasets are presented in Tables I and II, respectively.   From Table II it can be observed that both datasets are imbalanced, not only for its dominant majority neutral class, but also for the relative imbalance between minority classes. Therefore, we have opted to use the macro-F1 score for evaluation in order to promote consistent performance across all classes.

V. RESULTS AND ANALYSIS
A. Iterating towards the ideal approach We have performed extensive experiments in order to obtain our ideal model architecture. From experimenting different approaches to pool the various layers of embeddings RoBERTa provides to choosing which classification module to employ withing a wide variety of deep learning architectures, we put forward our experiments in this subsection.
1) Fine-tuning: Fine-tuning, the modification of the pretrained RoBERTa's weights along with the classification head during training with the target dataset, is a determinant procedure for the success of our approach.
In our experiments we observed that if we did not fine-tune the language model and just trained the classification head, the model would always predict the majority neutral class. This supports the notion that pre-trained-language models are useful for a wide variety of tasks but need to be fine-tuned for the specific task at hand.

2) Pooling:
We have performed experiments with several pooling alternatives. From average pooling, max pooling, concatenation of the CLS token of more than 1 last layers to the concatenation of the CLS token with the result from average pooling. All these pooling alternatives resulted in lower performance than choosing the CLS token of the last layer. This might suggest a high representative power for the CLS token, which is proposed for classification, and discards the need for directly considering other tokens for this task.

3) Classification module:
We have also performed alternative experiments with other classification modules than our simple classification head. These consisted in passing the pooled embeddings through Recurrent Neural Networks [6], uni [10] and bi-directional [9] Long Short-Term Memory Networks and a Conditional Random Field [12] before feeding them to the classification head. Performance was lower in all alternative experiments when compared to our main approach of using a simple classification head. These results may indicate that our approach leveraging RoBERTa's representational power for context suffices and there is no apparent need for modelling the context with complex classification modules, after obtaining our context-dependent embedding utterance representations.

B. Overall Performance
For each of the three datasets, we have performed experiments without introducing any context (c = 0) to introducing 4 previous conversation turns (c = 4), for which the overal performance operationalized by the macro-F1 metric is reported in Table III. Our results are an average of 5 runs. It can be observed that introducing previous conversational context turns leads to an increase in macro-F1 score. As hypothesised, providing no context is never the best option. This shows that the introduction of an adequate number of context turns directly as the language model input significantly improves model performance. In general performance increases with the introduction of each additional context turn up to the ideal number of turns and then it decreases. For overall performance it can be concluded that the ideal number of introduced context turns for ERC in both datasets is 3.

C. Performance on each emotion label
For each dataset, we report the results on each individual emotion label and also present the confusion matrices for the best determined c value. Our results are an average of 5 runs.
The individual emotion label F1-scores for the DailyDialog dataset are presented in Table IV. It can be observed that for more than half of the labels, Anger, Fear, Sadness and Neutral, the ideal context to be provided is 3 turns which maximise their F1-scores, and also the macro-F1 score on Table III, and for the other labels the ideal context is 4 turns for Disgust, 2 turns for Happiness and 1 turn for Surprise. As expected, providing no context is never the best option.
The confusion matrix for c = 3 corresponding to the highest macro-F1 score is displayed on Figure 3, in which the label nomenclature and order is the same as in table IV but with neutral as the first label.
This matrix indicates that majority of the errors are due to classifying utterances as neutral instead of assigning a nonneutral emotion. The classifier also displays some confusion in discerning between Happiness and Surprised.
The individual emotion label F1-scores for the EmoWOZ dataset are presented in Table V.  It can be observed that for 4 of the labels, Dissatistfied, Excited, Satisfied and Neutral, the ideal context to be provided is 4 turns which maximise their F1-scores. Regarding the other labels the ideal context is 2 for Fear, 3 for Abusive, and surprisingly 0 turns for Apologetic, which might indicate that this emotion is very explicit in this dataset.
The confusion matrix for c = 3 corresponding to the highest macro-F1 score is displayed on Figure 4, in which the label nomenclature and order is the same as in table V but with neutral as the first label.
This matrix indicates that majority of the errors are due to classifying utterances as neutral instead of assigning a nonneutral emotion, as in happens with the DailyDialog dataset.
It is worth noting that our results are an average of 5 runs and the final model is determined via performance on the validation set. Therefore, the fluctuation in individual label F1scores does not hinder the representativity of our results and these fluctuations may occur between results from the other reported state-of-the-art models.

D. Comparison with state-of-the-art
We further compare our approach to other state-of-the-art approaches that also resort to the RoBERTa or BERT pretrained-language models. This allows for a fair comparison between approaches given that using this language model brings  Regarding performance on the DailyDailog dataset, our approach outperforms not only the simple RoBERTa/BERT, but also RoBERTa/BERT in a more elaborate gated neural network model such as DialogueRNN and COSMIC. The Psychological model has a slightly higher performance than ours. It may be due to the fact that it leverages a large commonsense knowledge base and an elaborate classifier architecture, while we opted for a minimalistic classification module. Concerning performance on the EmoWOZ dataset, our approach outperforms all baselines by a wide margin, setting a new state of the art for task-oriented emotion datasets.

E. Case Studies
On Table VII we can compare the performance of our contextual classifier when considering the ideal 3 context turns on both datasets versus not considering any context at all.
In the first example, from the DailyDialog dataset, A offers B assistance, so B asks A to view the apartment, to which A sadly apologizes informing B that B will not be able to view it. The classifier that does not consider context classifies this last apology as neutral. However, given the context of the conversation, A should not be neutral since A is unable to assist B which was A's initial purpose. The contextual classifier is able to consider this, thus correctly classifying A's utterance with the emotion Sadness.
In the second example, also from the DailyDialog dataset, A gives B a good idea to which B happily reacts and thanks A. A happily reacts to B's acknowledgments, especially since B mentioned A's was a "wonderful idea". The classifier that does not consider context classifies A's final reaction to B as neutral, since A's utterance is a merely "No problem. Good luck", not being able to recognize A's positive reaction to B's acknowledgements. The contextual classifier, however, having this utterances into account, correctly classifies A's final reaction with the emotion Happiness.
In the last example, from the EmoWOZ dataset, B is merely answering A's question of what day B would like to travel. The classifier that does not consider context takes into account the words "please" and "vacation" which bias the classification towards the emotion Excited. The contextual classifier might grasp that "please" is used as a polite expression and "vacation" is just the object of the phrase, thus correctly classifying the utterance as neutral.

VI. CONCLUSIONS AND FUTURE WORK
In this work we have leveraged context-dependent embedding utterrance representations for Emotion Recognition in Conversations. Our approach of producing context-dependent representations of each utterance contrasted with the usual approach of producing context independent representations of each utterance and subsequently performing contextual modeling of these. It consisted in feeding a variable number of previous conversational turns appended to the utterance to be classified as input to the state-of-the-art pre-trained-language model RoBERTa, to which we appended a simple classification module. We further investigated how the number of introduced conversational turns influenced our model performance. We concluded that the introduction of an adequate number of context turns directly as the language model input significantly improves model performance.
Furthermore, we attained state-of-the-art results on the widely used DailyDialog dataset and established a new stateof-the-art by a wide margin on the EmoWOZ dataset, which are usually yielded by more elaborate classifiers resorting to larger state-of-the-art pre-trained-language models and more complex classification modules.
For future work, from adequately capturing the conversation context, the focus of our approach, to capturing several other factors that influence the emotions in the conversation, such as self and inter-speaker emotional influence and the emotion of preceeding utterances, various architectures comprising not only state-of-the art language models for embeddings but also combining our context-dependent embedding utterance representation with more elaborate classification modules can be used.
Finally, we put forward important ethical aspects pertaining to Emotion Recognition in Conversations. These are, for example and not limited to, whether an ERC module should be developed or used for a certain purpose, which data to collect, the subjects behind the data, diversity, inclusiveness, privacy, control and possible biases and misuses of the application [17]. Research taking into account these aspects will benefit the community with better ERC modules for current and novel applications.