Few-Shot Emotion Recognition in Conversation with Sequential Prototypical Networks

Several recent studies on dyadic human-human interactions have been done on conversations without specific business objectives. However, many companies might benefit from studies dedicated to more precise environments such as after sales services or customer satisfaction surveys. In this work, we place ourselves in the scope of a live chat customer service in which we want to detect emotions and their evolution in the conversation flow. This context leads to multiple challenges that range from exploiting restricted, small and mostly unlabeled datasets to finding and adapting methods for such context. We tackle these challenges by using Few-Shot Learning while making the hypothesis it can serve conversational emotion classification for different languages and sparse labels. We contribute by proposing a variation of Prototypical Networks for sequence labeling in conversation that we name ProtoSeq. We test this method on two datasets with different languages: daily conversations in English and customer service chat conversations in French. When applied to emotion classification in conversations, our method proved to be competitive even when compared to other ones.


Introduction
There has been a recent surge in research focusing on analyzing dyadic human to human interactions. Many of these studies (Poria et al., 2017;Zadeh et al., 2018a,b; focus on emotion recognition in conversations (ERC) taking into account multiple data modalities. Moreover, most of the progress made in ERC has been done without factoring in constraints corresponding to specific but prominent industrial applications, like customer service. This is partly due to studies focusing on using artificial datasets (Li et al., 2017;Busso et al., 2008) made of mock-up conversations to facilitate result replication and comparison. A few existing studies address customer service applications (Mundra et al., 2017;Yom-Tov et al., 2018;Maslowski et al., 2017) and show the difficulties to deal with such in-the-wild and domain-specific data.
In this work, we focus on data from a live chat support in which we want to detect emotions and their evolution in the conversational flow. This setting corresponds to a human dyadic conversation, albeit with a specific business-related objective. We make the hypothesis that the emotion flows of the visitor and the operator will bring information on the quality of the service and help operators better assist customers. This hypothesis is close to relevant studies on the importance of emotions and empathy in dyadic call center conversations (Alam, 2017;Alam et al., 2018). This specific setting leads to multiple challenges: indeed, it is difficult and costly to label this kind of data -and even then, these exchanges are very sparse in emotions, most of the labels associated with utterances being neutral. To maximize data efficiency, we use Few-Shot Learning (FSL), and adapt a popular approach to our highly unbalanced data. By setting up this approach in an episodic fashion (Ravi and Larochelle, 2016), we join studies on ERC and studies on FSL to tackle this industrial use-case.
We contribute by proposing a variant to Prototypical Networks (Snell et al., 2017) dedicated to ERC on data produced by company services, framing it as a sequence labeling task. We modify the original model by allowing it to consider the whole conversational context when making predictions, through a sequential context encoder and the use of Conditional Random Fields (CRF) on top of the model. We test our method on two datasets, in two different languages. The first one, made of daily conversations in English, allows us to compare ourselves to previous methods, while the second one, made of private data from a live chat customer ser-vice, allows us to conduct a performance analysis in our target setting. We also present the latter dataset, along with its annotation process. This paper is organized as follows. First, we sum up the related work on textual ERC and FSL in conversations (Section 2). Then we present the datasets along with the emotional annotation scheme and the annotation campaign set up for the customer service live chats dataset (Section 3). We continue by thoroughly presenting the Sequential Prototypical Networks (Section 4) before looking at the achieved results on both datasets (Section 5). Finally, we present the limitations of such a system (Section 6) and conclude (Section 7).

Related Work
Emotion Recognition in Conversations In recent years, the widening scope of emotion detection tasks led to the rise of another sub-topic: detecting emotions in conversations. This research topic, commonly referred to as ERC, gained popularity when Poria et al. (2017) first applied recurrent neural networks (RNN) (Jordan, 1997) to multi-modal emotion recognition in conversations. This led to many improvements (Zadeh et al., 2018a,b;Hazarika et al., 2018;. Among those,  used 3 Gated Recurrent Units (GRU) (Cho et al., 2014) units, one for each context representation target (speaker, utterance, emotion). Studies on ERC applied to text followed, mainly built on an artificial conversation dataset named DailyDialog (Li et al., 2017). (Zhong et al., 2019) incorporated a knowledge base into the network using context-aware attention and hierarchical self-attention using Transformers (Vaswani et al., 2017). Ghosal et al. (2019) uses graph neural networks to deal with context propagation limitations. These approaches in ERC consider the conversational context surrounding the current utterance; on the other hand, some recent studies consider it as a sequence and tackled ERC through a sequence labeling task . We follow this last approach and consider the ERC task as a sequence labeling task. However, these supervised approaches are difficult to use, as it is hard to find a sufficient amount of conversations labeled with emotions. Hence, in this paper, we approach ERC as a few-shot learning problem. -Shot Learning FSL (Miller et al., 2000;Fei-Fei et al., 2006;Lake, 2015) is suitable to tackle this data limitation. It aims at generaliz-ing faster, leading to a lower dependency on data quantity. It is mainly set up through episodic composition (Ravi and Larochelle, 2016) which recreates the few-shot learning setting by working with small training episodes. Several learning methods are based on metric-learning: Siamese Networks, which share some weights, are used to learn a metric between examples (Koch et al., 2015). Matching Networks (Vinyals et al., 2017) use the training examples to find the weighted nearest neighbors (Vinyals et al., 2017). Prototypical Networks Snell et al. (2017) consider averaged class representations from the training examples and a cosine distance to compare the elements to these class representations. Relation networks replace the Euclidean by the deep neural network which aims at training a distance metric (Sung et al., 2018). In this work, we consider approaches based on Prototypical Networks. As Al-Shedivat et al. (2021) recently showed it, such approaches are the most efficient when working with a low amount of training samples. Many variants have been proposed, on different tasks and topics such as relation classification in text (Gao et al., 2019;Hui et al., 2020;Ren et al., 2020), sentiment classification in Amazon comments (Bao et al., 2020), named entity recognition (Fritzler et al., 2019;Hou et al., 2020;Perl et al., 2020;Safranchik et al., 2020), or even speech classification in conversation (Koluguri et al., 2020). This surge of interest on applying few-shot learning to these topics can be attributed to specific datasets, such as Few-Rel (Han et al., 2018) for relation classification. While ERC is mainly considered in a fully supervised learning setting, we intend to view it as a few-shot learning sequence labeling class. In this paper, we propose the first few-shot learning approach on ERC using sequence labeling through adapting Prototypical Networks. We compare our method to the original Prototypical Networks (Snell et al., 2017) and to a variant dedicated to named entity recognition (Fritzler et al., 2019) that is easily applicable to our task.

Data
To be able to both study the behavior of our model in its targeted industrial use-case, and allow performance comparison with baselines, we will work with two very different corpora: our proprietary live chat customer service dataset, and DailyDialog (Li et al., 2017). In both datasets, messages are labeled with emotions while considering the context of the conversation. However, they vary considerably in their topics and lexical fields: ordinary matters for DailyDialog and railway related customer service for the live chats. They also vary in the assumptions they make about the speakers : while the topics discussed in DailyDialog imply a sense of proximity, the live chat customer service involve complete strangers with pre-existing emotional states (e.g. the visitor is already stressed due to a refund issue). Both datasets' statistics can be found in Table 1.

DailyDialog
DailyDialog is a dyadic conversation dataset in English whose purpose is to represent casual, everyday interactions between people, in order to facilitate training and sharing of dialog systems. The exchanges in DailyDialog are artificial conversations which are neither dedicated to a specific topic nor task-oriented: they mainly deal with relationships, everyday life, and work. Each utterance corresponds to a speaker turn, and is labeled with one of 7 labels: the 6 basic emotions (anger, disgust, fear, joy, sadness, and surprise) and "no_emotion" denoting the absence of one. The "no_emotion" label represents 80% of the corpus, leading to a very unbalanced dataset with an average length of 8 messages per conversation and a maximum of 35 messages. For this dataset, the inter annotator agreement achieved 78.9%. We choose DailyDialog for comparison and reproducibility purposes, as it is often used for ERC. In this work, we use the train/val/test splits provided by (Zhong et al., 2019).

Live chat customer service
Our primary objective is to detect emotions in conversations from a customer service live chat involving a visitor (i.e. the customer looking for help) and an operator (i.e. an employee being there to assist the visitor and better satisfy him). The corpus is written in French and is made of 5,000 conversations from which we annotate a subset of 1,500 conversations, leading to a total of 20,754 messages. The average message length is higher than DailyDialog, with 15.14 messages per conversation. We do not have a way to identify real speaker turns. Indeed, a speaker turn is not necessarily the sequence of contiguous segments corresponding to a same speaker because there could be a time delay between two messages of a same speaker, indicat-ing that the speaker is changing the topic. Because all our messages have a very short time difference we prefer not to automatically infer speaker turns and consider the message as the unit of analysis. This means the conversation context is a sequence of messages instead of a sequence of speaker turns which could have contained one or more messages artificially glued together.  Two annotators were involved in the process, which unrolled as follows: first, each message is labeled with an emotion. Once all the messages in a conversation have been assigned an emotion label, the conversation is labeled with a visitor satisfaction score (ranging from -3 to 3), and the status of the customer request ("solved", "test_required", "out of scope", or "aborted"). After a preliminary study of the corpus, we identify 10 emotion labels as relevant in this corpus: neutral, surprise, amusement, satisfaction 1 , relief, fear-anxiety-stress, sadness, disappointment, anger, and frustration. Compared to (Chowdhury et al., 2016), we consider the satisfaction at the conversation level and we are more precise with not only positive, neutral, and negative levels, but also with 4 additional intermediate levels (from -3 to +3 included). We have also a higher number of emotions, with 10 emotions instead of 4, with more precise emotions such as relief for instance. In our customer service interface, some alerts are automatically prompted for specific actions such as "user x left the chat" or "operator sent a link". We call these "alerts", and they are labeled as "no_emotion". The "neutral" label means that the emotional content of the message, written by a human, has been considered as neutral by the annotator. Figure 1 illustrates the distribution of emotion labels in the Live Chat Customer Service dataset. We can see that the neutral label is the most frequent by a large margin. The Cohen's κ scores obtained on the 3 label types correspond to substantial agreement at the message level and moderate agreement at the conversation level (Landis and Koch, 1977). κ-score is given for 3 label types: 1) the emotions at the message level (κ = 0.65); 2) the visitor's satisfaction at the conversation level (κ = 0.45); and 3) the request's status at the conversation level (κ = 0.46). Similarly to DailyDialog, the "neutral" label represents 81.5% of the corpus, resulting in another very unbalanced dataset in terms of emotions, as rendered obvious by Figure 1. Excluding this label gives a slightly more balanced label set, as the satisfaction represents 44.9% of the other emotions, and the "frustration" 20.8%. To tackle our hypothesis that the conversational emotion flow can define the overall visitor satisfaction, we calculate the Pearson correlation between the emotions at the message level and the global satisfaction of the visitor at the conversation level. These scores show the more extreme the emotion, the greater the correlation with the satisfaction score is 2 .

Methodology
Formally, our dataset D is comprised of conversations (C 1 , C 2 , . . . , C |D| ), which are in turn made of utterances: C i = (u 1 , u 2 , . . . , u |C i | ). To each of these utterances is associated an emotion label, giving a sequence of labels by conversation: Y i = (y 1 , y 2 , . . . , y |C i | ). Finally, an utterance is a sequence of words, u j = (w j 1 , w j 2 , . . . , w j |u j | ) .

Episodic learning
We use the episodic approach (Ravi and Larochelle, 2016), which simulates a context where only a few examples per class are available during training and the model must adapt during testing. This approach perfectly fits into our need for FSL. The episodic composition is defined by setting the number of classes (ways) N C , the number of examples per class N S (shots) and the number of elements to label N Q (queries). In our experiments with Dai-lyDialog, the task is 5-shot 7-way 10-query, and when using our customer service chats, the number of classes changes, making it a 5-shot 11-way 10-query. In the context of sequential ERC, this means that for each episode we train the model on 5 conversations (i.e. sequences to label) per emotion and apply it to 10 conversations per emotion. We identify a sequence as belonging to the target class set if at least one message is labeled with the target class in the sequence. This means that the number of example messages in each support set S k of class k can vary (with a minimum of N S elements), while the number of sequences is fixed.

ProtoSeq: Prototypical Networks for Emotion Sequence Labeling
In order to apply FSL to ERC, we choose to base our model on Prototypical Networks (Snell et al., 2017), which create prototypes from the average of the embeddings of the words forming the utterance. Our proposed model, ProtoSeq, builds on this by factoring in conversational context and performing sequence labeling, thus allowing the use of both input and output dependencies when applying FSL to ERC. ProtoSeq is divided into four main components, applied at each consecutive level of granularity of the data. Utterance Encoder: Similar to the encoder of the Prototypical Networks, our utterance encoder f u reduces the utterance u i to only one vector: The architecture of our encoder is based on the Convolutional Neural Network (CNN) described by (Kim, 2014) , which makes tokens through different convolution filters and merges the representation through max-over-time pooling.
Context Encoder: After applying a non-linear activation (ReLU), we use a Bi-directional LSTM layer (BiLSTM) (Huang et al., 2015), to integrate contextual information from the conversation, thus following the trend initiated by (Poria et al., 2017) in ERC to use a recurrent context encoder. We obtain contextual utterance representations v j : v As we work in a few-shot learning setting, we try not to over-complexify our model, hence we do not add a transformer-based global context encoder  on top of the BiLSTM.
Prototypes Creation: We feed the output of the context encoder to a multi-layer perceptron made of 2 fully connected layers with dropout and ReLU. The resulting representations are then used to create prototypes c: for the class k, where N C is the number of classes, and MLP refers to Multi-Layer Perceptron.
Sequence Prediction: We compute the euclidean distance from the contextual representation of the utterance to each class prototype. The predicted labelŷ j to each utterance u j is the class corresponding to its closest prototype: We allow our model to consider dependencies between the labels, we add a final CRF layer on top of label prediction, the emission scores being the euclidean distances for each utterance. Overall, our model is a variation of the traditional BiLSM-CRF model, based on prototypical networks.

Experimental protocol
We follow the setting used by (Bao et al., 2020) by considering a training epoch as a set of 100 random episodes from the training set, and applying a validation step made of 100 random episodes from the validation set after each epoch. We test our model using 1,000 random episodes from the test set. The maximum number of epochs is set to 1,000, but when the F1-micro score does not improve for 100 consecutive epochs, we stop the training and reload the best model's weights. We use the Adam (Kingma and Ba, 2017; Loshchilov and Hutter, 2019) optimizer to train the model while maximizing the log-likelihood loss of the correct emotion sequences in the query set Q k L = C∈Q k |C| j=1 log(p(ŷ j | u j , C)) During inference, we apply the Viterbi algorithm to output the best-scoring sequence of labels. We do not cut on either utterance or conversation length. To obtain an initial token representation, we use pre-trained FastText (Bojanowski et al., 2017) embeddings from Wiki News 3 for English (Daily-Dialog), and from Common Crawls 4 for French (customer service live chats). Both sets of embeddings are of dimension 300 and both datasets are tokenized with NLTK 5 . We choose our hyperparameters using a very targeted grid search for the learning rate (set to 1e3 for all the experiments) and manual tuning for the other parameters. In the following, we experiment with several variants of our model, each having dedicated hyper-parameters.
• ProtoSeq: We use hyper-parameters from Kim (2014) for the CNN: 50 filters with windows 3 different sizes (3, 4 and 5). We use one BiLSTM layer with 150 hidden units in order to fit to the 300 dimensions of the inputs considering the two directions.
• ProtoSeq-CNN: A lighter version of our model, without the BiLSTM context-encoder. The CNN configuration follows the same parameters from Kim (2014).
• ProtoSeq-Tr: A ProtoSeq with a 2-layers Transformer-based utterance encoder with 4 attention heads and a hidden size of 300. The global dropout is set to 0.2 while the position encoder dropout is set to 0.1.
• ProtoSeq-AVG: A ProtoSeq where the utterance encoder is just an average of the token representations. However, it should be noted that the averaging process excludes the padding elements in the utterances.

Results
Tables 3 and 4 show the performance of the model using the micro F1-score. We use the protocol usually followed by the literature and do not take into account the majority class "no_emotion" as it represents 80% of the DailyDialog corpus. This allows performance comparison with related work on ERC through supervised learning. We do the same for the Live Chat Customer Service corpus by ignoring the "neutral" label.
Comparison to supervised learning DailyDialog is used to compare our FSL approach with recent supervised learning results on ERC. As expected, our best FSL model, ProtoSeq, yields lesser performance than supervised approaches. The latter presuppose the availability of a sufficiently large amount of annotated data and their performance thus represents the upper bound of the expected results. More precisely, we focus on the difference between ProtoSeq with a state-of-the-art supervised model, CESTa , which is computation-heavy. Indeed, CESTa is a contextualized emotion sequence tagging model which considers the fusion of a combination of a transformer and BiLSTM as the global context encoder with a recurrent individual context encoder before feeding a CRF layer. CESTa achieves 63% in micro F1-score in a fully supervised learning approach. ProtoSeq, much lighter, achieves a 31% micro F1 score, demonstrating the potential of FSL for sequence labeling when available data is scarce, especially when many supervised approaches obtained F1-scores around 50%. While using the Live Chat Customer Service dataset, we only change the initial embeddings from English to French, and apply the two best models according to 3: CESTa and KET (Zhong et al., 2019). The CESTa implementation yielded inconclusive results 6 , this is why we present the KET results on our specific corpus in Table 4. KET relies on ConceptNet (Speer et al., 2017), a multilingual knowledge base. Thus, we only switch from GloVe embeddings (Pennington et al., 2014) to French FastText ones in order to ensure comparison with our ProtoSeq model. As expected, performance is lower on the Live Chat Customer Service corpus.
Few-shot learning baselines We consider two baselines. We apply the original Prototypical Networks (Snell et al., 2017), only retrieving the labels using the euclidean distance to class prototypes. We also apply the WarmProto-CRF (Frit-

zler et al., 2019) which is a variant of Prototypical
Networks designed for sequence labeling by integrating CRFs. We implement it without including the bias they created for the O label in the BIO sequence labeling task. This method uses a BiLSTM utterance encoder to further compute the prototypes with the euclidean distance.
Few-shot learning on DailyDialog Table 3 shows FSL results in the bottom section. All these models are trained in an episodic fashion, with the same episode constitution (5-shot 7-way 10-query). We can see the micro F1-score is really low with only 16.43%. By considering a ProtoSeq only using an utterance encoder based on CNN (ProtoSeq-CNN) or an utterance encoder based on a 2-layers 4-heads Transformer (ProtoSeq-Tr) we can see the score improve. The addition of the BiL-STM context encoder really enables the model to capture more information: these variants show the importance of integrating a context encoder in the model.

Few-shot learning on Customer Service Live Chats
We also apply this approach on the Customer Service Live Chats, further motivated by the high annotation cost and the fact that supervised approaches on clean data such as DailyDialog did not achieve an acceptable score for this use case (starting from 70 % in micro F1 score). Besides, new conversations with evolving contents (e.g., due to the evolution of company services) are created everyday. As a consequence, it would render the ideally annotated training corpus obsolete at some point. This FSL prediction leads to lesser scores, but with the same hierarchy among variants. Proto-Seq, using a BiLSTM context encoder, yields again the best scores. The higher number of classes (with 11 classes including 9 emotions versus 7 classes including 6 emotions) may explain the overall lower numbers we observe here, compared to those we obtain on DailyDialog.
Artificial versus Real Data DailyDialog is an artificial corpus which follows standard, idealized conversations. We can see that ERC performance is quite sensible to the conversation length, which seems to confirm conclusions drawn in recent literature . Customer Service Live Chats being real use-case data, their length varies a lot, ranging from 2 to 85 messages (where conversations from DailyDialog go from 2 to 35 messages). However, ERC also seems to be impacted by the utterance textual content, as our data contains a lot of spelling mistakes, shortcuts, or slurs. More importantly, the visitor may often use several small messages rather than only one to transmit information; this flow may be interrupted by a message from the operator, making it impossible to detect the whole set of messages as an utterance. This is specific to online instant conversations where speakers do not necessarily wait for the complete message to be written or sent by the addressee. By contrast, DailyDialog is made of clean and perfect exchanges, where one waits for the other to send the answer. Here is an example with the following clean conversation subset from DailyDialog.
A: Does your family have a record of your ancestors? B: Sure. My mom has been working on our family tree for years.
This conversation would often be represented as follows in real data from instant chat: Operator: Did you make the simulation using the promo code? Visitor: I did it 5 minutes ago Operator: Ok, you have to wait 30min Visitor: but as said before, I didn't finished the "simulation" because I had to pay a 10C ticket even th Visitor: ....even though the right one is 11.5 C Operator: And the code will be available again Moreover, specific lexical fields, relevant to the customer service being provided, can also make the task more difficult for the model.
Quantifying the impact of the CRF layer Our model benefits from the addition of a final CRF layer to compute the best possible output sequence. This allows the model to generalize faster and to achieve a higher score despite the few examples. However, the prediction stability lowers, as the standard deviation across episodes shows in Table 5. The downgrade in performance while omitting the CRF layer may be due to the label dependency it emphasizes. Indeed, without the CRF, label dependency can only be inferred from the BiLSTM context encoder. The CRF layer accentuates in-episode label dependency by allowing the prediction to be further adapted to the conversation context for each query conversation.   Tables 6 and 7 show additional information from ProtoSeq's performance on each label. These tables present averaged scores from all episodes' query sets. We can see the predictions differ a lot depending on the target label. When applied to DailyDialog, the model has no difficulty in detecting the absence of emotion. This is to be expected as this label mainly represents the conversations. However, the prediction scores for emotion labels are imbalanced, with recall scores higher than precision on both datasets. On DailyDialog, the anger and the sadness labels really hinders the overall prediction. How-ever, on the Customer Service Live Chats, Table 7 shows really poor prediction for the disappointment (translated from the French "déception" label) and fear labels. Actually, in this dataset the precision seems to be the main issue with only the frustration and satisfaction labels being somewhat correctly labeled. Given the model and the task, the detailed results obtained on both datasets show that performance score may benefit from the usage of macro F1-score along with the micro F1-score. Indeed, be it DailyDialog or Customer Service Live Chats, the multi-class prediction of sequence tagging is really sparse, and thus leads to imbalanced prediction, even while using an episodic strategy.

Emotion Predictions
Moreover, the gap between results on DailyDialog and the ones on the Customer Service Live Chats confirms the necessity for the ERC-related studies to focus on real conversation datasets whenever it is possible.   Table 7: Additional results on customer service live chats with our ProtoSeq prediction. We define the "fear" label as "fear/anxiety/stress". "no emotion" is only used for automatic chat prompts.

Limitations
While the ProtoSeq model seems to be suitable for FSL in ERC, it still has inherent limitations related to its architecture. ProtoSeq uses a CRF as its final layer, leading to a sequence labeling optimizer that does not take the order into account. While this yields better performance, it does not guarantee that the order information retrieved from the context encoder is wisely used, especially since we use the euclidean distances to class prototypes as emission scores for the sequence labeling. An ordered-prediction approach may allow the model to better assist operators in real-time during their decision process.
Another limitation of our model is that it may be difficult to adapt to changes in the context in which customer service is provided. Indeed, the type of service or the plaftorm used may lead to lexical field changes or very different emotional states for the incoming visitors.

Conclusion
In this paper, we presented the first study on emotion recognition in conversations using few-shot learning. We proposed a variant of Prototypical Networks taking into account the emotion recognition as a sequence labeling task while allowing fast convergence. When compared to other prototypical networks for sequence labeling in few-shot, our model obtained higher scores on both Daily-Dialog and Customer Service Live chats. Through this work, we showed that few-shot learning is possible for this task even though it is still difficult to achieve the same performance as supervised learning approaches. This study also shows the challenges that remain when tackling in-the-wild data collected in the context of a real application.
Future work will be dedicated to the improvement of the current few-shot ERC approach by adding unlabeled elements in the support set and by investigating the addition of external business knowledge to such an approach.  Figure 3 presents the Pearson correlation scores between visitor's emotions and satisfaction for the Customer Service Live Chats. While emotions are labeled for each utterance in conversation, satisfaction is a global label for the whole conversation. This Figure shows the correlation scores are higher when the emotion is extreme within a given polarity. For instance, anger is greatly correlated to a negative satisfaction score (vsent -3) than fear or disappointment, while "Satisfaction" is more correlated to a positive overall satisfaction score (vsent +3) than "Amusement" or "Relief" are to intermediate satisfaction scores (vsent_1 or vsent_2).