A Large-Scale Dataset for Empathetic Response Generation

Recent development in NLP shows a strong trend towards refining pre-trained models with a domain-specific dataset. This is especially the case for response generation where emotion plays an important role. However, existing empathetic datasets remain small, delaying research efforts in this area, for example, the development of emotion-aware chatbots. One main technical challenge has been the cost of manually annotating dialogues with the right emotion labels. In this paper, we describe a large-scale silver dataset consisting of 1M dialogues annotated with 32 fine-grained emotions, eight empathetic response intents, and the Neutral category. To achieve this goal, we have developed a novel data curation pipeline starting with a small seed of manually annotated data and eventually scaling it to a satisfactory size. We compare its quality against a state-of-the-art gold dataset using both offline experiments and visual validation methods. The resultant procedure can be used to create similar datasets in the same domain as well as in other domains.


Introduction
Researchers are increasingly inclined towards refining pre-trained language models with domainspecific datasets to achieve certain tasks (Devlin et al., 2019;Liu et al., 2019;Rashkin et al., 2018). One such area is the development of empathetic conversational agents that can understand human emotions and respond appropriately. The aim of the empathetic response generation task is to generate syntactically correct, contextually relevant, and more importantly emotionally appropriate responses following previous dialogue turns. Such tasks require the creation and availability of large dialogue datasets, in which each utterance is annotated with the correct intents and emotions. Though many such datasets have been developed in the past (Busso et al., 2008;Poria et al., 2019;Li et al., 2017;Rashkin et al., 2018), due to the cost of manual labor, they are limited in size, thus insufficient to train robust conversational agents. Since collecting and manually annotating such gold standard data is expensive, replacing them with automatically annotated silver standard data has become a rising interest (Filannino and Di Bari, 2015). We show how such a large-scale silver standard dataset with sufficient quality can be curated and used to fine-tune pre-trained language models for the generation of empathetic responses.
Emotions revealed in social chitchat are rather complex. It has many categories of emotions to distinguish due to subtle variations present in human emotion. For example, Sadness and Disappointment are pursued and dealt with differently in human conversations even though both of them are negative emotions. Also, the listener's reaction to emotion is not always a straightforward mirroring effect of the speaker's emotion. Rather it can be more neutral and convey a specific intent, as is evident from the dialogue example in Table 1 Welivita and Pu (2020) have analyzed listener responses in the EmpatheticDialogues dataset (Rashkin et al., 2018) and discovered eight listener specific empathetic response intents contained in emotional dialogues: Questioning; Agreeing; Acknowledging; Sympathizing; Encouraging; Consoling: Suggesting; and Wishing. They have annotated the EmpatheticDialogues dataset with 32 fine-grained emotions, eight empathetic response intents, and the Neutral category, and discovered frequent emotion-intent exchange patterns in empathetic conversations. They observe that this type of dataset tagged with fine-grained emotions and intents can be used to train neural chatbots to generate empathetically appropriate responses. But for this purpose, a large-scale emotion and intent labeled dataset is even more desirable. Curating such a dataset is technically challenging since 1) annotating such a large-scale dataset require costly human labor, and 2) given the fine-granularity of the emotion and intent labels, the human labeling task is more difficult and error-prone compared to the more coarse grained Angry-Happy-Sad emotion categories. As a result, existing manually labeled emotional dialogue datasets such as IEMOCAP (Busso et al., 2008), MELD (Poria et al., 2019), and DailyDialogue (Li et al., 2017) are smaller in scale and contain only a limited set of emotions (emotions derived from basic emotion models such as the Ekman's). Most importantly, existing datasets fail to distinguish between Neutral and Questioning, or any of the other eight empathetic response intents. They combine everything into a big label Neutral or Other when the utterance is not emotional. But Questioning, Agreeing, Acknowledging, Sympathizing, Encouraging, Consoling, Suggesting, and Wishing are important details in constructing empathetic dialogues. These eight response intents, which we call the plus categories, are novel in our work and contribute to the model's learning of important response patterns in the data.
To fill the above gap, we curate a novel largescale silver dialogue dataset, EDOS (Emotional Dialogues in OpenSubtitles), containing 1M emotional dialogues from movie subtitles, in which each dialogue turn is automatically annotated with 32 fine-grained emotions, eight plus categories as well as the Neutral category. Movie subtitles are extensively used for emotion analysis in text in earlier and recent research (Kayhani et al., 2020;Merdivan et al., 2020;Giannakopoulos et al., 2009). The Nature article "How movies mirror our mimicry" (Ball, 2011) states "screenwriters mine everyday discourse to make dialogues appear authentic" and "audiences use language devices in movies to shape their own discourse". Hence, it can be one of the major sources to train chatbots and learn emotional variations and corresponding response strategies in dialogues. To reduce the cost of human labeling and the complexity of labeling dialogues with fine-grained emotions and intents, we devised a semi-automated human computation task to collect fine-grained emotion and intent labels for a small set of movie dialogues (9K). We then followed automatic data augmentation techniques to expand the labeled data and trained a dialogue emotion classifier to automatically annotate 1M emotional dialogues. The process of curating the dataset involved several stages. First, we applied automatic turn and dialogue segmentation methods, data cleaning and removal of duplicates on movie subtitles in the OpenSubtitles (OS) corpus (Lison et al., 2019) and obtained close to 4M dialogues. Then, we applied a weak labeler (a BERT-based sentence-level classifier) trained on the EmpatheticDialogues dataset (Rashkin et al., 2018), to label utterances in OS dialogues and filtered 1M emotional dialogues (EDOS initial). Thereafter, we applied data augmentation techniques on a small set of human-annotated data and used the manually annotated and extended labels to train a strong labeler that is used to annotate dialogues in EDOS initial and obtained the final 1M EDOS dataset. We evaluated the quality of the resultant dataset by comparing it against the  EmpatheticDialogues dataset by means of offline experiments and visual validation methods. Figure  1 summarizes the process of creating EDOS. The data curation pipeline we followed substantially reduced the cost of human labor while ensuring quality annotations. Our contributions in this paper are three-fold. 1) We curate a large-scale dialogue dataset, EDOS, containing 1M emotional dialogues labeled with 32 fine-grained emotions, eight empathetic response intents (the plus categories), and Neutral. Compared to existing dialogue datasets tagged with emotions, EDOS is significantly larger (≈ 40 times larger than EmpatheticDialogues), and contains more fine-grained emotions and empathetic response strategies. 2) We outline the complex pipeline used to derive this dataset. 3) We evaluate the quality of the dataset compared to a state-of-theart gold standard dataset using offline experiments and visual validation methods.
2 Literature review IEMOCAP (Busso et al., 2008), MELD (Poria et al., 2019), DailyDialogue (Li et al., 2017), Emo-tionLines (Hsu et al., 2018), and EmoContext (Chatterjee et al., 2019) are some existing stateof-the-art dialogue datasets with emotion labels. However, these datasets are limited in size and are labeled with only a small set of emotions without any response strategies. Table 2 shows a summary of the size and the labels in these datasets. All the datasets compared here are in the English language. Herzig et al. (2016) detected customer emotions and agent emotional techniques (e.g., Apology, Empathy) in customer support dialogues. They curated a dialogue dataset from two customer support Twitter accounts and manually annotated the customer turns with one of 9 emotions and the agent turns with one of 4 emotional techniques. But emotions expressed by customers in social media service dialogues are mainly negative (e.g. anger, frustration), and the customer service agents also respond in a restricted manner, which limits the utility of this dataset, in addition to its small size.
The EmpatheticDialogues dataset (Rashkin et al., 2018) contains 25K open-domain dialogues grounded on 32 emotions. The 32 emotions range from basic emotions derived from biological responses (Ekman, 1992;Plutchik, 1984) to larger sets of subtle emotions derived from contextual situations (Skerry and Saxe, 2015). Welivita and Pu (2020) manually analyzed a subset of the listener turns in EmpatheticDialogues and identified eight listener-specific response intents. They developed a sentence-level weak labeler using which they annotated the entire dataset with 32 emotions, eight empathetic response intents, and the Neutral category. However, due to the limited size of EmpatheticDialogues, it is difficult to be used for data-intensive applications. To address the above limitations, we curate EDOS containing 1M movie dialogues. We label each dialogue turn with 32 emotions, eight empathetic response intents, and Neutral using our own dialogue emotion and intent classifier. Table 2 compares EDOS to state-of-theart emotion annotated dialogue datasets.

Methodology
This section describes the dialogue selection process, the design of the human annotation task, the data augmentation techniques used to expand human-labeled dialogues, and the development of a strong labeler to annotate the dataset.

Dialogue curation from movie subtitles
The OpenSubtitles 2018 corpus consists of 3.7M movie and TV subtitles. It comprises 3.4B sentences and 22.2B tokens. It is an excellent source to learn emotional variations in dialogue and corresponding response mechanisms. But due to the absence of speaker markers, movie subtitles do not contain an explicit dialogue turn structure (who speaks what) and specific indicators where one dialogue ends and the next dialogue begins. To overcome the first issue, we reproduced the work by Lison and Meena (2016) to build an SVM-based classifier that determines if two consecutive sentences are part of the same dialogue turn. Our classifier achieved a segmentation accuracy of 76.69%, which is close to the accuracy of 78% that the authors claim. The set of features that gave the best turn segmentation accuracy are: 1) unigram and bi-gram features of adjacent sentences after lemmatization; 2) first and final tokens of adjacent sentences; 3) first and final bi-grams of adjacent sentences; 4) whether the two sentences belong to the same subtitle block or not (boolean); 5) genre of the movie (Drama, Crime, Musical etc.); 6) sentence density of the subtitles file (no. of sentences/subtitle duration); and 7) quadratic combinations of the above features with itself and the rest.
After performing turn segmentation on the Open-Subtitles corpus, we divided the turns into separate dialogues based on a simple heuristic. If the difference between the end time of the previous turn and the start time of the current turn is more than 5 seconds, we take these two turns as belonging to 2 different dialogues. An exception occurs if this timestamp information is missing in at least one of the turns. In this case, we assume that these two turns appear in the same subtitle block and consider them as belonging to the same dialogue. This way, we formed 9M dialogues from the OpenSubtitles corpus altogether. The choice of 5 sec.s to separate dialogues is explained in Appendix C.
To further clean the dialogues, we removed character names, the repetitive dialogue turns, turns that start with "previous on..." (monologue at the beginning of TV episodes), turns with character length less than 2 or greater than 100, turns with an alphabetic proportion less than 60%, and turns with a lot of repetitive tokens. When a dialogue turn was removed, all the turns following that turn were also removed from the dialogue to maintain consistency. After that, all the dialogues left with only one turn were removed from the corpus. We removed dialogues from movies of the genre 'Documentary' since they do not correspond to actual dialogues. This resulted in a cleaned OS dialogue dataset consisting of 4M dialogues.
To filter out dialogues containing emotional statements and empathetic responses from the cleaned OS dialogues dataset, we employed a weak labeler, (a BERT transformer-based sentence level classifier) trained on 25K situation descriptions from EmpatheticDialogues (Rashkin et al., 2018) tagged with 32 emotion classes, and 7K listener utterances tagged with eight empathetic response intents and the Neutral category (Welivita and Pu, 2020). The classifier had a high top-1 classification accuracy of 65.88%. We call it a weak labeler since it predicts emotion or intent only at the sentence level and is trained on a different dataset other than OS. We filtered the top 1M dialogues having the highest label confidence as predicted by this classifier to form the 1M EDOS (initial) dataset. The statistics of the EDOS dataset are given in Table  3. More detailed statistics including the number of dialogues per emotion are included in Appendix D.

Human computation
To train a dialogue emotion classifier that can identify both fine-grained emotions and empathetic response intents, we devised an Amazon Mechanical Turk (AMT) experiment to collect an initial set of ground truth labels for OS dialogues. But annotating dialogue turns with one of 41 labels is a daunting task. To make the task less exhaustive, we devised a semi-automated approach using our weak labeler. By applying the weak labeler on each turn of the cleaned OS dialogue dataset, we filtered out the turns having prediction confidence ≥ 0.9, along with their dialogue history. Next, we ranked these dialogues according to their readability and selected the highest readable dialogues from each class to be labeled. This is to reduce the time spent by the workers in having to read long and complicated dialogues. The steps followed in computing dialogues' readability are included in Appendix A. Workers had to select a label from the top-3 predictions made by the weak labeler. If none of the top-3 predictions matched, they could manually specify the correct class. The main purpose of incorporating a weak labeler here was to make the task less daunting for the crowd worker. Otherwise, having to choose a label out of 41 labels may lead to even worse results due to the complicated nature of the task. The risk of reduced data reliability is avoided by taking only the labels with the majority vote. The AMT task's user interface design is included in Appendix B.
After ranking the dialogues according to readability, we selected the top 250 dialogues in each category for the AMT task. We bundled 15 dialogues in a HIT with 5 quiz questions that served as checkpoints to evaluate the crowd workers' quality. Situation descriptions from the Empathetic-Dialogues dataset for which we already knew the emotion labels were used to formulate the quiz questions. Finally, we obtained dialogues where we had 2 out of 3 worker agreements, which resulted in 8, 913 dialogues altogether. Table 4 shows the results of the AMT task.

Data augmentation and annotation
To scale up the training data obtained from the AMT task, we utilized a distant learning technique using dialogue embeddings (Reimers and Gurevych, 2019) and self-labeling (Triguero et al., 2015), a semi-supervised learning technique. The first approach we used is using Sentence-BERT (SBERT) proposed by Reimers and Gurevych (2019), which uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. Using this approach, we obtained semantically similar dialogues to those annotated by crowd workers and tagged them with the same class label. Among several models the authors have proposed, we used the roberta-base-nli-stsbmean-tokens model, fine-tuned on the NLI (Bowman et al., 2015) and STS benchmark (STSb) (Cer et al., 2017) datasets, since it has reported a high Spearman's rank correlation of 84.79 ± 0.38 between the cosine-similarity of the sentence embeddings and the gold labels in the STS benchmark test set outperforming the existing state-of-the-art. It is also more efficient to use than roberta-large. Before proceeding, we left out 20% of the crowdannotated dialogues, balanced across all class labels, as testing data. Then, we followed the following steps in extending the rest of the dialogues using SBERT.
1) Using the SBERT model, first, we computed dialogue turn embeddings (each with a vector representation of 768 dimensionalities) for all the turns (≈19M) in the cleaned OS dataset. 2) Then, we calculated dialogue embeddings for human-annotated and unlabeled dialogues from the cleaned OS dialogues dataset. For this, we applied a decaying weight starting from the last turn and took the weighted average of the turn embeddings of each dialogue. We used half decaying, i.e, if we have a dialogue with turn embeddings v 1 , v 2 , and v 3 , the final dialogue embedding would be (4/7)v 3 + (2/7)v 2 + (1/7)v 1 . 3) Next, we calculated the cosine similarity between annotated and unlabeled dialogue embeddings and ranked the results. 4) Finally, we applied a similarity threshold and obtained all the unlabeled dialogues with a cosine similarity that exceeds this threshold and tagged them with the same crowd annotated class label. Here, we used a threshold of 0.92 after manually inspecting a random subset of the results obtained for a range of thresholds (Examples from this stage are denoted in Appendix C).
We extended the original crowd annotated dialogue dataset by 3, 196 more dialogues with distantly annotated class labels using the above method. Thereafter, using the crowd-annotated and extended labels, we trained an initial classifier that we used to annotate the rest of the dialogues and add more labels to our dataset that had annotation confidence over 0.9. This method is termed selflabeling (Triguero et al., 2015), a semi-supervised learning technique that can be used to grow labeled data. With this, we were able to extend the labeled data by 4, 100 more dialogues. Next, we again applied SBERT over the self-labeled data and extended them by 2, 118 more dialogues. Finally, we were able to have ≈ 14K labeled dialogues altogether. We used this data to train a final dialogue emotion classifier to annotate the rest of the unlabeled data. This resulted in a classifier with precision 64.11%, recall 64.59%, macro F1-score 63.86%, and accuracy 65.00%, which is comparable with the state-of-the-art dialogue emotion classifiers (as denoted in Table 5). The design of the dialogue emotion classifier we utilized to annotate the dataset is explained in section 3.3.1.

Design of the dialogue emotion classifier
Our dialogue emotion classifier consists of a representation network that uses the BERT architecture, an attention layer that aggregates all hidden states at each time step, a hidden layer, and a softmax layer. We used the BERT-base architecture with 12 layers, 768 dimensions, 12 heads, and 110M parameters as the representation network. It was initialized with weights from RoBERTa (Liu et al., 2019). We fed in a dialogue turn along with the preceding context in the reverse order as input to the representation network. To give more importance to the dialogue turn for which prediction has to be made and the turns that immediately precede it, we multiplied the token embeddings belonging to each turn by a decreasing weight factor. Its input representation is constructed by summing the corresponding token embedding multiplied by the weighting factor and its position embedding. More details including the hyper-parameters used are included in the Appendix C.

EDOS quality analysis and comparison
with the state-of-the-art gold standard Table 6 shows some example dialogues taken from the EDOS dataset along with annotations and confidence scores. By observing the examples, it could be noticed that even for less confident predictions, the label quite accurately describes the emotion or intent of the corresponding dialogue turn. We also conducted a qualitative comparison of the annotations in the EDOS dataset with Empa-theticDialogues (Rashkin et al., 2018; Welivita and Pu, 2020), a state-of-the-art gold standard dataset for empathetic conversations. Figure 2 compares the distributions of emotions and intents in the two datasets. It is observed that in both datasets, intent categories take prominence over individual emotion classes. This is in par with observations of Welivita and Pu (2020), where they notice that one or more intents from the taxonomy of empathetic intents are mostly utilized when responding to emotions in dialogue, rather than similar or opposite emotions. Especially, the intent Questioning takes the highest percentage among the annotations in EmpatheticDialogues and EDOS. We also computed the KL-divergence (≥ 0) of the emotion and intent distribution of EDOS with respect to that of EmpatheticDialogues, which measures how one probability distribution is different from a second, reference probability distribution (Kullback and Leibler, 1951). It resulted in a KL-divergence value of 0.2447, which indicates a considerable similarity between the two distributions (the lower the KL divergence, the more similar the distributions are). Figure 3 compares the emotion-intent flow patterns in EmpatheticDialogues and EDOS. In the visualization corresponding to EmpatheticDialogues, the 1 st and 3 rd dialogue turns correspond to the speaker and the 2 nd and 4 th dialogue turns correspond to the listener. However, in EDOS, we cannot distinguish the dialogue turns as speaker and listener turns due to the absence of speaker annotations. Though this is the case, we could still observe some conversational dynamics present in EmpatheticDialogues are preserved in EDOS. For example, in both datasets, the speaker mostly starts the conversation with some emotional statement and in the subsequent turn, the response tends to be of the intent Questioning. In both datasets, intents Agreeing and Acknowledging follow emotions seen in the first turn irrespective of whether they are positive or negative. As the dialogues proceed, it could be seen in both datasets the emotions deescalate as more empathetic response intents emerge.

Experimental baselines
We propose some experimental baselines using the curated dataset for empathetic response generation and compare the performance against a dialogue model trained on the EmpatheticDialogues dataset. For this purpose, we trained a transformer (Vaswani et al., 2017) model with various training settings. Specifically, the following datasets were involved: 1) OS dialogues (As described in Section 3.1, these dialogues were obtained by segmenting the movie subtitles. Note that for the purpose of pre-training, we excluded the EDOS dialogues, resulting in around 3M dialogues.); 2) EDOS (1M dialogues); and 3) EmpatheticDialogues (25K dialogues). All    three datasets were split into a training (80%), validation (10%), and test (10%) sets. Based on the training strategies, we have the following models: 1) Pre-trained-to take advantage of transfer learning, we pre-trained the transformer model on the 3M OS dialogues. The large scale of this training set is expected to provide a good starting point for fine-tuning; 2) Fine-tuned-we took the pre-trained transformer and then fine-tuned it on EDOS and EmpatheticDialogues datasets respectively. All the models have 4 layers, 6 multi-heads, and a hidden size of 300, and were trained until the minimum validation loss was reached. For inference, we used beam search with beam size 32 and 4-gram repeats blocking.
To evaluate the performance of the dialogue models, we adopted the following metrics: 1) perplexity; 2) distinct-1 and -2 metrics (Li et al., 2016), which measure the diversity of the generated responses; 3) sentence embedding similarity-we used SBERT (Reimers and Gurevych, 2019) to obtain an embedding for the generated response as well as the ground-truth and then calculated the cosine similarity between the two embeddings. The performance of the dialogue models was tested in held-out and zero-shot settings. The evaluation results are shown in Table 7.
In the held-out setting, where the model is evaluated on data from the same domain as the training data, all three models achieved good performance, and the perplexity values are much lower compared with the zero-shot setting, where the model is evaluated on data from a different domain. We also observe that the model fine-tuned on OS and EDOS dialogues achieves much higher Distinct-1 and -2 scores, even in the zero-shot setting when evaluated on EmpatheticDialogues. This indicates that by training on our curated OpenSubtitles dialogues, the model gains more diversity in the generated responses. It might be due to the larger size of the datasets containing many diverse responses. Out of the two, EDOS performs the best in terms of diversity, which reflects the quality of dialogues filtered from OpenSubtitles.

Discussion and conclusion
In this work, we curated a large-scale dialogue dataset, EDOS, comprising of 1M emotional dialogues from movie subtitles. This dataset is significantly larger in size and contains more fine-grained emotion categories and empathetic response intents than the existing emotional dialogue datasets. To facilitate annotation, we utilized data augmentation techniques to extend a small set of manually annotated data and trained a dialogue emotion classifier having comparable accuracy to the state-of-the-art. The data augmentation and automatic annotation procedure we employed significantly reduced the manual annotation cost and time.
Obtaining a large dataset is important only if the quality can be assured. The qualitative comparison conducted between EDOS and the state-of-the-art EmpatheticDialogues dataset by means of visual validation was one way to confirm that. The results of the comparison confirmed that most of the conversational dynamics present in EmpatheticDia-  logues were observed in EDOS. We also proposed some experimental baselines by training a transformer model for empathetic response generation on OS, EDOS, and EmpatheticDialogues datasets and tested them in held-out and zero-shot settings.
The results showed that the model fine-tuned on EDOS scored the best in terms of diversity metrics. This dataset can be readily utilized to develop empathetic conversational agents and for fine-grained emotion analysis in dialogues. The pipeline we present can be used when creating similar largescale datasets in similar or even different domains.
As future work, we plan to utilize this dataset to further conduct experiments on empathetic response generation. Since it is annotated with emotions and intents, we will use it for experiments involving controllable and interpretable response generation. Particularly, the plus categories present in the dataset can be utilized to condition the chatbot's response generation process, making it possible to control and interpret the generated responses. The dataset can also be used to train state-of-the-art dialogue emotion classifiers.

Ethical considerations
EDOS contains dialogues derived from the Open-Subtitles corpus (Lison et al., 2019), which is publicly available. 2 It is part of the OPUS (Open Parallel corpUS), which is based on open source products and is delivered as an open content package. The workers annotating the dataset were compensated with $0.4 per HIT, which takes 4.12 minutes on average to complete (excluding the time taken by workers who took an unusually long time to complete the task) and a bonus of $0.1 if they completed at least 3 out of 5 quiz questions correctly. Fair compensation was determined based on the US minimum wage of $7.12 per hour. Since the dataset is in English, the annotators recruited from AMT were restricted the majority native Englishspeaking countries: US; UK; Canada; Australia; and New Zealand. The fact that the dataset is Using this dataset to directly train end-to-end chatbot models can involve certain risks. Though we have taken steps to remove profanity from the responses in the dataset, due to the lack of controllability and interpretability in end-to-end neural response generation models, there exists the risk of generating inappropriate or biased responses for certain emotional prompts. A recent example is Microsoft's Taybot that started producing unintended and offensive tweets denying the Holocaust as a result of learning from offensive information from Twitter (Lee, 2016). To mitigate this, researchers have recently focussed on inducing controllability in these end-to-end response generation models by means of jointly modeling dialogue intent selection and response generation (Wu et al., 2018;Sankar and Ravi, 2019;Hedayatnia et al., 2020;Santhanam et al., 2020;Ke et al., 2018;Lee et al., 2020). We encourage the readers to look into these approaches when developing conversational agents using this dataset.
Though human-like chatbots with emotion recognition and empathetic responding abilities can be beneficial in a number of situations such as in the medical domain, crisis management, customer service, and elderly care, it should not be underestimated that they involve some potential harms. For example, a chatbot can be used to impersonate a real human being and used for cybercrimes such as scamming and phishing. It is also important to note that one could get emotionally attached to a bot, or even become codependent, distracting him or herself from relationships with humans and causing distress if the chatbot becomes dysfunctional. Users may tend to reveal their private and confidential information such as certain health conditions and private attributes during such interaction, which could be misused when in the hands of the wrong people. Developers should take these risks into account when deploying such chatbots in the real world to ensure safe and ethical use. C Choice of hyper-parameters and additional training details regarding the dialogue emotion classifier used for annotation The choice of 5 seconds to separate dialogues is based on a histogram of time intervals between adjacent subtitle blocks in the OpenSubtitles corpus, which is denoted in Figure 5. As it can be observed in the histogram, most of the time gaps fall below 3 seconds. A clear drop in count was observed between 3-5 seconds. Therefore, we chose 5 seconds as the time interval to separate dialogues. The choice a threshold of 0.92 to select dialogues similar to those that were already annotated was based on manually inspecting a random subset of the results obtained after using a range of similarity thresholds. Table 8 shows some example dialogues discovered at this threshold.
Using decreasing weights for context utterances is based on the intuition that in human dialogues, more attention is paid to the most recent utterances in dialogue history. This idea is backed up by time-decay functions used in neural dialogue understanding approaches (See et al., 2019). We conducted an ablation study with without using decreasing weights in the model. Performance of the unweighted models was lower than the performance of weighted models yielding final F1 scores of 63.44 and 64.86 for unweighted weighted models, respectively.
We used the same hyper-parameter setting used in RoBERTa (Liu et al., 2019) when training the dialogue emotion classifier used for annotation. We used the Adam optimizer with β 1 of 0.9, β 2 of 0.98, an value of 1 × 10 −6 , and a learning rate of 2 × 10 −5 . A dropout of 0.1 was used on all layers and attention weights, and a GELU activation function (Hendrycks and Gimpel, 2016). We limited the maximum number of input tokens to 100, and used a batch size of 256. All the experiments were conducted on a machine with 2x12cores@2.5GHz, 256 GB RAM, 2x240 GB SSD, and 2xGPU (NVIDIA Titan X Maxwell). 546.84 sec.s in total were taken to train the final emotion classifier. The optimal model was selected based on the average cross entropy loss calculated between the ground-truth and predicted labels of the validation set. Table 9 shows more descriptive statistics of the EDOS dataset: the number of dialogues; and the number of dialogues turns per emotion and intent category. A dialogue is counted under an emotion or an intent if the beginning dialogue prompt is annotated with that emotion or intent.

E Additional training details about the experiemental baselines
Here we summarize some of the parameters of the model implementation. We used the RoBERTa tokenizer to tokenize the input utterances, and the vocabulary size is 50,265. We allow a maximum number of 100 tokens as the input to the model. We used 4 sub-layers in the encoder and decoder, with 6 heads in the multi-head attention. The dimension of the hidden units is 300, and the dimension of the Manually annotated dialogues Dialogues discovered using similarity matching (with similarity ≥ 0.92) -That 's beautiful !. (Acknowledging) -Now , let 's take a look at this beautiful piece of work -Oh , my God . It 's beautiful .
-I thought the coils were closer to me .
-Oh , well ... It was a good one nonetheless .
-I 'm so happy ! (Joyful) -Actually , I just wanted to say I love you . And I 'm sorry if I 'm a bit edgy about my book , but all that counts for me is you . You becoming my wife .
-That 's what really matters .
-I 'm very happy .
-Hey ! Don 't eat at my house anymore .
-You 're disgusting . (Disgusted) -I thought I told you to stay the fuck away from me if you were back on that shit .
-Was the team mad , then ? -I wasn 't happy ! -That 's pretty bad . (Acknowledging) -It 's starting to hurt so bad .
-Really ? That bad ? -Really bad . Table 8: Examples of similar dialogues discovered above a cosine similarity threshold of 0.92. The last turn in each dialogue discovered through similarity matching was labeled with the emotion or intent of that of the last turn of the manually labeled dialogue.  pointwise feed-forward layers is 1200. We use a dropout rate of 0.1, and the GELU (Hendrycks and Gimpel, 2016) activation function for the hidden layers. The loss function was optimized with the Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 5 × 105 For inference, we use beam search with a beam size of 32. To prevent the models from generating repetitive tokens or n-grams, we modified the beam search algorithm so that at each time step, if any of the branches contains repetitive 4-grams, we set the log probability of this branch to infinitely negative, to stop it from being further expanded. All the models were trained with a batch size of 512, on machines with 4 Nvidia Titan X Pascal GPUs, 2 Intel Xeon E5-2680 v3 CPUs, and 256GB RAM. Table 10 lists the training details as well as the validation performance for all the models.