Learning Dialogue Representations from Consecutive Utterances

Learning high-quality dialogue representations is essential for solving a variety of dialogue-oriented tasks, especially considering that dialogue systems often suffer from data scarcity. In this paper, we introduce Dialogue Sentence Embedding (DSE), a self-supervised contrastive learning method that learns effective dialogue representations suitable for a wide range of dialogue tasks. DSE learns from dialogues by taking consecutive utterances of the same dialogue as positive pairs for contrastive learning. Despite its simplicity, DSE achieves significantly better representation capability than other dialogue representation and universal sentence representation models. We evaluate DSE on five downstream dialogue tasks that examine dialogue representation at different semantic granularities. Experiments in few-shot and zero-shot settings show that DSE outperforms baselines by a large margin, for example, it achieves 13% average performance improvement over the strongest unsupervised baseline in 1-shot intent classification on 6 datasets. We also provide analyses on the benefits and limitations of our model.


Introduction
Due to the variety of domains and the high cost of data annotation, labeled data for task-oriented dialogue systems is often scarce or even unavailable. Therefore, learning universal dialogue representations that effectively capture dialogue semantics at different granularities (Hou et al., 2020;Krone et al., 2020;Yu et al., 2021) provides a good foundation for solving various downstream tasks (Snell et al., 2017;Vinyals et al., 2016). Left: each color indicates one intent category, while the black circles represents out-of-scope samples. Right: items with the same color stands for query-response pairs, where triangles represent queries. The black circles represents randomly sampled responses.
Contrastive learning (Chen et al., 2020;He et al., 2020) has achieved widespread success in representations learning in both the image domain (Hjelm et al., 2018;Lee et al., 2020;Bachman et al., 2019) and the text domain (Gao et al., 2021;Zhang et al., 2021a,b;Wu et al., 2020a). Contrastive learning aims to reduce the distance between semantically similar (positive) pairs and increase the distance between semantically dissimilar (negative) pairs. These positive pairs can be either human-annotated or obtained through various data augmentations, while negative pairs are often collected through negative sampling in the mini-batch.
In the supervised learning regime, Gao et al. (2021); Zhang et al. (2021a) demonstrate the effectiveness of leveraging the Natural Language Inference (NLI) datasets (Bowman et al., 2015;Williams et al., 2018) to support contrastive learn-Pair 1: I am looking for restaurants. | What type of food do you like? I am looking for restaurants. I want some pizza.

What type of food do you like?
Domino's is a good place for pizza.
Pair 2: What type of food do you like? | I want some pizza. Pair 3: I want some pizza. | Domino's is a good place for pizza.
Find me some restaurants.
Korean food, please.

What type of food do you like?
There is no Korean restaurant.
Pair 4: Find me some restaurants. | What type of food do you like? Pair 5: What type of food do you like? | Korean food, please. Pair 6: Korean food, please. | There is no Korean restaurant. ing. Inspired by their success, a natural choice of dialogue representation learning is utilizing the Dialogue-NLI dataset (Welleck et al., 2018) that consists of both semantically entailed and contradicted pairs. However, due to its relatively limited scale and diversity, we found learning from this dataset leads to less satisfying performance, while the high cost of collecting additional human annotations precludes its scalability. On the other extreme, unsupervised representation learning has achieved encouraging results recently, among which Sim-CSE (Gao et al., 2021) and TOD-BERT (Wu et al., 2020a) set new state-of-the-art results on general texts and dialogues, respectively. SimCSE uses Dropout (Srivastava et al., 2014) to construct positive pairs from any text by passing a sentence through the encoder twice to generate two different embeddings. Although SimCSE outperforms common data augmentations that directly operate on discrete text, we find it performs poorly in the dialogue domain (see Sec. 4.3). This motivates us to seek better positive pair constructions by leveraging the intrinsic properties of dialogue data. On the other hand, TOD-BERT takes an utterance and the concatenation of all the previous utterances in the dialogue as a positive pair. Despite promising performance on same tasks, we found TOD-BERT struggles on many other dialogue tasks where the semantic granularities or data statistics are different from those evaluated in their paper.
In this paper, inspired by the fact that dialogues consist of consecutive utterances that are often semantically related, we use consecutive utterances within the same dialogue as positive pairs for contrastive learning (See Figure 2). This simple strategy works surprisingly well. We evaluate DSE on a wide range of task-oriented dialogue applications, including intent classification, out-of-scope detection, response selection, and dialogue action prediction. We demonstrate that DSE substan-tially outperforms TOD-BERT, SimCSE, and some other sentence representation learning models in most scenarios. We assess the effectiveness of our approach by comparing DSE against its variants trained on other types of positive pairs (e.g., Dropout and Dialogue-NLI). We also discuss the trade-off in learning dialogue representation for tasks focusing on different semantic granularities and provide insights on the benefits and limitations of the proposed method. Additionally, we empirically demonstrate that using consecutive utterances as positive pairs can effectively improve the training stability (Appendix A.3).

Why Contrastive Learning on
Consecutive Utterances?
When performing contrastive learning on consecutive utterances, we encourage the model to treat an utterance as similar to its adjacent utterances and dissimilar to utterances that are not consecutive to it or that belong to other dialogues. On the one hand, this training process directly increases an utterance's similarity with its true response and decreases its similarities with other randomly sampled utterances. The ability to identify the appropriate response from many similar utterances is beneficial for dialogue ranking tasks (e.g., response selection). On the other hand, consecutive utterances also contain implicit categorical information, which benefits dialogue classification tasks (e.g., intent classification and out-of-scope detection). Consider pairs 1 and 4 in Figure 2: we implicitly learn similar representations of I am looking for restaurants and Find me some restaurants, since they are both consecutive with What type of food do you like?.
In contrast, SimCSE does not enjoy these benefits by simply using Dropout as data augmentation. Although TOD-BERT also leverages the intrinsic dialogue semantics by combining an utterance with its dialogue context as positive pair, the context is often the concatenation of 5 to 15 utterances. Due to the large discrepancy in both semantics and data statistics between each utterance and its context, simply optimizing the similarity between them leads to less satisfying representations on many dialogue tasks. As shown in Section 4, TOD-BERT can even lead to degenerated representations on some downstream tasks when compared to the original BERT model.

Notation
be a batch of positive pairs, where M is the batch size. In our setting, each (x i , x i + ) denotes a pair of consecutive utterances sampled from a dialogue. Let e i denote the representation of the text instance x i that is obtained through an encoder. In this paper, we use mean pooling to obtain representations.

Training Target
Contrastive learning aims to maximize the similarity between positive samples and minimize the similarity between negative samples. For a contrastive anchor x i , the contrastive loss aims to increase its similarity with its positive sample x i + and decrease its similarity with the other 2M − 2 negative samples within the same batch.
We adopt the Hard-Negative sampling strategy proposed by Zhang et al. (2021a), which puts higher weights on the samples that are close to the anchor in the representation space. The underlying hypothesis is that hard negatives are more likely to occur among those that are located close to the anchor in the representation space. Specifically, the Hard-Negative sampling based contrastive loss regarding anchor x i is defined as follows: As mentioned above, here i and i + represent the indices of the anchor and its positive sample. We use τ to denote the temperature hyperparameter and sim(e i , e j ) represent the cosine similarity of e i and e j . In the above loss, α ij is defined as follows, Noted here, the denominator is averaged over all the other 2M -2 negatives of x i . Intuitively, samples that are close to the anchor in the representation space are assigned with higher weights. In other words, α ij denotes the relative importance of instance x j for optimizing the contrastive loss of the anchor x i among all 2M -2 negatives. For every positive pair (x i , x i + ), we respectively take x i and x i + as the contrastive anchor to calculate the contrastive loss. Thereby, the contrastive loss over the batch is calculated as: Here i + ,i is defined by exchanging the roles of instances i and i + in Equation (1), respectively.

Experiments
We run experiments with five different backbones: BERT base , BERT large (Devlin et al., 2018), RoBERTa base , RoBERTa large (Liu et al., 2019b), DistilBERT base (Sanh et al., 2019). Due to the space limit, we only present the results on BERT base in the main text. The results of other models are summarized in Appendix D. We use the same training data as TOD-BERT for a fair comparison. We summarize the implementation details and data statistics of both pre-training and evaluation in Appendices A and B, respectively.

Baselines
We compare DSE against several representation learning models that attain state-of-the-art results on both general text and dialogue languages. We categorize them into the following two categories. The evaluations on DSE-dropout and DSE-dianli allow us to fairly compare our approach against the state-of-the-art approaches in both the supervised learning and the unsupervised learning regimes.

Evaluation Setting
To accommodate the fact that obtaining a large number of annotations is often time-consuming and expensive for solving the task-oriented dialogue applications, especially considering the variety of domains and certain privacy concerns, we mainly focus on few-shot or zero-shot based evaluations.

Evaluation Methods
Considering that only a few annotations are available in our setting, we mainly focus on the similarity-based evaluations, where predictions are made based on different similarity metrics applied in the embedding space without requiring updating the model.
We use different random seeds to independently construct multiple (See Table 1) few-shot train and validation sets from the original training data and use the original test data for performance evaluation. To examine whether the performance gap reported in the similarity-based evaluations is consistent with the associated fine-tuning approaches, we also report the fine-tuning results. We perform early stopping according to the validation set and report the testing performance averaged over different data splits.

Tasks and Metrics
We evaluate all models considered in this paper on two types of tasks: utterance-level and dialoguelevel. The utterance-level tasks take a single dialogue utterance as input, while the dialogue-level tasks take the dialogue history as input. These two types of tasks assess representation quality on dialogue understanding at different semantic granularities, which are shared across a variety of downstream tasks.
Intent Classification is an utterance-level task that aims to classify user utterances into one of the pre-defined intent categories. We use Prototypical Networks (Snell et al., 2017) to perform the similarity-based evaluation. Specifically, we calculate a prototype embedding for each category by averaging the embedding of all the training samples that belong to this category. A sample is classified into the category whose prototype embedding is the most similar to its own. We report the classification accuracy for this task.
Out-of-scope Detection advances intent classification by detecting whether the sample is out-of-BERTbase Clinc150 Bank77 Snips Hwu64 Appen-A Appen-H Ave.  scope, i.e., does not belong to any pre-defined categories. We adapt the aforementioned Prototypical Networks to solve it. For a test sample, if its similarity with its most similar category is lower than a threshold, we classify it as out-of-scope. Otherwise, we assign it to its most similar category. For each model, we calculate the mean and std (standard deviation) of the similarity scores between every sample and its most similar prototype embedding, and take mean−std and mean as the threshold, respectively. The evaluation set contains both inscope and out-of-scope examples. We evaluate this task with four metrics: 1) Accuracy: accuracy of both in-scope and out-of-scope detection. 2) In-Accuracy: accuracy reported on 150 in-scope intents. 3) OOS-Accuracy: out-of-scope detection accuracy. 4) OOS-Recall: recall of OOS detection.

1-shot
Utterance-level Response Selection is an utterance-level task that aims to find the most appropriate response from a pool of candidates for the input user query, where both the query and response are single dialogue utterances. We formulate it as a ranking problem and evaluate it with Top-k-100 accuracy (a.k.a., k-to-100 accuracy), a standard metric for this ranking problem (Wu et al., 2020a). For every query, we combine its ground-truth response with 99 randomly sampled responses and rank these 100 responses based on their similarities with the query in the embedding space. The Top-k-100 accuracy represents the ratio of the ground-truth response being ranked at top-k, where k is an integer between 1 and 100. We report the Top-1, Top-3, and Top-10 accuracy of the models.
Dialogue-Level Response Selection is a dialogue-level task. The only difference with the Utterance-level Response Selection is that query in this task is dialogue history (e.g., concatenation of multiple dialogue utterances from different speakers). We also report the Top-1, Top-3, and Top-10 accuracy for this task.
Dialogue Action Prediction is a dialogue-level task that aims to predict the appropriate system action given the most recent dialogue history. We formulate it as a multi-label text classification problem and evaluate it with model fine-tuning. We report the Macro and Micro F1 scores for this task.

Main Results
Intent Classification & Out-of-scope Detection  Tables 2 and 3 show the results of similarity-based intent classification and out-of-scope detection.
The fine-tuning based results are presented in Appendix C. As we can see, DSE substantially outperforms all the baselines. In intent classification, it attains 13% average accuracy improvement over the strongest unsupervised baseline. More impor-   tantly, DSE achieves a 5%-10% average accuracy improvement over the supervised baselines that are trained on a large amount of expensively annotated data. The same trend was observed in out-of-scope detection, where DSE achieves 13%-20% average performance improvement over the strongest unsupervised baseline. The comparison between DSE, DSE-dropout, and DSE-dianli further demonstrates the effectiveness of using consecutive utterances as positive pairs in learning dialogue embeddings.
The left panel of Figure 1 visualizes the embeddings on the Clinc150 dataset given by TOD-BERT, SimCSE, and DSE, which provides more intuitive insights into the performance gap. As shown in the figure, with the DSE embeddings, inscope samples belonging to the same category are closely clustered together. Clusters of different categories are clearly separated with a large margin, while the out-of-scope samples are far away from those in-scope clusters. Table 4 shows the the results of similarity-based 0-shot response selection on utterance-level (AmazonQA) and dialogue-level (DSTC7-Ubuntu). Results of finetune-based evaluation on AmazonQA show similar trend and we summarize in Table 9 in Appendix. In Table 4, the large improvement attained by DSE over the baselines indicate our model's capability in dialogue response selection, in presence of both singleutterance query or using long dialogue history as query. The right panel of Figure 1   answers in the AmazonQA dataset calculated by DSE, SimCSE, and TOD-BERT. With the DSE embedding, each question is placed close to its real answer while far away from other candidates.

Response Selection
Dialogue Action Prediction Table 5 shows that DSE outperforms all baselines except TOD-BERT, which indicates its capability in capturing dialogue-level semantics. To better understand TOD-BERT's superiority over DSE on this task, we further investigate this task and find its data format is special.

Trade-off in Query Construction
To understand the impact of using multiple utterances as queries, we train three new variants of 4 We use query to refer the first utterance in a positive pair and use response to refer the other one DSE. Specifically, we construct positive pairs as: where u i represents the i-th utterance in a dialogue. We use the [SEP] token to concatenate two consecutive utterances as query. We refer DSE trained with this data as DSE 2-1 since it uses 2 utterances as the query and 1 utterance as the response. Similarly, we train another variant DSE 3-1 . Lastly, we also combine the positive pairs constructed for training DSE, DSE 2-1 , and DSE 3-1 together to train another variant named DSE 123-1 .
As shown in Table 5, by simply increasing the number of utterances within each query to three, DSE again outperforms TOD-BERT, and the improvement further expands when trained with the combined set, i.e., DSE 123-1 . Our results demonstrate that using long queries that consist of 5 to 15 utterances as what TOD-BERT does is not necessary even for dialogue action prediction. We further demonstrate this by evaluating DSE and its variants on all the other four tasks in Table 6, where our model outperforms TOD-BERT by a large margin. As it indicates, by using a single utterance as a query, DSE achieves a good balance among different dialogue tasks. In cases where dialogue action prediction is of great importance, augmenting the original training set of DSE with positive pairs constructed by using query consisting of 2 to 3 utterances is good enough to attain better performance while only incurring a slight performance drop on other tasks.

Potential Limitation
Considering the effectiveness of using consecutive utterances as positive pairs, a natural yet important question is: what are the potential limitations of our proposed approach? When using consecutive utterances as positive pairs for contrastive learning, an assumption is that responses to the same query are semantically similar. Vice versa, queries that prompt the same answer are similar. This assumption holds in many scenarios, yet it fails sometimes.
It may fail when answers have different semantic meanings. Take the pairs 2 and 5 in Figure  2 as an example. Through our data construction, we implicitly consider I want some pizza and Korean food, please to be semantically similar since they are both positively paired with What type of food do you like. Although this may be correct in some coarse-grained classification tasks since these two sentences generally represent the same intent (e.g., order food), using them as positive pairs can introduce some noise when considering more fine-grained semantics. This problem is further elaborated when answers are general and ubiquitous, e.g., Thank you. Since these utterances can be used to respond to countless types of dissimilar queries, e.g., I have booked a ticket for you v.s. Happy birthday, we may implicitly increase the similarities among highly dissimilar utterances when training on these samples, which is undesirable.
We verify this on the NLI datasets, where the the task is to identify whether one sentence semantically entails or contradicts the anchor sentence. For each anchor sentence, we calculate its cosine similarities with both the true entailment, contradiction sentences in the representation space. We classify the sentence with higher cosine similarity with the anchor as entailment and the other as the contradiction. Despite DSE achieves better classification accuracy (76.62) than BERT (69.40) and TOD-BERT (70.51), it underperforms SimCSEunsup (80.31). Although using dropout to construct positive pairs is not as effective as ours in many dialogue scenarios, this method better avoids introducing fine-grained semantic noise.
Despite the limitations, using consecutive utterances as positive pairs still leads to better dialogue representation than the elaborately labeled NLI datasets, indicating the great value of the information contained in dialogue utterances.

Related Work
Positive Pair Construction Popular supervised sentence representation learning often takes advantage of the human-annotated natural language inference (NLI) datasets (Bowman et al., 2015;Williams et al., 2018) for contrastive learning (Gao et al., 2021;Zhang et al., 2021a;Reimers and Gurevych, 2019;Cer et al., 2018). These sentence pairs either entail or contradict each other, making them the great choice for constructing positive and negative training pairs. Unsupervised sentence representation learning often relies on variant data augmentation strategies. Logeswaran and Lee (2018) and Giorgi et al. (2020) propose using sentences and their surrounding context as positive pairs. Other works resort to popular NLP augmentation methods such as word permutation (Wu et al., 2020b) andback-translation (Fang et al., 2020). Recently, Gao et al. (2021) demonstrates the superiority of using Dropout over other data augmentations that directly operate on the discrete texts.  Peng et al., 2020). For dialogue understanding, Henderson et al. (2019b) propose a response selection approach using a dual-encoder model. They pretrain the response selection model on Reddit and then fine-tune it for different response selection tasks. Following this, Henderson et al. (2019a) introduces a more efficient conversational model that is pre-trained with a response selection target on the Reddit corpus. However, they did not release code or pre-trained models for comparison. Wu et al. (2020a) combines nine dialogue datasets to obtain a large and high-quality task-oriented dialogue corpus. They introduce the TOD-BERT model by further pre-training BERT on this corpus with both the masked language modeling loss and the contrastive response selection loss.

Conclusion
In this paper, we introduce a simple contrastive learning method DSE that learns dialogue representations by leveraging consecutive utterances in dialogues as positive pairs. We conduct extensive experiments on five dialogue tasks to show that the proposed method greatly outperforms other stateof-the-art dialogue representation models and universal sentence representation methods. We provide ablation study and analysis on our proposed data construction from different perspectives, investigate the trade-off between different data construction variants, and discuss the potential limitation to motivate further exploration in representation learning on unlabeled dialogues. We believe DSE can serve as a drop-in replacement of the dialogue representation model (e.g., the text encoder) for a wide range of dialogue systems.

A Pre-train
In this section, we present the training data, implementation details, and training stability of model pre-training.

A.1 Data
We utilize the corpus collected by TOD-BERT (Wu et al., 2020a) to construct positive pairs. This dataset is the combination of 9 publicly available task-oriented datasets: MetaLWOZ , Schema (Rastogi et al., 2020), Taskmas (Mrkšić et al., 2016), CamRest676 (Wen et al., 2017. The combined dataset contains 100707 dialogues with 1388152 utterances over 60 domains. We filter out sentences with less or equal to 3 words and end up with 892835 consecutive utterances (for DSE) and 879185 unique sentences (for DSE-dropout). Note that, the training data of SimCSE-unsup consists of 1 million sentences from Wikipedia. That says, on the one hand, we use the same dataset as TOD-BERT but with our proposed data construction. On the other hand, we use a similar number of training samples as SimCSE-unsup. We believe such data construction makes the comparisons fair enough.

A.2 Hyperparameters
We add a contrastive head after the Transformer model and use the outputs of the contrastive head to perform contrastive learning. We use a twolayer MLP with size (d × d, d × 128) as the contrastive head. We use Adam (Kingma and Ba, 2014) with a batch size of 1024 and a constant learning rate as the optimizer. We set the learning rate for contrastive head as 3e − 4 and the learning rate for the Transformer model as 3e − 6.

A.3 Training Stability
In this section, we analyze the model's stability in terms of training steps when training with different type of positive pairs. We compare two data construction methods: consecutive utterances (DSE) and dropout (DSE-dropout). We first train each model for 15 epochs, save the checkpoint at the end of each epoch and evaluate each checkpoint with similarity-based methods. Figure 3 shows the two model's average performances on intent classification, out-of-scope detection, utterance-level response selection and dialogue-level response selection. This result further illustrates the effectiveness of using consecutive utterances as positive pairs for learning dialogue representation. As shown in the figure, DSE's performance on all the tasks consistently improves during the training process, while DSE-dropout achieves the best performance at the first epoch and significantly loses performance afterwards. Besides, DSE's performance is less sensitive to the training steps. It achieves stable performance after about 5 epochs. In contrast, DSEdropout's performance drops dramatically during the training process, yet it never surpasses DSE's performance. Therefore, we report DSE-dropout's performance at the first epoch in all the tables.

B Evaluation Setup
In this section, we present evaluation details and introduction to the evaluation dataset. Throughout this paper, we use cosine similarity as the similarity metric and mean pooling of token embeddings as the sentence representation. For baseline models, we report the better results of using its default setting (e.g., last hidden state of the [CLS] token as sentence embedding for SimCSE) and mean pooling.

B.1 Hyperparameters
We use the same hyperparameters for all the models. For similarity-based methods, the only hyperparameter is the max sequence length, we empirically choose a number that can fit at least 99% of the samples. We respectively set it as 64, 64, 128, and 128 for intent classification, out-of-scope detection, utterance-level response selection and dialogue-level response selection. Hyperparameters for fine-tune evaluations as listed as follows: Intent Classification We fine-tune all the models for 50 epochs with a batch size of 16 and learning rate of 3e-05. We evaluate the model on the few-shot validation set after every 10 steps. Early stopping is applied based on the model's validation results. The max sequence length is set as 64 and the dropout at the classification layer is set as 0.1.

Utterance-level Response Selection
In this task, we set the max sequence length as 128 and batch size as 100. Other hyperparameters are same as those in Intent Classification. We use the original SimCLR loss (Chen et al., 2020) to optimize the model.

Dialogue Action Prediction
In this task, we fine-tune all the models for 100 epochs with a batch size of 32 and learning rate of 5e-05. We evaluate the model on the few-shot validation set after every 30 steps. Early stopping is also applied. The max sequence length is set as 32 since we find shorter inputs leads to much better performance for all the models. We truncate sentences from the head to keep the most recent dialogue utterances as model input. We set the dropout at the classification layer as 0. in intent classification, we remove all the out-ofscope samples. We also use an internal dataset named Appen, whose texts are transcribed from customer recording. This dataset contains 30 categories and 310 test samples. There are two versions of each sentence. One is transcribed by Automatic Speech Recognition (ASR), which includes some ASR noise (e.g., transcribe errors). The other is transcribed by human annotator. We refer them respectively as Appen-A and Appen-H.
Out-of-scope Detection We use the entire Clinc150 dataset, which contains 150 in-scope intents and one out-of-scope intent. There are 5500 test samples in total (4500 in-scope and 1000 outof-scope).

Dialogue-Level Response Selection
We use the DSTC7-Ubuntu dataset (Lowe et al., 2017), which contains conversations about the Ubuntu system. Each query of this dataset comes together with one ground-truth response and 100 candidate responses. We combine the validation and test sets together for evaluation, which results in 6000 evaluation samples.

C Results of BERT base
In this section, we present other evaluation results on the BERT base model, including 1-shot and 5shot fine-tune on intent classification (Table 7), 5shot similarity-based out-of-scope detection (Table  8), and 500-shot and 1000-shot fine-tune on Ama-zonQA response selection (Table 9).

D Results of Other Backbone Models
In this section, we present similarity-based evaluation results on other four backbone models: BERT large , RoBERTa base , RoBERTa large , and DistilBERT base . Table 10 shows the results of similarity-based intent classification and Table 11 shows the results of similarity based response selection on both utterance-level and dialogue-level. As shown in the tables, DSE leads to consistent and significant performance boost on all the backbone models.

Clinc150
Bank77 Snips Hwu64 Ave.   Table 11: Results on 0-shot response selection on AmazonQA (utterance-level) and DSTC7-Ubuntu (dialogue-level). DSE leads to significant and consistent performance improvements on all the models.