DialogueCSE: Dialogue-based Contrastive Learning of Sentence Embeddings

Learning sentence embeddings from dialogues has drawn increasing attention due to its low annotation cost and high domain adaptability. Conventional approaches employ the siamese-network for this task, which obtains the sentence embeddings through modeling the context-response semantic relevance by applying a feed-forward network on top of the sentence encoders. However, as the semantic textual similarity is commonly measured through the element-wise distance metrics (e.g. cosine and L2 distance), such architecture yields a large gap between training and evaluating. In this paper, we propose DialogueCSE, a dialogue-based contrastive learning approach to tackle this issue. DialogueCSE first introduces a novel matching-guided embedding (MGE) mechanism, which generates a context-aware embedding for each candidate response embedding (i.e. the context-free embedding) according to the guidance of the multi-turn context-response matching matrices. Then it pairs each context-aware embedding with its corresponding context-free embedding and finally minimizes the contrastive loss across all pairs. We evaluate our model on three multi-turn dialogue datasets: the Microsoft Dialogue Corpus, the Jing Dong Dialogue Corpus, and the E-commerce Dialogue Corpus. Evaluation results show that our approach significantly outperforms the baselines across all three datasets in terms of MAP and Spearman’s correlation measures, demonstrating its effectiveness. Further quantitative experiments show that our approach achieves better performance when leveraging more dialogue context and remains robust when less training data is provided.


Introduction
Sentence embeddings are used with success for a variety of NLP applications  and many prior methods have been proposed with different learning schemes. ; Logeswaran and Lee (2018); Hill et al. (2016) train sentence encoders in a self-supervised manner with web pages and books. Conneau et al. (2017); ; Reimers and Gurevych (2019) propose to learn sentence embeddings on the supervised datasets such as SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018). Although the supervised-learning approaches achieve better performance, they suffer from high cost of annotation in building the training dataset, which makes them hard to adapt to other domains or languages.
Recently, learning sentence embeddings from dialogues has begun to attract increasing attention. Dialogues provide strong semantic relationships among conversational utterances and are usually easy to collect in large amounts. Such advantages make the dialogue-based self-supervised learning methods promising to achieve competitive or even superior performance against the supervised-learning methods, especially under the low-resource conditions. While promising, the issue of how to effectively exploit the dialogues for this task has not been sufficiently explored.  propose to train an input-response prediction model on Reddit dataset (Al-Rfou et al., 2016). Since they build their architecture based on the single-turn dialogue, the multi-turn dialogue history is not fully exploited. Henderson et al. (2020) demonstrate that introducing the multi-turn dialogue context can improve the sentence embedding performance. However, they concatenate the multi-turn dialogue context into a long token sequence, failing to model intersentence semantic relationships among the utterances. Recently, more advanced methods such as (Reimers and Gurevych, 2019) achieve better performance by employing BERT (Devlin et al., 2019) as the sentence encoder. These works have in common that they employ a feed-forward network with a non-linear activation on top of the sentence en-coders to model the context-response semantic relevance, thereby learning the sentence embeddings. However, such architecture presents two limitations: (1) It yields a large gap between training and evaluating, since the semantic textual similarity is commonly measured by the element-wise distance metrics such as cosine and L2 distance. (2) Concatenating all the utterances in the dialogue context inevitably introduces the noise as well as the redundant information, resulting in a poor result.
In this paper, we propose DialogueCSE, a dialogue-based contrastive learning approach to tackle these issues. We hold that the semantic matching relationships between the context and the response can be implicitly modeled through contrastive learning, thus making it possible to eliminate the gap between training and evaluating. To this end, we introduce a novel matching-guided embedding (MGE) mechanism. Specifically, MGE first pairs each utterance in the context with the response and performs a token-level dot-product operation across all the utterance-response pairs to obtain the multi-turn matching matrices. Then the multi-turn matching matrices are used as guidance to generate a context-aware embedding for the response embedding (i.e. the context-free embedding). Finally, the context-aware embedding and the context-free embedding are paired as a training sample, whose label is determined by whether the context and the response are originally from the same dialogue. Our motivation is that once the context semantically matches the response, it has the ability to distill the context-aware information from the context-free embedding, which is exactly the learning objective of the sentence encoder that aims to produce context-aware sentence embeddings. We train our model on three multi-turn dialogue datasets: the Microsoft Dialogue Corpus (MDC) , the Jing Dong Dialogue Corpus (JDDC) (Chen et al., 2020), and the E-commerce Dialogue Corpus (ECD) (Zhang et al., 2018). To evaluate our model, we introduce two types of tasks: the semantic retrieval (SR) task and the dialogue-based semantic textual similarity (D-STS) task. Here we do not adopt the standard semantic textual similarity (STS) task (Cer et al., 2017) for two reasons: (1) As revealed in (Zhang et al., 2020), the sentence embedding performance varies greatly as the domain of the training data changes. As a dialogue dataset is always about several certain domains, evaluating on the STS benchmark may mis-lead the evaluation of the model. (2) The dialoguebased sentence embeddings focus on context-aware rather than context-free semantic meanings, which may not be suitable to be evaluated through the context-free benchmarks. Since previous dialoguebased works have not set up a uniform benchmark, we construct two evaluation datasets for each dialogue corpus. A total of 18,964 retrieval samples and 4,000 sentence pairs are annotated by seven native speakers through the crowd-sourcing platform 1 . The evaluation results indicate that DialogueCSE significantly outperforms the baselines on the three datasets in terms of both MAP and Spearman's correlation metrics, demonstrating its effectiveness. Further quantitative experiments show that Dia-logueCSE achieves better performance when leveraging more dialogue context and remains robust when less training data is provided. To sum up, our contributions are threefold: • We propose DialogueCSE, a dialogue-based contrastive learning approach with MGE mechanism for learning sentence embeddings from dialogues. As far as we know, this is the first attempt to apply contrastive learning in this area.
• We construct the dialogue-based sentence embedding evaluation benchmarks for three dialogue corpus. All of the datasets will be released to facilitate the follow-up researches.
• Extensive experiments show that Dia-logueCSE significantly outperforms the baselines, establishing the state-of-the-art results.
2 Related Work 2.1 Self-supervised Learning Approaches Early works on sentence embeddings mainly focus on the self-supervised learning approaches.  train a seq2seq network by decoding the token-level sequences of the context in the corpus. Hill et al. (2016) propose to predict the neighboring sentences as bag-of-words instead of step-by-step decoding. Logeswaran and Lee (2018) perform sentence-level modeling by retrieving the ground-truth sentence from candidates under the given context, achieving consistently better performance compared to the previous token-level modeling approaches. The datasets used in these works are typically built upon the corpus of web pages and books . As the semantic connections are relatively weak in these corpora, the model performances in these works are inherently limited and hard to achieve further improvement.
Recently, the pre-trained language models such as BERT (Devlin et al., 2019) and GPT (Radford et al.) yield strong performances across many downstream tasks . However, BERT's embeddings show poor performance without fine-tuning and many efforts have been devoted to alleviating this issue. Zhang et al. (2020) propose a self-supervised learning approach that derives meaningful BERT sentence embeddings by maximizing the mutual information between the global sentence embedding and all its local context embeddings. Li et al. (2020) argue that BERT induces a non-smooth anisotropic semantic space. They propose to use a flow-based generative module to transform BERT's embeddings into isotropic semantic space. Similar to this work, Su et al. (2021) replace the flow-based generative module with a simple but efficient linear mapping layer, achieving competitive results with reported experiments in BERT-flow.
Lately, the contrastive self-supervised learning approaches have shown their effectiveness and merit in this area. ; Giorgi et al. (2020); Meng et al. (2021) incorporate the data augmentation methods including the word-level deletion, reordering, substitution, and the sentencelevel corruption into the pre-training of deep Transformer models to improve the sentence representation ability, achieving significantly better performance than BERT especially on the sentence-level tasks Cer et al., 2017;Conneau and Kiela, 2018). Gao et al. (2021) apply a twice independent dropout to obtain two same-source embeddings from a single sentence as input. Through optimizing their cosine distance, SimCSE achieves remarkable gains over the previous baselines. Yan et al. (2021) empirically study more data augmentation strategies in learning sentence embeddings, and it also achieves remarkable performance as SimCSE. In this work, we propose the MGE mechanism to generate a context-aware embedding for each candidate response based on its context-free embedding. Different from previous methods built upon the data augmentation strategies, MGE leverages the context to accomplish this goal without any text corruption.
For dialogue,  train a siamese transformer network with single-turn inputresponse pairs extracted from Reddit. Such architecture is further extended in (Reimers and Gurevych, 2019) by replacing the transformer encoder with BERT. Henderson et al. (2020) propose to leverage the dialogue context to improve the sentence embedding performance. They concatenate the multi-turn dialogue context into a long word sequence and adopt a similar architecture as  to model the context-response matching relationships. Our work is closely related to their works. We propose a novel dialogue-based contrastive learning approach, which directly models the context-response matching relationships without an intermediate MLP. We also consider the interactions between each utterance in the dialogue context and the response instead of simply treating the dialogue context as a long sequence.

Supervised Learning Approaches
The supervised learning approaches mainly focus on training classification models with the SNLI and the MNLI datasets (Bowman et al., 2015;Williams et al., 2018). Conneau et al. (2017) demonstrate the superior performance of the supervised learning model on both the STS-benchmark (Cer et al., 2017) and the SICK-R tasks (Marelli et al., 2014). Based on this observation,  further extend the supervised learning to the multi-task learning by introducing the QA prediction task, the Skip-Thought-like task (Henderson et al., 2017;, and the NLI classification task, achieving significant improvement over InferSent. Reimers and Gurevych (2019) employ BERT as sentence encoders in the siamese-network and finetune them with the SNLI and the MNLI datasets, achieving the new state-of-the-art performance.

Problem Formulation
u t } is the i-th dialogue session in D with t turn utterances. r is the response and C i = {u 1 , · · · u k−1 , u k+1 , · · · , u t } is the bi-directional context around r. We omit the subscript i in the following paragraph and use S, C instead of S i , C i for brevity.
To generate the contrastive training pairs, we introduce two embedding matrices for r, named context-free embedding matrix and context-aware  (1) We use BERT to encode the multi-turn dialogue context and the responses, all of the BERT encoders share the same parameters. (2) The matching-guided embedding (MGE) mechanism performs the token-level matching between each utterance and a response, generates multiple refined embeddings across turns.
(3) All refined embedding matrices are aggregated to form a context-aware embedding matrix, which is further pooled along the sequence dimension. embedding matrix. Specifically, we first encode r as an embedding matrixR. SinceR is encoded independently of the dialogue context, it is treated as the context-free embedding matrix. Then we generate a corresponding embedding matrixR based onR according to the guidance of C.R is treated as the context-aware embedding matrix. As C and r are derived from the same dialogue, (R,R) naturally forms a positive training pair. To construct a negative training pair, we first sample an utterance r from a dialogue randomly selected from D. r is encoded as the context-free embedding matrixR based on which a context-aware embedding matrix R is generated through the completely identical process. (R ,R ) is treated as a negative training pair. For each response r, we generate a positive training pair (since there is only one ground-truth response for each context) and multiple negative training pairs. All the training pairs are then passed through the contrastive learning module.
It is worth to mention that there is no difference between sampling the response or the context as they are symmetrical in constructing the negative training pairs. But we prefer the former as it is more straightforward and in accordance with the previous retrieval-based works for dialogues. With all the training samples at hand, our goal is to minimize their contrastive loss, thus fine-tuning BERT as a context-aware sentence encoder. Figure 1 shows the model architecture. Our model is divided into three stages: sentence encoding, matching-guided embedding, and turn aggregation. We describe each part as below.

Sentence Encoding
We adopt BERT (Devlin et al., 2019) as the sentence encoder. Let u represent a certain utterance in C. u and r are first encoded as two sequences of output embeddings, which is formulated as: (1) {r 1 , r 2 , · · · , r n } = BERT(r), where u i , r j represent the i-th and the j-th output embedding derived from u and r respectively. n is the maximum sequence length of both input sentences. ∀i, j ∈ 1, 2, · · · , n, the shapes of u i and r j are 1 × d, where d is the dimension of BERT's outputs. We stack {u 1 , u 2 , · · · , u n } and {r 1 , r 2 , · · · , r n } to obtain the context-free embedding matricesŪ andR, whose shapes are both n × d.

Matching-Guided Embedding
The matching-guided embedding mechanism performs a token-level matching operation onŪ andR to form a matching matrix M, which is formulated as: Then it generates a refined embedding matrixR based on the context-free embedding matrixR, which is given by: R is a new representation of r from the perspective of the utterance u. Note that as u is only a single turn utterance in C, we generate t − 1 refined embedding matrices for r in total.

Turn Aggregation
After obtaining all of the refined embedding matrices across turns, we consider two strategies to fuse them to obtain the final context-aware embedding matrixR. The first strategy adopts a weighted sum operation based on the attention mechanism, formulated by: where i ∈ {1, · · · , k − 1, k + 1, · · · , t} andR i is the refined embedding matrix corresponding to the i-th turn utterance in the context. The attention weight α i is decided by: where FFN is a two-layer feed-forward network with ReLU (Nair and Hinton, 2010) activation function. We denote this strategy as I 1 . The second strategy I 2 directly sums up all the refined embeddings across turns, which is defined as: For the negative sample r , we apply the same procedure to generate the context-free embedding matrixR and the context-aware embeddingR . Each context-aware embedding matrix is then paired with its corresponding context-free embedding matrix to form a training pair.
As mentioned in the introduction, MGE holds several advantages in modeling the contextresponse semantic relationships. Firstly, the tokenlevel matching operation acts as a guide to distill the context-aware information from the contextfree embedding matrix. Meanwhile, it provides rich semantic matching information to assist the generation of the context-aware embedding matrix. Secondly, MGE is lightweight and computationally efficient, which makes the model easier to train than the siamese-network-based models. Finally and most importantly, the context-aware embed-dingR shares the same semantic space withR, which enables us to directly measure their cosine similarity. This is the key to successfully model the semantic matching relationships between the context and the response through contrastive learning.

Learning Objective
We adopt the NT-Xent loss proposed in (Oord et al., 2018) to train our model. The loss L is formulated as: where N is the number of all the positive training samples and M is the number of all the training pairs associated with each positive training sample r. τ is the temperature hyper-parameter. sim(·, ·) is the similarity function, defined as a token-level pooling operation followed by the cosine similarity. Once the model is trained, we take the mean pooling of BERT's output embeddings as the sentence embedding.

Experiments
We conduct experiments on three multi-turn dialogue datasets: the Microsoft Dialogue Corpus (MDC) , the Jing Dong Dialogue Corpus (JDDC) (Chen et al., 2020), and the Ecommerce Dialogue Corpus (ECD) (Zhang et al., 2018). Each utterance in these three datasets is originally assigned with an intent label, which is further leveraged by us in the heuristic strategy to construct the evaluation datasets.  JD 2 . Although the dataset collected from the realworld scenario is quite large, it contains much noise which brings great challenges for our model. The E-commerce Dialogue Corpus is a large-scale dialogue dataset collected from Taobao 3 . The released dataset takes the form of the response selection task. We recover it to the dialogue sessions by dropping the negative samples and splitting the context into multiple utterances. We pre-process these datasets by the following steps: (1) We combine the consecutive utterances of the same speaker.

Training
(2) We discard the dialogues with less than 4 turns in JDDC and ECD since such dialogues are usually incomplete in practice.

Evaluation
We introduce the semantic retrieval (SR) and the dialogue-based STS (D-STS) tasks to evaluate our model. For the SR task, we construct evaluation datasets by the following steps: (1) we sample a large number of sentences with the intent labels as candidates.
(2) the candidates are annotated with binary labels indicating whether the given sentence and its intent label are consistent. The inconsistent instances are directly discarded from the candidates.
(3) for each sentence, we retrieval 100 sentences through BM25 (Robertson and Zaragoza, 2009) from the candidates, and assign each candidate sentence a label by whether its intent is consistent with the target sentence. We limit the number of positive samples to a maximum of 30 and keep approximately 7k, 7k, and 4k samples for MDC, JDDC, and ECD respectively. For the D-STS task, we sample the sentence pairs from the dialogues following the heuristic strategies proposed by (Cer et al., 2017) to ensure there are enough semantically similar samples. The heuristic strategies include unigram-based and w2v-based KNN retrieval methods and random sampling from the candidates with the same intent labels. The sentence pairs are further annotated through the crowd-sourcing platform, with five degrees ranging from 1 to 5 according to their semantic relevance. We use the median number of annotated results as the semantic relevance degrees, obtaining 1k, 2k, and 1k sentence pairs for MDC, JDDC, and ECD respectively.
All annotations are carried out by seven native speakers. For the SR task, we adopt the Mean average precision (MAP) and the Mean reciprocal rank (MRR) metrics. Following previous works, we adopt Spearman's correlation metric for the D-STS task to assess the quality of the dialogue-based sentence embeddings.

Baselines
We evaluate our model against the two groups of baselines: self-supervised learning methods and dialogue-based self-supervised learning methods. The former is not designed for dialogues while the latter is.

Self-supervised learning methods
In this line, we consider the BERT-based methods, which include BERT (Devlin et al., 2019), domain-adaptive BERT (Gururangan et al., 2020), BERT-flow (Li et al., 2020), and BERT-whitening (Su et al., 2021). "Domain-adaptive BERT" means that we run continue pre-training with the dialogue datasets. BERT-flow and BERT-whitening are two BERT-based variants that transform BERT's sentence embedding to the isotropic semantic space.
For BERT, we use the [CLS] token embedding (denoted as BERT-CLS) and the average of the sequence output embeddings (denoted as BERT-avg) as the sentence embedding, and the same is true for domain-adaptive BERT. It should be noted that in related sentence embedding researches, domainadaptive BERT is rarely considered since the training datasets are relatively small. Fortunately, the large-scale dialogue datasets allow us to explore whether the domain-adaptive pre-training is helpful for our tasks. We also adopt the average of GloVe word embeddings (Pennington et al., 2014) (denoted as Avg. GloVe) as the sentence embedding to compare with our results.

Dialogue-based self-supervised learning methods
In this line, we mainly consider the siamesenetworks commonly applied in dialogue-based researches. Considering none of the previous works Henderson et al., 2020) employs the pre-trained language model as encoder, we re-  Table 2: Evaluation results on the dialogue-based semantic textual similarity (D-STS) task and the semantic retrieval (SR) task. Corr. refers to Spearman's correlation metric for the D-STS task. MAP and MRR are metrics for the SR task. Reported numbers are in percentages.
implement two BERT-based siamese-network models according to their original approaches. The first baseline SiameseBERT s is a siamese-network which shares the architecture with Reimers and Gurevych, 2019). It is equipped with a non-linear activation function in the matching layer to model the heterogeneous matching relationships between the context and the response 4 . The second baseline SiameseBERT m has the similar architecture as (Henderson et al., 2020). It flattens the multi-turn context and takes the token sequence as input. There is also an MLP layer on top of the sentence encoders.

Implementation Details
Our approach is implemented in Tensorflow (Abadi et al., 2016) with CUDA 10.0 support. For all datasets, we continue pre-training BERT for approximately 0.5 epochs to improve its domain adaption ability as well as keeping the general domain information as much as possible. During the continue pre-training stage, we use a masking probability of 0.15, a learning rate of 2e-5, a batch size of 50, and a maximum of 10 masked LM predictions per sequence. During the contrastive learning stage, we freeze the bottom 6 layers of BERT to prevent catastrophic forgetting which simultaneously en-ables the model to be trained with larger batch size. Such a setting achieves the best performance in our experiments. The batch size, the learning rate, and the number of context turns are set to 20, 5e-5, and 3 respectively. The maximum sequence length is set to 100, 50, 50 for JDDC, MDC, and ECD for both continue pre-training stage and contrastive learning stage. All models are trained on 4 Tesla V100 GPUs. There are even larger improvements between Di-alogueCSE and the domain-adaptive baselines including BERT(adapt) and its variants. We attribute this improvement to two main reasons: First, by introducing contrastive learning, DialogueCSE eliminates the gap between training and evaluating, gaining significant improvements on both SR and D-STS tasks. Second, DialogueCSE models the semantic relationships in each utterance-response pair, which distills the important information at turn-level from the multi-turn dialogue context and achieves better performance. Moreover, by comparing the performances of DialogueCSE I 1 and DialogueCSE I 2 , we find that the weighted sum aggregation strategy surprisingly brings a significant deterioration on all metrics. We consider that this is because the weighted sum operation breaks down the turn-level unbiased aggregation process. Since the attention mechanism tends to provide shortcuts for the model to achieve its learning objective, the long-tail utterances in the context may be partially ignored, thus leading to a decline in embedding performance. We hold that we can completely dismiss the weighted sum aggregation strategy in DialogueCSE since the tokenlevel matching operation in MGE has implicitly served this role.

Evaluation Results
We also notice that BERT(adapt) achieves significantly better performance than the original BERT, especially on JDDC and ECD. It demonstrates the importance of continued pre-training with the indomain training data. Without such procedure, the in-domain data can't be fully exploited, making it difficult for the model to achieve satisfactory performance. This also indicates that the MLM pre-training task is indeed a powerful task to learn effective sentence embeddings from texts, especially when the domain training data is sufficient.

Discussion
We conduct comparison and hyper-parameter experiments in the following section to study how our model performs with different numbers of turns, data scales, temperature hyper-parameter, and numbers of negative samples.

Comparison with Baseline
In this section, we choose SiameseBERT m as a comparison method. MAP and Spearman's correlation metrics are adopted in these experiments.
Impact of turn number. Figure 2 shows the performance of our model and the baseline under different numbers of turns on all datasets. From the results, we observe that our model is indeed benefited from the multi-turn dialogue context, and it exhibits consistently better performance than the baseline. The performance of our model increases as the turn number increases until it approximately arrives at 3. When the turn number goes bigger, the performance of both models begins to drop.
We believe that in this case, adding more dialogue context will bring too much noise. Since MGE acts as a noise filter at both token and turn level, it makes the model more robust when using more context turns. Impact of data scale. We further explore whether our model is robust when fewer training samples are given. we select JDDC and ECD in this experiment since they are large-scale and topically diverse, which is suitable for simulating a few-shot learning scenario. Figure 3 shows the performances of our model and the baseline under different numbers of training dialogues. As the figure reveals, the performance gaps between our model and the baseline are even larger when fewer training dialogue sessions are given. Particularly, when using only a few dialogues, our model can achieve even superior performance over the SiameseBERT trained on larger datasets, especially on the D-STS task. We think this is reasonable since the siamese-networks introduce a large amount parameters to model the semantic matching relationships, while our model accomplishes this goal without introducing any additional parameters.

Hyper-parameter Evaluations
We further conduct experiments on JDDC and EDC to study how our model is influenced by the temperature τ and the number of negative samples. The MDC dataset is excluded here since the semantics  of its utterances are highly centralized around a few top intents. Impact of temperature. Table 3 shows the experimental results with different τ values. We find that the Spearman's correlations increase monotonically as τ increases until 0.1 for JDDC and 0.2 for ECD, then they begin to drop. The MAP metrics also increase as τ increases until 0.1 for both datasets, but they remain stable as τ varies from 0.1 to 0.5. We consider this is due to the coarsegrained nature of the SR task. When τ approaches 0.1, our model can gradually distinguish among different fine-grained semantics, thus achieving better performance on both SR and D-STS tasks. As τ continues to increase, the model forces the sentence embeddings to be closer, resulting in a decrease in Spearman's correlation. However, as all positive samples in the candidates have identical labels, such degradation may not be fully reflected through the ranking metric (e.g. MAP) or even be covered as the number of retrieved positive samples changes.
Impact of negative samples. We vary the number of negative samples for each positive sample within {1, 4, 9, 19}. Table 4 shows the experimental results, from which we find that both metrics improve slightly when the number of negative samples increases. Considering the similar observation in (Gao et al., 2021;Yan et al., 2021), we conclude this phenomenon may be related to the discrete nature of language. Specifically, as the generation of the sentence embeddings in our approach is guided and constrained by the token-level interaction mechanism, our model is more robust than the other contrastive learning approaches and is even effective when only one negative sample is provided.

Conclusion
In this work, we propose DialogueCSE, a dialoguebased contrastive learning approach to learn sentence embeddings from dialogues. We also propose uniform evaluation benchmarks for evaluating the  quality of the dialogue-based sentence embeddings. Evaluation results show that DialogueCSE achieves the best result over the baselines while adding no additional parameters. In the next step, we will study how to introduce more interaction information to learn the sentence embeddings and try to incorporate the contrast learning method into the pre-training stage.