Dialogue-oriented Pre-training

Pre-trained language models (PrLM) has been shown powerful in enhancing a broad range of downstream tasks including various dialogue related ones. However, PrLMs are usually trained on general plain text with common language model (LM) training objectives, which cannot sufficiently capture dialogue exclusive features due to the limitation of such training setting, so that there is an immediate need to fill the gap between a specific dialogue task and the LM task. As it is unlikely to collect huge dialogue data for dialogue-oriented pre-training, in this paper, we propose three strategies to simulate the conversation features on general plain text. Our proposed method differs from existing post-training methods that it may yield a general-purpose PrLM and does not individualize to any detailed task while keeping the capability of learning dialogue related features including speaker awareness, continuity and consistency. The resulted Dialog-PrLM is fine-tuned on three public multi-turn dialogue datasets and helps achieve significant and consistent improvement over the plain PrLMs.


Introduction
Recently, pre-trained language models (PrLMs) have shown impressive improvements for various downstream NLP tasks Ouyang et al., 2021;Radford et al., 2018;Yang et al., 2019;Zhang et al., 2020c;Clark et al., 2020;Li et al., 2021), including the response selection task for multi-turn dialogues, which takes Table 1: A multi-turn dialogue example with interleaved or continuous utterances between two speakers. a dialogue history as input and aims to select a most suitable response from a collection of answers Zhou et al., 2018b;Zhu et al., 2018;Tao et al., 2019;Gu et al., 2019).
Pre-training tasks of all these PrLMs almost concentrate on two aspects: token prediction and sentence relation prediction. For example, the genetic BERT model (Devlin et al., 2019) uses masked language modeling (MLM) and next sentence prediction (NSP) objectives; ALBERT (Lan et al., 2020) predicts sentence order rather than NSP; ELEC-TRA (Clark et al., 2020) transfers MLM into a generating and then discriminating process like GAN (Goodfellow et al., 2014). However, these tasks are just devoted to incorporating token-level and sentence-level semantic information into embeddings, and cannot be sufficiently compatible with its dialogue-oriented characteristics. Table 1 shows a multi-turn dialogue example. Compared with plain text, the utterance turn and speaker role keep shift as a conversation goes on, and the next utterance should keep continuous and consistent with the context. Besides, the two speakers may not follow strict shift rules, and one speaker may continuously shoot multiple utterances. Al-arXiv:2106.00420v2 [cs.CL] 31 Jul 2021 though some existing works have noticed such nonlinear nature of multi-turn dialogues, they are limited to conducting post-training or pre-training in a specific domain and do not provide generalpurpose dialogue-oriented PrLMs to fundamentally solve this problem (Xu et al., 2021a;Whang et al., 2021;Wolf et al., 2019;Zhang et al., 2020b;Henderson et al., 2020;Bao et al., 2020).
In this work, we make the first attempt to train a general-purpose dialogue-oriented PrLM. However, such a PrLM should be trained on huge dialogue data, which is hard to collect. Thus we propose three novel pre-training strategies (i.e., Insertion, Deletion, Replacement), so that we facilitate plain text originally for common PrLM training to simulate dialogue-like features. The resulted model, we denote as Dialog-PrLM, then is capable of effectively learning speaker awareness, continuity and consistency in a general way. Especially, for the convenient use of the downstream dialogue tasks, we introduce a special token [SOT] before each utterance to tell that it is a start of a turn and learn from these three strategies. These targeted pretraining tasks enable [SOT] to better represent each context utterance. We mimic dialogue-related features on conventional plain text, which can bring up the possibility that similar techniques could be adopted in other domains not only for dialogues.
Our pre-trained Dialog-PrLM is fine-tuned on three multi-turn dialogue response selection benchmarks, and obtains significant and consistent improvements over the plain PrLMs.

Related Work
For multi-turn dialogue response selection task, earlier works conduct single-turn match, which concatenates all the utterances in the history dialogue or just considers the last one to match with the candidate response (Lowe et al., 2015;Kadlec et al., 2015;Tan et al., 2016;Wan et al., 2016;Wang and Jiang, 2016). Recently, existing works tend to model the interaction between each dialogue utterance and the response, which usually adopt the encoding-matching-aggregation paradigm Zhou et al., 2018a,b;Tao et al., 2019;Yuan et al., 2019). After encoding, distinct matching networks generate features for each utterance which are usually passed to GRU (Cho et al., 2014) for aggregating into a final matching score. Besides, some works adopt topic information Wu et al., 2018;Xu et al., 2021b) or conversation disentanglement to select the proper response .
There are more and more practice using powerful PrLMs as the model encoder (Zhang et al., , 2020a like BERT (Devlin et al., 2019), RoBERTa , ALBERT (Lan et al., 2020) and ELECTRA (Clark et al., 2020). Considering task domain difference from the general corpus for PrLM pre-training, recent studies start to conduct post-training on target multi-turn dialogue datasets to incorporate in-domain knowledge (Whang et al., 2020;Lu et al., 2020;Gu et al., 2020;Xu et al., 2021a;Whang et al., 2021). Whang et al. (2020) conduct post-training of MLM and NSP tasks as BERT. Rather than using the same tasks as PrLMs, Xu et al. (2021a) and Whang et al. (2021) both considers auxiliary tasks through posttraining to enhance response selection.
Although the PrLMs which are trained on plain text have learned contextual semantic representation from token-level or sentence-level pre-training tasks like MLM, NSP, they all do not consider dialogue related features like speaker role, continuity and consistency. Despite some existing works (Xu et al., 2021a;Whang et al., 2021) considers that when conducting post-training, they are limited to a specific domain. (Wolf et al., 2019;Zhang et al., 2020b;Henderson et al., 2020;Bao et al., 2020) train on open-domain conversational data like Reddit for response selection or generation tasks, but they are limited to original pre-training tasks on plain text and ignore the dialogue related features. Besides,  and  conduct task-specific training on collected dialogue corpora, but they also suffer from biased and limited amount of dialogue data. Different from all the previous studies, we still make an attempt in obtaining a general-purpose PrLM but not aiming at any specific tasks like post-training methods. Meanwhile, our proposed dialogue-oriented pre-training enables the resulted PrLMs to especially capture dialogue related features in a general way.  placement) to jointly learn dialogue related characteristics based on the plain PrLMs. A special token [SOT] is added before each "utterance", which tells that it is a start of a turn and matches the realistic scene of turn shift. The three tasks use the embedding of [SOT] to represent each utterance and conduct targeted pre-training, which enables [SOT] to learn dialogue related representation about speaker-awareness, continuity and consistency respectively. Figure 1 shows the overview of the three strategies.
Insertion In a real scenario, the speaker role might shift or not for each turn as a conversation goes on. Two speakers may carry out a conversation in turn or one speaker may continuously shoot multiple utterances. Considering a conversation session of four sentences between speaker A and B, we consider three possible cases: AABB, ABAB ABBA. The next time A speaks will happen after 0,1, or 2 turns from the last time. To enable Dialog-PrLM aware of the speaker role information, we should first simulate a two-party conversation on Wikipedia. We sample two continuous sentences {u A1 , u A2 } in one paragraph of an article as the two utterances of A, and sample two continuous ones {u B1 , u B2 } in the following paragraph of the same article as what B says. Sampling from the same article is to ensure they are talking about one topic in general, which is in line with the realistic scenario and increases the difficulty of prediction. The "continuous" sentences simulate that A continues to express his opinion after being interrupted by B. We insert u A2 into {u B1 , u B2 }, and add [SOT] before each utterance. We will not disrupt the utterance order inside one speaker, and also keep the overall order that A first then B. In this way, we can get three cases mentioned above. One possible input case is listed here: The insertion task is to predict the next utterance of A. The whole sequence is encoded by PrLM, and we calculate the cosine similarity of the [SOT] embedding of u A1 with the other three utterances as matching scores to predict the u A2 . These three scores are passed to a softmax layer and use cross entropy as the insertion loss L ins .
Deletion The plain pre-training tasks like MLM, NSP or SOP of PrLMs just enable the model to learn token-level or sentence-level semantic information, and they fail to catch dialogue related signals like continuity, which also helps to choose the answer that is coherent with context. We sample continuous k sentences {u 1 , u 2 , · · · , u k } from one paragraph and randomly delete u i from the first k − 1 (We do not choose u k , as there is no [SOT] after u k−1 ). The input sequence of the PrLM is: We append u i at end and use [SEP] for separation. Similarly, we calculate the cosine similarity of the [SOT] embedding of u i with the other remaining k − 1 utterances to predict the [SOT] of u i+1 , where u i should be inserted back. These k − 1 scores are passed to a softmax layer and use cross entropy as the deletion loss L del .
Replacement The replacement task is to make Dialog-PrLM recognize the inconsistent utterance within a dialogue session, so that it will select the proper response which is consistent with the context in both style and context. Similar to deletion, we sample continuous k sentences {u 1 , u 2 , · · · , u k } from one paragraph, and then we sample one sentence u r from another article, which is used to replace a randomly chosen u i in {u 1 , u 2 , · · · , u k }. The input sequence is: Each [SOT] is gathered after encoding, and passed to a linear layer to get a score: where j = 1...i − 1, r, i + 1, ...k, and W r , b r are trainable parameters. E j is the embedding of the jth [SOT]. These k scores are passed to a softmax layer and use cross entropy as the replacement loss L rep . We adopt multi-task learning and define the final dialogue-oriented pre-training loss L gen as:

Use of Dialogue-oriented Pre-training
Our Dialog-PrLM may be used in terms of domain fine-funing or multi-task learning: (1) Domain fine-tuning: Our Dialog-PrLM is fine-tuned on the target response selection task. (2) Specific posttraining: Our pre-training strategies are slightly adjusted and applied to specific multi-turn dialogue datasets.
(3) Domain multi-task learning: the target response selection task jointly learns with the three auxiliary post-training tasks in (2) on Dialog-PrLM.

Domain Fine-tuning
After our dialogue-oriented pre-training, the target response selection task can be fine-tuned on our Dialog-PrLM. We denote the dataset as D = where C is dialogue context, and R is the candidate response, and Y ∈ {0, 1} is the label indicating whether R is a proper response for C. Besides, C = {U 1 , ..., U n } and U i , 1 ≤ i ≤ n is the i-th utterance in context C. We concatenate all utterances {U i } n i=1 as well as the response R and add [SOT] before each to represent the following sequence: With the pre-training, Dialog-PrLM becomes effectively capable of representing each utterance. Therefore, rather than directly using [CLS], we pass all embeddings E of [SOT]s to GRU to model sequential interaction of the context and response, whose final hidden state H is used for generating matching score s for C and R:

Specific Post-training
Because the target dataset usually concentrates on a specific domain, existing works tend to introduce self-supervised post-training on the target domain in order to incorporate the in-domain knowledge. Here we can also apply the three strategies to the target multi-turn dialogue dataset as selfsupervised auxiliary tasks, which jointly learn with the response selection task in Section 4.1.
To conduct the three auxiliary tasks, we should first sample from multi-turn dialogues to build post-training datasets. For the insertion task, there is a little difference from that in Wikipedia. We randomly choose k continuous utterances {u 1 , u 2 , · · · , u k } from a dialogue, and fix u 1 but randomly insert u 2 to any interval among {u 3 , · · · , u k }. The input sequence is: We calculate the cosine similarity of the [SOT] embedding of u 1 with the other following utterances as matching scores to predict the u 2 . The loss is denoted as L ins . Considering the following turn (u 2 ) tends to be more related to u 1 compared with the next utterance of u 1 's speaker (denoted as u t ), we do not predict u t as what we do in Wikipredia. But they are both expected to recognize the most related utterance with u 1 , which helps select the proper response.
For the deletion and replacement tasks, we sample continuous k utterances from one dialogue, and conduct deletion or replacement in the same way as Wikipedia. The post-training losses for both tasks on the target domain are denoted as L del , L rep respectively.

Domain Multi-task Learning
We apply the multi-task learning framework on the target domain on our Dialog-PrLM. Carried with dialogue related features from the general corpus, Dialog-PrLM is expected to learn from the target domain together with the target task. We train the response selection task in the same way as 4.1, and denote the loss as L reselect . The final loss is to sum up response selection loss and the three auxiliary task losses:

Implementation of Dialogue-oriented Pre-training
For dialogue-oriented pre-training, we sample train and valid datasets for the insertion, deletion and replacement tasks on both English and Chinese Wikipedia. To prevent information leakage (e.g. the model peeks at the correct utterance order from deletion samples when conducting replacement), we divide all the articles equally into three disjoint equal sets to sample from for the three tasks respectively. Data statistics are in Table 2.  For English Wikipedia (totally 1,060,131 articles), we sample from 330,000/10,000 articles for training/evaluation respectively for each task. To ensure data quality, we omit the "References" and "Literature" part. For insertion, we sample twice disjointly from one paragraph which has more than 4 sentences, and then sample twice from the next satisfactory one to construct two training samples, which goes on until the end of an article. For deletion and replacement, we sample k continuous sentences as a training sample from each paragraph with more than k sentences. We limit the maximum words of each sample to 400 to prevent overflows after tokenization.
For Chinese Wikipedia (totally 262,405 articles), we sample from 67,468/20,000 articles for training/evaluation respectively for each task. As Chinese corpus is much smaller than the English one, we sample disjoint k continuous sentences as much as we can from each paragraph with more than k sentences for the deletion and replacement tasks. As to insertion, we conduct sampling the same way as English. The maximum length of each sample is limited to 450.

Datasets
For the target response selection selection task, our Dialog-PrLM is fine-tuned on three widely used benchmark datasets: (1) E-commerce Corpus : includes conversations between customers and shopkeepers from the largest e-commerce platform Taobao in China. (2) Douban Corpus : consists of multi-turn conversations from the Douban group, which is a popular social networking service in China. (3) Ubuntu Corpus (v1.0) (Lowe et al., 2015) consists of English multi-turn conversations about technical support collected from chat logs of the Ubuntu forum.
As to the three auxiliary tasks for domain multitask learning, we conduct sampling from the batch when training the response selection task. Dialogues will be neglected if they are less than 3 utterances. Different from general pre-training, we do not require every dialogue to have at least k sentences for all the three tasks.
For evaluation, we use the same metric R n @k as previous works, which selects k best matchable candidate responses among n and calculates the recall of the true ones. We also use MAP (Mean Average Precision), MRR (Mean Reciprocal Rank), and Precision-at-one P@1 as previous works.
For our dialogue-oriented pre-training on Wikipedia, the max input sequence length is set to 512 after WordPiece tokenization. We set the learning rate as 2e-5 with a warmup proportion of 10%. The plain PrLMs are continuously pre-trained with batch size of 8 per task for BERT and 16 per task for ELECTRA. It is trained for 1 epoch and evaluated every 10000 steps. The model with the best average accuracy of the three tasks is saved as Dialog-PrLM. The pre-training experiment needs 4 nVidia RTX 2080 GPUs.
For fine-tuning on our Dialog-PrLMs, the batch size is 32 and the max sequence length is 350. The model is trained for 5 epochs and evaluated after each epoch on the three datasets and both Dialog-PrLMs. Other settings are the same as dialogueoriented pre-training. For domain multi-task learning on our Dialog-PrLMs, the batch size is 16, and the epoch is 3 for Douban on BERT, 4 for Ubuntu on ELECTRA and 5 for other cases. Other settings are the same as fine-tuning. The k value for both pre-training and domain multi-task learning is 5. The fine-tuning/multi-task learning experiments need 1/2 nVidia RTX 2080 GPUs.

Results
To verify the effectiveness of our method, we conduct extensive empirical studies on three multi-turn dialogue benchmarks. We are aware that applying complicated matching networks, speaker em-beddings (Gu et al., 2020) or other various auxiliary tasks (Whang et al., 2020;Xu et al., 2021a;Whang et al., 2021) would achieve further improvement, but to fairly evaluate the general-purpose pre-training for dialogue tasks, we still follow the standard fine-tuning procedure on Dialog-PrLM by excluding those too advanced auxiliary techniques. Dialog-BERT: This model conducts dialogueoriented pre-training on the original BERT, we finetune on our Dialog-BERT through feeding [SOT] embeddings to GRU.
BERT+multi-task: The response selection task is trained with the three auxiliary tasks on target datasets. We also add the special token [SOT], and the only difference from Section 4.3 is that the joint learning is conducted on the original BERT. Dialog-BERT+multi-task: As described in Section 4.3, we conduct domain multi-task learning on our pre-trained Dialog-BERT.
We also conduct fine-tuning on the English Ubuntu dataset with two dialogue related models: (1) DialoGPT (Zhang et al., 2020b) is an extension of GPT-2 that is pre-trained on Reddit data  from scratch.
(2) TOD-BERT  is trained on a combination of 9 task-oriented dialogue datasets over BERT and incorporates response selection objective. Experiments on ELEC-TRA are conduct in the same way with BERT. The results are in Table 3. Below PrLM-denotes BERT or ELECTRA. Compared to the unsatisfactory results from both DialoGPT and TOD-BERT, it demonstrates the powerfulness and universities of our proposed dialogue-oriented pre-training. Compared with PrLM-[CLS], PrLM-[SEP] performs better in general except a little decrease in Ubuntu and Douban on ELECTRA, which shows that modelling the sequential interaction of the dialogue context and response helps improve performance.
After conducting the dialogue-oriented pretraining on Wikipredia, our Dialog-PrLM achieves further improvement on the three datasets and the two PrLMs, which shows that the three targeted training strategies enables the [SOT] token in Dialog-PrLM to grasp dialogue related nature (e.g. speaker-awareness, continuity, consistency) at the same time, so that it is more capable of representing an utterance compared with [SEP] in the plain PrLM (PrLM- [SEP]).
When we train the response selection task jointly with the three auxiliary tasks on target datasets, the domain multi-task learning on our Dialog-PrLM (Dialog-PrLM+multi-task) is still always performing better than on the plain PrLM (PrLM+multitask). Having re-learned broader representation on general corpus, domain post-training further incorporates [SOT] with dialogue related feature from in-domain multi-dialogues and thus helps choose the correct response.
Compare PrLM-[SEP] with PrLM+multi-task, Dialog-PrLM with Dialog-PrLM+multi-task, domain multi-task learning indeed achieves improvements due to its incorporated in-domain dialogue related knowledge, which verifies the effectiveness of our proposed three strategies when applying to domain multi-turn dialogue datasets.
In conclusion, conducting dialogue related fea-ture pre-training with our proposed three strategies on Wikipredia (Dialog-PrLM) helps achieve improvements when fine-tuning, and it will further improve when applying these strategies to domain multi-turn dialogues (Dialog-PrLM+multi-task).

Ablation Study
In order to investigate the performance of each strategy, we conduct ablation experiments for both pre-training and domain multi-task learning. Results are shown in Tables 4 and 5 respectively. The results in Table 4 indicate that the insertion, deletion and replacement tasks jointly contribute to the final increase. Influence on Ubuntu seems less than that on Douban and E-commerce, as the Ubuntu corpus contains many terminologies that do not usually appear in general corpora (e.g., apt-get, lsmod and grep) (Whang et al., 2020), so according to BERT+multi-task in Table 3, conducting domain post-training is more effective.
We also do ablation study for Douban on Dialog-BERT in Table 5 to explore the performance of three auxiliary tasks when applying to the target multi-turn dialogue datasets. Similarly, removing any part leads to worse performance, showing the necessity of each task.

Utterance Representation Test
We have added a special token [SOT] before each utterance or response to represent the following sequence. After pre-training on Wikipedia on PrLMs, [SOT] of our Dialog-PrLM is expected to obtain the respective utterance representation through the three dialogue-oriented strategies. 0.6862 1 0.5704 10: ok, if pay before 16:00 today, it will be delivered today, otherwise it will be delivered next day. 0.9550 1 0.5866 11: i've already paid.
0.8663 1 0.0622 Response: ok, we will deliver the goods as soon as possible today, please wait at patience. To explore the semantic information of [SOT] of our Dialog-BERT, we calculate the cosine similarity of the correct response to each utterance in the dialogue context ([SOT]). Table 6 lists examples from the three target datasets respectively. For comparison, we use BERT to encode each utterance or response and use [CLS] for calculation ([CLS]). We also concatenate utterances and response with separation of [SEP] on BERT and then split to calculate similarity ([SEP]).
We observe that for all examples, both BERT([CLS]) and BERT([SEP]) can not discriminate which utterance is related with the correct answer. All the utterances are treated the same way including the irrelevant ones, which leads to much noise for response selection.
After conducting the dialogue-oriented pretraining on Wikipedia, Dialog-BERT learns a stark sense of "irrelevant" and "relevant". It is able to concentrate on the most critical utterances and distinguish from the irrelevant ones by a large margin.
For the example of Douban, Dialog-BERT realizes that the second and last utterance are most relevant. The response asks for someone to travel together, and is related to the second utterance which expresses a wish to go and the last one which gives a group to travel together. Dialog-BERT ignores the noise about accommodation and transportation, and is able to select related utterances among noise rather than just use the last one. For Ubuntu, Dialog-BERT concentrates on utterances about script and ignores the previous background information. For E-commerce, it recognizes the related last few utterances about delivery, and ignores the packaging information before. From examples from target datasets, our Dialog-BERT has absorbed related knowledge from our proposed dialog-oriented pre-training. The [SOT] could better represent each utterance, which can be utilized in tasks about representation like multi-party dialogue disentanglement. This paper presents a novel general-purpose solution for dialogue tasks with pre-trained language models. To fill the gap between a detailed task and the LM task of PrLM, we propose dialogueoriented pre-training on large scale of artificially built dialogue data which lets the resulted Dialog-PrLM enjoy both merits of general-purpose and capturing key dialogue related features including speak awareness, continuity and consistence. Our models are evaluated on three benchmark response selection datasets and achieve consistent performance improvement over the plain PrLMs.