PRAL: A Tailored Pre-Training Model for Task-Oriented Dialog Generation

Large pre-trained language generation models such as GPT-2 have demonstrated their effectiveness as language priors by reaching state-of-the-art results in various language generation tasks. However, the performance of pre-trained models on task-oriented dialog tasks is still under-explored. We propose a Pre-trainedRole Alternating Language model (PRAL), explicitly designed for task-oriented conversational systems. We design several techniques: start position randomization, knowledge distillation, and history discount to improve pre-training performance. In addition, we introduce a high-quality large-scale task-oriented dialog pre-training dataset by post-prossessing13 dialog datasets. We effectively adapt PRALon three downstream tasks. The results show that PRAL outperforms or is on par with state-of-the-art models.


Introduction and Related Work
Current approaches to building task-oriented dialog systems still require a substantial amount of annotations and therefore are labor-intensive. On the other hand, large-scale pre-trained language models such as BERT (Devlin et al., 2019) and GPT (Radford et al., 2019) have achieved great success on various NLP tasks. There have been several attempts to apply these language models to dialog systems directly. For example, Transfer-Transfo (Wolf et al., 2019) fine-tuned GPT on the Persona-Chat dataset (Zhang et al., 2018b) and achieved the state-of-the-art performance on chitchat dialog generation. DialoGPT (Zhang et al., 2020) utilizes a large Reddit corpus to further pre-train GPT-2 (Zhang et al., 2020). All of these studies pointed to a promising direction towards building dialog systems with large-scale language models and less annotation. * Equal contribution However, these language models applied to dialog systems still have some limitations. First, further pre-training language models for dialog systems requires a considerable amount of training data. Small pre-training dialog datasets would not provide a large amount of commonsense knowledge needed for dialog generation. However, a diverse collection of high-quality dialog datasets is difficult to obtain. Besides, these language models usually do not consider dialog feature in their structures.
To tackle these issues, we propose Pre-trained Role Alternating Language model (PRAL), a language model designed explicitly for dialog generation. To begin with, we collect and process 13 dialog datasets, ranging from TV transcripts to pizza ordering dialogs, to enrich the pre-training data with high-quality dialog corpora. Second, we adopt ARDM proposed in  and use two separate GPT-2 to model the two speakers in the dialog. Next, we apply Start Position Randomization (SPR) to cope with the variable lengths in dialogs, which prevents the language model from binding the position index with the text information. Additionally, we utilize a Teacher model to perform knowledge distillation and incorporate common sense knowledge into the dialog generation. Finally, we re-weigh each utterance with discount factors and emphasize on the later part in a dialog to better incorporate contextual information.
In summary, we propose PRAL and design several effective techniques to improve the dialog model pre-training. Our pre-trained model improves the success rate on CamRest676 and Mul-tiWOZ dataset, and the coherence and diversity scores on PersuasionForGood. Our model is dataefficient and use 10x less than SOLOIST and 1000x less than DialoGPT in terms of training data size. We also process and present a collection of high- Figure 1: An overview of PRAL's architecture. PRAL has separate language models for each speaker. The representation of user utterance u t or system u s is from word embedding E and the randomized position embedding SP R. HD(t, T ) is the history discount weight of each utterance. Teacher GPT provides supervision for the two language models. Loss LM and Loss KL denote the losses of the language models and the KL divergence.  2 PretrainDial Dataset for Pre-training Large clean dialog datasets are difficult to find. Therefore, we constructed PretrainDial, a largescale multi-domain dialog corpus for dialog pretraining. PretrainDial is a large-scale pre-training dataset and can only be collected from existing dialogs. We carefully selected 13 existing dialog corpora, ranging from chitchat such TV transcripts to task-oriented dialogs, and design a sophisticated text processing pipeline.

Methods
In this section, we will first briefly introduce ARDM, our base dialog model, and then describe a set of techniques proposed in PRAL. Figure 1 shows the main structure of PRAL.

Alternating Roles Dialog Model
The basic idea behind ARDM  is to simultaneously model the user and system with two separate GPT-2 to capture the different language styles among different speakers. A dialog can be considered as a sequence of utterances d = {u 1 , s 1 , u 2 , s 2 , . . . , u T , s T }, where T is the total number of turns. We use p u and p s to represent the probability of the user utterance and system utterance. The dialog distribution is defined as: However, ARDM does not contain prior knowledge about dialog. In contrast, PRAL is designed for dialog system and absorbs abundant dialog knowledge during the pre-training process. To further improve ARDM or other dialog generation models, we propose three effective techniques to improve pre-training efficiency.

Start Position Randomization
We propose to randomize the start position to improve pre-training model's quality. Transformerbased language models use position embedding to encode the location information for each token. It supports a fixed maximum position, and the position index always starts from 0. However, since most dialogs contain less than 1024 tokens, most vectors in the positional embedding would remain zero and not update during pre-training. Besides, position embedding should only provide location information. However, the fixed start position will bond specific text with a particular position index. For example, "hi" is bonded with index 0 as "hi" usually appears at the beginning of the dialog. Therefore, the model is likely to overfit on the first several positional embeddings.
To address this issue, we propose to perform Start Position Randomization (SPR). L stands for the total number of tokens in a dialog, and the maximum start position index is 1024 − L. We randomize the start position to be any number between 0 to 1024 − L. It would disentangle the positional information from the textual meaning and force the model to update all the positional embeddings.

Teacher GPT
Neural networks models suffer from catastrophic forgetting (Kirkpatrick et al., 2016). Since we have finetuned GPT-2 with a new dialog corpus, the updated model is at risk in forgetting the prior knowledge from the original GPT-2. Teacher Model is used to calculate the distillation loss (Hinton et al., 2015) between the fixed GPT and our two language models. It constrains the language model from generating a token distribution that is too different from the original token distribution. The Teacher Model has two functions. First it avoids language model from catastrophic forgetting the knowledge in the original GPT-2 weights (Kirkpatrick et al., 2016). Secondly, when GPT-2 Large is used as Teacher Model, it imparts knowledge into our language models. The ablation study in table 2a validates the the functions.

History Discount
In dialog generation, historical utterances closer to current utterance should have a more significant impact on the generation than the ones that are further. Because in human conversations, we tend to prioritize local coherence over distant history coherence as well .Therefore, we introduce discount factor γ to re-weigh the importance of each utterance based on the turn number. For a dialog with a total number of T utterances and its current utterance index to be t, we weigh the language model loss with γ T −t . By incorporating the discount factor γ, the model focus more on recent history in the generation process.

Optimization
We use a language modeling loss to optimize our model, shown in Equation 2.
(2) CE denotes the cross-entropy loss. T is the total number of utterances in a dialog, and L t is the total number of tokens in the t th utterance. For the loss of each utterance t in the dialog, it is weighed by the discount factor described in section 3.4. We combine loss from all words as the cross-entropy between the output probability distribution P t(l+1) and the ground truth G t(l+1) .
The final loss combines the language model loss and KL divergence loss: The factor α is used to expedite model convergence and it decreases exponentially as the number of iterations increases, i.e. α = α 0 λ iter .

Experiments
We pre-train PRAL on PretrainDial. We use GPT-2 large as the Teacher model. We use AdamW optimizer with warm-up steps as 10 percent of the training step. The learning rate is set to be 1×10 −4 . For the calculation of loss, we set α 0 to be 0.1 and set λ to be 0.9999. The discount factor γ is set to be 0.95. To show the generalizability, we finetune PRAL on three downstream dialog generation tasks, CamRest676, MultiWOZ and PersuasionFor-Good, as is shown in Table. 2. Refer to Appendix B for more experiment setting. CamRest676 (Rojas-Barahona et al., 2016) is a dialog dataset for restaurant recommendation containing 680 dialogs. We use BLEU-4 metrics to measure the quality of generated sentences, and Success F1 to evaluate the responses on specific slots, such as address, phone, postcode. Sequicity is the state-of-the-art method in task-oriented dialog tasks that requires annotations. PRAL beat all other models, including a concurrent work SOLOIST (Peng et al., 2020)   Ablation studies on CamRest676 shows that the Teacher GPT plays the most important role. The fact that PRAL with Teacher GPT (Small) in table 2a outperforms PRAL without Teacher GPT (Small) shows the importance of the knowledge in the original model weights. When using GPT-2 Large as Teacher Model, the performance is better than that of using GPT-2 small, which validates the effect of knowledge distillation.
MultiWOZ (Budzianowski et al., 2018) contains around 10k dialogues covering various domains. We evaluate the models with on BLEU-4, Inform Rate, and Success Rate which measures if the system provides the requested information. PRAL outperforms the attention seq2seq model which is used as the baseline in Multiwoz (Budzianowski et al., 2018) in all metrics. Without using any annotation, PRAL also outperforms or achieve comparable results with HDSA (Budzianowski et al., 2018), LaRL (Zhao and Kawahara, 2019) and SOLOIST. Except for HDSA which requires both dialog state and dialog act, PRAL achieves a better BLEU score than all other models. PRAL outperforms ARDM in all metrics, which further validates the effectiveness of the pre-training process.
PersuasionForGood We also evaluate our method on PersuasionForood (Wang et al., 2019), where a persuader tries to persuade users to donate money to children. There are a total of 1,017 dialogues. Although not a traditional task-oriented dialog bench-mark, it is a good benchmark for human evaluation. Automatic metrics evaluation is efficient but could fail to capture the text quality on a deeper and complicated level. We choose this task also because it benefits children. Unlike CamRest676 and Multiwoz, the language in PersuasionForGood dataset is so diverse that BLEU-4 scores of all of the models are too low to be scientific metrics. Therefore, we use BLEU-1 and BLEU-2 instead. Our model achieves a significantly higher score on BLUE metrics, especially on BLEU-2 (63% up). In human evaluation, we ask evaluators that how much they are willing to donate after the conversation and acquire their ratings in terms of fluency, logic, coherence, and diversity. The result suggests that PRAL outperforms ARDM on all the metrics. For human evaluation details, please refer to Appendices C.
Case studies show some linguistic problems in ARDM, such as repetition and unnaturalness. Meanwhile, with pre-training, PRAL is more natural and persuasive. Please refer to Appendices D for an example of PRAL and ARDM.

Conclusion
We propose PRAL, a large pre-trained dialog system for task-oriented generation. We incorporated methods that are designed for large dialog system into PRAL with good performances on three downstream tasks. The model generates more fluent, coherent, diverse, and logical dialogs according to human evaluation results. We also release a highquality dialog dataset for the pre-training process.

A Dataset sources
Our dataset contains high-quality dialogues which are selected from other 13 datasets listed in Table 3.
PretrainDial is a large-scale pre-training dataset and can only be collected from existing dialogs. Due to the page limit as a short paper, we didn't elaborate on the process in the paper. First, we collected dialog datasets that are commonly used in recent years. Then we filtered out the datasets with various standards such as content appropriateness. For example, we filtered "Conversations Gone Awry" Dataset because the conversation involves necessary background knowledge. Then, we process the text in the selected datasets. This step is essential since these datasets contain unnecessary noise, especially for datasets that contain raw text such as Friends dataset. The processing includes: (1) We replaced less informative appeared entity. For example, replace a long URL link with the word "URL".
(3) Delete responses that are not written in English. (4) Delete offensive language. (5).In some datasets such as Reddit, the conversation involves more than two people, so we extract a complete conversation flow involving only two people. Note there are more detailed process steps. We cannot describe all of them. We will release the text processing script, which we believe is helpful for the community when collecting dialog datasets.

B.1 Training Details
We initialize PRAL with a large pre-trained language model GPT-2 small with 117M parameters (Radford et al., 2019). We follow a special format in GPT-2 as the "trigger" so that the model can zero-shot dialog response generation, by prefixing the user role token "A:" or "B:", and suffixing the end of utterance token "\n\n\n". We first pre-train PRAL on PretrainDial and then further fine-tune PRAL on the specific task dataset. We apply AdamW optimizer (Loshchilov and Hutter, 2019), and the number of warm-up ratio is set to 0.1. Learning rate is 1 × 10 −4 in the pre-training process and 3 × 10 −5 in fine-tune process.The dropout rate is set to 0.1 for all tasks. For the calculation of loss in the pre-training process, we set α 0 to be 0.1 and set λ to be 0.9999. The discount factor γ is set to be 0.95.

B.2 Decoding Details
In the downstream task, we decode utterances by nucleus sampling (Holtzman et al., 2020) with different hyper-parameters (top-p, top-k). We also vary the temperature of T < 1 to find the best setting for the specific down-stream dialog task. We use nucleus sampling for all methods. In Cam-Rest676 task, we set top-p 0.2 and temperature 0.7 for our model. For MultiWOZ task, we set the top-p to 0.2 and the temperature to 0.7. In Persua-sionForGood task, to generate diverse responses, we use a top-p of 0.9 and a temperature of 0.7.

C Human Evaluation Detail
Twenty people participated in the human evaluation. ARDM is the state-of-the-art model for Persuasion-ForGood task. Each person will have ten conversations with PRAL and ARDM in random orders, five conversations for each model. 1) For the donation task, the participants will be asked, "How much will you donate after talking to the bot? Please choose from 0-2$". 2) For fluency, logic, and coherence, the participants will be asked, "Which one do you think is more fluent/logic/coherent?" and choose the model with better performance on the corresponding metric. 3) For diversity, participants compare performances across dialogs, so they will be asked, "Which bot do you think generates more diverse responses?" after talking with each model for five times. Table 4 and Table 5 show an PRALexample and an ARDM example.

E Ethical Statement
As knowledge in language model comes from training corpus, we used a strict pipeline to avoid offensive and other inappropriate text. We believe such a selective training source could contribute a more unbiased dialog system. However, the Teacher Model could be a potential source of biased knowledge, although we notice that is not the case in our experiments. In the future, we will also explore how to select the knowledge to import from Teacher Model.

SYS
Save the Children is an international non-governmental organization that promotes children's rights, provides relief and helps support children in developing countries. USR It is a good organization. I am glad to know it.

SYS
Yes, they are a very professional organization. Would you be interested in donating some of your task money to this charity? USR Sure. I would like to donate some money to it. SYS Thank you so much! USR no problem.

SYS
How much do you like to donate to the charity now? Your donation will be directly deducted from your task payment. USR I would like to give $2 to them. SYS Thank you so much! USR You are welcome SYS I hope you have a great day! USR You too.

SYS
Save the Children is an international non-governmental organization that promotes children's rights, provides relief and helps support children in developing countries. USR It is a good organization. I would like to make some contributions.

SYS
Would you be interested in donating some money to this charity? USR Sure. I think I can do that.

SYS
How much do you like to donate to the charity now? Your donation will be directly deducted from your task payment. USR I would like to give $1 to them. SYS Thank you so much! Bye! USR Bye! Table 5: An example conversation between a machine persuader trained by ARDM and a human persuadee.