Different Strokes for Different Folks: Investigating Appropriate Further Pre-training Approaches for Diverse Dialogue Tasks

Loading models pre-trained on the large-scale corpus in the general domain and fine-tuning them on specific downstream tasks is gradually becoming a paradigm in Natural Language Processing. Previous investigations prove that introducing a further pre-training phase between pre-training and fine-tuning phases to adapt the model on the domain-specific unlabeled data can bring positive effects. However, most of these further pre-training works just keep running the conventional pre-training task, e.g., masked language model, which can be regarded as the domain adaptation to bridge the data distribution gap. After observing diverse downstream tasks, we suggest that different tasks may also need a further pre-training phase with appropriate training tasks to bridge the task formulation gap. To investigate this, we carry out a study for improving multiple task-oriented dialogue downstream tasks through designing various tasks at the further pre-training phase. The experiment shows that different downstream tasks prefer different further pre-training tasks, which have intrinsic correlation and most further pre-training tasks significantly improve certain target tasks rather than all. Our investigation indicates that it is of great importance and effectiveness to design appropriate further pre-training tasks modeling specific information that benefit downstream tasks. Besides, we present multiple constructive empirical conclusions for enhancing task-oriented dialogues.


Introduction
Pre-trained models, e.g., BERT (Devlin et al., 2019a), RoBERTa (Liu et al., 2019) and GPT2 (Radford et al., 2019), have been widely used in many NLP tasks. These models are pretrained on the large-scale general text corpus, such as Wikipedia or books, with self-supervised train- * Jinchao Zhang is the corresponding author.
ing objectives. Fine-tuning these models on downstream tasks can achieve excellent performance.
Recently, Gururangan et al. (2020) proposed a domain-adaptive pre-training method, they further pre-training the RoBERTa on a large corpus of unlabeled domain-specific text, e.g., biomedical papers and computer science papers, before fine-tuning on downstream tasks and achieved strong performance. Besides, they proved that it is also helpful to continue pre-training on the task-specific text. Wu et al. (2020) applied this method to task-oriented dialogue and proposed a new self-supervised pretraining objective on dialogue corpus. Despite they achieved performance improvements, the improvements on different downstream tasks vary a lot, some tasks even obtain no improvement, which indicates that different downstream tasks may need different further pre-training tasks.
To investigate this issue, we carry out experiments in the area of task-oriented dialogue. We choose one popular pre-training language model, BERT (Devlin et al., 2019a) as our base model, and construct a large scale domain-specific dialogue corpus which consists of nine task-oriented datasets for further pre-training (Wu et al., 2020). We also select four core task-oriented dialogue tasks, intent recognition, dialogue action prediction, response selection, and dialog state tracking as the downstream tasks used in fine-tuning phase. We aim to explore the following questions: 1) In the area of task-oriented dialogue, can further pre-training using the masked language model improve the performance of all downstream tasks? 2) Do different further pre-training tasks have different effects on different downstream tasks? 3) Which factors affect whether a further pre-training task can achieve improvement on a certain downstream task? 4) Does combining different further pre-training tasks benefits more downstream tasks?
To answer these questions, we design five selfsupervised pre-training tasks according to different characteristics of the downstream tasks. Specifically, we first use specially designed pre-training tasks to further pre-training BERT on the domainspecific corpus, obtaining multiple new pre-trained models, denoted as BERT's variants. Then, we finetune these variants on all downstream tasks and observe the effect of different pre-training tasks on different downstream tasks. From experiment results, we figure out that: 1) Further pre-training with masked language model does not achieve improvements for all downstream tasks, it is necessary to design special further pre-training tasks according to the characteristics of dialogue data. 2) Different pre-training tasks do have different effects on different downstream tasks, and there is a need to design a specific pre-training task for a certain downstream task. 3) Model's ability and structure are two key factors influencing effectiveness of the further pre-training on a certain downstream task. 4) Training two further pre-training tasks in a multi-task paradigm does not lead to incremental performance improvements on downstream tasks.
The main contribution of our work is to give a set of empirical principles about how to design effective further pre-training tasks for enhancing the task-oriented dialogue. The key points of the design are to make the model structures of the pretraining task and the downstream task similar and let the model learn the abilities required by downstream tasks in the pre-training phase while maintaining the masked language model's training. We release the source code at the GitHub repo. 1 2 Background 2.1 Pre-trained Models Two-stage Training. Large pre-training models, such as BERT (Devlin et al., 2019b), RoBERTa (Liu et al., 2019), GPT2 (Radford et al., 2019), XLNet (Yang et al., 2019) T5 (Raffel et al., 2020), are trained on massive general domain text with self-supervised training objectives, like masked language model (Devlin et al., 2019b) and permutation language model (Yang et al., 2019). These models learned strong and general word representations, fine-tuning these pre-trained models on downstream tasks is proved to be effective.
Three-stage Training. Recently, further pretraining large language models on domain-specific corpus before fine-tuning on downstream tasks has 1 https://github.com/FFYYang/DSDF.git become a popular and effective paradigm. Gururangan et al. (2020) proposed domain-adaptive pretraining and task-adaptive pre-training methods, and they proved that such a second phase of pretraining in a specific domain leads to performance gains. Wu et al. (2020) applied the second phase pre-training on task-oriented dialogue, in addition to masked language modeling objective, they also proposed a new self-supervised objective according to the characteristic of dialogue corpus. However, the performance improvement gained from their proposed methods varies a lot across different downstream tasks, which indicates different downstream tasks may need different further pretraining tasks rather than the conventional one, such as MLM.

Task-oriented Dialogue
A task-oriented dialog system aims to assist the user in completing certain tasks in one or several specific domains, such as restaurant booking, weather query, and flight booking. The entire system usually consists of four modules, including natural language understanding (NLU), dialog state tracking (DST), dialog policy, and natural language generation (NLG). In this work, we focus on four core tasks: • Intent recognition: The model is required to predict the intent type given the user utterance. Intent type is a high-level classification label of the user utterance, such as Query and Inform, which indicates the function of the user utterance.
• Dialog act prediction: The model is required to predict the dialog act (e.g., Question, Statement) of the next response given the whole dialog history.
• Response selection: The model is required to select the proper response from many candidate responses given the whole dialog history. The negative candidate responses are randomly sampled.
• Dialog state tracking: The dialog state tracker estimates the user's goal in each time step by taking the entire dialog context as input. The dialog state at time t can be regarded as an abstracted representation of the previous turns until t.

Approaches
In this section, we firstly present the three-stage training framework, then introduce five expressly designed further pre-training tasks and downstream tasks. At last, we present some heuristic analysis on the relations between the tasks in the further pre-training and the fine-tuning stage.

Three-stage Training for the Task-oriented Dialogue
We design a three-stage training framework, includes the general pre-training stage, task-level further pre-training stage and the task-specific fine-tuning stage for enhancing the various tasks in the task-oriented dialogue, as shown in Figure 1. The general pre-training stage aims to learn general word representation. The task-level further pre-training stage contains multiple optional tasks trained on the un-labeled dialogue corpus. The task-specific fine-tuning stage is to train specific models for solving the downstream task such as intent recognition. To be emphasized, our further pre-training stage attempts to bridge the task-level gap between the pre-training and the fine-tuning stage rather than the domain adaptation on the datalevel (Gururangan et al., 2020).

Task-level Further Pre-training
To enhance the task-oriented dialogue through bridging the task-level gap between pre-training and fine-tuning, we design multiple optional tasks which can be trained on dialogue corpus without any human annotation.
Dialog Speaker Prediction (DSP). The model is required to predict the speaker (user or agent) of a given utterance. The model can learn a better single utterance representation from this task. The input of the model is a single utterance U = u 1 , u 2 , ..., u K , where K is the utterance length. The model outputs a binary result indicating the speaker is a user or agent.
Where F bert is the forward function of BERT, we use its [CLS] representations as the utterance representation. W DSP is a trainable linear mapping matrix. The task is trained with the cross-entropy loss.
Context Response Matching (CRM). Given a dialog context, the model selects the proper response from many randomly sampled candidate responses. This task is in the same as the response contrastive loss proposed by Wu et al. (2020). The model can learn the dialogue coherence information from this task.
Dialogue Coherence Verification (DCV). This task asks the model to predict whether a dialog is coherent. The incoherent dialog is constructed by randomly replacing some utterances in the dialog. The model can learn a better multi-turn dialog representation from this task. We first randomly select half of the training data and randomly replace some utterances in the dialogue to destroy their coherence. The input of the model is the whole dialog, concatenating all utterances together, denoted as S = x 1 , x 2 , ..., x N , where N is the sequence length. The model outputs a binary prediction.
Where F bert is the forward function of BERT, we use its [CLS] representations as the dialog representation. W DCV is a trainable linear mapping matrix. The task is trained with the cross-entropy loss.
Entity Number Prediction (ENP). The model predicts the number of entities contained in an utterance. Entities are extracted using the open-source tool stanza 2 . The model can learn a better single utterance representation and entity information. This task is formulated as a multi-class classification problem, where we input a single utterance U , and the model predicts one single class indicating how many entities are contained in the utterance.
Where F bert is the forward function of BERT. W EN P is a trainable linear mapping matrix. The task is trained with the cross-entropy loss.
Dialog Utterances Reordering (DUR). The model reorders a group of shuffled utterances. The model can learn dialog coherence information from this task. The input of the model is the whole dialog, but some utterances' positions are shuffled. We put special tokens [USR] and [SYS] at the front of each utterance indicating it is spoken by a user or agent. We concatenate all utterances together, feed them to BERT, and take the representation of [USR] and [SYS] as the representation of each utterance. The model predicts the correct relative position of the shuffled utterances. For example, utterances U i , U i+1 , U i+2 are shuffled, we first use BERT to get their representations R i , R i+1 , R i+2 , and use a FFN and softmax function to get the probability distribution of their relative positions, The loss is calculated as: Where y p is the correct probability distribution of these utterances relative positions, for example, suppose the correct relative position is [2, 1, 3], then y p = Sof tmax([2, 1, 3]).

Task Specific Fine-tuning
After further pre-training, we fine-tuning our models on each downstream task individually. These downstream tasks are modeled in different forms following (Wu et al., 2020).
Intent Recognition (INT). The task is a multiclass classification problem, the input of the model is a single utterance U and model predicts one single intent type.
The task is trained with the cross-entropy loss.
Dialogue Act Prediction (DA). The task is modeled as a multi-label classification problem, since a system response may contain multiple dialogue acts. The model's input is the whole dialogue history S, and the model outputs a binary prediction for each possible dialogue act.
It is trained with the binary cross-entropy loss.
Response Selection (RS). The model selects the most proper system response from multiple candidates. We utilize a siamese structure and compute similarity scores between dialogue history H and a candidate response R i .
Where s i is the cosine similarity. The negative candidates are randomly sampled from the corpus.
Dialogue State Tracking (DST) is modeled as a multi-class classification task based on a predefined ontology. The model's input is the whole dialogue history S, and the model predicts the value of the slot for each (domain, slot) pair. We define v j i as the i-th value for j-th (domain, slot) pair, we use BERT to obtain its representation which is fixed during the whole fine-tuning stage.
Where Sim is the cosine similarity function, and S j is the probability distribution of the j-th (domain, slot) pair over its possible values. G j is the slot projection layer of the j-th (domain, slot) pair, and the number of layers |G| is equal to the number of (domain, slot) pairs. The task is trained with the cross-entropy loss summed over all the pairs.  Table 1: This table shows the comparison between further pre-training and downstream tasks from the ability and structure perspective. The above four tasks are downstream tasks, the below five tasks are further pre-training tasks. ※ indicates the task has the ability or belongs to the model structure.
All of the proposed tasks are trained with the masked language model in a multi-task paradigm. In addition, these tasks are optional, we focus on investigating their relations with each downstream task.

Heuristic Analysis on Task Relations between Further Pre-training and Fine-tuning
We analyse the task relations from two perspectives: model ability and structure. Ability refers to the information or knowledge the model learns, for example, the ability of single turn representation, the knowledge about the entity. Structure refers to the model's network structure and its objective function, for example, the siamese structure and list-wise ranking loss function. The details are shown in Table 1. We suggest that if a further pretraining task learns similar abilities or has a similar model structure the with the downstream task, then the further pre-training will be more effective for fine-tuning.

Evaluation Datasets
We select four datasets, OOS, DSTC2, GSIM, and MWOZ, for downstream evaluation. Details of each evaluation dataset are discussed below.
OOS. (Larson et al., 2019) It contains 151 intent types across ten domains, including 150 in-scope and one out-of-scope intent.
DSTC2. (Henderson et al., 2014) It is a machinehuman task-oriented dataset, We follow Wu et al. (2020) to map the original dialogue act labels to universal dialogue acts, resulting in 19 acts.
MWOZ. (Budzianowski et al., 2018) It is a popular benchmark for task-oriented dialogues. It has 30 (domain, slot) pairs across seven different domains. We use the revised version MWOZ 2.1.
GSIM. (Shah et al., 2018) It is a human-rewrote task-oriented dataset. Following Wu et al. (2020) we combine movie and restaurant domains into one single corpus, and map its dialogue act labels to universal dialogue acts, resulting in 13 acts.

Training Setting
For further pre-training, we set the learning rate equal to 5e-5, batch size to 32, and maximum sequence length to 512. For fine-tuning, we set the learning rate to 5e-5 (except dialog state tracking task, which is 3e-5). We use the batch size that maximizes the GPU usage. We train our models using the Adam optimizer. Models are early-stopped using the loss of a validation set. We train each downstream task three times with different seeds. We use 4 NVIDIA V100 GPUs for further pretraining and one for fine-tuning. Our code is based on Transformers 3  Table 2: The results of the experiment investigating the effect of data-level further pre-train. BERT 2 does not contain a further pre-training stage, M LM 3 utilizes masked language model to further pre-train BERT on unlabeled dialogue corpus before fine-tuning. M LM 3 does not surpass BERT 2 in all metrics and datasets.

Results and Discussion
In this section, we collect experimental results and analyse the effects of different further pre-training tasks on different downstream tasks.

Effect of the Data-level Further Pre-training
To investigate the effect of the data-level further pre-training, we firstly further pre-train BERT with masked language model (MLM) on the un-labeled task-oriented dialogue corpus, then fine-tune the model on each downstream task, we denote this experiment as M LM 3 . In contrast, we also directly fine-tune BERT on downstream tasks, the experiment is denoted as BERT 2 . The experiment results are shown in Table 2, M LM 3 outperforms BERT 2 on response selection and dialog state tracking task, as for dialog act prediction and intent recognition task, M LM 3 does not surpass BERT 2 in all metrics and datasets. From the result, we can conclude that further pre-training using MLM objective does not bring performance improvement for all downstream tasks, so it is necessary to design special further pre-training tasks ac-cording to the characteristics of the dialogue data.

Effect of Various Further Pre-training Tasks
To investigate the effects of different further pretraining tasks on different downstream tasks, we compare three further pre-training tasks, dialogue speaker prediction (DSP), context response matching (CRM), and dialogue coherence verification (DCV), each of which has special characteristics. From the experiment results shown in Table 3, DSP, CRM, and DCV are better than M LM 3 on most of the metrics, this indicates the effectiveness of these auxiliary pre-training tasks. In addition, we also observe that different pre-training tasks are more beneficial to different downstream tasks, for example, DSP is more beneficial to downstream intent recognition task than others, CRM is mainly beneficial to response selection, DCV is beneficial to dialogue act prediction and dialogue state tracking. We can conclude that different pre-training tasks do have different effects on different downstream tasks, so there is a need to design a specific pre-training task for a downstream task.  Table 4: The experiments investigating the effect of the ability. ENP are designed to learn more abilities which are needed by downstream INT and DST task, and its performance on these two tasks is completely higher than DSP.  Table 5: The experiments investigating the effect of the structure. The model structure of CRM is more similar to downstream task RS, and its performance on this task is completely higher than DUR.

Empirical Analysis on Task Relations between Further Pre-training and Fine-tuning
In session 3.4, we provide a heuristic analysis on task relations between further pre-training and finetuning. We suggest ability and structure are two key factors that influence the effectiveness of further pre-training to fine-tuning.
We define nice pair meaning that a further pretraining task is effective to a downstream task. From Table 3 we can find DSP is more beneficial for INT, CRM is for RS, while DCV is for DA and DST. So there are four nice pairs, (DSP, INT), (CRM, RS), (DCV, DA), and (DCV, DST). Among these four nice pairs, we can find there is one thing in common, the further pre-training task and downstream task in the same nice pair almost share the same ability and the model structure. Take (CRM, RS) pair as an example, both CRM and RS mainly learn the ability of dialogue coherence and belong to the siamese structure.
To further investigate the effect of the ability, we compare dialogue speaker prediction (DSP) and entity number prediction (ENP). Their structures are the same, that is, single turn classification, but the abilities they learn are different, DSP mainly learns the ability of single turn representation, while ENP also learns entity information. Experiment results are shown in Table 4, ENP outperforms DRP on intent recognition and dialogue state tracking tasks across all metrics because these two tasks also need the ability about entity information. This indicates ability is important for further pre-training.
To further investigate the effect of the structure, we compare context response matching (CRM) and dialogue utterances reordering (DUR). Both of them mainly learn the ability about dialogue coherence, but their structures are different. Results in Table 5 show that CRM surpasses DUR on the response selection task because the CRM model is a siamese structure which is the same as the response selection task. This indicates the structure is also a crucial factor for the effectiveness of further pre-training.

Effect of Combining Further
Pre-training Tasks We jointly further pre-train entity number prediction (ENP) and context response matching (CRM) in the multi-task paradigm, the experiment is denoted as Joint. We expect the joint model can combine the advantages of ENP and CRM, and bring improvement on downstream INT, RS, and DST. The results in Table 6 are not fully consistent with our expectation, specifically, on intent recognition, joint model's performance drops significantly, on the other three downstream tasks, joint model's   performance is between ENP and CRM.

Effect of Combining Data-level and Task-level Further Pre-training
In the former experiments, each proposed further pre-trained task is trained with masked language model (MLM), we suppose MLM is for data-level adaptation while the proposed task is for task-level adaptation. In this section, we investigate the effect of MLM by removing MLM objective from further pre-training stage, this experiment is denoted as w.o. mlm. Experiment results are shown in Table 7.
Removing MLM leads to performance drop across almost all downstream tasks, indicating MLM is important to further pre-training stage.

Experiment Summary
Through all the experiments, we can conclude that: In the area of task-oriented dialogue, 1) Masked language model alone is not enough for further pre-training, but it still plays an important role for enhancing fine-tuning. And there is a need to design special further pre-training tasks according to the characteristics of dialogue data. 2) Different pre-training tasks do have different effects on different downstream tasks, and it is necessary to design a specific pre-training task for a specific downstream task. 3) Ability and structure of a further pre-training task are key factors influencing the performance of fine-tuning on a downstream task. 4) Training two further pre-training tasks in the multi-task paradigm does not lead to incremental performance improvement. From these conclusions, we can obtain multiple empirical principles to design further pre-training tasks: 1) The ability learned by pre-training task should be similar to the ability required by the downstream task. 2) the modeling structure should also be similar, 3) the masked language model training objective should be kept.

Conclusion
In this work, we study how to make further pretraining more effective to downstream tasks in the area of the task-oriented dialog. Firstly, we notice that further pre-training using MLM objective does not improve all downstream tasks, then we designed multiple pre-training tasks for dialog data, finding that different pre-training tasks benefit different downstream tasks. Further, we observe that ability and structure are key factors influencing whether a pre-training task is helpful to a downstream task. These finds can be used as empirical principles to design pre-training tasks.