Domain-Adaptive Pretraining Methods for Dialogue Understanding

Language models like BERT and SpanBERT pretrained on open-domain data have obtained impressive gains on various NLP tasks. In this paper, we probe the effectiveness of domain-adaptive pretraining objectives on downstream tasks. In particular, three objectives, including a novel objective focusing on modeling predicate-argument relations, are evaluated on two challenging dialogue understanding tasks. Experimental results demonstrate that domain-adaptive pretraining with proper objectives can significantly improve the performance of a strong baseline on these tasks, achieving the new state-of-the-art performances.


Introduction
Recent advances in pretraining methods (Devlin et al., 2019;Joshi et al., 2020;Yang et al., 2019) have achieved promising results on various natural language processing (NLP) tasks, including natural language understanding, text generation and question anwsering Song et al., 2019;Reddy et al., 2019). In order to acquire general linguistic and semantic knowledge, these pretraining methods are usually performed on open-domain corpus, like Wikipedia and BooksCorpus. In light of the success from open-domain pretraining, a further question is naturally raised: whether downstream tasks can also benefit from domain-adaptive pretraining?
To answer this question, later work (Baevski et al., 2019;Gururangan et al., 2020) has demonstrated that continued pretraining on the unlabeled data in the target domain can further contribute to the corresponding downstream task. However, these studies are dependent on additional data that can be unavailable in certain scenarios, and they only evaluated on easy downstream tasks. For instance, Gururangan et al. (2020) perform continued pretraining with masked language modeling loss on several relevant domains, and they obtain improvements on eight well-studied classification tasks, which are too simple to exhibit the strength of continued domain-adaptive pretraining. Besides, it is still unclear which pretraining objective is the most effective for each downstream task.
In this work, we give a deeper analysis on how various domain-adaptive pretraining methods can help downstream tasks. Specifically, we continuously pretrain a BERT model (Devlin et al., 2019) with three different kinds of unsupervised pretraining objectives on the domain-specific training set of each target task. Two of them are Masked Language Model (MLM) (Gururangan et al., 2020) and Span Boundary Objective (SBO) (Joshi et al., 2020), both objectives have been explored in previous work. In addition, a novel pretraining objective, namely Perturbation Masking Objective (PMO), is proposed to better learn the correlation between arguments and predicates. After domain-adaptive pretraining, the adapted BERT is then tested on dialogue understanding tasks to probe the effectiveness of different pretraining objectives.
We evaluate on two challenging tasks that focus on dialogue understanding, i.e. Conversational Semantic Role labeling (CSRL) and Spoken Language Understanding (SLU). CSRL (Xu et al., 2020(Xu et al., , 2021 was recently proposed by extending standard semantic role labeling (SRL) (Palmer et al., 2010) with cross-utterance relations, which otherwise require coreference and anaphora resolution for being recognized. We follow previous work to consider this task as sequence labeling. On the other hand, SLU includes intent detection and slot filling. To facilitate domain-adaptive pretraining, we only use the training set of each downstream task. In this way, the usefulness of each pretraining objective can be more accurately examined, as no additional data is used.
Experimental results show that domain-adaptive pretraining significantly helps both tasks. Besides, our novel objective achieves better performances than the existing ones, shedding more lights for future work on pretraining.

Tasks
Conversational Semantic Role Labeling. Xu et al. (2021) first proposed the CSRL task, which extends standard SRL by explicitly annotating other cross-turn predicate-argument structures inside a conversation. Compared with newswire documents, human conversations tend to have more ellipsis and anaphora situations, causing more problems for standard NLU methods. Their motivation is that most dropped or referred components in the latest dialogue turn can actually be found in the dialogue history. As the result, CSRL allows arguments to be in different utterances as the predicate, while SRL can only work on each single utterance. Comparing with standard SRL, CSRL can be more challenging due to the long-range dependencies. Similar to SRL, we view CSRL as a sequence labeling problem, where the goal is to label each token with a semantic role.
Spoken Language Understanding. Proposed by Zhu et al. (2020), the SLU task consists of two key components, i.e., intent detection and slot filling. Given a dialogue utterance, the goal is to predict its intents and to detect pre-defined slots, respectively. We treat them as sentence-level classification and sequence labeling, respectively.

Domain-Adaptive Pretraining Objectives
While previous works have shown the benefit of continued pretraining on domain-specific unlabeled data (e.g., Lee et al. (2020); Gururangan et al.
(2020)), these methods only adopt the Masked Language Model (MLM) objective to train an adaptive language model on a single domain. It is not clear how the benefit of continued pretraining may vary with factors like the objective function.
In this paper, we use the dialogue understanding task as a testbed to investigate the impact of three pre-training objectives to the overall performance. In particular, we explore the MLM (Devlin et al., 2019) and Span Boundary Objective (SBO) (Joshi et al., 2020) , and introduce a new objective, namely Perturbation Masking Objective (PMO), which is more fit for the dialogue NLU task.

Masked Language Model Objective
Masked Language Model (MLM) is the task of predicting missing tokens in a sequence from their placeholders. Specifically, given a sequence of tokens X = (x 1 , x 2 , .., x n ), a subset of tokens Y ⊆ X is sampled and substituted with a different set of tokens. In BERT's implementation, Y accounts for 15% of the tokens in X; of those, 80% are replaced with [MASK], 10% are replaced with a random token (according to the unigram distribution), and 10% are kept unchanged. Formally, the contextual vector of input tokens X is denoted as H = (h 1 , h 2 , ..., h n ). The task is to predict the original tokens in Y from the modified input and the objective function is: where |Y | is the number of masked tokens, and θ represents the model parameters.

Span Boundary Objective
In many NLP tasks such as the dialogue understanding, it usually involves reasoning about relationships between two or more spans of text. Previous works (Joshi et al., 2020) have shown that SpanBERT is superior to BERT in learning span representations, which significantly improves the performance on those tasks. Conceptually, the differences between these two models are two folds.
Firstly, different with BERT that independently selects the masked token in Y , SpanBERT define Y by randomly selecting contiguous spans. In particular, SpanBERT first selects a subset Y ⊆ X by iteratively sampling spans until masking 15% tokens 1 . Then, it randomly (uniformly) selects the starting point for the span to be masked.
Secondly, SpanBERT additionally introduces a span boundary objective that involves predicting each token of a masked span using only the representations of the observed tokens at the boundaries. For a masked span of tokens (x s , ..., x e ) ∈ Y , where (s, e) are the start and end positions of the span, it represents each token in the span using the boundary vectors and the position embedding: where p i marks relative positions of span token x i with respect to the left boundary token x s−1 , and f (·) is a 2-layer MLP with GeLU activations and layer normalization. SpanBERT sums the loss from both the regular MLM and the span boundary objectives for each token in the masked span:

Perturbation Masking Objective
In dialogue understanding tasks like CSRL, the major goal is to capture the semantic information such as the correlation between arguments and predicate. However, for the sake of generalization, existing pretraining models do not consider the semantic information of a word and also not assess the impact of predicate has on the prediction of arguments in their objectives. To address this, we propose to use the perturbation masking technique  to explicitly measure the correlation between arguments and predicate and further introduce that into our objective.
The perturbation masking is originally proposed to assess the impact one word has on the prediction of another in MLM. In particular, given a list of tokens X, we first use a pretrained language model M to map each x i into a contextualized representation H(X) i . Then, we use a two-stage approach to capture the impact word x j has on the prediction of another word x i . First, we replace x i with the [MASK] token and feed the new sequence X\{x i } into M. We use H(X\{x i }) i to denote the representation of x i . To calculate the impact x j ∈ x\{x i } has on H(X) i , we further mask out x j to obtain the second corrupted sequence X\{x i , x j }. Similarly, H(X\{x i , x j }) i denotes the new representation of token x i . We define the the impact function as: where d is the distance metric that captures the difference between two vectors. In experiments, we use the Euclidean distance as the distance metric.
Since our goal is to better learn the correlation between arguments and predicate, we introduce a perturbation masking objective that maximizes the impact of predicate on the prediction of argument span: where p 0 ,... p m−1 are m predicates that occur in the sentence. In practice, we first follow the Span-BERT to sample a subset of contiguous span texts and perform masking (i.e., span masking) on them. Then, we select verbs from X as predicates and perform perturbation masking on those predicates.

Experiments
We evaluate pretraining objectives on three datasets, DuConv, NewsDialog 2 and CrossWOZ. The former two datasets are annotated by Xu et al. (2021) for the CSRL task and the last one is provided by Zhu et al. (2020) for the SLU task. Duconv is a Chinese knowledge-driven dialogue dataset, focusing on the domain of movies and stars. NewsDialog is a dataset collected in a way that follows the setting for constructing general open-domain dialogues: two participants engage in chitchat, and during the conversation, the topic is allowed to change naturally. Xu et al. (2021) annotates 3K dialogue sessions of DuConv to train their CSRL parser, and directly test on 200 annotated dialogue sessions of NewsDialog. CrossWOZ is a Chinese Wizard-of-Oz task-oriented dataset, including 6K dialogue sessions and 102K utterances on five domains.
Since the state-of-the-art models on these tasks are all developed based on BERT, we use the same model architectures but just replace the BERT base with our domain-adaptive pretrained BERT. Notice that, we also experiment with other pretrained language models such as RoBERTa and XLNet. We observed similar results but here we only report the results based on BERT due to the space limitation.
In particular, we perform the domain-adaptive pretraining on CSRL task using all dialogue sessions of training set in DuConv (Wu et al., 2019) and NewsDialog (Wang et al., 2021), which includes 26K and 20K sessions, respectively; on the SLU task, we use the whole CrossWOZ training dataset.
The hyper-parameters used in our model are listed as follows. The network parameters of our model are initialized using the pretrained language model. The batch size is set to 128. We use Adam (Kingma and Ba, 2015) with learning rate 5e-5 to update parameters.
Results and Discussion. On the CSRL task, we follow Xu et al. (2021)  (referred as F1 all ) and those in the same and different dialogue turns as predicates (referred as F1 intra and F1 cross ). On the SLU task, we report results on F1 intent , F1 slot and F1 all . Table 1 summarizes the results. The first row shows the performance of existing state-of-the-art models without domainadaptive pretraining on each dataset. We can see that on two tasks, existing models could benefit from the domain-adaptive pretraining, achieving new state-of-the-art performance on these datasets.
Let us first look at the CSRL task. Pretraining with MLM objective could slightly improve the performance by 0.4 and 0.12 in terms of F1 all on DuConv and NewsDialog, respectively. By additionally considering the span boundary objective, the overall performance especially F1 cross could be further improved by at least 0.75 and 2.6, respectively. These results are expected since arguments in the CSRL task are usually spans and SBO is better than MLM in learning the span representation. We can also see that our proposed perturbation masking objective boosts the performance by a larger margin than SBO, indicating that learning correlations between arguments and predicates is more crucial to the NLU task. By summing three objectives, the CSRL model could achieve the best performance, significantly improving the baseline that without domain-adaptive pretraining by 1.05 and 3.2 F1 all score, respectively.
From Table 1, we can see that similar findings are also observed on the SLU task. First of all, domain-adaptive pretraining on CrossWOZ could also improve the performance. Secondly, adding either SBO or PMO, the F1 scores on intent and slot could be further improved. Thirdly, the best performance is achieved when all three objectives are considered. However, we do not observe similar substantial gains on the SLU task as on the CSRL task. We think this is because the state-of-the-art performance on CrossWOZ is relatively high, but it is still impressive to achieve absolute 0.81, 0.90 and 0.87 points improvement in terms of F1 intent , F1 slot and F1 all .
We also investigate the impact of span masking scheme to the overall performance. Recall that, in the span masking, we randomly sample the span length and a start position of the span. Joshi et al. (2020) showed that no significant performance gains are observed by using more linguisticallyinformed span masking strategies such as masking Named Entities or Noun Phrases. Specifically, they use the spaCy's 3 named entity recognizer and constituency parser to extract named entities and noun phrases, respectively. In this paper, we revisit these span masking scheme. Since there is no available constituency parser designed for the dialogue, we use an unsupervised grammar induction method (Jin and Schuler, 2020) to extract grammars from the training data. Noun phrases from Viterbi parse trees from different grammars are tallied without labels, resulting in a posterior distributions of the spans, which are used in our span sampling. As shown in Table 1, we find the best choice is to combine random sampling and noun phrases sampling, i.e., sampling from the noun phrases at α% of the time and from a geometric distribution for the other (1 -α%). The performance on all three datasets coherently increases when more noun phrases are used in the span sampling.

Conclusion
In this paper, we probe the effectiveness of domainadaptive pretraining on dialogue understanding tasks. Specifically, we study three domain-adaptive pretraining objectives, including a novel objective: perturbation masking objective on three NLU datasets. Experimental results show that domainadaptive pretraining with proper objectives is a sim-ple yet effective way to boost the dialogue understanding performance.