Structural Pre-training for Dialogue Comprehension

Pre-trained language models (PrLMs) have demonstrated superior performance due to their strong ability to learn universal language representations from self-supervised pre-training. However, even with the help of the powerful PrLMs, it is still challenging to effectively capture task-related knowledge from dialogue texts which are enriched by correlations among speaker-aware utterances. In this work, we present SPIDER, Structural Pre-traIned DialoguE Reader, to capture dialogue exclusive features. To simulate the dialogue-like features, we propose two training objectives in addition to the original LM objectives: 1) utterance order restoration, which predicts the order of the permuted utterances in dialogue context; 2) sentence backbone regularization, which regularizes the model to improve the factual correctness of summarized subject-verb-object triplets. Experimental results on widely used dialogue benchmarks verify the effectiveness of the newly introduced self-supervised tasks.


Introduction
Recent advances in large-scale pre-training language models (PrLMs) have achieved remarkable successes in a variety of natural language processing (NLP) tasks (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019a;Clark et al., 2020;Zhang et al., 2020d). Providing fine-grained contextualized embedding, these pre-trained models are widely employed as encoders for various downstream NLP tasks. Although the PrLMs demonstrate superior perfor- Figure 1: A multi-turn dialogue example. Different colors indicate the utterances from different speakers. mance due to their strong representation ability from self-supervised pre-training, it is still challenging to effectively adapt task-related knowledge during the detailed task-specific training which is usually in a way of fine-tuning (Gururangan et al., 2020). Generally, those PrLMs handle the whole input text as a linear sequence of successive tokens and implicitly capture the contextualized representations of those tokens through self-attention. Such fine-tuning paradigm of exploiting PrLMs would be suboptimal to model dialogue task which holds exclusive text features that plain text for PrLM training may hardly embody. Therefore, we explore a fundamental way to alleviate this difficulty by improving the training of PrLM. This work devotes itself to designing the natural way of adapting the language modeling to the dialogue scenario motivated by the natural characteristics of dialogue contexts.
As an active research topic in the NLP field, multi-turn dialogue modeling has attracted great interest. The typical task is response selection (Lowe et al., 2015;Wu et al., 2017;Zhang et al., 2018) that aims to select the appropriate response according to a given dialogue context containing a number of utterances, which is the focus in this work. How-ever, selecting a coherent and informative response for a given dialogue context remains a challenge. The multi-turn dialogue typically involves two or more speakers that engage in various conversation topics, intentions, thus the utterances are rich in interactions, e.g., with criss-cross discourse structures Bai and Zhao, 2018;Qin et al., 2016Qin et al., , 2017. A critical challenge is the learning of rich and robust context representations and interactive relationships of dialogue utterances, so that the resulting model is capable of adequately capturing the semantics of each utterance, and the relationships among all the utterances inside the dialogue.
Inspired by the effectiveness for learning universal language representations of PrLMs, there are increasing studies that employ PrLMs for conversation modeling (Mehri et al., 2019;Rothe et al., 2020;Whang et al., 2020;Han et al., 2021). These studies typically model the response selection with only the context-response matching task and overlook many potential training signals contained in dialogue data. Although the PrLMs have learned contextualized semantic representation from token-level or sentence-level pre-training tasks like MLM, NSP, they all do not consider dialogue related features like speaker role, continuity and consistency. One obvious issue of these approaches is that the relationships between utterances are harder to capture using word-level semantics. Besides, some latent features, such as user intent and conversation topic, are under-discovered in existing works (Xu et al., 2021). Therefore, the response retrieved by existing dialogue systems supervised by the conventional way still faces critical challenges, including incoherence and inconsistency.
In this work, we present SPIDER (Structural Pre-traIned DialoguE Reader), a structural language modeling method to capture dialogue exclusive features. Motivated to efficiently and explicitly model the coherence among utterances and the key facts in each utterance, we propose two training objectives in analogy to the original BERT-like language model (LM) training: 1) utterance order restoration (UOR), which predicts the order of the permuted utterances in dialogue context; 2) sentence backbone regularization (SBR), which regularizes the model to improve the factual correctness of summarized subject-verb-object (SVO) triplets. Experimental results on widely used benchmarks show that SPDER boosts the model performance for various multi-turn dialogue comprehension tasks including response selection and dialogue reasoning.  (Zhou et al., 2020b,a;Xu et al., 2020a,b;Li et al., 2021Li et al., , 2020b. Most of the PrLMs are based on the encoder in Transformer, among which Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019b) is one of the most representative work. BERT uses multiple layers of stacked Transformer Encoder to obtain contextualized representations of the language at different levels. BERT has helped achieve great performance improvement in a broad range of NLP tasks (Bai and Zhao, 2018;Zhang et al., 2020a;Luo and Zhao, 2020;. Several subsequent variants have been proposed to further enhance the capacity of PrLMs, such as XLNet , RoBERTa , ALBERT (Lan et al., 2020), ELECTRA (Clark et al., 2020). For simplicity and convenient comparison with public studies, we select the most widely used BERT as the backbone in this work.
There are two ways of training PrLMs on dialogue scenarios, including open-domain pretraining and domain-adaptive post-training. Some studies perform training on open-domain conversational data like Reddit for response selection or generation tasks (Wolf et al., 2019;Zhang et al., 2020c;Henderson et al., 2020;Bao et al., 2020), but they are limited to the original pre-training tasks and ignore the dialogue related features. For domain-adaptive post-training, prior works have indicated that the order information would be important in the text representation, and the well-known next-sentence-prediction (Devlin et al., 2019b) and sentence-order-prediction (Lan et al., 2020) can be viewed as special cases of order prediction. Especially in the dialogue scenario, predicting the word order of utterance, as well as the utterance order in the context, has shown effectiveness in the dialogue generation task (Kumar et al., 2020;Gu et al., 2020b), where the order information is well recognized . However, there is little attention paid to dialogue comprehension tasks such as response selection (Lowe et al., 2015;Wu et al., 2017;Zhang et al., 2018). The potential difficulty is that utterance order restoration involves much more ordering possibilities for utterances that may have a quite flexible order inside dialogue text than NSP and SOP which only handle the predication of two-class ordering.
Our work is also profoundly related to auxiliary multi-task learning, whose common theme is to guide the language modeling Transformers with explicit knowledge and complementing objectives Sun et al., 2019b;Xu et al., 2020a). A most related work is Xu et al. (2020a), which introduces four self-supervised tasks including next session prediction, utterance restoration, incoherence detection and consistency discrimination. Our work differs from Xu et al. (2020a) by three sides. 1) Motivation: our method is designed for a general-purpose in broad dialogue comprehension tasks whose goals may be either utterancelevel discourse coherence or inner-utterance factual correctness, instead of only motivated for downstream context-response matching, whose goal is to measure if two sequences are related or not. 2) Technique: we propose both sides of intra-and inter-utterance objectives. In contrast, the four objectives proposed in Xu et al. (2020a) are natural variants of NSP in BERT, which are all utterancelevel. 3) Training: we empirically evaluate domainadaptive training and multi-task learning, instead of only employing multi-task learning, which requires many efforts of optimizing coefficients in the loss functions, which would be time-consuming.
In terms of factual backbone modeling, compared with the existing studies that enhance the PrLMs by annotating named entities or incorporating external knowledge graphs , the SVO triplets extracted in our sentence backbone predication objective (SBP) method, appear more widely in the text itself. Such triplets ensure the correctness of SVO and enable our model to discover the salient facts from the lengthy texts, sensing the intuition of "who did what".

Multi-turn Dialogue Comprehension
Multi-turn dialogue comprehension aims to teach machines to read dialogue contexts and solve tasks such as response selection (Lowe et al., 2015;Wu et al., 2017;Zhang et al., 2018) and answering questions (Sun et al., 2019a;Cui et al., 2020), whose common application is building intelligent humancomputer interactive systems (Chen et al., 2017a;Shum et al., 2018;Zhu et al., 2018b). Early studies mainly focus on the matching between the dialogue context and question (Huang et al., 2019;Zhu et al., 2018a). Recently, inspired by the impressive performance of PrLMs, the mainstream is employing PrLMs to handle the whole input texts of context and question, as a linear sequence of successive tokens and implicitly capture the contextualized representations of those tokens through self-attention (Qu et al., 2019;. Such a way of modeling would be suboptimal to capture the high-level relationships between utterances in the dialogue history. In this work, we are motivated to model the structural relationships between utterances from utterance order restoration and the factual correctness inside each utterance in the perspective of language modeling pre-training instead of heuristically stacking deeper model architectures.

Approach
This section presents our proposed method SPI-DER (Structural Pre-traIned DialoguE Reader). First, we will present the standard dialogue comprehension model as the backbone. Then, we will introduce our designed language modeling objectives for dialogue scenarios, including utterance order restoration (UOR) and sentence backbone regularization (SBR). In terms of model training, we employ two strategies, i.e., 1) domain adaptive post-training that first trains a language model based on newly proposed objectives and then finetunes the response selection task; 2) multi-task finetuning that trains the model for downstream tasks, along with LM objectives.

Transformer Encoder
We first employ a pre-trained language model such as BERT (Devlin et al., 2019a) to obtain the initial word representations. The utterances and response are concatenated and then fed into the encoder. Given the context C and response R, we concatenate all utterances in the context and the response candidate as a single consecutive token sequence with special tokens separating them: and [SEP] are special tokens.
[EOU] is the "End Of Utterance" tag designed for multiturn context. X is then fed into the BERT encoder, which is a deep multi-layer bidirectional Transformer, to obtain a contextualized  representation H.
In detail, let X = {x 1 , . . . , x n } be the embedding of the sequence, which are features of encoding sentence words of length n. The input embeddings are then fed into the multi-head attention layer to obtain the contextual representations.
The embedding sequence X is processed to a multi-layer bidirectional Transformer for learning contextualized representations, which is defined as where K,Q,V are packed from the input sequence representation X. As the common practice, we set K = Q = V in the implementation. For the following part, we use H = {h 1 , . . . , h n } to denote the last-layer hidden states of the input sequence.

SPIDER Training Objectives
To simulate the dialogue-like features, we propose two pre-training objectives in addition to the original LM objectives: 1) utterance order restoration, which predicts the order of the permuted utterances in dialogue context; 2) sentence backbone regularization, which regularizes the model to improve the factual correctness of summarized subject-verbobject triplets. The utterance manipulations are shown in Figure 2. The following subsections describe the objectives in turn.

Utterance Order Restoration
Coherence is an essential aspect of conversation modeling. In a coherent discourse, utterances should respect specific orders of relations and logic. The ordering of utterances in a dialogue context determines the semantic of the conversation. Therefore, learning to order a set of disordered utterances in such a way that maximizes the discourse coherence will have a critical impact in learning the representation of dialogue contexts.
However, most previous studies focused on semantic relevance between context and response candidate. Here we introduce utterance-level position modeling, i.e., utterance order restoration to encourage the model to be aware of the semantic connections among utterances in the context. The idea is similar to autoencoding (AE) which aims to reconstruct the original data from corrupted input . Given permuted dialogue contexts that comprise utterances in random orders, we maximize the expected log-likelihood of a sequence of the original ground-truth order.
The goal of the utterance order restoration is to organize randomly shuffled utterances of a conversation into a coherent dialogue context. We extract the hidden states of [EOU] from H as the representation of each utterance. Formally, given an utterance sequence denoted as where K means the number of maximum positions to be predicted. We expect an is the most coherent permutation of utterances.
As predicting the permuted orders is a more challenging optimization problem than NSP and SOP tasks due to the large searching space of permutations and causes slow convergence in preliminary experiments, we choose to only predict the order of the last few permuted utterances by a permutation ratio δ to control the maximum number of permutations: K = K * δ. The UOR training objective is then formed as: whereô k denotes the predicted order.

Sentence Backbone Regularization
The sentence backbone regularization objective is motivated to guide the model to distinguish the internal relation of the fact triplets that are extracted from each utterance, which would be helpful to improve the ability to capture the key facts of the utterance as well as the correctness. First, we apply a fact extractor to conduct the dependency parsing of each sentence. After that, we extract the subject, the root verb, and the object tokens as an SVO triplet corresponding to each utterance. Inspired by Bordes et al. (2013) where the embedding of the tail entity should be close to the embedding of the head entity plus some vector that depends on the relationship, we assume that given the dialogue input, in the hidden representation space, the summation of the subject and the verb should be close to the object as much as possible, i.e., Consequently, based on the sequence hidden states h i where i = 1, ..., L y , we introduce a regularization for the extracted facts: where m is the total number of fact tuples extracted from the summary and k indicates the k-th triplet. "subj k ", "verb k ", and "obj k " are indexes of the k-th fact tuple's subject, verb, and object.
In our implementation, since PrLMs take subwords as input while the SVO extraction performs in word-level, we use the first-token hidden state as the representation of the original word following the way in Devlin et al. (2019a) for named entity recognition.

Use of SPIDER Objectives
In this section, we introduce two training methods to take the newly proposed language modeling objectives into account, namely domain-adaptive post-training and multi-task fine-tuning, as illustrated in Figure 3.

Domain Adaptive Post-training
Similar to BERT, we also adopt the masked language model (MLM) and the next sentence prediction (NSP) as LM-training tasks to enable our model to capture lexical and syntactic information from tokens in text. More details of the LM training tasks can be found from Devlin et al. (2019a). The overall post-training loss is the sum of the MLM, NSP, UOR, and SBR loss.
Our full model is trained by a joint loss by combining both of the objectives above: where λ 1 , λ 2 , λ 3 are hyper-parameters.
After post-training the language model on the dialogue corpus, we load the pre-trained weights as the same way of using BERT (Devlin et al., 2019a), to fine-tune the downstream tasks such as response selection and dialogue reasoning as focused in this work (details in Section 5.1).

Multi-task Fine-tuning
Since our objectives can well share the same input as the downstream tasks, there is an efficient way of using multi-task fine-tuning (MTF) to directly train the task-specific models along with our SPI-DER objectives. Therefore, we feed the permuted context to the dialogue comprehension model and combine the three losses for training: where β 1 , β 2 , β 3 are hyper-parameters. In order to train a task-specific model for dialogue comprehension, the hidden states H will be fed into a classifier with a fully connected and softmax layer. We learn model g(·, ·) by minimizing cross entropy loss with dataset D. Let Θ denote the parameters, for binary classification like the response selection task, the objective function L(D, Θ) can be formulated as: where N denotes the number of examples. For multiple choice task like MuTual, the loss function is: where C is the number of choice.

RS UOR SBR
(a) Domain-Adaptive Post Training (b) Multi-task Fine-tuning

Ubuntu Dialogue Corpus
Ubuntu (Lowe et al., 2015) consists of English multi-turn conversations about technical support collected from chat logs of the Ubuntu forum. The dataset contains 1 million context-response pairs, 0.5 million for validation and 0.5 million for testing. In training set, each context has one positive response generated by human and one negative response sampled randomly. In validation and test sets, for each context, there are 9 negative responses and 1 positive response.
1 Actually, MuTual is a retrieval-based dialogue corpus in form, but the theme is English listening comprehension exams, thus we regard as a reading comprehension corpus in this work. Because the test set of MuTual is not publicly available, we conducted the comparison with our baselines on the Dev set for convenience.

Douban Conversation Corpus
Douban (Wu et al., 2017) is different from Ubuntu in the following ways. First, it is an open domain where dialogues are extracted from Douban Group. Second, response candidates on the test set are collected by using the last turn as the query to retrieve 10 response candidates and labeled by humans. Third, there could be more than one correct response for a context.

E-commerce Dialogue Corpus
ECD (Zhang et al., 2018) dataset is extracted from conversations between customer and service staff on Taobao. It contains over 5 types of conversations based on over 20 commodities. There are also 1 million context-response pairs in the training set, 0.5 million in the validation set, and 0.5 million in the test set.

Multi-Turn Dialogue Reasoning
MuTual (Cui et al., 2020) consists of 8860 manually annotated dialogues based on Chinese student English listening comprehension exams. For each context, there is one positive response and three negative responses. The difference compared to the above three datasets is that only MuTual is reasoning-based. There are more than 6 types of reasoning abilities reflected in MuTual.

Implementation Details
For the sake of computational efficiency, the maximum number of utterances is specialized as 20.
The concatenated context, response, [CLS] and [SEP] in one sample is truncated according to the "longest first" rule or padded to a certain length, which is 256 for MuTual and 384 for the other three datasets. For the hyper-parameters, we empirically set λ 1 = λ 2 = λ 3 = β 1 = β 2 = 1. Our model is implemented using Pytorch and based on the Transformer Library. 2 We use BERT (Devlin et al., 2019a) as our backbone model. AdamW (Loshchilov and Hutter, 2019) is used as our optimizer. The batch size is 24 for MuTual, and 64 for others. The initial learning rate is 4 × 10 −6 for MuTual and 3 × 10 −5 for others. The ratio is set to 0.4 in our implementation by default. We run 3 epochs for MuTual and 2 epochs for others and select the model that achieves the best result in validation. The training epochs are 3 for DAP.
Our domain adaptive post-training for the corresponding response selection tasks is based on the three large-scale dialogue corpus including Ubuntu, Douban, and ECD, respectively. 3 The data statistics are in Table 1. Since domain adaptive post-training is time-consuming, following previous studies (Gu et al., 2020a), we use bert-baseuncased, and bert-base-chinese for the English and Chinese datasets, respectively. Because there is no appropriate domain data for the small-scale Mutual dataset, we only report the multi-task fine-tuning results with our SPIDER objectives, and also present the results with other PrLMs such as ELECTRA (Clark et al., 2020) for general comparison.

Baseline Models
We include the following models for comparison: • Multi-turn matching models: Sequential Matching Network (SMN) (Wu et al., 2017), Deep Attention Matching Network (DAM) (Zhou et al., 2018), Deep Utterance Aggregation (DUA) (Zhang et al., 2018), Interaction-over-Interaction (IoI) (Tao et al., 2019b) have been stated in Section 2.2. Besides, Multi-Representation Fusion Network (MRFN) (Tao et al., 2019a) matches context and response with multiple types of representations. Multi-hop Selector Network (MSN)  utilizes a multi-hop selector to filter necessary utterances and matches among them.

Evaluation Metrics
Following (Lowe et al., 2015;Wu et al., 2017), we calculate the proportion of true positive response among the top-k selected responses from the list of n available candidates for one context, denoted as R n @k. Besides, additional conventional metrics of information retrieval are employed on Douban: Mean Average Precision (MAP) (Baeza-Yates et al.,   et al., 1999), and precision at position 1 (P@1).

Results
Tables 2-3 show the results on the four benchmark datasets. We have the following observations: 1) Generally, the previous models based on multi-turn matching networks perform worse than simple PrLMs-based ones, illustrating the power of contextualized representations in context-sensitive dialogue modeling. PrLM can perform even better when equipped with our SPIDER objectives, verifying the effectiveness of dialogue-aware language modeling, where inter-utterance position information and inner-utterance key facts are better exploited. Compared with SA-BERT that involves more complex architecture and more parameters by injecting extra speaker-aware embeddings, SPI-DER keeps the same model size as the backbone BERT, and even surpasses SA-BERT on most of the metrics.
2) In terms of the training methods, DAP generally works better than MTF, with the merits of two-step procedures including the pure LM-based post-training. According to the ablation study in Table 4, we see that both of the dialogue-aware LM objectives are essentially effective and combining them (SPIDER) gives the best performance, which verifies the necessity of modeling the utterance order and factual correctness. We also notice that UOR shows better performance than SBR in DAP, while gives relative descent in MFT. The most plau- sible reason would be that UOR would permute the utterances in the dialogue context which helps the language model learn the utterance in UOR. However, in MFT, the major objective is the downstream dialogue comprehension task. The permutation of the context would possibly bring some negative effects to the downstream task training.

Influence of Permutation Ratio
For the UOR objective, a hyper-parameter δ is set to control the maximum number of permutations (as described in Section 3.2.1), which would possibly influence the overall model performance. To investigate the effect, we set the permutation ratio from [0, 20%, 40%, 60%, 80%, 100%]. The result is depicted in Figure 4, in which our model outperforms the baseline in general, showing that the permutation indeed strengthens the baseline.

Comparison with Different Context Length
Context length can be measured by the number of turns and average utterance length in a conversation respectively. We split test instances from the Ubuntu dataset into several buckets and compare SPIDER with UOR with the BERT baseline. According to the results depicted in Figure 5, we observe that SPIDER performs much better on contexts with long utterances, and it also performs robustly and is significantly and consistently superior to the baseline. The results indicate the benefits of modeling the utterance order for dialogue comprehension.

Human Evaluation about Factual Correctness
To compare the improvements of SPIDER over the baseline on factual correctness, we extract the error cases of the BERT baseline on MuTual (102 in total) and 42 (41.2%) are correctly answered

Conclusion
In this paper, we focus on the task-related adaptation of the pre-trained language models and propose SPIDER (Structural Pre-traIned DialoguE Reader), a structural language modeling method to capture dialogue exclusive features. To explicitly model the coherence among utterances and the key facts in each utterance, we introduce two novel dialogue-aware language modeling tasks including utterance order restoration and sentence backbone regularization objectives. Experiments on widely-used multi-turn dialogue comprehension benchmark datasets show the superiority over baseline methods. Our work reveals a way to make better use of the structure learning of the contextualized representations from pre-trained language models and gives insights on how to adapt the language modeling training objectives in downstream tasks.