RAST: Domain-Robust Dialogue Rewriting as Sequence Tagging

The task of dialogue rewriting aims to reconstruct the latest dialogue utterance by copying the missing content from the dialogue context. Until now, the existing models for this task suffer from the robustness issue, i.e., performances drop dramatically when testing on a different dataset. We address this robustness issue by proposing a novel sequence-tagging-based model so that the search space is significantly reduced, yet the core of this task is still well covered. As a common issue of most tagging models for text generation, the model’s outputs may lack fluency. To alleviate this issue, we inject the loss signal from BLEU or GPT-2 under a REINFORCE framework. Experiments show huge improvements of our model over the current state-of-the-art systems when transferring to another dataset.


Introduction
Recent years have witnessed increasing attention in conversation-based tasks, such as conversational question answering (Choi et al., 2018;Reddy et al., 2019;, dialogue response generation (Li et al., 2017;Zhang et al., 2018;, dialogue state tracking (Eric et al., 2020;Zeng et al., 2021) and dialogue understanding , mainly due to increasing commercial demands. However, current models still face tremendous challenges in representing multi-turn dialogues, due to the frequent omission (a.k.a. ellipsis) and coreference that people naturally use in conversations for brevity. Specifically, recent work (Su et al., 2019) has shown that ellipsis and coreference can exist in more than 70% of dialogue utterances. To tackle this problem, people have proposed coreference resolution and zero pronoun recovery. But, the state-of-the-art performances * Work done while J. Hao was interning and L. Wang was working at Tencent AI Lab. † Corresponding author.  Table 1: An example dialogue including the context utterances (u 1 and u 2 ), the latest utterance (u 3 ) and the rewritten utterance (u 3 ). on these tasks are still far from satisfactory, not to mention their uncovered situations, such as when a whole verb phrase is omitted.
Recently, the task of dialogue utterance rewriting (Su et al., 2019;Pan et al., 2019;Elgohary et al., 2019) was proposed as for explicitly representing multi-turn dialogues. The task aims to reconstruct the latest dialogue utterance into a new utterance that is semantically equivalent to the original one and can be understood without referring to the context. In another point of view, it integrates the recovering of both coreference and omission. As shown in Table 1, the incomplete utterance u 3 omit "上海 (Shanghai)" and refer "经常阴天下雨 (always raining)" with pronoun "这样 (this)". By explicitly rewriting the dropped information into the latest utterance, the downstream dialogue model only needs to take the last utterance. Thus the burden on long-range reasoning can be largely relieved.
Most previous efforts (Su et al., 2019;Pan et al., 2019;Elgohary et al., 2019;Xu et al., 2020) consider this task as a standard text-generation problem, adopting a sequence-to-sequence model with a copy mechanism (Gulcehre et al., 2016;Gu et al., 2016;See et al., 2017). They have demonstrated almost ready-to-use performances on the test set from the same data source as the training set. However, they are not robust, as our experiments show that their performances can drop dramatically (by roughly 33 BLEU4 (Papineni et al., 2002) points and 44 percent of exact match) on another test set created from a different data source (not necessarily from a totally different domain). We argue that it may not be the best practice to model utterance rewriting as standard text generation. One main reason is that text generation introduces an overly large search space, while a rewriting output (e.g., u 3 in Table 1) always keeps the core semantic meaning of its input (e.g., u 3 ). Besides, exposure bias (Wiseman and Rush, 2016) can further exacerbate the problem for test cases that are not similar to the training set, resulting in outputs that convey different semantic meanings from the inputs.
In this paper, we propose a novel solution that treats utterance rewriting as multi-task sequence tagging. In particular, for each input word, we decide whether to delete it or not, and at the same time, we choose what span from the dialogue context need to be inserted to the front of the current word. In this way, our solution enjoys a far smaller search space than the generation based approaches.
Since our model does not directly take features from the word-to-word interactions of its output utterances, this may cause the lack of fluency. To encourage more fluent outputs, we propose to inject additional supervisions from two popular metrics, i.e., sentence-level BLEU (Chen and Cherry, 2014) and the perplexity of a pretrained GPT-2 (Radford et al., 2019) model, using the framework of "REINFORCE with a baseline" (Williams, 1992). Sentence-level BLEU is computationally efficient, but it requires references and thus may only provide domain-specific knowledge. Conversely, the perplexity by GPT-2 is reference-free, giving more guidance on open-domain scenarios benefiting from the large-scale pretraining.
Experiments on two dialogue rewriting benchmarks show that our model can give huge improvements (14.6 in BLEU4 score and 18.9 percent of exact match) over the current state-of-the-art model for cross-dataset evaluation. More analysis shows that the outputs of our model keep more semantic information from the inputs. Our code is available at https://github.com/ freesunshine0316/RaST-plus.

Related Work
Initial efforts (Su et al., 2019;Elgohary et al., 2019) treat dialogue utterance rewriting as a stan-dard text generation problem, adopting sequenceto-sequence models with copy mechanism to tackle this problem. Later work (Pan et al., 2019;Huang et al., 2021) explores taskspecific features for additional gains in performance. For instance, Pan et al. (2019) adopts a pipeline-based method, where all context words that need to be inserted during rewriting are identified in the first step. The second step adopts a pointer generator that takes the outputs of the first step as additional features to produce the output. Xu et al. (2020) train a model of semantic role labeling (SRL) to highlight the core meaning (e.g., who did what to whom) of each input dialogue to prevent their rewriter from violating this information. To obtain an accurate SRL model on dialogues, they manually annotate SRL information for more than 27,000 dialogue turns, which is timeconsuming and costly.  casts this task into a semantic segmentation problem, a major task in computer vision. In particular, their model generates a word-level matrix, which contains the operations of substitution and insertion, for each original utterance. They adopt a heavy model that takes 10 convolution layers in addition to the BERT encoder. None of the existing efforts mention the robustness issue, a critical aspect for the usability of this task. Besides, they only compare performances under automatic metrics (e.g., BLEU). We take the first step to address this severe robustness issue, and we adopt multiple measures for comprehensive evaluation. Besides, we propose a novel model based on sequence tagging for solving this task, and our model takes a much smaller search space than previous models.
Sequence tagging for text generation Given the intrinsic nature of typical text-generation problems (e.g., machine translation), i.e. (1) the number of predictions cannot be determined by inputs, and (2) the candidate space for each prediction is usually very large, sequence tagging is not commonly adopted on text-generation tasks. Recently, Malmi et al. (2019) proposed a model based on sequence tagging for sentence fusion and sentence splitting, and they show that their model outperforms a vanilla sequence-to-sequence baseline. In particular, their model can decide whether to keep or delete each input word and what phrase needs to be inserted in front of it. As a result, they have to extract a large phrase table from the training data, causing inevitable computation for choosing phrases from the table. Their approach also faces the issue on unseen cases where their phrase table has limited coverage. Though we also convert our original problem into a multi-task tagging problem, we predict what span to be inserted, avoiding the issues caused by using a phrase table. Besides, we study injecting richer supervision signals to improve the fluency of outputs, which is a common issue for tagging based approaches on text generation, as they do not directly model wordto-word dependencies. Finally, we are the first to apply sequence tagging on dialogue rewriting, showing much better performances than those of BERT-based strong baselines.

Baseline: TRANS-PG+BERT
Our baseline consists of a BERT (Devlin et al., 2019) encoder and a Transformer (Vaswani et al., 2017) decoder with a copy mechanism. Given input tokens X = (x 1 , . . . , x N ) that is the concatenation of the current dialogue context c = (u 1 , . . . , u i−1 ) and the latest utterance u i , the BERT encoder is firstly adopted to represent the input with contextualized embeddings: Next, the Transformer decoder with copy mechanism is adopted to generate a rewriting output u = (y 1 , . . . , y M ) one token at a time: where TransDecoder is the Transformer decoder that returns the attention probability distribution p attn t over the encoder states E and the latest decoder state s t for each step t. Following See et al. (2017), the generation probability θ t for timestep t is calculated from the weighted sum for the encoder-decoder cross attention distribution and the encoder hidden states.
where w represents the model parameter. In this way, the copy mechanism encourages copying words from the input tokens. The TRANS-PG baseline is trained with standard cross-entropy loss: log p(y t |y <t , X; θ) (6) where θ represents all model parameters.

RAST: Rewriting as Sequence Tagging
In this section, we describe how to convert the dialogue rewriting task into a multi-task sequence tagging problem.
Task description Our analysis shows that dialogue rewriting mainly handles two linguistic phenomena: coreference and omission. To recover a coreference, it has to replace a pronoun in the current utterance with the phrase it refers to in the dialogue context. To recall an omission, it needs to insert the corresponding phrase into the omission position. Accordingly, we cast the dialogue rewriting as a sequence tagging task by introducing two types of tags for each word x n : • Deletion ∈ {0, 1}: the word x n is deleted (i.e. 1) or not (i.e. 0);  Figure 1 shows an example, where the word "这样 (like this)" corresponds to a coreference, and the word "冬天 (winter)" corresponds to an omission in front of it. 1 Constructing annotated data The gold tags for dialogue utterance rewriting are not naturally available. In response to this problem, we construct the annotated data based on the alignment between the input and reference utterances. Specifically, we employ the longest common sub-sequence (LCS) 2 algorithm to generate the word alignments between the input utterance u i and the reference utterance u i for each instance (the black lines in Figure 1). The LCS algorithm is based on dynamic programming, which takes a time complexity of O(|u i | × |u i |).
For the words in the reference utterance that are not aligned, we search them from the dialogue context and obtain their span (e.g. the words in color highlighting). Given the aligned instance, we construct the annotation tags by traversing the alignments in a left-to-right manner and comparing each alignment with its preceding one 3 under the following rules: R1. If the two alignments are adjacent in both utterances (e.g. "就是 (is)"), there is no change for the current word, which is assigned the tags {Deletion:0, Insertion:[-1, -1]}.
R2. If the two alignments are only adjacent in the input utterance (e.g. "冬天 (winter)"), this generally corresponds to an omission . We insert the reference words between the two alignments (i.e. "上海 (Shanghai)") in front of the current input word. Accordingly, we assign the current word "冬天 (winter)" the tags {Deletion:0, Insertion:[1, 1]}.
R3. If the two alignments are only adjacent in the reference utterance, we simply delete the input words between the two alignments, and assign them the tags {Deletion:1, Insertion:[-1, -1]}. This situation is rare in the task of dialogue utterance rewriting.
R4. If the two alignments are not adjacent in either utterance (e.g. "。 (.)"), this generally corresponds to a coreference that requires a replacement . We first delete the input words between the two alignments (i.e. "这样 (like this)"), then insert the corresponding target phrase (i.e. "经常 阴天 下雨 (always raining)") in front of the left-most deleted input word Both rules #2 and #4 require finding phrases from the dialogue context, as highlighted in color in Figure 1. If there are multiple candidates in the dialogue context, we choose the one that is closest to the input utterance to avoid long-range dependency. If no candidate can be found, we consider such instances cannot be covered by our approach. In the REWRITE and RESTORATION datasets used in this work, we respectively found that 6.0% and 6.5% instances are not covered, and deleted them from the datasets. Figure 2 shows the architecture of our model. For a fair comparison, it takes the same BERT-based encoder (Equation 1) as the baseline to represent each input. For simplicity, we directly apply classifiers to predict the corresponding tags for each input word. In particular, to determine whether each word x n in the current utterance u i should be kept or deleted, we use a binary classifier:

Model Architecture
where W d and b d are learnable parameters, d n is the binary classification result, and e n is the BERT embedding for x n .
Moreover, we cast span prediction as machine reading comprehension (MRC) (Rajpurkar et al., 2016), where a predicted span corresponds to an MRC target answer. For each input token x n ∈ u i , we follow the previous work on MRC to predict the start position s st n and end position s ed n for the target span s n , performing separate self-attention mechanisms for them: p(s ed n |X, n) = Attn end (E, e n ) where Attn start and Attn end are the self-attention layers for predicting the start and end positions of a span. We use the standard additive attention mechanism (Bahdanau et al., 2014) to perform the attention function. The probability for the whole span s n is: where s st n is no greater than s ed n . Given an example of (c, u i ) pair and X = [c; u i ], the overall loss function for the multi-task sequence labeling is defined as the standard cross-entropy loss over gold tags: where the terms are defined in Equation 7 and 10.

Enhancing Fluency with Additional Supervision
By converting dialogue utterance rewriting into a sequence tagging task, our model enjoys better efficiency and lower search space. However, a potential side effect is that our outputs may lack fluency, because our approach does not directly model word-to-word dependencies. We explore sentence-level BLEU (Chen and Cherry, 2014) and GPT-2 (Radford et al., 2019) as additional training signal to improve the fluency of our generated outputs, adopting the framework of "REINFORCE with a baseline" to inject these supervision signals. For more detail, we first generate two candidate sentences: one is by sampling the tags at each position of the input utterance according to the distributions in Equations 7 and 10, the other is by greedily choosing the model-considered  best tags. Next, the RL objective for sample (c, u i ) is calculated by: whereû s i andû g i represents the two candidate sentences by sampling and greedy "argmax", respectively. r(·, ·) is the reward function, which can correspond to either sentence-level BLEU or the perplexity by the GPT-2 model. Finally, we follow previous work by combining this additional loss with the tagging loss: where the λ is a constant weighting factor that is empirically set to 0.5.

Experiments
We study the robustness of our tagging-based model on two benchmarks for dialogue rewriting.

Setup
Datasets We conduct experiments on two popular dialogue rewriting datasets: REWRITE (Su et al., 2019) and RESTORATION (Pan et al., 2019). Both datasets are created by first crawling multi-turn dialogues from popular Chinese social media platforms, before asking human annotators to generate the rewriting result for the last turn of each dialogue.  Model settings We implement the baseline and our model on top of a BERT-base model (Devlin et al., 2019), and we use Adam (Kingma and Ba, 2015) as the optimizer, setting the learning rate to 3e −5 as determined by a development experiment. For the reinforcement learning stage, we respectively use the sentence-level BLEU score with "Smoothing 3" (Chen and Cherry, 2014) or the perplexity score based on a Chinese GPT-2 model trained on massive dialogues  4 as the reward function. It is worth noting that the GPT-2 model is not fine-tuned during the reinforcement learning stage.
Comparing models In addition to the TRANS-PG+BERT baseline, we compare our approach with several state-of-the-art dialogue rewriting models that are also based on BERT. CSRL (Xu et al., 2020) leverages additional information on conversational semantic role labeling (CSRL) to enhance BERT representation, extra human efforts are required on CSRL annotation. RUN  treats this problem as semantic segmentation by predicting a word-level edit matrix for the input utterance. For fair comparison, we either run their released model or ask the authors to generate their outputs on our data.

Evaluation
We use both automatic metrics and human evaluations to compare our proposed model with other approaches. For the automatic metrics, we follow previous work to use BLEU (Papineni et al., 2002), ROUGE (Lin, 2004) and the percent of sentence-level exact match (EM score). 4 https://github.com/yangjianxin1/GPT2-chitchat

Main Results
Training on REWRITE Table 3 shows the results when all comparing models are trained on the REWRITE dataset, before evaluating on the indomain REWRITE and the RESTORATION test data for robustness examination. On the REWRITE test set, our tagging-based models (Rows 4-6) are much better than the TRANS-PG+BERT baseline, and they can get comparable performances with RUN, the previous state-of-the-art model. RUN usually gets high numbers on BLEU1 without consistent improvements on higher-order BLEU scores. Our observation shows that it tends to insert context words into wrong places, hurting the number of matches regarding higher n-gram.
Among our tagging-based models, we find that injecting additional training signal (Row 5-6) does not help the in-domain performance, which is already very descent for practical use. The reason can be that optimizing with external rewards will dilute the main signal: the cross-entropy loss of the in-domain training data. This is especially evident on RAST+RL-GPT2, where the perplexity from an external GPT-2 model may not be fully aligned with the training data. Comparatively, sentencelevel BLEU is better consistent with the main signal than GPT-2, explaining why RAST+RL-BLEU reports slightly higher in-domain numbers than RAST+RL-GPT2. Please note that our main focus is the robustness issue and that slight performance changes on in-domain data will not affect its practical use. Our observation shows that both types of rewards can improve the fluency of model outputs, especially on other non-in-domain datasets.
When switching from the in-domain REWRITE  test set to the non-in-domain RESTORATION test set, all comparing systems (Rows 7-9) get much worse performances than the in-domain situation, where the drops are 27, 23 and 20 BLEU4 points for TRANS-PG+BERT, CSRL and RUN, respectively. Conversely, our models are much more robust regarding this change, resulting in large advantages of 14.6 points in BLEU4, 4.0 points in Rough-L and 9.2 points in exact match over RUN, the previous state-of-the-art model.

Training on RESTORATION
As shown in Table 4, we also conduct experiments in the opposite direction by training all models on the RESTORATION training set to further verify the above conclusions. Similarly, our models achieve comparable performances with all comparing systems on the in-domain test set, they are more advantageous on the test set for robustness examination. The advantages (10.8 in BLEU4, 2.2 in exact match) are smaller than the previous direction, with the reason being that the RESTORATION training set is nearly 11-time larger than the REWRITE training set. This shows that our model is less data hunger than the comparing systems, and please note that data annotation is usually very costly.
For other interesting facts, the advantage of RUN on BLEU1 still does not benefit high order situations. This is consistent with the situation in Table  3, where RUN tends to insert words into wrong contexts. Comparing the two types of extra signals, sentence-level BLEU is better on the in-domain test set, while GPT-2 is better on helping the robustness on other datasets. This is intuitive as our GPT-2 model has been pretrained on massive data.

Human Evaluation
In addition to these automatic metrics, we also conduct a human evaluation for each system pair to further compare their rewriting quality, and we focus on the robustness examination scenario. Specifically, we first train the comparing systems on REWRITE, before using them to decode 500 randomly selected test examples from RESTORATION. Finally, we ask 3 graders to choose a winner from each pair of rewriting outputs. The evaluation criteria is based on fluency and adequacy, where the adequacy mainly considers two aspects: 1) how much meaning is retained; and 2) how many coreference and omission situations are recovered. All graders agree on 88.8% cases. As shown in Table 5, the number of winning cases for RAST+RL-GPT2 is much more than the number of losing cases when comparing with any other system. This further confirms the effectiveness of our model. Many losing cases are due to the lack of fluency caused by object fronting, while most of this type of situations are understandable by human. Some examples are discussed in Section 6.5. Our model loses the least number of samples against RUN, because RUN also lack fluency due to improper context word insertion (as mentioned in Section 6.2).

Evaluating with Semantic Role Labeling
We also compare our model with the baselines regarding the "semantic corectness" of their rewriting outputs, taking semantic role labeling (SRL) as the form of semantics on their outputs. By doing so, we can focus on comparing the core meaning, ignoring other functional words and phrases. Specifically, we choose a state-of-the-art SRL system (Che et al., 2020) to annotate rewriting outputs and references. Next, the precision, recall and F1 scores for each system are calculated by comparing the SRL results of its outputs with these of the references. Table 6 lists the performances on both transfer (robustness examination) scenarios, where our model reports consistently higher numbers than all other systems. This indicates that our model also makes improvement regarding the core semantic meaning. The relative differences on recall, precision and F1 are also consistent with the differences under automatic metrics (Table 3, 4) and human evaluation (Table 5). Table 7 gives 3 test examples that indicate the representative situations we find. The first example illustrates the cases when RUN inserts context words (e.g. "意 (meaning)") into wrong places. This hurts the fluency and high-order BLEU score as mentioned in Section 6.2. The third example shows the situations when the TRANS-PG+BERT baseline messes up by word repeating (e.g. "考口语考口语 (it's oral test it's oral test)"). This is a common situation for generation-based models, especially on unseen data samples. Conversely, this situation rarely happens to our model, as it is based on sequence tagging. Lastly, the second example corresponds to the situation of referring to a complex concept (e.g. "西安到商洛的顺风车 (a free ride from Xi'an to Shangluo)"). For these cases, it is easier for our model to get the correct span. This is because our model directly predicts the span boundaries, thus it has a smaller search space than other previous approaches, like generating the concept word by word.

Evaluation on Uncovered Examples
As mentioned earlier, a few examples may not be covered by our model that treats rewriting as sequence tagging. To get a more comprehensive evaluation, we further compare the TRANS-PG+BERT baseline and our model on the uncovered test examples of both REWRITE and RESTORATION datasets.

Conclusion
In this paper, we addressed the robustness issue of dialogue utterance rewriting, which is crucial for its usability on real applications. We proposed a novel tagging-based approach that results in a significantly smaller search space than the existing methods on this task, and we introduced additional supervision (e.g. by GPT-2) to improve the fluency of model outputs. Experiments with automatic metrics, human evaluation and semantic matching show that our model is much more robust than the previous state-of-the-art system without sacrificing its in-domain performances. Future work includes evaluating this tagging framework on other English benchmarks, such as SMCalFlow (Andreas et al., 2020) and TreeDST (Cheng et al., 2020).