Incomplete Utterance Rewriting by A Two-Phase Locate-and-Fill Regime

Rewriting incomplete and ambiguous utterances can improve dialogue models’ understanding of the context and help them generate better results. However, the existing end-to-end models will have the problem of too large search space, resulting in poor quality of rewriting results. We propose a 2-phase rewriting framework which ﬁrst predicts the empty slots in the utterance that need to be completed, and then generate the part to be ﬁlled into each positions. Our framework is simple to implement, fast to run, and achieves the state-of-the-art results on several public rewriting datasets.


Introduction
In multi-turn dialogues, speakers naturally tend to make heavy use of references or omit complex discourses to save the efforts.Thus natural language understanding models usually need the dialogue history to understand the true meaning of the current utterance.The existence of such incomplete utterances increases the difficulty of modeling dialogues.The sources of incompleteness of an utterance can be divided into two categories: coreference and ellipsis.The task for solving these two kinds of incompleteness is called Incomplete Utterance * The corresponding author.
Rewriting (IUR).As shown in Figure 1, the third utterance of this multi-turn dialogue is incomplete.If this utterance is taken out alone without a context, we will not be able to understand what "one" means and where to buy it.The fourth utterance is a rewriting of the third one.We can see that "one" in the third utterance is replaced by "J.K. Rowling's new book".In addition, the place adverbial "from the book store in town" is inserted after "for me".In today's industry strength dialogue systems and applications, due to stringent requirements on running time and maintenance cost, single-turn models are much more preferred than multi-turn models.If an incomplete single-turn utterance can be completed, it will be more understandable without the context, and the cost of downstream NLP tasks, such as intention extraction and response generation, will be reduced.
Figure 1 shows that that all the words added in the rewritten utterance except "from" come from the context.Inspired by this, many early rewriting works used pointer networks (Vinyals et al., 2015) or sequence to sequence models with copy mechanism (Gu et al., 2016;See et al., 2017) to directly copy parts from the context into the target utterance.More recently, pre-trained language models such as T5 (Raffel et al., 2020) succeeds in many NLP tasks, and it appears that T5 is a plausible choice for utterance rewriting as well.However, IUR task is different from other generation tasks in that new parts typically only need to be added in one or two specific locations in the original utterance.That is, the changes to the utterance are localized.For example, a typical operation is adding modifiers before or after a noun.On the contrary, end-toend text generation models such as T5 may not preserve the syntactic structure of the input, which may cause the loss of important information and the introduction of wrong information into the output, which is illustrated as below (Two examples are generated by T5.).
• Can you buy J.K. Rowling's new book?
(Losing original structure) • Can you publish new book for me ?(Introducing wrong information) Another problem of the end-to-end pre-trained models, which generate the rewritten utterances from scratch, is that they generally incur a large search space and are therefore not only imprecise but also inefficient.In order to solve the large search space issue, Hao et al. (2021a) treated utterance rewriting as a sequence tagging task.For each input word, they predict whether it should be deleted and the span that needs to be replaced with.Liu et al. (2020) formulated IUR as a syntactic segmentation task.They predict segmentation operations required on the utterance to be rewritten.However, they still did not take the important step of predicting the site of rewrite, particularly the position within the syntactic structure of the input utterance.If the model can learn the syntactic structure information in the target sentence, it can predict which part of the sentence needs to be modified, i.e., which words need to be replaced and where new words need to be inserted.After that, the model only needs to fill in these predicted positions.These two tasks are relatively simple to perform, and they collectively avoid the above problems.Our approach is based on the above intuition.
In order to effectively utilize the syntactic structure of the sentence to be rewritten, we divide the IUR task into two phases.The first phase is to predict which positions in the utterance need to be rewritten (including coreference and ellipsis).The second phase is to fill in the predicted positions.In the first phase, we use the sequence annotation method to predict the locations of coreference and ellipsis in the utterance.In the second phase, we take the utterances with blanks as input and directly predict the words required for the blank position.By seperating the original rewriting task into two relatively simple phases, our results show that our model performs the best among recent state-of-theart rewriting models 1 .
Our main contributions are as follows.
• A two-phase framework for solving incomplete utterance rewriting task is proposed.

It can complete the Incomplete Utterance
Rewriting (IUR) task.(Section 2) • An algorithm for aligning the two sentences before and after rewriting based on the longest common subsequences (LCS) algorithm.We succinctly and efficiently generated two kinds of data which can be used for predicting the positions to be rewritten (the first phase) and filling the blanks (the second phase) respectively.(Section 2.1.2) • We have carried out experiments on 5 datasets, and the experimental results show that our two-phase framework achieves state-of-theart results.(Section 3) 2 Approach Our framework is divided into two phases: Locating positions to rewrite and Filling the blanks.Figure 2 is a brief schematic of the framework.Phase 1 can be done either by heuristic rules or by supervision.Phase 2 can be done with a seq2seq text generation model.We give the details of these phases next.

Locating Positions to Rewrite
We designed an unsupervised and a supervised method to locate positions to rewrite.The two methods are described below.

Unsupervised Rule-based Method
We first implement a rule-based method for the first phase of our problem, aiming at predicting the blanks automatically.We looked through thousands of complete utterance examples in Elgohary et al. (2019).Based on our observations and experience, we define six rules for generating two kinds of blanks which are used for resolving coreference and ellpisis in the second phase.The rules for generating blanks are summarized and explained below: Personal Pronouns: We replace all the personal pronouns (except the first-and second-person pronouns) and their corresponding possessive pronouns with [MASK_r].This indicates that we will replace these pronouns with some specific noun phrases at second phase.Interrogatives: We insert [MASK_i] after the interrogative if the whole utterance only contains interrogatives such as what, how, why, when and so forth.[MASK_i] indicates that some additional text span shall be inserted at this location.That, This: The use of word like "this", "that", "these" and "those" are commonly used in colloquial language, which becomes a source of ambiguity.Therefore, we deal with the use of these pronouns in following ways: -Not followed by a noun phrase: In this case, we simply replace the word by [MASK_r].
-Otherwise: We will insert [MASK_i] after the noun phrase.
The+Noun Phrase: We will insert [MASK_i] after the noun phrase.
Other, Another, Else: If the utterance contains these words, it usually indicates that there are people/things additional to what have been mentioned before.Hence, we add a [MASK_i] at the head of the sentence.Before, After: We insert [MASK_i] after the sentence ended with "before" or "after", which is considered as an incompletion.

Supervised LCS-based Method
We also design an algorithm based on the Longest Common Subsequence (LCS) algorithm .The sentence to be rewritten X and after rewriting Y are aligned via a sequence labeling model.To obtain the common subsequence, LCS algorithm returns a matrix M which stores the LCS sequence for each step of the calculation.The value of M i,j indicates the actual LCS length of sequences X[0, i] and Y [0, j]2 .When we trace back from the max value at the corner, the decreases of length show that the sentences have a common token.
Coreference and ellipsis towards original sentence are extracted through LCS trace back algorithm, which is further labeled as COR and ELL respectively.Given the tokenized original sentence X and ground truth Y as shown in Figure 3, the rules for labeling are specified as follows: • The labeling is proceeded from the bottom right to the top left corner of a LCS matrix.
If the current tokens in X i and Y j are equal, X i matches part of the LCS and is labeled as O, then we go both up and left (shown in black).If not, we go up or left, depending on which cell has a higher number or lower index j, until we find next matched X i that satisfies • If traversed path from previously matched token pair to newly match pair is a straight up arrow, it indicates that token(s) in interval (Y j , Y j )3 is (are) inserted at corresponding position i in X to complete the original sentence.In this case, token X i is labeled as ELL(shown in orange).
• If two matched pairs in the LCS matrix are joined by paths with corners, interval Can you buy that novel for me ?<EOS> Then, we input the pre-processed training data into the BERT-CRF (Souza et al., 2019) model, which is considered as a sequence annotation task.Using the method described in Section 2.1.2,we obtained the locations of coreferences and ellipses of each utterance waiting to be rewritten.As shown in Figure 3, we use the BIO format to annotate the sequence.The starting position of the coreference is marked as B-COR (Begin-Coreference), while other positions of the coreference are marked as I-CORs (Inside-Coreference). Ellipsis only appears in the middle of two tokens, so we mark the position of the latter token as B-ELL (Begin-Ellipse), which means that there should be missing words between this token and the previous token, and the subsequent model is required to fill in it.We use T5-small (Raffel et al., 2020) and bartbase (Lewis et al., 2020) as pre-trained language model (PLM) in phase 2. In this section, we will take T5 as an example to illustrate the process of blanks filling.

Blanks Filling
We use the T5 model to fill in blanks with two optimizations: adding hints and splitting the current utterance into sub-sentences.The latter can ensure that there is only one blank in the sentence to be filled in the T5 model.The two optimizations are shown in Figure 4 and Figure 5.We transfer the data of each multi-turn dialogue into the format shown in Figure 6, and fine-tune the T5 model.
Input: I heard that J.K. Rowling's new book has been published.
[SEP] Great.I'm going to the bookstore in town.
[SEP] Can you buy <extra_id_0> (that book) for me?Output: -. 5RZOLQJ ¶V QHZ ERRN Input: I heard that J.K. Rowling's new book has been published.
[SEP] Great.I'm going to the bookstore in town.
[SEP] Can you buy that book for me <extra_id_0> ( ) ?Output: from the bookstore in town After fine-tuning, we take the predicted results of BERT-CRF model in Section 2.1.2as input to get the final blank filled results of T5 model.Finally, the outputs of T5 model are filled back into the blanks of the original sentence to get the rewritten utterance.The same is for the rule-based method.The blank prediction obtained from it is directly input into the same T5 model (the two optimization methods described in Figure 4 and Figure 5 will also be used) to obtain the output of T5.

Experiment
In this section, we will introduce our experiment setup and results.

Datasets
We tested the baseline and our framework on 3 public datasets in English and 2 in Chinese.The statistics are shown in Table 1.The examples are shown in Appendix.
MuDoCo (Martin et al., 2020) has a lower rewriting rate, which makes the rule-based method less accurate in predicting the locations to be rewritten.CQR (Regan et al., 2019) contains imperative dialogues in life (between people or between people and intelligent agents).The sentence patterns are relatively simple, fixed and easy to understand.REWRITE (Su et al., 2019a) is a Chinese dataset, each dialogue of which contains 3 turns.It is collected from several popular Chinese social media platforms.The task is to complete the last turn.RES (Restoration-200k) (Pan et al., 2019a) is a large-scale Chinese dataset in which 200K multiturn conversations are collected and manually labeled with the explicit relations between an utterance and its context.Each dialogue is longer than REWRITE.CANARD (Elgohary et al., 2019) contains a series of English dialogues about a certain topic or person organized in the form of QA.It has the largest size and the longest context length.The sentence pattern in CANARD is complex, the understanding is difficult, and the rewriting degree is high.

Baselines
We choose the following strong baselines to compete with our framework.T5-small model and T5-base model 4 (Raffel et al., 2020).We directly take the context and the current utterance as inputs, use the training set to fine-tune the T5 model, and test its end-to-end output on the test set as the result of rewriting the utterance.BART-base model (Lewis et al., 2020).This is another pre-trained model we used.Its size is close to T5-small.Our model is tested based on these 2 PLMs.Rewritten U-shaped Network (RUN) (Liu et al., 2020).In this work, the authors regard the incomplete utterance rewriting task as a dialogue editing task, and propose a new model using syntactic segmentation to solve this task.Hierarchical Context Tagging (HCT) (Lisa et al., 2022).A method based on sequence tagging is proposed to solve the robustness problem in rewriting task.
Rewriting as Sequence Tagging (RAST) (Hao et al., 2021b).The authors proposed a novel tagging-based approach that results in a significantly smaller search space than the existing methods on the incomplete utterance rewriting task.

Evaluation Metrics
We use the BLEU n score (Papineni et al., 2002) to measure the similarity between the generated rewritten utterance and the ground truth.Low order n-gram BLEU n score can measure precision, while high-order n-gram can measure the fluency of the sentence.We also use the ROUGE n score (Lin, 2004) to measure recall of rewritten utterance.Rewriting F-score n (Pan et al., 2019b) is used to examine the words newly added to the current sentence.We calculte Rewriting F-score by comparing words added by the rewriting model with added words in ground truth.It is a widely accepted metric that can better measure the quality of rewriting.In addition to the automatic evaluation method, we also asked human annotators to conduct comparative tests on the rewriting results. 4A fine-tuned version of T5-base is used: https:// huggingface.co/castorini/t5-base-canard.

Implementation Details
All of the models are running and evaluated on 2 Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz with 4 NVIDIA GeForce RTX 2080 GPU and a 128GB RAM.Due to the memory constraints of our experimental environment, we adopt T5-small model in the second phase of our framework, and fine-tune it for 20 epochs.All experiments are repeated for 3 times and averaged.

Main Results
In the following section, "Ours-T5" represents our model based on T5-small in phase 2. "Ours-BART" is our framework based on BART-base in phase 2. "Ours-rule" is a variant of our method which uses the rule-based method in Section 2.1.1 to generate blanks in phase 1 and T5-small in phase 2. "Gold-T5" is the result of directly inputting the sentence with the correct blanks into the T5-small model in phase 2. "Gold-BART" is directly inputting the sentence with the correct blanks into the BARTbase model in phase 2.
Table 2 shows the results of our framework and baselines on CQR and MuDoCo.Compared with CANARD, the two datasets are smaller in size and simpler in sentence structure.Our approach is significantly better than all baselines on all metrics.For Rewriting F-score, our method is 6.37 and 6.63 percentage points higher than the suboptimal end-to-end T5-small model, respectively.This metric strongly shows that our method can introduce more new words provided in the ground truth (compared with the original sentence).Relatively larger advantages of our model compared with T5-small in BLEU and ROUGE show that our method based on blank prediction and filling can retain the structure of the original sentence to the greatest extent, so as to retain more correct same information when calculating these two metrics and comparing the two sequences.However, end-to-end T5 model generates the whole rewriting utterance directly, which may lose some information from the original sentence.
The last part of Table 2 shows the results of our framework and baselines on CANARD.Among the three datasets we used, samples in CANARD are the most difficult and the most complex.Our model is superior to other baseline methods in all the experimental metrics.Especially in BLEU score, our method is significantly better than all baselines.As for Rewriting F-score and ROUGE, we found that the performance of end-to-end T5 model is close to our method.This is because the generative T5 model is very powerful and can generate fluent sentences.However, our 2-phase framework can better predict which positions in the current sentence should be rewritten, which can not be achieved by the end-to-end model.In the following analysis, we will further analyze this point.
An important reason why our framework is better than baselines on CQR and MuDoCo is that CQR mainly contains dialogues that users are asking agents for help.The positions and forms of words that can be added are relatively fixed, such as adding place adverbials.Samples in MuDoCo are basically imperative dialogues in daily life.It also has the same feature, which makes our model easier to learn.The results in Section 3.7 can also illustrate this point.The accuracy of the first phase of our framework is higher on CQR and MuDoCo.
Table 3 shows the results of our framework and baselines on Chinese datasets REWRITE and RES.Due to the better performance of BART in Chinese texts, our model is mainly tested based on BARTbase rather than T5-small in these two datasets.These two PLMs have similar sizes.HCT, RUN and RAST perform well on these two datasets.Because these two datasets have few turns and simple contents, they have been studied a lot in previous work.However, their performance is not as good as that of BART-base.This shows the great potential of using PLMs directly in rewriting tasks.Compared with BART-base, our model has improved in BLEU score and ROUGE score.This shows that our method is also effective in Chinese.And when different PLMs are used as frameworks, the results can be improved.Current A: how long was he there ?Gold A: how long was yogi berra with the yankees ?Ours-sup A: how long was yogi berra at the yankees ?T5-small A: how long was yogi berra there ?
Table 4: A typical example extracted from the prediction results on CANARD.
In the Table 4, our model is compared with T5 model for end-to-end prediction.It can be observed that the word "there" is not considered to be replaced by the end-to-end model, which is due to the fact that the position to be rewritten is not obviously predicted.Our two-phase framework can make up for this.The sequence annotation model indicates that "there" is a part that needs to be replaced, so the T5 in the second phase can be predicted correctly.This is our advantage over the end-to-end model.More case studies are shown in appendix.Table 5: Human evaluation on CANARD."Rule" means our rule based method conacted with T5-small of 2nd phase, introduced in Section 2.1.1.

Human Evaluation
Table 5 shows the results of human evaluation on CANARD.For each pair of competing models, 50 pairs of rewriting results were randomly sampled from the testset for comparative testing.A total of 200 questions were randomly assigned to 5 human volunteers on average.Each person needs to choose the better one from the prediction results of the two models.As can be seen from the table, our method is significantly stronger than RUN, HCT, and the rule-based method in Section 2.1.1.When compared with the end-to-end T5-small model, our advantage is relatively small.After observing the feedback of human annotators, we find that the end-to-end model has the advantage of direct generation and can generate more complete and fluent sentences.Our method only generates the words needed in blank, which lacks a certain degree of sentence fluency.However, our 2-phase framework can accurately predict the positions that need to be rewritten in the current sentence, which is beyond the reach of the end-to-end model (see appendix for specific analysis).Taken together, our method should be even better.Table 6 shows the results of end-to-end ablation test on CANARD.We can see that by replacing LCS algorithm with greedy algorithm, the experimental results have decreased to a certain extent, which shows the effectiveness of LCS algorithm.On the other hand, due to the diversity of experi-mental data, the matching algorithm can only approach the correct results, and can not guarantee the complete correctness.Greedy algorithm is also a substitute.Our greedy algorithm is described as follows.

Ablation Tests
We use 2 pointers to traverse the current utterance and ground truth utterance.The pointers point to the current word in each of the two utterances.If they cannot be matched, the pointer of the ground truth will advance to the next matching position and stop, and the scanned span will be marked as an "ellipsis".If no match can be made until the end, the pointer of the current occurrence moves forward one bit and adds the previous position to the span of "coreference".
If we remove the two optimizations of splitting sentences according to the number of blanks and adding hint from our framework, there will be more obvious decline.The reason is that splitting sentences can keep more syntactic information in sentences, and multiple blanks will make sentences look "full of loopholes".Adding hint will prompt the original words in the language model in phase 2, so as to provide more information.For example, if our hint is "he", the model will not tend to fill in a female name or other things here.Table 7 shows the F1-score of our LCS based algorithm and greedy based algorithm in predicting the locations that need to be rewritten (that is, the first phase in the 2-phase framework).They are trained and tested on the sequence annotation data generated by their own methods.We can see that the algorithm based on LCS has better effect.

Time Cost Evaluation
Table 7 shows the results of training and predicting time on CANARD.In Section 3.5, we found that  our model has the least advantage over the endto-end T5-small model.Therefore, in this section, we compare their time consumption.In Table 7a, under the same configuration, we found that our method would take more time to fine-tune.This is understandable because although there are only 5571 samples in the testset of canard dataset, we will segment sentences according to the number of blanks.Even if there are sentences without any blanks, this optimization also leads to an increase in the number of samples to 6569.Interestingly, in the inference time, Table 7b shows that our model takes less time.This may be because our model does not need to generate a whole sentence, but only needs to fill in the blank, which is much shorter than a complete utterance.Due to the short time of BERT-CRF, our method only takes 11.9% more time than the end-to-end T5 model, and the overall size of the model is almost the same as other training requirements.Therefore, we believe that even a small increase in results can illustrate the effectiveness of our method.

Comparison with ChatGPT
In this section, we will present the results of comparison with ChatGPT 5 .Dialogue systems are useful in many tasks and scenarios.Rewriting utterances is particularly useful when a light-weight dialogue model which only takes the last utterance as input is desirable.This is exactly where very large models such as ChatGPT cannot help, not to mention the various woes of current ChatGPT such as the cost of deployment, slow inference speed, and privacy issues.Therefore, we believe that it is not fair to compare ChatGPT with the kind of rewriting technology that we are advocating in this 5 https://chat.openai.com/paper, and the latter still has its merits.
Please complete the following incomplete sentence completion task.Given the context of the conversation and incomplete sentences to be rewritten, you need to complete the sentences to be rewritten so that they can be understood out of context.Please do not change the words in the sentence to be rewritten or the structure of the sentence unless necessary.Do not use information that goes beyond the context.Your answer should be at most 10 words more than the sentence to be rewritten.
give an example:

context:
anna politkovskaya the murder remains unsolved , 2016 sentence to be rewritten: did they have any clues ? answer: did investigators have any clues in the unresolved murder of anna politkovskaya ?
If you understand, I will give you some tasks.The scale of ChatGPT or is at least 3 orders of magnitude larger than the models we use in this paper, which means this is not a fair comparison.Nevertheless, we still conducted the following supplementary experiments on ChatGPT.The prompt we used is shown in Table 8.The experimental results on 30 cases of CA-NARD is shown in Table 8.Some examples of the results are shown in Table 9.After repeated tries and with the best prompt we can find, Chat-GPT is still worse than our method in terms of automatic evaluation metrics.However, by human evaluation, testers think that the rewriting results of ChatGPT are of higher quality (more fluent).This is no surprise given the tremendous parameter space of ChatGPT.

Related Work
Early work on rewriting often considers the problem as a standard text generation task, using pointer networks or sequence-to-sequence models with a copy mechanism (Su et al., 2019b;Elgohary et al., Ours-T5 did fsb get into trouble for the attack against the account annapolitovskaya@us provider1 ?why did superstar billy graham return to the wwwf ?ChatGPT Did the perpetrators face consequences for the attack on Anna Politkovskaya's email?What was the reason for Superstar Billy Graham's return to WWWF? 2019; Quan et al., 2019) to fetch the relevant information in the context (Gu et al., 2016).Later, pre-trained models like T5 (Raffel et al., 2020) are fine-tuned with conversational query reformulation dataset to generate the rewritten utterance directly.Inoue et al. (2022) uses Picker which identifies the omitted tokens to optimize T5.In general, these generative approaches ignore the characteristic of IUR problem: rewritten utterances often share the same syntactic structure as the original incomplete utterances.
Given that coreference is a major source of incompleteness of an utterance, another common thought is to utilize a coreference resolution or corresponding feartures.Tseng et al. proposed a model which jointly learns coreference resolution and query rewrite with the GPT-2 architecture (Radford et al., 2019).By first predicting coreference links between the query and context, the performance of rewriting has improved while the incompleteness is induced by coreference.However, this does not work for utterances with ellipisis.Besides, the performance of the rewriting model is limited by the coreference resolution model.
Recently, some of the work on incomplete utterance rewriting focuses on the "actions" we take to change the original incomplete utterance into a self-contained utterance (target utterance).Hao et al. (2021a) solves this problem with a sequencetagging model.For each word in the input utterance, the model will predict whether to delete it or not, meanwhile, the span of words which need to be inserted before the current word will be chosen from the context.Liu et al. (2020) formulated the problem as a syntactic segmentation task by predicting segmentation operations for the rewritten utterance.Zhang et al. (2022) extracts the coreference and omission relationship directly from the self-attention weight matrix of the transformer instead of word embeddings.Compared with these methods, our framework separates the two phases more thoroughly of predicting the rewriting position and filling in the blanks, and meanwhile, reduces the difficulty of the two phases with the divide and conquer method.

Conclusion
In this work, we present a new 2-phase framework which includes locating positions to rewrite and filling the blanks for solving Incomplete Utterance Rewriting (IUR) task.We also propose an LCS based method to align the original incomplete sentence with the ground truth utterance to obtain the positions of coreference and ellipsis.Results show that our model performs the best in several metrics.We also recognize two directions for further research.First, as the performance of our 2-phase framework is often limited by the first phase, we will try to improve the accuracy of locating rewriting positions.Second, it will be useful to study the best way for applying our rewriting model to other downstream NLP tasks.

Limitations
Our framework is a two-phase process, which has its inherent defects, that is, the results of the second phase depend on the results of the phase 1.Because the sequence annotation algorithm in the first phase cannot achieve 100% accuracy, it will predict the wrong position that should be rewritten when the second phase is followed, which will further lead to the error of the final result.
On the other hand, T5 model is only used to predict the words that should be filled in blank, rather than generate the whole sentence, which may lead to the decline of the overall fluency of the sentence.

Figure 1 :
Figure 1: An example of utterance rewriting.The phrase in the first red box is coreference, and the second is ellipsis.

Figure 3 :
Figure 3: Example of generating sequence labeling data (based on LCS).

Figure 6 :
Figure 6: Format of fine-tune data of T5.
leaguesA:what team signed him ?B:berra was called up to the yankees and played his first game on september22 , 1946 ; Ablation test on CANARD."w/o LCS" means replace LCS algorithm with a greedy algorithm."w/o split" and "w/o hint" respectively represent removing the 2 kinds of optimizations in Section 2.2.

Figure 7 :
Figure 7: Different methods' results of predicting locations to be rewritten in the phase 1.

Figure 8 :
Figure 8: A prompt designed to allow ChatGPT to do rewriting task.

Original Sentence Sentence with Blanks Add Hints Sentence with Hints
Can you buy [MASK_r] for me ?Can you buy that novel for me [MASK_i] ?Split Figure 4: Split the sentence according to the number of blanks in the utterance.Can you buy [MASK_r] (that novel) for me [MASK_i] ( ) ?Can you buy [MASK_r] for me [MASK_i] ?Can you buy that novel for me ?

Table 1 :
Descriptions of the datasets.
"Ave Len" means the average length of context."% RW" denotes the percentage of samples whose current utterance is actually rewritten.

Table 2 :
Results on English datasets.

Table 3 :
Results on Chinese datasets.

Table 7 :
Time cost of our method and end-to-end T5small model on CANARD."Total Time" is the total training time spent on all samples in the test set."Ave Time" is the average inference time of all samples in the test set.

Table 8 :
Experimental results on 30 cases of CA-NARD.

Table 9 :
Examples of ChatGPT and ours on CANARD.