Enhance Incomplete Utterance Restoration by Joint Learning Token Extraction and Text Generation

This paper introduces a model for incomplete utterance restoration (IUR) called JET (\textbf{J}oint learning token \textbf{E}xtraction and \textbf{T}ext generation). Different from prior studies that only work on extraction or abstraction datasets, we design a simple but effective model, working for both scenarios of IUR. Our design simulates the nature of IUR, where omitted tokens from the context contribute to restoration. From this, we construct a Picker that identifies the omitted tokens. To support the picker, we design two label creation methods (soft and hard labels), which can work in cases of no annotation data for the omitted tokens. The restoration is done by using a Generator with the help of the Picker on joint learning. Promising results on four benchmark datasets in extraction and abstraction scenarios show that our model is better than the pretrained T5 and non-generative language model methods in both rich and limited training data settings.\footnote{The code is available at \url{https://github.com/shumpei19/JET}}


Introduction
Understanding conversational interactions through NLP has become important with increasing connectivity and range of capabilities. The applications using natural conversations cover a wide range of solutions including dialogue systems, information extraction, and summarization. For example, Adiwardana et al. (2020); Su et al. (2020) aimed to build the dialogue system where an intelligent virtual agent answers human conversations and makes suggestions in an open/closed domain. Bak and Oh (2018); Karan et al. (2021) attempted to detect decision-related utterances from multi-party meeting recordings, while Tarnpradab et al. (2017) applied extractive summarization for online forum discussions. These features allow users to * * Corresponding Author. 1 The code is available at https://github.com/ shumpei19/JET to quickly catch up with the current situation, decisions and next-action without having to follow a lengthy or comprehensive dialogue. However, utterances, the components of a conversation, are generally not self-contained and are difficult to understand by their own. This comes from the nature of multi-turn dialogue where each utterance contains co-references, rephrases, and ellipses ( Figure  1). Su et al. 2019 also showed that co-references and ellipses occur in over 70% of utterances in conversations. This is a ubiquitous problem in conversational AI, making the challenge for building practical systems with conversations.
Incomplete Utterance Restoration (IUR) (Pan et al., 2019) is one solution to restore semantically underspecified utterances (i.e., incomplete utterances) in conversations. Figure 1 shows an example of IUR, in which the model rewrites the incomplete utterance to the reference. IUR is a challenging task due to two reasons. Firstly, the gold utterance (the reference) overlaps a lot of tokens with the pre-restored, incomplete utterance, while it overlaps only a few tokens with utterances in the context. We observed that for CANARD (Elgohary et al., 2019), 85% of tokens in incomplete utterances were directly cited for rewriting, while only 17% of tokens in context was cited for rewriting. Secondly, it is important to detect omitted tokens in incomplete utterances and to include them in the restoration process. In actual cases of IUR, no matter how fluent and grammatically correct the machine's generation is, it is useless as long as important tokens are left out.
Recent studies used several methods for IUR. It includes the extraction of omitted tokens for restoration (PAC) (Pan et al., 2019), two-stage learning (Song et al., 2020), seq2seq fine-tuning (Bao et al., 2021), semantic segmentation (RUN-BERT) , or the tagger to detect which tokens in incomplete utterances should be kept, deleted or changed for restoration (SARG) Figure 1: The sample data from CANARD. IUR models rewrite the incomplete utterance to be as similar as possible to the reference. The blue tokens are omitted tokens (excluding stop words) in the incomplete utterance. The red tokens are defined by our hard labeling approach as an important token. (Huang et al., 2021). However, we argue that these methods can only work on neither extractive nor abstractive IUR datasets. For example, SARG and seq2seq achieve promising results on Restoration 200k (Pan et al., 2019) where omitted tokens can be directly extracted from the context (extraction). But they are not the best on CANARD (Elgohary et al., 2019), which requires more abstraction for restoration. In Figure 1, 2 we can observe that the output of SARG and seq2seq are worse than that of our JET. Text editing strategy by SARG is limited in its ability to generate abstractive rewriting while seq2seq has the problem in picking omitted tokens. As the result, the generality of these methods is still an open question.
We introduce a simple but effective model to deal with the generality of IUR methods named JET (Joint learning token Extraction and Text generation). The model is designed to work widely from extractive to abstractive scenarios. To do that, we first address the problem of identifying omitted tokens from the dialogue context by introducing a picker. The picker uses a new matching method for dealing with various forms of tokens ( Figure 1) in the extraction style. We next consider the abstraction aspect of restoration by offering a generator. The generator utilizes the power of the pre-trained T5 model to rewrite incomplete utterances. The picker and generator share the T5's encoder and are jointly trained in a unified model for IUR. This paper makes three main contributions: 2 The performance of RUN-BERT is limited on CANARD.
• We propose JET, a simple but effective model based on T5 for utterance restoration in multiturn conversations. Our model jointly optimizes two tasks: picking important tokens (the picker) and generating re-written utterances (the generator). To our best knowledge, we are the first to utilize T5 for the IUR task.
• We design a method for identifying important tokens for training the picker. The method facilitates IUR models in actual cases, in which there are no (a few) existing gold labels.
• We demonstrate the validity of the model by comparing it to strong baselines from multiple perspectives such as limited data setting (Section 5.2), human evaluation (Section 5.4) and output observation (Section 5.5).

Related Work
Sentence rewriting IUR can be considered to be similar to the sentence rewriting task (Xu and Veeramachaneni, 2021;Lin et al., 2021;Chen and Bansal, 2018;Cao et al., 2018). Recent studies have been addressed the IUR task with various sophisticated methods. For example, Pan et al. 2019 introduced a pick-then-combine model for IUR. The model picks up omitted tokens which are combined with incomplete utterances for restoration.  proposed a semantic segmentation method that segments tokens in an edit matrix then applied an edit operation to generate utterances. Huang et al. 2021 presented a complicated model which uses a tagger for detecting kept, deleted, or changed tokens for restoration. We share the idea of using a tagger with Huang et al. 2021 for IUR. However, we design a more simple but effective model which includes a picker (picking omitted tokens) and a generator for the restoration of incomplete utterances.
Text generation IUR can be formulated as text generation by using the seq2seq model (Pan et al., 2019;Huang et al., 2021). For the generation, several well-known pre-trained models have been applied (Lewis et al., 2020;Brown et al., 2020;Raffel et al., 2020) with promising results. We employ the T5 model (Raffel et al., 2020) as the main component to rewrite utterances. To address the problem of missing important tokens in model's rewriting, we enhance T5 by introducing a Picker and two labeling methods (Section 3.2).

Problem Statement
This work focuses on the incomplete utterance restoration of conversations. Let H = {h 1 , h 2 , ..., h m } be the history of the dialogue (context), U = {u 1 , u 2 , ..., u n } is the incomplete utterance that needs to be re-written. The task is to learn a mapping function f (H, U |Θ) = R, where R = {r 1 , r 2 , ..., r k } is the re-written version of U . The learning of Θ is composed by only using utterance generation (the generator) or the combination of two tasks: important token identification (the picker) and utterance generation (the generator).

The Proposed Model
Our model is shown in Figure 2. The Picker receives the context to identify omitted tokens. The Generator receives incomplete utterances for restoration. The model jointly learns to optimize the two tasks. Our model distinguishes in three significant differences compared to PAC (Pan et al., 2019) and SARG (Huang et al., 2021). First, our model bases on a sing pre-trained model for both picker and generator while other models (i.e. PAC and SARG) use different architectures for the two steps. This makes two advantages for our model. (i) Our design can be easily adapted to create a new unified model for different tasks by using a single generative LM (Paolini et al., 2021). (ii) Our model can work well in several scenarios: extraction vs. abstraction (data characteristics) and full vs. limited training data (Section 5). Second, we design a joint training process to implicitly take into account the suggestion from the picker to the generator instead of using a two-step model as PAC which explicitly copes extracted tokens from the Pick for generation. Our joint training model can reduce the error accumulation compared to the two-step framework. Finally, we design a heuristic approach to build important tokens, which enable the model to work on a wider range of datasets and scenarios.

Input representation
As shown in Figure 2 and [X2] convey two pieces of useful information to the model; the signal indicating the switch of speakers and the cue to distinguish whether the utterance is from context or incomplete utterance. The embedding of each token in the entire input sequence S = {w 1 , w 2 , ..., w l } was obtained as Here, WE is word embedding initialized from a pretrained model by using a wordpiece vocabulary. PE is relative position embedding representing the position of each token in the sequence. These embeddings were fed into the L stacked Encoder of T5; E L is the contextual representation of the whole input used by Picker and Generator in next sections.

The Picker
It is possible to directly use T5 (Raffel et al., 2019) for IUR. However, we empower T5 with a Picker to implicitly take into account information from important tokens. The idea of selecting important tokens was derived from Pan et al. (2019), in which the authors suggested the use of important tokens contributes the performance of utterance restoration. We extend this idea by designing an end-toend model which includes important token identification and generation, instead of using the two-step framework as Pan et al. (2019).
Given the context and the incomplete utterance, the Picker identifies tokens that are included in context utterances but omitted in the incomplete utterance. We call these tokens as important tokens. However, no important tokens are originally provided except for Restoration 200k in four datasets (please refer to Table 1). Besides, the form of important tokens could change after restoration such as from plural to singular or nouns to verbs (Figure 1). To overcome this issue, we introduce a label creation method that automatically identifies important tokens from the context for restoration.

Important token identification
Since building a set of important tokens is time-consuming and important tokens are usually not defined in practical cases, we introduce a heuristic strategy to automatically construct important tokens. In the following processing, stop words in the context, incomplete utterances, and gold references are removed in advance, assuming that stop words are the out of scope of important tokens. In addition, we applied lemmatization and stemming, the process of converting tokens to their base or root form, to alleviate the spelling variants.
First, we extracted tokens, called "clue tokens", that exist in gold but not in incomplete utterances. If some tokens in context are semantically similar to some of the clue tokens, we can naturally presume that these tokens in the context are cited as important tokens for the rewriting. Therefore, we performed scoring by the distance d ij between the word representations of i-th token in context h i and j-th clue tokens c j ; d ij = cosine_sim(h i , c j ) where cosine_sim() is the score of Cosine similarity. We used word representations of h i and c j from fastText (Bojanowski et al., 2017) trained on Wikipedia as a simple setting of our model.
According to the distance d ij , we introduce two types of labels for the Picker, soft i as soft labels and hard i as hard labels.
Here, the max operation was applied based on the assumption that at most one clue token corresponds to a token in the context.
Intuitively, the soft label method takes into consideration the cases that could not be handled by lemmatization and stemming, such as paraphrasing by synonyms, and reflects them as the importance score in the range of 0 to 1. On the other hand, the hard label is either 0 or 1 where an important token is defined only when there is an exact matching between the context tokens and the clue tokens in the form after lemmatization and stemming. We provide the two methods to facilitate important token identification.

Important token selection
The Picker takes encoded embeddings E L = {E L 1 , ..., E L l } and predicts the scores of the soft label or hard label corresponding to each input token.
where FNN() is the vanilla feedforward neural network, which stands for projecting encoded embedding to the soft label or hard label space. Then cross-entropy was adopted as the loss function.
where q i is the picker's label for the i-th input token. To optimize loss function L picker is equal to minimize the KL Divergence if the label is a soft label. In the hard labeling case, we assign three types of tags for tokens by following the BIO tag format as a sequence tagging problem.

The Generator
We explore the restoration task by using Text-to-Text Transfer Transformer (T5) (Raffel et al., 2019). This is because T5 provides promising results for the text generation task. We initialized transformer modules from T5-base, which uses 12 layers, and fine-tuned it for our IUR task. For restoration, encoder's representation E L was fed into a L stacked decoder with cross attention.
.., r i−1 } and < s > is the SOS token. The probability p of a token t at the time step i was obtained by feeding the decoder's output D L into the softmax layer.
Here, v(t) is a one-hot vector of a token t with the dimension of vocabulary size. The objective is to minimize the negative likelihood of conditional probability between the predicted outputs from the model and the gold sequence R = {r 1 , r 2 , ..., r k }.

Joint learning
JET aims to optimize the Picker and the Generator jointly as a setting of Multi-Task Learning. Different from PAC (Pan et al., 2019) that directly copies extracted tokens to generation, JET can implicitly utilize knowledge from the Picker, in which the learned patterns of the Picker to identify important tokens can be leveraged by the Generator. It can reduce error accumulation in the two-step framework as PAC. The final loss of the proposed model is defined as follows. L = αL picker + L generator where the hyperparameter α balances the influence of the task-specific weight. Our simple setting enables us to implement minimal experiments to evaluate how much important token extraction makes the contribution to generation.

Settings and Evaluation Metrics
Data We conducted all experiments on four wellknown datasets of utterance rewriting in Table 1.  (Choi et al., 2018). The datasets range from extraction to abstraction challenging UIR models.
Settings For running JET, we used AdamW with β 1 = 0.9, β 2 = 0.999 and a weight decay of 0.01 with a batch size of 12 and learning rate of 5e −5 . We used 3 FFN layers (dimension as 768, 256, 64) with ReLu as the activation function. The final dimension is 1 for soft labeling and 3 for hard labeling. We set α = 1 for the loss function. We applied beam search with the beam size of 8. For picker's label creation, we used stop words from NLTK for English and from stopwordsiso 3 for Chinese. For lemmatization and stemming, NLTK's Word-NetLemmatizer and PorterStemmer were adopted for English, while lemmatization and stemming were skipped for Chinese. The pre-trained model was T5-base (English 4 and Chinese 5 ). In the full training data setting (Section 5.1), the epoch size of 6 was used for Restoration200k and CANARD and 20 for REWRITE and TASK. In the limited training data setting (Section 5.2), the epoch size of 20 was used for all four datasets (Table 1). All models were trained on a single Tesla P100 GPU.  Evaluation metrics We followed prior work (Pan et al., 2019;Elgohary et al., 2019;Huang et al., 2021) to use three different metrics for evaluation, including ROUGE-scores, BLUE-scores, and f-scores.

Full Training Data Setting
We provide two scenarios of comparison with full training data: comparison with T5 and comparison with non-generative LM models.

Comparison with T5
We first compare our model against a strong pre-trained T5 model used for the generator as the first scenario. This scenario ensures fair comparison among strong pre-trained models for text generation and also shows the contribution of the Picker. Results in Table 2 show that JET is consistently better than T5 across all metrics on all four datasets. This is because the picker can pick up important omitted tokens, which are beneficial for restoration. These results prove joint learning can implicitly supports to capturing the hidden relationship between the picker and generator. Also, the promising results show that our labeling method can work in both extraction and abstraction datasets. The results of T5 are also competitive. The reason is that T5 (Raffel et al., 2019) was trained with a huge amount of data by using the generative learning process, which mimics the text generation task. As the result, it is appropriate for the restoration.
For other strong pre-trained models for text generation, we also test our joint learning framework with ProphetNet (Qi et al., 2020) but the results are not good to report. We leave the comparison with UniLM (Dong et al., 2019) and ERNIGEN (Xiao et al., 2020) as a minor future task due to no pre-trained models for Chinese.

Comparison with non-generative LM models
We next challenge JET to strong baselines which do not directly use generative pre-trained LMs, e.g. T5 for restoration. This is the second scenario that ensures the diversity of our evaluation. We leave the comparison of our model with BERT-like methods (e.g. SARG and RUN-BERT by using the T5 encoder) as a minor future task. For Restoration 200k and CANARD, we use the following baselines. Syntactic is the seq2seq model with attention (Kumar and Joshi, 2016). CopyNet is a LSTM-based seq2seq model with attention and the copy mechanism (Huang et al., 2021). T-Ptr employs transformer layers for encoder-decoder for restoration (Su et al., 2019). PAC is the two-stage model for utterance restoration (Pan et al., 2019). s2s-ft leverages specific attention mask with several fine-tuning method (Bao et al., 2021). RUN-BERT is an IUR model by using semantic segmentation . SARG is a semi autoregressive model for multi-turn utterance restoration (Huang et al., 2021). Table 3 shows that JET outputs promising results compared to strong baselines. For Restoration 200k, JET is competitive with RUN-BERT, the SOTA for this dataset. For CANARD, JET is consistently better than the baselines. The improvements come from the combination of the picker and generator. It is important to note that RUN-BERT and SARG are behind our model significantly on the abstractive scenario (CANARD). It supports our statement in Section 1, in which the current strong models for IUR is overspecific for extractive datasets and their generality is limited.
We next report the comparison on REWRITE and TASK in another table due to a small number of evaluation metrics. Following , we compare our model with RUN and two new methods: GECOR1 and GECOR2.
Results from Table 4 are consistent with the results in Tables 2 and 3. It indicates that our model outperforms the baselines on both TASK and REWRITE. For REWRITE, the EM (exact match) score of our model is much better than the baselines. It shows that the model can correctly restore incomplete utterances. These results confirm that our model can work well in the two scenarios over all four datasets.
Important token ratio We observed how many important tokens are included in prediction on Restoration 200k. To do that, we defined two   Table 5 shows JET with hard labeling achieves better results on both metrics compared to single T5. This supports our hypothesis that the Picker contributes the Generator for the IUR task.

Limited Training Data Setting
We challenge our model in the limited training data setting. This simulates actual cases in which only a small number of training samples is available. We trained three strong methods: SARG (Huang et al., 2021), T5, and JET on 10% of training data by using sampling. We could not run RUN-BERT due to errors in the original code.
As shown in Table 6, JET is consistently better than SARG with large margins. This is because JET is empowered by T5 which helps our model to work with a small number of training samples. This point is essential in actual cases. JET is also better than T5, showing the contribution of the Picker. SARG is good at ROUGE-scores and BLUE-scores but worse at f-scores, e.g. on REWRITE. The reason is that SARG uses the pointer generator network, that directly copies input sequences for generation, but it learns nothing.

Soft Labels vs. Hard Labels
We investigated the efficiency of our labeling method in Section 3.2.2. We run JET with soft and hard labeling methods. We also include the results of the JET on defined labels of Restoration 200k because this dataset originally provides labels of important tokens.  Table 7 we can see the hard labeling method performs well on both datasets. Interestingly, the hard labeling method is even better than the one with defined labels on Restoration 200k. Although defined labels were manually created, Restoration 200k defines at most one important token in one sample even though some samples actually contain two or more omitted tokens. We found the hard label method detects 164k omitted tokens while the originally defined tokens are about 120k, and tokens detected by hard labeling cover 42% of defined tokens. This suggests the hard label method extensively picks up important tokens even some important tokens are missing, and it can contribute to the enhancement of the JET.
For the soft labeling method, it contributes to the f-scores on CANARD (=abstractive) while it exacerbates accuracy on Restoration 200k (=extractive). This implies soft label does not function well in the distinction case between important and unimportant tokens is clear as in Restoration 200k. The soft labeling method would need more exploration on abtractive scenarios that require more synonymous paraphrasing or creative summarization.

Human Evaluation
We report human evaluation with strong methods on CANRD because it is much more challenging than others. We asked three annotators who are well skilled in English and data annotation from the annotation team in our company. For the evaluation, we randomly selected 300 outputs from four models. Each annotator read each output and gave a score (1: bad; 2: acceptable; 3: good). Following Kiyoumarsi (2015) we adopted Text flow and Understandability as our criteria. Text flow shows how the restoration utterance is correct grammatically and easy to understand. Understandability shows how much the predictions are similar to reference semantically. As shown in Table 8, JET obtains the highest scores on two criteria over other methods. It is consistent with the results of automatic evaluation in Tables 2 and 3. This is because our model utilizes strong pre-trained weights which provide the ability of text generation on unseen tokens, especially for abstractive data. The scores of JET also show the contribution of the Picker compared to the T5 for restoration.

Output Observation
We observed the restoration outputs of different models in Figure 1. There exist 9 omitted tokens between the incomplete utterance and the reference. The SARG and s2s-ft can restore only 2 important tokens. T5 can restore 8 the important tokens out of 9 but generates unnecessary words. Our proposed model also can restore 8 important tokens and have the same semantic meaning as the gold utterance. This suggests our model learns to use only the tokens picked up by Picker as additional tokens for rewriting. We also examined the ability of strong methods with different input lengths on CANARD. Results in Table 9 show that our model can deal with longer input sequences. Compared to SARG and seq2seq, the performance of our model is much better. This is because the implicit suggestion from the Picker combined with the ability to deal with long sequences of T5 increase the score.

Conclusion
This paper introduces a simple but effective model for incomplete utterance restoration. The model is designed based on the nature of conversational utterances, where important omitted tokens should be included in restored utterances. To do that, we introduce a picker with two labeling methods for supporting a generator for restoration. We found that the picker contributes to improve the generality of the model on four benchmark datasets. The model works well in English and Chinese, from extractive to abstractive scenario in both full and limited training data settings. The future work will investigate the behavior of the model in other domains and the potential application of JET, e.g. combining utterance extraction and utterance restoration for information extraction from dialogue.