Improving Bot Response Contradiction Detection via Utterance Rewriting

Though chatbots based on large neural models can often produce fluent responses in open domain conversations, one salient error type is contradiction or inconsistency with the preceding conversation turns. Previous work has treated contradiction detection in bot responses as a task similar to natural language inference, e.g., detect the contradiction between a pair of bot utterances. However, utterances in conversations may contain co-references or ellipsis, and using these utterances as is may not always be sufficient for identifying contradictions. This work aims to improve the contradiction detection via rewriting all bot utterances to restore co-references and ellipsis. We curated a new dataset for utterance rewriting and built a rewriting model on it. We empirically demonstrate that this model can produce satisfactory rewrites to make bot utterances more complete. Furthermore, using rewritten utterances improves contradiction detection performance significantly, e.g., the AUPR and joint accuracy scores (detecting contradiction along with evidence) increase by 6.5% and 4.5% (absolute increase), respectively.


Introduction
Latest chatbots powered by large pre-trained neural models have shown decent capabilities to maintain fluent and interesting conversations with human users (Paranjape et al., 2020;Roller et al., 2021;Bao et al., 2021;Konrád et al., 2021). However, they are still prone to various kinds of annoying mistakes (Xu et al., 2020;See and Manning, 2021;. One such error is contradiction or inconsistency, as illustrated in Table 1. In order to reduce contradiction errors, one approach is to develop a detection model to identify such problems after a system produces response candidates. To this end, Welleck et al. (2019) characterized the modeling of personarelated consistency as a natural language inference (NLI) problem and constructed a dialog NLI dataset based on Persona-Chat. To cover a broader range of consistency types (e.g., persona, logic, causality, etc), Nie et al. (2021) collected DE-CODE, a dataset containing human written dialogues with self-contradictory utterances. Besides the in-distribution human-human dialogues test set, they collected an out-of-distribution set containing dialogues between human and different chatbots. This human-bot test set can better evaluate models' performance in detecting contradiction in conversations between human and chatbots, which is the focus of this work.
We find one failure of the state-of-the-art (SOTA) contradiction detection model is due to the frequent anaphora and ellipses in chatbot utterances. One typical example is shown in Table 1, where the first bot utterance has an anaphor, "mine", and the last bot utterance misses an important entity, "Johnny Cash's concert". Such incomplete utterances would prevent detection models from fully understanding the bot utterances in the dialog, thus leading to detection errors. Therefore, we propose to first rewrite the bot utterances to recover all the missing information and then perform the contradiction detection task. To support this goal, we first collect a new dataset for incomplete utterance rewriting, which is a widely studied task (Pan et al., 2019;Su et al., 2019;Hao et al., 2021) but still lacks supporting datasets for open-domain conversations in English (Quan et al., 2019). Then we propose a rewriting model trained on this data to rewrite the anaphors to their corresponding entities and restore any missing content. We conduct experiments on the DECODE dataset (Nie et al., 2021), and demonstrate substantial performance improvement in contradiction detection when the utterance rewrite module is applied. Overall, we have made the following contributions in this work:  Table 1: Examples of human-bot conversations with contradictory bot utterances marked by red color. We rewrite every bot utterance to restore co-references and ellipsis (the restored parts are highlighted by bold font).
ing model for utterance restoration.
• With bot utterance rewriting, we can improve the previous best contradiction detection model by 6.5% in AUPR and 4.5% in joint accuracy that considers both contradiction and evidence labels.
• We relabeled the human-bot test set of the benchmark DECODE dataset and corrected some annotations. 1

Task Definition
We formalize dialogue contradiction detection as an NLI task. Given a list of utterances x = {u H 1 , u B 1 , ..., u H n , u B n } representing a dialogue, the task is to determine if the last bot utterance u B n contradicts any previously conveyed information contained in the past bot utterances {u B 1 , ..., u B n−1 }. Note that we are using human and bot alternating turns here (referred to as H and B), but they can be human-human conversations too. In addition to the binary label y, with 0 or 1 corresponding to the non-contradiction and the contradiction labels, respectively, we also output a set of indices I ∈ {1, ..., n − 1} representing the utterances in {u B 1 , ..., u B n−1 } that is actually contradicted by the last utterance u B n .

Detection Models
Based on the benchmark DECODE dataset, Nie et al. (2021) proposed two approaches for contradiction detection: an unstructured approach and a structured utterance-based (SUB) approach. The former one concatenates all the previous utterances in the dialogue history to form a single textual context. Then a classification model f θ is applied to the context and the last utterance to infer the probability of contradiction. The latter SUB approach pairs every past bot utterance with the last one, and then 1 Code and data are released at: https://github.com/jind11/utterance-rewriting feeds each pair to the classification model f SU B θ .
The final contradiction probability is the maximum over all the outputs: The supporting evidence (SE) for a contradiction decision contains the pairs having contradiction probability higher than a threshold η, i.e., Nie et al. (2021) demonstrated that the latter SUB approach significantly outperforms the former one on the human-bot test set (more than 10% in accuracy). This SUB method is the current SOTA model for contradiction detection, which we adopted as one baseline.

Utterance Rewriting for Contradiction Detection
As discussed earlier, we noticed that many bot utterances contain co-references and ellipses and thus the baseline model fails to capture the semantic meaning or contradiction in the sentence pair. Therefore, we propose to first rewrite the bot utterances to restore co-references and ellipsis, and then feed the rewritten utterances (e.g., the dialogues on the right in Table 1) to the model. To this end, we first collect a new dataset specially for utterance rewriting and then develop a rewriting model.

Rewriting Data Collection
To get parallel training data for utterance rewriting for open-domain conversations, we sub-sampled 6,000 and 4,000 dialogues from the DailyDialog (Li et al., 2017) and BST (Smith et al., 2020) datasets, respectively, as the training set. Besides, we sub-sampled 400 and 400 dialogues from DailyDialog and BST, respectively, as the test set. We only use the first six utterances in each dialog. Specifically, we use the first two utterances (from both speakers) as leading context and ask annotators to check the remaining four utterances, following Pan et al. (2019). 2 Empirically we find that the context information needed to resolve co-references and ellipsis can always be found within 1-3 turns (Pan et al., 2019;Su et al., 2019). We ask annotators to identify whether an utterance is complete and can be understood without reading the context, and if not, then rewrite it to restore any missing information.
To ensure the annotation quality, we hired three in-house professional data annotators, who have been first trained via a pilot annotation session and then proceed to the official annotation phase after passing our provided qualification set. In the official annotation phase, two of them first worked independently and then the third annotator was tasked to make the adjudication over the two annotations and pick the best one or make revisions if needed. Besides, we periodically sampled 10% of the annotations from each annotator throughout the annotation process and provided feedback. The annotation is considered valid only when the accuracy of examined results surpassed 95% (we deem those rewrites that are both correct and complete as correct rewrites, and then calculate the percentage of correct rewrites as the accuracy). Overall, we have obtained 40,000 and 3,200 samples for training and testing, respectively.

Rewriting Model
We treat rewriting as a sequence-to-sequence (Seq2Seq) task and adopt two pre-trained Seq2Seq models, T5 (Raffel et al., 2019) and Pegasus . The input is the concatenated context utterances and the original last utterance, with special tokens inserted before each utterance to indicate its speaker.

Contradiction Detection Data
We use the DECODE dataset (Nie et al., 2021) in this study. However, we found some issues with its human-bot test set: (1) Around one third of noncontradiction dialogues contain only one human and one bot utterances, which makes the detection task over-simplified, since there are no previous bot utterances.
(2) Not every bot utterance has been annotated for contradiction with respect to its history. (3) Evidence is not labeled to indicate which history bot utterance contradicts the last one.
To resolve the above-mentioned issues, we curate new annotation using the dialogues in the original test set. Details of annotation procedures are provided in Section A of the Appendix. Overall, we have obtained 1,889 samples (453 positive samples and 1,436 negative ones), which we call an unbal-anced set. Besides, we sub-sampled 453 negative samples and combined them with all the positive ones to form a balanced set. Table A.1 summarizes the data statistics. We will release this new test set.

Baselines
We compare the contradiction detection performance with and without rewriting bot utterances, all based on the same SUB model framework, which is the current SOTA model for contradiction detection. Another baseline we introduced is SUB-CONCAT, where each bot utterance is the concatenation of the original one with the preceding human utterance such that the missing information (coreference or ellipsis) can be recovered from the included previous utterance.
For rewriting, we compare our model against four strong baselines: one is the off-the-shelf SOTA co-reference resolution model trained on OntoNotes (named as "Co-reference") (Toshniwal et al., 2021;Wu et al., 2020), and the other three are developed based on three related datasets for rewriting, named as "CANARD" (Elgohary et al., 2019), "Gunrock" (Zhang et al., 2020), and "Mu-DoCo" (Tseng et al., 2021). Specifically, CA-NARD is a query rewriting dataset that aims to rewrite a query/question based on previous consecutive QA pairs for the conversational question answering task. The Gunrock dataset focuses on resolving ellipsis while containing a small portion of co-reference cases, and it consists of 1745 samples where all dialogues are in-house curated following the Alexa Prize competition format. The MuDoCo dataset is also for query rewriting for task-oriented dialogues covering 6 domains.

Evaluation Metrics
To evaluate incomplete utterance rewriting, we use both automatic and human evaluation. For human evaluation, we propose two metrics: (1) Correctness; (2) Completeness. The former one checks whether the rewriting part is correct and obeys the information in dialogue context, while the latter one checks whether the rewritten utterance is complete enough to be understood without reading the context. We have binary labels for both metrics and report the percentage of positive labels after human evaluation. For automatic evaluation, in addition to the widely used BLEU (Papineni et al., 2002), ROUGE-1 (R-1), and ROUGE-L (R-L) (Lin, 2004), we have added two more metrics specially for evaluating text editing models: exact match (EM) accuracy, and the F 1 score, which was proposed in Pan et al. (2019) and focuses on n-grams that contain at least one restored word. Specifically, the n-gram restoration precision, recall, and F-score can be calculated as: {n-grams in ref} F n = 2 · P n * R n P n + R n where "restored n-grams" refer to the n-grams in restored utterance that contain at least one restored words, and "n-grams in ref" refer to the n-grams in reference that contain at least one restored words.
For contradiction detection, we first set the threshold η to be 0.5, and report Precision/Recall/F1 for both the binary contradiction label and the support evidence labels, following Thorne et al. (2018). 3 Besides, we report Joint Accuracy, which indicates the performance when both the 2-way contradiction detection and the supporting evidence retrieval are correct. Considering that these scores are sensitive to η, we also report Area-under-Precision-Recall-Curve (AUPR) as a threshold-independent score.

Experimental Setup
For utterance rewriting, we have used three kinds of pre-trained models: T5-Base, T5-Large, and Pegasus-Large, whose parameter sizes are 220 M, 770 M, and 568 M. Each model is trained for 4 epochs with a learning rate of 5e −5 , and beam search (beam size of 5) is used for generation.
For contradiction detection, following Nie et al. (2021), we used the RoBERTa-Large model whose parameter number is 330 M, which is trained for 3 epochs with a learning rate of 1e −5 . We have used the Huggingface Transformer code base 4 and all experiments were run on Nvidia V100 GPUs.

Utterance Rewriting
We performed both automatic and human evaluation for utterance rewriting (please refer to Section 3.3 for evaluation details). Table 3 summarizes the automatic evaluation results. As can be seen, the three models perform similarly overall, with T5-Large slightly outperforming the other two. We 3 https://github.com/sheffieldnlp/fever-scorer 4 https://github.com/huggingface/transformers/tree/master thus adopt it as the main rewriting model in later experiments.
We also sub-sample 100 rewritten utterances by T5-Large for human evaluation. As shown in Table 4, the correctness and completeness scores for both test sets are above 85%, validating the highquality of the rewriting model. We also report the change rate in the table that defines the percentage of the rewritten utterances that are different from the original ones (only differences in punctuation and upper/lower-case are not considered). The bottom block of Table 4 shows the percentage of utterances containing co-reference or ellipsis, or either, i.e, incomplete utterances. We see that co-reference and ellipsis occur almost equally frequently in incomplete utterances. Considering all the numbers together, we demonstrate that the rewriting model has covered most of those incomplete utterances. Table 2 compares the contradiction detection performance without rewriting and with rewriting by different rewriting models. First of all, the SUB-Concat method without rewriting does not yield any performance gain although it has included the context utterances. More importantly, after rewriting all bot utterances for both training and test sets, only our rewriting model can lead to significant improvements for all the evaluation metrics, while those baseline rewriting models either maintain or deteriorate the performance (we provided the rewriting performance of these baselines in Section B of Appendix for reference). We see that the AUPR metric has been improved by around 2.8% and 3.2% absolutely for the balanced and unbalanced sets by our model, respectively. We also implemented model ensemble where we rewrite bot utterances using our three rewriting models (T5-Base/Large and Pegasus-Large), run contradiction detection using each, and average their contradiction scores to obtain the final prediction. This further improves the detection performance over single models. Overall, we have achieved a substantial increase of 4.2% and 6.5% for AUPR and 4.5% and 3.4% for Joint-Acc. for the balanced and unbalanced sets, respectively.

Error Analysis
We conducted additional error analysis to understand the performance gains and remaining errors. We first obtained 95 false negative samples by the    "SUB-Bot only" model without applying rewriting, and then manually identified 28 samples whose last bot utterances are incomplete. We then manually rewrote those incomplete bot utterances. With such manual rewriting, we are able to correctly classify 18 out of 28 samples to be positive (64.3% in accuracy), whereas, with the T5-Large rewriting model, 15 samples can be correctly predicted (53.6% in accuracy). This comparison indicates that our automatic rewriting has pushed the performance improvement close to the upper bound achieved by manual rewriting. More error analysis is provided in Section C of Appendix.

Why Utterance Rewriting Helps?
As illustrated by Table 1, in order to infer the entailment relationship between the premise (i.e. "Mine is johnny cash of course.") and hypothesis (i.e. "I have not been since last year though."), we need to resolve the anaphora and ellipses so that some key information can be restored, e.g., "Mine" is replaced by "My favorite singer" in the premise and the missing phrase of "to Johnny Cash's concert" is restored in the hypothesis. Without restoring such key information from the dialogue context, the contradiction detection model cannot fully understand the premise and hypothesis sentences, thus not being able to accurately detect contradictory cases.
One could argue that we can simply concatenate the context with both premise and hypothesis respectively so that the detection model could grab the missing information itself from the context, however, the baseline method "SUB-Concat", which follows this setting, still under-performs the baseline without concatenating the context (i.e. SUB-Bot only). This indicates that when the premise and hypothesis are organized in a dialogue structure with multiple turns rather than as single-turn sentences, the NLI based detection model is not good at inferring their relationship anymore. Therefore, we need to use the utterance rewriting model to grasp the most necessary information from context and insert into the bot utterances so that we can still use the single-turn format while making up the missing information for entailment inference.

Future Work
We will keep improving the utterance rewriting model. Besides, we will showcase that utterance rewriting can also help improve other dialogue related tasks, such as task-orientated dialogue state tracking and response generation, open-domain dialogue response selection and generation, etc.

Conclusion
In this work, we aim to improve contradiction detection in chatbot utterances via rewriting to restore anaphora and ellipsis. To develop such an utterance rewriting model, we curated a dataset by crowd-sourcing and demonstrated that the rewriting quality is satisfactory. With such a rewriting technique, we are able to significantly improve the contradiction detection performance.

A Contradiction Detection Data Collection
Considering that the original human-bot test set of the benchmark DECODE dataset is problematic, we specially curate new annotation based on those dialogues of the original test set via the following steps: (1) We first obtained 507 unique and full dialogues from the original human-bot test set 1 by merging dialogues with overlaps and removing dialogues of only one turn. We then obtained 1,889 partial dialogues for annotation by cutting each full dialogue from the beginning to each bot utterance so that we can annotate whether each bot utterance contradicts against its context. (2) In the first round of annotation, we ask three Amazon Mechanical Turk workers (from English-speaking countries, including USA, England, and Canada) to annotate both the binary label of contradiction and evidence indices that indicate which history bot utterance contradicts the last one. When settingup the annotation interface, we have provided one line of guidance to warn annotators not to reveal any personal information during annotation. We keep those samples with three full votes as finalized samples and pass those without three equal votes to the second round.
(3) In the second round, we provide the maximum set of evidence indices to another three AMT workers and let them verify and write down new annotation if they do not agree. Again, samples with three agreements are selected as finalized ones and those without are passed to authors of this work for final adjudication. Finally, among all the 1,889 samples, we have obtained 453 positive samples and 1,436 negative ones, which we call an unbalanced set. Besides, we have also sub-sampled 453 negative samples and combine them with all positive ones to form a balanced set.  B Rewriting Quality of Baselines Table B.2 compares our rewriting model with baselines that are developed on three related datasets for utterance rewriting (CANARD, Gunrock, and Mu-DoCo) and we report performance on our rewriting test set. As expected, our rewriting model that is trained on our own rewriting dataset performs the best. And by combining Table 2 and Table B

C Qualitative Error Analysis
Among all 95 false negative samples predicted by the baseline, we find that the last bot utterances of 28 samples are incomplete and need rewriting for restoration. After automatic rewriting of all bot utterances, we can get 15 samples correct but still have 13 samples being false negatives. Now we would like to analyze the error pattern of these 13 false negatives after rewriting and we can categorize the errors into four types: numerical reasoning, logical reasoning, common sense reasoning, and hard to judge. Table C.3 provides examples for each type. Here are detailed definitions for these four types: (1) Numerical reasoning: Models need to do some numerical calculation or comparison to make a decision; (2) Logical reasoning: It requires logical reasoning for prediction; (3) Common sense reasoning: Some common sense knowledge needs to be equipped for reasoning; (4) Hard to judge: It is even hard for human to judge whether contradiction really exists or not. Table C.4 provides several examples that are originally false negatives before rewriting bot utterances but later get correctly predicted after rewriting. As can be seen, the rewriting process can make up those critical information needed for detecting contradiction cases. Taking the first sample in Table C.4 as an example, there is ellipsis in the last two bot utterances in the original dialogue, which would lead to models' misunderstanding. After rewriting, the last two bot utterances get complete, which makes it much easier for models' decision making.