I like fish, especially dolphins: Addressing Contradictions in Dialogue Modeling

To quantify how well natural language understanding models can capture consistency in a general conversation, we introduce the DialoguE COntradiction DEtection task (DECODE) and a new conversational dataset containing both human-human and human-bot contradictory dialogues. We show that: (i) our newly collected dataset is notably more effective at providing supervision for the dialogue contradiction detection task than existing NLI data including those aimed to cover the dialogue domain; (ii) Transformer models that explicitly hinge on utterance structures for dialogue contradiction detection are more robust and generalize well on both analysis and out-of-distribution dialogues than standard (unstructured) Transformers. We also show that our best contradiction detection model correlates well with human judgments and further provide evidence for its usage in both automatically evaluating and improving the consistency of state-of-the-art generative chatbots.


Introduction
Recent progress on neural approaches to natural language processing (Devlin et al., 2019;Brown et al., 2020), and the availability of large amounts of conversational data (Lowe et al., 2015;Smith et al., 2020a) have triggered a resurgent interest on building intelligent open-domain chatbots.Newly developed end-to-end neural bots (Zhang et al., 2020;Adiwardana et al., 2020;Roller et al., 2020) are claimed to be superior to their predecessors (Worsnick, 2018;Zhou et al., 2020) using various human evaluation techniques (See et al., 2019;Li et al., 2019b;Adiwardana et al., 2020) that aim to give a more accurate measure of what makes a good conversation.While the success is indisputable, there is still a long way to go before we arrive at human-like open-domain chatbots.For example, it has been shown that open-domain chatbots frequently generate annoying errors (Adiwardana et al., 2020;Roller et al., 2020) and a notorious one among these is the class of contradiction, or consistency, errors.
When interacting with chatbots, people carry over many of the same expectations as when interacting with humans (Nass and Moon, 2000).Self-contradictions (see examples in Figure 1) by these bots are often jarring, immediately disrupt the conversational flow, and help support arguments about whether generative models could ever really understand what they are saying at all (Marcus, 2018).From a listener's perspective, such inconsistent bots fail to gain user trust and their long-term communication confidence.From a speaker's perspective, it violates the maxim of quality in the Grice's cooperative principle (Grice, 1975) -"Do not say what you believe to be false."Hence, efforts on reducing contradicting or inconsistent conversations by open-domain chatbots are imperative.
Historically, modularizing dialogue systems, i.e., assigning an aspect of conversational modeling to a specific component and then integrating it back into the dialogue system, can often help improve overall system satisfaction (Fang et al., 2017;Chen et al., 2018).Prior works (Welleck et al., 2019) characterized the modeling of persona-related consistency as a natural language inference (NLI) problem (Dagan et al., 2005;Bowman et al., 2015), constructed a dialog NLI dataset based on Persona-Chat (Zhang et al., 2018), but so far state-of-the-art chatbots (Roller et al., 2020) have not been able to make use of such techniques.Overall, the challenge remains that we are still unable to answer the simple yet important question-"how well can a natural language understanding module model the consis- tency (including persona, logic, causality, etc) in a general conversation?".The lack of an ability to measure this obscures to what degree building new modules or techniques can in turn help prevent contradicting responses during generation.
Seeking to answer this question, we introduce the DialoguE COntradiction DEtection task (DE-CODE) 1 and collect a new conversational dataset containing human written dialogues where one of the speakers deliberately contradicts what they have previously said at a certain point during the conversation.We also collect an out-of-distribution (OOD) set of dialogues in human-bot interactive settings which contain human-labeled selfcontradictions made by different chatbots.
We then compare a set of state-of-the-art systems, including a standard unstructured approach and a proposed structured approach for utilizing NLI models to detect contradictions.In the unstructured approach, a Transformer NLI model directly takes in the concatenation of all utterances of the input dialogue for prediction, following the paradigm of NLU modeling.In the structured approach, utterances are paired separately before being fed into Transformer NLI models, explicitly taking account 1 Our DECODE dataset is publicly available at https: //parl.ai/projects/contradiction.
of the natural dialogue structure.
Results reveal that: (1) our newly collected dataset is notably more effective at providing supervision for the contradiction detection task than existing NLI data including those aimed at covering the dialogue domain; (2) the structured utterancebased approach for dialogue consistency modeling is more robust in our analysis and more transferable to OOD human-bot conversation than the unstructured approach.This finding challenges the mainstream unstructured approach of simply applying pre-trained Transformer models and expecting them to learn the structure, especially for OOD scenarios which are often the case when incorporating NLU modules into NLG systems, since intermediate in-domain data are scarce.
Finally, with such improvements on the contradiction detection task, we show that our best resultant contradiction detector correlates well with human judgements and can be suitable for use as an automatic metric for checking dialogue consistency.We further provide evidence for its usage in improving the consistency of state-of-the-art generative chatbots.

Related Work
Several prior works on improving dialogue consistency have explored using direct modeling of the dialogue context in generation algorithms.The modeling can be implicit where the dialogue consistency-related information like style (Wang et al., 2017), topics, or personal facts are maintained in distributed embeddings (Li et al., 2016;Zhang et al., 2019a), neural long-term memories (Bang et al., 2015), hierarchical neural architecture (Serban et al., 2016), latent variables (Serban et al., 2017), topical attention (Dziri et al., 2019b), or even self-learned feature vectors (Zhang et al., 2019b).Some works have grounded generation models on explicit user input (Qian et al., 2018), or designated personas (Zhang et al., 2018).Although, improvements on automatic generation metrics were often shown on guided response generation based on the consistency modeling, the issue of contradiction has never been resolved, nor have generally applicable methods to gauge the consistency improvements been developed.Further, simply scaling models has not made the problem go away, as is evident in the largest chatbots trained such as BlenderBot with up to 9.4B parameter Transformers (Roller et al., 2020).
More similar to our work is utilizing NLI models in dialogue consistency.Dziri et al. (2019a) attempted to use entailment models trained on synthetic datasets for dialogue topic coherence evaluation.Particularly, Welleck et al. (2019) constructed the dialogue NLI dataset and (Li et al., 2020) utilized it to try to reduce inconsistency in generative models via unlikelihood training in a preliminary study that reports perplexity results, but did not measure actual generations or contradiction rates.We note that the dialogue NLI dataset is only semi-automatically generated, with limited coverage of only persona-chat data (Zhang et al., 2018), whereas our DECODE is human-written and across diverse domains.Our task also involves logical and context-related reasoning beyond personal facts, for example the dialogue at the bottom of Figure 1 shows a non-persona-related contradiction.We show in our experiments that transfer of DECODE is subsequently more robust than dialogue NLI on both human-human and human-bot chats.
3 Task and Data

Dialogue Contradiction Detection
We formalize dialogue contradiction detection as a supervised classification task.The input of the task is a list of utterances x = {u 0 , u 1 , u 2 , ..., u n } representing a dialogue or a dialogue snippet.The output is y, indicating whether the last utterance u n contradicts any previously conversed information contained in the dialogue{u 0 , u 1 , u 2 , ..., u n−1 }, where y can be 0 or 1 corresponding to the noncontradiction and the contradiction label respectively.Preferably, the output should also include a set of indices I ⊆ {0, 1, ..., n − 1} representing a subset of {u 0 , u 1 , u 2 , ..., u n−1 } which contain information that is actually contradicted by the last utterance u n .The extra indices I output require models to pinpoint the evidence for the contradiction, providing an extra layer of explainability.

Data Collection
Annotation Design Our goal is first to collect training and evaluation data for this task.We thus collect dialogues in which the last utterance contradicts some previous utterances in the dialogue history.To obtain such dialogues, we give annotators dialogue snippets from pre-selected dialogue corpora, and then ask them to continue the conversation by writing one or two utterances such that the last utterance by the last speaker contradicts the dialogue history.We also ask annotators to mark all the utterances in the dialogue history that are involved in the contradiction as supporting evidence.Figure 2 shows the annotation user interface.We ask annotators to write contradicting utterances based partly on existing dialogues rather than collecting new dialogue from scratch because the provided dialogues can often convey semanticrich contexts from different domains and inspire annotators to write more diverse examples.We crowdsource the continuation and annotation data with Amazon Mechanical Turk and the collection is based on the ParlAI2 framework.
Quality Control We apply the following mechanism to ensure the quality of collected data:  (Geva et al., 2019).• Verification: This subtask ensures that the dialogue examples indeed contain an ending utterance that contradicts the dialogue history.We ask 3 additional annotators to verify some of the collected examples and select the ones where all three verifiers agreed on the contradiction label, and use these for our resulting validation and tests sets.This mechanism ensures that there is a clear, agreed-upon contradiction in the dialogue, preventing the subjectivity and ambiguity issues in some NLU tasks (Nie et al., 2020b).See the appendix for statistics about the data verification.

Dataset
We collected 17,713 human-written contradicting dialogues in which 4,121 are verified by 3 annotators.The pre-selected dialogue source corpora 3 We balance the labels in the dataset following the standard NLI evaluation (Bowman et al., 2015;Welleck et al., 2019).

Train Dev Test
Wizard of Wikipedia • Add Two Turns (A2T) We insert a pair of randomly sampled utterances into the dialogue such that the inserted utterances are between the two original contradicting utterances.This gives a new contradicting dialogue with a longer dialogue history.
• Remove Contradicting Turns (RCT) We remove all the turns (all pairs of utterances) 4 marked as supporting evidence for the contra-4 All the dialogues in the dataset involved two speakers that takes turns in speaking.To maintain this structure, for each marked utterance we remove a pair of utterance that represents a turn of conversation.This also helps remove the information that was involved in the contradiction such that the resultant label should be "non-contradiction".Notice that the two data transformations we used were based on utterance-level evidence annotations and therefore are not applicable for DNLI and other NLI data.Human-Bot Test Set Our main collected dataset involves human-written dialogues containing contradicting utterances based on human-human dialogues from existing corpora.In practice, to evaluate the response quality of a machine rather than a human in terms of its consistent responses, we care about how well a contradiction detector can perform in human-bot interactive conversations.To that end, we further collect human-bot dialogue data by employing workers on Amazon Mechanical Turk to interact with a diverse set of opendomain bots.These include Poly-encoder (Humeau et al., 2019) based retrieval models, generative models (Roller et al., 2020), unlikelihood trained models (Li et al., 2019a), retrieve-and-refine models (Weston et al., 2018;Roller et al., 2020), models either pre-trained on a previously existing Reddit dataset extracted and obtained by a third party that was hosted by pushshift.io(Baumgartner et al., 2020) or fine-tuned on the Blended Skill Talk (BST) dialogue tasks (Smith et al., 2020b) -that is, all the dialogue models that are compared in the study in (Roller et al., 2020).During the collection, if the bot generates an utterance that contradicts itself, we ask the worker to mark the utterance.In some of the dialogues, workers are explicitly instructed to goad the bots into making contradicting utterances.The final human-bot test set we derive contains 764 dialogues, half of which ends with a contradicting utterance by the bot.All the dialogues in the set, Main (Human-Human) Human1: I recently had a long weekend with some old friends.It was fantastic.Human2: It must have been fun to catch up with them.Human1: it was.We've spent a lot of time together and apart now, so it was good to catchup.Human2: Well I hope you guys continue to stay in touch.Human1: Sometimes friends grow apart over time though.Human2: Yeah, that's true.You should forget about them and go on with your life.
Human-Bot: Human1: It's a beautiful day.I'm going to take a walk later.Bot: What do you do for a living?Human1: I work in a doctor's office.How about yourself?Bot: I am a stay at home mom.Human1: That's great!How many kids do you have?Bot: I don't have any kids yet, but I plan on getting married soon.
Add Two Turns (A2T): Human1: i hate when ash from my cigarette drops in my work pants Human2: oof that sucks really bad Human1: yeah, i haave to wait till i get home to get the stain off, it is really embarrasing Human2: yea i can imagine it is Human1: Every time I look at it I remember the good times we had together.Human2: well thats nice Human1: I will have to wash the stain with soap and water.Human2: Ash stains on your pants is not a big deal though.
Remove Contradicting Turns (RCT): Human1: I was disgusted when I noticed the food on the table Human2: What kind of food?Human1: It was brussel sprouts and Liver Human2: Oh, disgusting.Human1: I couldn't even bear to take a single bite Human2: Brussel sprouts and liver sounds delicious to me! Table 3: Dialogue examples for different dataset types.Underline indicates that the pair of utterances is randomly added.Strikethrough text indicates that the pair of utterances is removed.Dialogue examples for Human-Human, Human-Bot, and A2T end with a contradicting utterance whereas the example for RCT has an ending utterance whereby the original contradicting pair of utterances in the dialogue history are removed.
with either contradiction or non-contradiction labels, are verified by 3 additional annotators, beside the human who actually talked to the bot.
The auxiliary and human-bot test sets are aimed to test models' robustness and generalizability beyond accuracy on the collected human-written test set (Ribeiro et al., 2020;Gardner et al., 2020), and give a more comprehensive analysis of the task.Table 2 summarizes the final overall dataset.Table 3 gives one example for each dataset type.

Models
To model the dialogue consistency task, we first employ some of the techniques used in NLI sequenceto-label modeling, where the input is a pair of textual sequences and the output is a label.The benefit of such modeling is that we can directly make use of existing NLI datasets during training.However, unlike previous work (Welleck et al., 2019) that directly utilized NLI models giving a 3-way output among "entailment", "contradiction", and "neutral", we modify the model with a 2-way output between "contradiction" and "non-contradiction" labels.This is because the task is, in its essence, centered around the detection of inconsistency.
More formally, we denote the model as ŷpred = f θ (C, u), where ŷpred is the prediction of the label y, i.e. whether the textual response u contradicts some textual context C, and where θ are the parameters of the model.We then explore two different approaches to utilize f θ for dialogue contradiction detection.

Dialogue Contradiction Detectors
As described in subsection 3.1, a detector is asked to determine whether the last utterance of the dialogue u n contradicts the previous dialogue history {u 0 , u 1 , u 2 , ..., u n−1 }.In what follows, we describe two approaches that propose differing f θ for the detection prediction problem.
Unstructured Approach.In this approach, we simply concatenate all the previous utterances in the dialogue history to form a single textual context.Then, we apply f θ to the context and the last utterance to infer the probability of contradiction.
When concatenating the utterances, we insert special tokens before each utterance to indicate the speaker of that utterance.This is aimed to provide a signal of the dialogue structure to the models.Still, this approach assumes that the model can use these features adequately to learn the underlying structure of the dialogue implicitly during training.
Structured Utterance-based Approach.Since the reasoning crucially depends on the last utterance, in this method we first choose all the utterances by the last speaker to form a set S. We then pair every utterance in the set with the last utterance and feed them one by one into f U B θ .The final contradiction probability is the maximum over all the outputs.
Additionally, the utterance-based approach is able to give a set of utterances as supporting evidence for a contradiction decision by choosing the pairs having contradiction probability higher than a threshold η e : This not only gives explanations for its prediction but can also help diagnose the model itself, e.g.we can measure metrics of the model's ability to provide these explanations by comparing them against gold supporting evidence annotations from DECODE.
One downside of this modeling approach is that it will not be able to capture reasoning between speakers.A case for that would be a pronoun by one speaker might refer to something initiated by the other speaker.Nevertheless, the utterancebased approach explicitly adds an inductive structure bias to learning and inference which we will see can aid its generalization capability.
Thresholding.For both the unstructured and utterance-based approaches, the detection of contradiction is made by comparing ŷpred with a threshold τ and by default τ is 0.5.

Experimental Setup
We study four base pre-trained models variants for f θ : BERT (Devlin et al., 2019), Electra (Clark et al., 2019), RoBERTa (Liu et al., 2019), and BART (Lewis et al., 2020).They represent the start-of-the-art language representation models and have yielded successes in many NLU tasks.The input format of f θ follows how these models handle sequence-pairs (C and u) classification task with padding, separator and other special tokens such as position embeddings and segment features inserted at designated locations accordingly.
We fine-tune f θ on different combinations of NLI training data including SNLI (Bowman et al., 2015), MNLI (Williams et al., 2018), ANLI-R3 (Nie et al., 2020a) 5 , DNLI (Welleck et al., 2019), as well as our DECODE Main training set.We convert the 3-way labels of the examples in existing NLI datasets to 2-way6 and θ is optimized using cross-entropy loss.When training f U B θ in the utterance-based approach using the DECODE training set, the input sequences are sampled utterance pairs from the DECODE dialogue.In other scenarios, f θ or f U B θ are trained with data treated as in normal NLI training.
The models are evaluated on the test sets described in Sec.3.3.For the utterance-based approach, which additionally provides supporting evidence utterances (Equation 3), we report Precision, Recall, and F1 on these evidence predictions.We also report a stricter score which evaluates whether both 2-way contradiction detection and supporting evidence retrieval exactly match with the ground truth on our DECODE Main test set.

Performance on Constructed Dataset
We test different pre-trained models with both the unstructured and the structured utterance-based approaches.We explicitly investigate the model performance when trained on DNLI or ANLI-R3 and compare it with DECODE because these are recently published NLI datasets that contain examples in a dialogue setting.However, we do also provide results comparing to other NLI datasets as well as multi-tasking all datasets at once, in addition to various ablations.The results are shown in Table 4.We now describe our key observations.DECODE is notably more effective than other existing NLI data in providing supervision for contradiction detection in dialogue.We found that models trained on DECODE achieve higher accuracy than that of those trained on DNLI or ANLI-R3, on all evaluation sets in both the unstructured and utterance-based approach.with better performance on supporting evidence retrieval as well.We speculate that this is due to the fact that RoBERTa pre-training data has a broader coverage than Electra and BART.We hope future work on dialogue contradiction detection could explore pre-training models on more dialogue-focused corpora.
The unstructured approach gets higher accuracy on the in-domain test set.A direct comparison between unstructured RoBERTa and utterancebased RoBERTa trained on DECODE reveals that the unstructured approach more often than not gets a higher accuracy than its corresponding utterancebased approach when other experiential setups are kept identical.Noticeably, unstructured RoBERTa trained on all NLI data got a 97.46% score, whereas utterance-based yielded 94.19%.This seemingly indicates that training an unstructured model is able to yield a good representation of the consistency of the dialogue.However, further analysis on the human-bot and auxiliary test sets shows that such high accuracy is an over-amplification of the model's real understanding ability, as we discuss next.
The structured utterance-based approach is more robust, and more transferable.able to maintain satisfactory performance across all the sets whereas the unstructured model underperforms at the human-bot and RCT auxiliary test sets with a 34.4% accuracy on RCT compared to 78.4% for utterance-based, in stark contrast to the high performance of the unstructured method on the in-domain DECODE Main test set.This result indicates the unstructured approach overfits on superficial patterns in the DECODE Main training data which are still present due to RCT's construction process. 7The fact that the utterance-based approach has good transferability to the OOD humanbot test set indicates that injecting the correct inductive structure bias is beneficial for modeling dialogue consistency.We believe this is an interesting result generally for research using Transformers, where there is currently a belief amongst some practitioners that they can just use a standard Transformer and it will learn all the structure correctly on its own.In our setting that is not the case, and we provide a method that can rectify that failing.
In general, there is still much room for improvement.The results in Table 4 also demonstrate that the modeling of dialogue consistency is a demanding task.On the contradiction detection task, the best score achieved by the state-of-the-art pretrained language models on DECODE (Test-Strict) is 80.86% and the best human-bot test score is 84.69%.Considering all the examples in the test sets are verified by at least 3 annotators, humans are able to swiftly identify such contradictions.This suggests there is a large ability gap between our best automatic detectors and humans.Closing this gap is an important challenge for the community.

Performance in an Interactive Setting
The results discussed above evaluate models on constructed datasets with intentionally balanced labels.This facilitates the comparison between models following a NLU evaluation perspective.
In practice, we would like to evaluate how well a model can detect contradicting utterances sampled naturally from interactive human-bot dialogue.To that end, we test our trained detection models on the raw interactive human-bot dialogue data8 having a total number of 764 dialogues consisting of 8,933 utterances.Since the contradiction task in naturally sampled dialogue can be extremely unbalanced, the total number of contradicting utterances in the raw dialogue list is only 3819 .We apply our contradiction detectors on every bot-generated utterance and calculate the precision, recall, and F1 on contradiction detection.Since the scores might be subjective to the threshold τ , we also evaluate the threshold-invariant Area Under the ROC Curve (AUC) (Bradley, 1997).
As shown in Table 5, model precision on the task is not satisfactory (23.94 at best).However, the best model achieves acceptable scores on both Recall and AUC.This indicates its potential usage for strict blocking of inconsistent utterances of a generative model (bot).The table also draws the same conclusion as Table 4 that the structured utterance-based RoBERTa model trained using DE-CODE data is the best method for contradiction detection, comparing to training on other NLI data or using an unstructured approach.In the following sections we thus use that best method as our detector for further experiments.

Model vs. Human Judgement
To further understand the detector predictions and how well they might align with human judgements, we conduct the following experiment.We first divide all the utterances into two categories based on whether they are generated by a human or a bot.Then, the bot-generated utterances that have been marked by annotators as contradicting utterances are categorized into three sets based on the number of annotators that agree on the contradiction label.By design, the more annotators that agree on the contradiction label, the more plausible that it is a contradiction.We examine detector model fire rate on the utterances in the 5 different categories and results are shown in Figure 4.The fire rate of utterance-based RoBERTa trained on DECODE on human utterances is 5.5% contrasting to the 74.3% on 3-agreed contradicting utterances, whereas the fire rates of unstructured RoBERTa on different categories are more clustered together.This finding demonstrates that all the models can discriminate between utterances with a distinct nature, and the model predictions are aligned with human judgments.Moreover, the fire rate of a strong discriminative detector could be a useful quantity to stratify utterances.Using DECODE as an Automatic Metric The results presented above indicate that the prediction of the detector can easily differentiate between the quality of utterances by humans and the utterances by bots.We further investigate whether it can differentiate the quality of the utterances by different bots and be used as an automatic metric checking generation consistency.We compare the average contradiction score of the detector with the contradiction rate by human judgements on the utterances generated by different classes of model (bots).The bots are the same set of models described in subsection 5.2 from which we collected our human-bot dialogue examples.The trend in Figure 5 reveals that the scores are positively correlated with human judgments, with a Pearson correlation coefficient of 0.81.We would expect that improvement on the DECODE task will directly increase the correlation between the automatically produced detection score and human judgements, where use of such an automatic metric can ease the burden on laborious human evaluation of consistency.

Generation Re-ranking
Given a contradiction detector, an obvious question other than using it as an automatic metric, is: can it be used to improve the consistency of dialogue generation models?We consider a very simple way to do that in the state-of-the-art generative model, BlenderBot (BST 2.7B) (Roller et al., 2020).During the decoding phase, for decoding methods that can output multiple hypotheses, we simply rerank the top scoring hypotheses us-  ing the contradiction detection classifier.We use our best performing classifier, our utterance-based RoBERTa model with DECODE fine-tuning, and consider three methods of decoding: beam search, top-k sampling (Fan et al., 2018) and sample-andrank (Adiwardana et al., 2020), and compare the standard and DECODE-reranked decoding methods to each other.For beam search we use the best found parameters from (Roller et al., 2020) which are beam size 10, minimum beam length 20 and beam blocking of 3-grams.For top-k we use k = 40.For Sample-and-Rank we use k=40 and 20 samples.We consider the same human-bot dialogue logs as before, but only between Blenderbot BST 2.7B and humans, equally sampled between contradicting and non-contradicting utterances.Table 6 presents the results.
Automatic metric using DECODE Using our same DECODE contradiction classifier as the automatic metric, as in Sec.5.2.We observe that by re-ranking the beam of beam search (size 10) we can modestly improve the metric, but still 22.7% of the time the detector flags generations as contradictions.Upon observation of the outputs, this appears to be because the beam of beam decoding tends to be not diverse enough (Vijayakumar et al., 2016), and when the top scoring utterance is flagged as contradicting, many of the other utterances in the beam are similar responses with slight rephrases, and are flagged contradicting as well.Top-k sampling fares much better, where reranking in our test can very often find at least one from the k = 40 samples that does not flag the classifier, leaving only a 1.1% contradiction firing rate.
Human Judgments The last column of Table 6 presents human judgments of the various model generations, judged using the same approach as before with three human verifiers, and reporting the percentage of contradictions.We observe similar results to the automatic metric findings: that DECODE re-ranking reduces the number of contradictions for both types of generation methods that we attempted to re-rank.

Conclusion
We introduce the DialoguE COntradiction DEtection task (DECODE) and a new conversational dataset containing both human-human and humanbot contradictory dialogues.Training models on DECODE achieves better performance than other existing NLI data by a large margin.We further propose a structured utterance-based approach where each utterances are paired with other utterance before being fed into Transformer NLI models to tackle the dialogue contradiction detection task.We show the superiority of such an approach when transferring to out-of-distribution dialogues compared to a standard unstructured approach representative of mainstream NLU modeling.This is a valuable property since intermediate in-domain data are often scarce when integrating NLU module into NLG systems.We further show that our best contradiction detector correlates with human judgements, and provide evidence for its usage in both automatic checking and improving the consistency of state-of-the-art generative chatbots.While this paper deeply studies the contradiction detection problem, we believe here we have only scratched the surface of the non-contradiction generation problem, while obtaining promising first results in that setting.Future work should address this further by studying and analysing the results of these techniques more deeply, as well as considering other methods than simply rescoring during decoding.Going forward, we envision complementary progress on both the modeling of NLU and NLG and the integration of the two.We hope our work could facilitate and provide guidelines for future work on incorporating NLU modeling into dialogue systems.

Figure 1 :
Figure 1: Two dialogue examples demonstrating a state-of-the-art chatbot (B) (Roller et al., 2020) contradicting itself when talking to a human (A).

Figure 2 :
Figure2: The collection interface.The task preview box (top right) gives a short description of the task before the annotator will work on the writing.The collection consists of two steps.In Step 1 (on the left), the annotators are asked to write one or two utterances such that the last utterance will contradict some previous utterances in the conversation.In Step 2 (on the right), the annotators are asked to pick the utterances in the conversation that are involved in the contradiction.We use a casual term "message" instead of "utterance" in the instructions.

Figure 3 :
Figure 3: Comparison between utterance-based and unstructured approaches of RoBERTa pre-trained, DE-CODE fine-tuned models on DECODE Main (Test), Human-bot, and auxiliary test sets.

Figure 5 :
Figure 5: The comparison between the average contradiction score by the detector (y-axis) and the human identified contradiction rate (x-axis) on the utterances by different bots, averaged by type of bot.Each point in the plot is a bot which has conversed with humans and produced at least 180 utterances (with some identified as contradictions) in our interactive settings.The regression line shown yields a Pearson correlation coefficient of 0.81.

Table 4 :
Test performance of different models and approaches."All" in the "Training Data" column stands for a combination of SNLI, MNLI, DNLI, ANLI-R3, DECODE."All -DNLI" denotes all the datasets with DNLI removed."SE" stands for supporting evidence.The "Main (Test-Strict)" column indicates the performance where both the 2-way contradiction detection and the supporting evidence retrieval exactly match with the ground truth.

Table 5 :
RoBERTa performance on all the botgenerated utterances from the raw interactive humanbot dialogue.The threshold τ for prediction is 0.5.