Don’t be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Consistency Identification has obtained remarkable success on open-domain dialogue, which can be used for preventing inconsistent response generation. However, in contrast to the rapid development in open-domain dialogue, few efforts have been made to the task-oriented dialogue direction. In this paper, we argue that consistency problem is more urgent in task-oriented domain. To facilitate the research, we introduce CI-ToD, a novel dataset for Consistency Identification in Task-oriented Dialog system. In addition, we not only annotate the single label to enable the model to judge whether the system response is contradictory, but also provide more fine-grained labels (i.e., Dialogue History Inconsistency, User Query Inconsistency and Knowledge Base Inconsistency) to encourage model to know what inconsistent sources lead to it. Empirical results show that state-of-the-art methods only achieve 51.3%, which is far behind the human performance of 93.2%, indicating that there is ample room for improving consistency identification ability. Finally, we conduct exhaustive experiments and qualitative analysis to comprehend key challenges and provide guidance for future directions. All datasets and models are publicly available at https://github.com/yizhen20133868/CI-ToD.


Introduction
Task-oriented dialogue systems (ToDs) (Young et al., 2013) aim to achieve user goals such as hotel booking and restaurant reservation, has gained more attention recently in both academia and industries. Over the last few years, two promising research directions in ToDs have emerged. The first focuses on a pipeline approach, which consists of modularly connected components (Wu et al., 2019a;Takanobu et al., 2020;Peng et al., 2020;. The second direction employs an end-to-end model, which directly takes the sequence-tosequence (Seq2Seq) model to generate a response from a dialogue history and a corresponding knowledge base (KB) (Eric et al., 2017;Madotto et al., 2018;Wen et al., 2018;Qin et al., 2019b;Wu et al., 2019b;Qin et al., 2020b) In recent years, with the burst of deep neural networks and the evolution of pre-trained language models, the research of ToDs has obtained great success. While the success is indisputable, previous work have shown that it's inevitable to generate inconsistent response with the neural-based model, resulting in a contradiction (Welleck et al., 2019;Nie et al., 2021). Such contradictions caused by these bots are often jarring, immediately disrupt the conversational flow. To address the above issue, some work try to improve consistency in dialogue by posing a consistency identification into dialogue. Welleck et al. (2019) made an early step towards performing consistency identification in dialogue agent. Nie et al. (2021) proposed dialogue contradiction detection task to prevent the system response from being inconsistent with dialogue history.  further proposed a profile consistency identification to consider whether response is consistent with the corresponding profile. Though achieving Dataset Open Domain/ Task-Oriented Dialoge System External Knowledge Multi-turn / Single-turn Single Label / Fine-grained Labels Dialogue NLI (Welleck et al., 2019) Open domain Single-Turn Single Label InferConvAI (Dziri et al., 2019) Open domain Multi-Turn Single Label KvPI  Open domain Single-Turn Single Label DECODE (Nie et al., 2021) Open domain Multi-Turn Single Label CI-ToD Task-Oriented Multi-Turn Fine-grained Labels (HI, QI and KBI)  the promising performance, the above work were limited to open-domain dialogue. In this paper, we highlight that inconsistent generation problems should also be considered in task-oriented dialogue.
For example, as shown in Figure 1, the system expresses about the POI whole foods in dialogue history. However, when we run the state-of-the-art model (DF-Net) (Qin et al., 2020b), the system generate response "mandarin roots is located at 271 springer street.", which incorrectly generates irrelevant POI mandarin roots, resulting in contradiction. This is because neural-based models are a black-box and thus make us hard to explicitly control the neural-based dialogue systems to maintain a consistent response generation. From the user's perspective, such inconsistent bots not only fail to meet the requirements of the user but also mislead users to get wrong feedback in the task-oriented domain. Therefore, it's promising to consider consistency problem and detect in advance whether the generated response is consistent in task-oriented dialogue direction. Unfortunately, there still has been relatively little research on considering consistency identification in task-oriented dialogue due to the the lacking of public benchmarks. To fill this research gap, we introduce a novel human-annotated dataset CI-ToD: Consistency Identification in Task-oriented Dialog system. Dialogue data for CI-ToD is collected from the public dialogue corpora KVRET (Eric et al., 2017). For each final system response in KVRET, we re-write the utterance by crowdsourcing where we deliberately contradict the dialogue history, user query or the corresponding knowledge base (KB). As shown in Table 1, compared to the existing consistency identification for dialogue dataset, CI-ToD has the following characteristic: (1) Task-oriented Dialogue Domain. To the best of our knowledge, we are the first to consider dialog consistency in task-oriented dialogue system while the prior work mainly focuses on the open domain dialogue system. We hope CI-ToD can fill the gap of consistency identification in the task-oriented dialogue domain; (2) Fine-grained Annotations. We provide not only single annotations of whether each sentence is consistent, but also more fine-grained annotations, which can be used for helping the model analyze what source is causing this inconsistency.
To establish baseline performances on CI-ToD, we evaluate the state-of-the-art pre-trained and non pre-trained models for consistency identification. Experimental results demonstrate a significant gap between machine and human performance, indicating there is ample room for improving consistency identification ability. In addition, we show that our best consistency identification detector correlates well with human judgements, demonstrating that it can be suitable for use as an automatic metric for checking task-oriented dialogue consistency. Finally, we perform exhaustive experiments and qualitative analysis to shed light on the challenges that current approaches faced with CI-ToD.
In summary, our contributions are three-fold: • We make the first attempt to consider consistency identification in task-oriented dialog and introduce a novel human-annotated dataset CI-ToD to facilitate the research.
• We establish various baselines for future work and show well-trained consistency identification model can be served as an automatic metric for checking dialogue consistency.
• We conduct exhaustive experiments and qualitative analysis to comprehend key challenges and provide guidance for future CI-ToD work.

Problem Formulation
In our paper, the consistency identification in taskoriented dialogue is formulated as a supervised multi-label classification task, which aims to judge whether the generated system response is inconsistent. To equip the model with the ability to analyze what the inconsistent sources lead to it, we require the model not only provide the final prediction but also the fine-grained sources including dialogue history, knowledge base (KB) and user's Where is the nearest gas station?
There is a 76 4 miles away.
What is the address?
The 76 gas station is located at 611 ames ave.
Where is the nearest gas station?
There is a 76 4 miles away.
What is the address?
The 76 grocery store is located at 611 ames ave.
Where is the nearest gas station?
There is a 76 4 miles away.
What is the address?
The 76 grocery store is located at 611 ames ave. it is 3 miles away Where is the nearest gas station?
There is a 76 4 miles away.
What is the address?
The 76   query. More specifically, given a task-oriented dialogue between a user (u) and a system (s), the n-turned dialogue snippet consists of dialogue history H = {(u 1 , s 1 ), (u 2 , s 2 ), ..., (u n−1 , s n−1 )}, the corresponding knowledge base KB, the user query u n and system response s n . More specifically, the task can be defined as: where f denotes the trainable model; y is an output three-dimension vector, indicating whether the last utterance s n contradicts any previously mentioned dialogue history H, user query u n or the corresponding knowledge base KB .

Dataset
We construct the CI-ToD dataset based on the KVRET dataset and follow four steps: (a) Data Pre-Processing, (b) KBI Construction, (c) QI and HI Construction and (d) Human Annotation, which is illustrated in Figure 2. In the following, we first describe the definition of QI, HI and KBI, then illustrate the four construction steps in detail.

Inconsistency Types
As show in Figure 3, we give an example to show different inconsistency types, which are illustrated as follows: User Query Inconsistency (QI) QI denotes that the dialogue system response is inconsistent with the current user query. Take the dialogue in Figure 3 for example, in the last turn of dialogue, user's query is asking about valero, while the final system response don't satisfied with user's requirement, showing a route to willows_market, which causes the user query inconsistency.
Dialogue History Inconsistency (HI) HI denotes that the dialogue system response is inconsistent with the dialogue history except the current user query. Take the dialogue in Figure 3 for example, the previous system response is talking about valero and the user do not change the theme of the dialogue. However, the final system response turn to discussing about willows_market , causing the dialogue history inconsistency.
Knowledge Base Inconsistency (KBI) KBI denotes that the dialogue system response is inconsistent with the corresponding KB, which is an unique challenge in task-oriented dialogue domain. Take the dialogue in Figure 3 for example, the final system response express the traffic_info of willows_market is heavy_traffic, which is conflict with the corresponding KB (no_traffic for wil-lows_market).

KBI Definition:
KBI denotes that the dialogue system response is inconsistent with the corresponding Knowledge Base.

HI Definition:
HI denotes that the dialogue system response is inconsistent with the dialog history.

QI Definition:
QI denotes that the dialogue system response is inconsistent with user's queries. Where is the nearest gas station?

Save
There is a 76 4 miles away.
What is the address?

Step 2 KBI Annotation
Given the pre-processed dialogues, we first construct KBI for each dialogue. KBI denotes that the final system response is inconsistent with the corresponding KB. We simply replace the knowledge entity value to construct KBI automatically.
More specifically, for each knowledge value in the system response, we sample specific entities from the whole KB to replace the selected slot and ensure that the sampled KB entity is different with the selected value. By this means, the constructed response is inconsistent with the corresponding KB. For example, as shown in Figure 2(b), we replace the entity "gas station" with "grocery store", which resulting in KBI (the corresponding KB is (poi_type for gas station)).

Step 3 QI and HI Annotation
In this section, we show how we generate QI and HI. Since this require us to have a deep understanding for the corresponding user's query and dialogue history, constructing a system response with QI or HI is non-trivial, To address this issue, we achieve this by human efforts. We hire a human annotation team 1 to (1) randomly assign a sample with QI or HI and re-write each response to make it inconsistent with user query or dialogue history, and (2) check whether each written response is fluent or not by three extra annotators.

Step 4 Human Re-Check
In the final step, we will re-check the fine-grained inconsistent information with human efforts, including QI, HI and KBI. To ensure quality, each sample is annotated by three people, and the annotation process lasts nearly three months. Figure 4 shows the annotation user interface.
The detailed statistics of CI-ToD is summarized in Table 2. The percentage of inconsistency has exceeded 50%, indicating that CI-ToD is challenging.

Quality Control
To control the quality of the annotated dataset, we introduce different verification methods:  (1) Onboarding Test: Each annotator will have an advance annotation test, where each annotator will first annotate 100 samples and 3 experts check their annotation results. Finally, only those who achieves 80% annotation performance can conduct the following annotation work; (2) Double Check We randomly sampled 1,000 samples from the final annotated dataset and ask two new annotators to annotate the inconsistent information. Following (Bowman et al., 2015), we calculated the Fleiss' Kappa among the previous labels and two new labels and obtained a kappa of 0.812, which means almost perfect agreement (Landis and Koch, 1977).

Models
In this section, we establish several strong baseline methods using the state-of-the-art non pre-trained models ( §4.1) and pre-trained models ( §4.2). Since multi-task framework has obtained remarkable success on various NLP tasks (Fan et al., 2021;Qin et al., 2019aQin et al., , 2020aQin et al., 2021), we adopt a vanilla multitask framework to simultaneously perform QI, HI, and KBI, which has the advantage of extracting the shared knowledge across three tasks. For both pre-trained models and non-pre-trained models, we introduce delimiter tokens [SOK], [USR] and [SYS] to signal the beginning of KB, user utterance and system response, respectively, aiming to learn to distinguish the role of KB, user and system behavior in multi-turn dialogues. Specifically, the input of KB is denoted asKB = "[SOK] KB [EOK]" while input of H is defined asĤ = "[U SR] u 1 [SY S] s 1 ... [U SR] u n ".

Non Pre-trained Models
In this approach, we simply concatenate all the previous utterances in the dialogue history and the corresponding KB to form a single textual context, which is shown in Figure 5. For KB representation, we format each knowledge entity into "column name, cell value" pairs instead of "subject, relation, object" triples to save length space. KB representation for ToDs is actually an important issue which is mentioned in our challenge section. Then, we apply f non as the non pre-trained models to obtain the final prediction, which is defined as: (2) In our work, we explore some state-of-the-art non pre-trained models including: ESIM (Chen et al., 2017), InferSent (Conneau et al., 2017) and RE2 (Yang et al., 2019).

Pre-trained Models
We investigate several state-of-the-art BERT-based and BART models, which are illustrated in Figure  5. Given a dialogue {(u 1 , s 1 ), . . . , (u n , s n ), KB}, for BERT-based models, following , the input can be denoted as ([CLS] [SEP] are special symbol for classification token and separator token. After pre-training model encoding, the last layer's hidden representation h CLS from the [CLS] token is used for classification, which can be defined as: where W and b are the trainable parameters.
For BART, we feed the same sequence to both the encoder and the decoder, using the last hidden state for classification. The class that corresponds to the highest probability is chosen as model prediction, which is illustrated in Figure 5

Training Objective
The training objective is the binary cross-entropy loss, which is defined as: where y i is the predicted score between 0 and 1 whileŷ i is the gold label for the i inconsistent type.

Implementation Details
For pre-trained models the batch size we use in our framework is selected from {4, 8} and learning rate is selected from {5e −6 } to {2e −5 } with a step of {1e −6 }. We set the max length to 512 tokens for all models except Longformer, of which 3,000 tokens are the max length we take. For the non pre-trained models, we adopt the suggested hyperparameters in their open-sourced code. All experiments are conducted at TITAN Xp and Tesla V100 GPUs. For all experiments, we select model which performs best on the development set and evaluate it on the test set.

Evaluation
We adopt overall accuracy (Overall Acc) to evaluate model's performance, measuring the ratio of sample for which both QI, HI and KBI are predicted correctly. Furthermore, to give more detailed analysis, we also calculate F1 score on the QI, HI and KBI labels.

Human Performance
To measure human performance on the CI-ToD dataset, we ask three experts to judge each sample from dataset. Only if the results of the three experts are consistent, we consider this sample is predicted correctly by human. The human performance is shown in the last row of Table 3. Table 3 shows the results of the models discussed in the previous section.

Main Results
From the results, we have the two interesting observations: (1) The human performance is 93.2%. In contrast, all of the non pre-trained and pretrained models perform significantly worse than humans, demonstrating that there is ample room for improving consistency ability in the task-oriented dialogue; (2) Pretrained models outperform all nonpre-trained models in CI-ToD, which is consistent with results in other literature (Talmor et al., 2019). We think that knowledge learned from pre-training can benefitial to consistency identification.

Performance Across Different Consistency Types
We compare human performance and model performance across different consistency types. The results are shown in Table 3. We can observe that humans are good at deciding the all consistency types, indicating that it's easy for human to detect whether a dialogue is consistent because human have a strong reasoning ability. In contrast, we find that the best pre-trained model (BART) obtains the worst results on HI type compared with other types (QI and KBI). This is because that correctly detecting HI rely on the dialogue context information which faces the challenges of coreference resolution. We will discuss it in details in Section 5.7.

Context Ablation Study
In this section, we analyze the impact of context on final performance. More specifically, we conduct experiments by removing the corresponding dialogue contextual information and only keeping the final user query. Figure 6 shows the results of BART without contextual information. We observe that our model drops in all consistency types. This is because dialogue context can help model to understand the whole dialogue topic, which is useful to the consistency detection.

Multi-Task Training vs. Separate Task Training
In this section, we explore the effectiveness of the proposed multi-task framework. In particular, we conduct separate training setting where we use the BART to perform each task prediction (QI, HI and KBI) separately. The comparison results are shown in Figure 7, we can observe that model with multitask training outperforms separate task training paradigm in all metrics, which indicates that QI, HI and KBI tasks are correlated, and thus modeling the correlation across tasks can improve performance.

Using CI-ToD as an Automatic Metric
In this section, we want to further investigate whether it can judge the quality of the utterances by different task-oriented dialogues and be used  (Liu et al., 2019) 0.715 0.472 0.715 0.500 XLNet  0.725 0.487 0.736 0.509 Longformer (Beltagy et al., 2020) 0.717 0.500 0.710 0.497 BART (Lewis et al., 2020) 0.744 0.510 0.761 0.513 Human Human Performance 0.962 0.805 0.920 0.932 Table 3: Comparison of varying approaches on CI-ToD dataset. as an automatic metric checking generation consistency. We compare the overall accuracy of the welltrained best model BART with the contradiction rate by human judgements on the utterances generated by different models. In particular, we explore the state-of-the-art end-to-end task-oriented dialogue models (Mem2seq (Madotto et al., 2018), GLMP (Wu et al., 2019b), DF-Net (Qin et al., 2020b), DDMN ). The results are shown in Figure 8 and we can see that the scores are positively correlated with human judgments, with a Pearson correlation coefficient of 0.9. This demonstrates the proposed consistency identification model can be used as a automatic metric to evaluate consistency in task-oriented dialogues.

Error Analysis
In this section, we empirically divide all the error samples generated with BART into three categories, which are shown in Figure 9. Long KB. When the KB is relatively large, it contains a lot of redundant information which is irrelevant to the current conversation. This redundant information will become noise in the process of model learning and simply flatting the KB into a sequence can not effectively modeling the relevant information. For example, as shown in Figure 9, when the KB is large (49 rows), BART predicts the KBI as 1 incorrectly.
Long Dialog History. When the dialogue history is too long, it may contain some noise information. As shown in Figure 9, there are three rounds of dialogues at this time, the system expresses "palo_alto_medical_foundation is located 2_miles away" at first round of dialogue while describes "palo_alto_medical_foundation is located 3_miles away" at last turn, causing the HI due to the irrelevant middle context. Coreference Resolution. When there are some implicit or explicit references in the dialogue, it is necessary to resolve the references to restore the intention of the conversation, which greatly increases the difficulty of the model to predict the types of inconsistencies. For example, in Figure  9, the last round of the user's query "can i have the address" does not clearly indicate a specific  object, which confuses model to predict the QI and HI as 0 incorrectly. Actually, by resolving the implicit reference according to the dialogue history, we can know that the reference object of the user's current problem is "tai_pan restaurant", which helps model to obtain correct results.

Challenges
Based on above analysis, we summarize the current challenges faced by the consistency detection task: KB Representation. The corresponding Knowledge base is the relational database, which has high-order structure information presented in the original knowledge graph. How to modeling the structure information in the relational knowledge base rather than simply flattening the KB is an interesting research question to investigate. In addition, since the size of KB is relatively big, how to effectively modeling relevant KB information rather than injecting noisy is another challenge to explore.
Effectively Context Modeling. Since some dialogue has extreme long histories, not all context information have a positive influence for the final performance. How to effectively model the long-distance dialogue history and filter irrelevant information is an interesting research topic.
Coreference Resolution. There are multiple coreference resolution in a dialogue, which will result in ambiguity in the user's query, making it difficult for model to predict the consistency label correctly. Thus, how to explicitly conduct coreference resolution to help the consistency detection is an important research question.
Explicit Joint Learning. Though achieving promising performance based on the multi-task training paradigm, the prior work did not "explicitly" model the relationships between the different tasks (HI, QI and KBI task); instead, it adopted shared parameters to "implicitly" model the correlation. However, simply relying on a set of shared parameters cannot make a full interaction to achieve desirable results (Qin et al., 2019a(Qin et al., , 2020a. Thus, how to explicitly modeling the correlation between HI, QI and KBI to directly control information flow still deserves to be explored.

Related Work
This work is related to the considering consistency in open-domain dialogue. In recent years, several personalized dialogue datasets have been introduced, such as PersonaChat (Zhang et al., 2018) and PersonalDialog (Zheng et al., 2020). These datasets are able to implicitly consider the consistency in dialogue generation, but fail to explicitly teach the model to judge whether the generated system response is consistent. Another series of related work explicity improve consistency in dialogue. To this end, some benchmarks have been proposed to promote the research. Welleck et al. (2019) made an early step towards reducing the dialogue consistency identification to natural language inference (NLI). Dziri et al. (2019) presented a novel paradigm for evaluating the coherence of dialogue systems by using stateof-the-art entailment techniques and build a synthesized dataset InferConvAI geared toward evaluating consistency in dialogue systems. Nie et al. (2021) introduced the DialoguE COntradiction DE-tection task (DECODE) and a new conversational dataset containing contradictory dialogues, aiming to evaluate the ability to detect contradictory.  proposed a KvPI dataset and profile consistency identification task for open-domain dialogue agents to further evaluate whether the system response is inconsistent with the corresponding profile information. Compared with their work that mainly focus on the open-domain dialogue direction, we aim to fill the gap of consistency identification in task-oriented dialogue systems. Furthermore, we introduce a human-annotated dataset to this end. Besides, we provide some key challenges and future directions to facilitate further research.

Conclusion
We studied consistency identification in taskoriented dialogue and introduced a new humanannotated dataset CI-ToD. Further, we analyzed the problems of CI-ToD through extensive experiments and highlight the key challenges of the task. We hope CI-ToD can facilitate future research on consistency identification in task-oriented dialogue.