CDConv: A Benchmark for Contradiction Detection in Chinese Conversations

Dialogue contradiction is a critical issue in open-domain dialogue systems. The contextualization nature of conversations makes dialogue contradiction detection rather challenging. In this work, we propose a benchmark for Contradiction Detection in Chinese Conversations, namely CDConv. It contains 12K multi-turn conversations annotated with three typical contradiction categories: Intra-sentence Contradiction, Role Confusion, and History Contradiction. To efficiently construct the CDConv conversations, we devise a series of methods for automatic conversation generation, which simulate common user behaviors that trigger chatbots to make contradictions. We conduct careful manual quality screening of the constructed conversations and show that state-of-the-art Chinese chatbots can be easily goaded into making contradictions. Experiments on CDConv show that properly modeling contextual information is critical for dialogue contradiction detection, but there are still unresolved challenges that require future research.


Introduction
Large-scale pre-training for dialogue generation (Zhang et al., 2020;Freitas et al., 2020) has advanced the development of engaging and humanlike dialogue systems.Unfortunately, state-ofthe-art open-domain chatbots, such as BlenderBot (Roller et al., 2021), EVA (Zhou et al., 2021;Gu et al., 2022) and PLATO (Bao et al., 2021b), still often behave inconsistently with their role or identity and produce utterances that are self-contradictory
Dialogue contradiction detection has shown to be an effective means to improve the consistency of chatbots (Welleck et al., 2019;Nie et al., 2021), which, however, is always a challenging task.Specifically, the contextualization nature of conversations indicates the necessity of considering and modeling contextual information.For instance, in the "Contradiction" example in Figure 1, b 2 does not explicitly contradict b 1 .However, given u 1 , the actual meaning of b 1 should be "I like dogs, cats" and b 1 and b 2 are thus contradictory.In contrast, in the "Non-contradiction" example, while b 1 and b 2 seem inconsistent ("love" vs. "dislike"), b 2 actually means "I dislike noodles" considering the dialogue context.Hence, b 2 is compatible with b 1 and does not make a contradiction.
Despite the above challenge, existing datasets for contradiction detection (Dziri et al., 2019 (Dagan et al., 2005), which is largely insufficient for dialogue contradiction detection due to the neglect of contextual information.A recent work (Nie et al., 2021) crowd-sourced a dataset named DE-CODE that contains conversations where the last utterances contradict the dialogue histories.However, DECODE lacks a wide coverage of typical contradiction categories, and most of its contradiction cases are written by human, which have gap with the real scenario where users trigger chatbots to make contradictions.
In this work, we propose a benchmark for Contradiction Detection in Chinese Conversations, namely CDCONV.It contains 12K multi-turn conversations with human-annotated contradiction labels ( §3).Different from previous work (e.g., Nie et al. 2021) that only considered the contradiction to dialogue history (i.e., History Contradiction), CDCONV covers another two typical categories: Intra-sentence Contradiction and Role Confusion, which refer to that a reply contradicts itself and that a reply confuses the speaker's role, respectively.
Since the cases of non-contradiction and contradiction in natural human-bot conversations are extremely unbalanced ( §3, Nie et al. 2021), we automatically construct the CDCONV conversations combined with elaborate manual inspection ( §4.1).Specifically, we first devise a series of automatic methods to generate conversations ( §4.2), which simulate the common user behaviors that trigger chatbots to make contradictions.We then conduct careful human screening and annotation for the constructed conversations to ensure the data quality ( §4.3).We validate the effectiveness of the trigger methods and show that state-of-the-art Chinese open-domain chatbots (EVA and PLATO) can be easily goaded into making contradictions ( §4.4).
Finally, we evaluate popular Chinese pre-trained models on CDCONV ( §5).Results show that properly modeling contextual information is critical for dialogue contradiction detection.However, there is still much room for future research in dialogue modeling, integrating commonsense and world knowledge, and reasoning.
Our contributions are summarized as follows: • We propose CDCONV, a benchmark for contradiction detection in Chinese conversations.It contains 12K conversations annotated with three typical contradiction categories: Intra-sentence Contradiction, Role Confusion, and History Contradiction.
• We present a series of methods by simulating common user behaviors to automatically trigger chatbots to make contradictions.We demonstrate the effectiveness of these trigger methods through detailed human annotation.
• We evaluate popular Chinese pre-trained models on CDCONV.Results show the importance of properly modeling contextual information in dialogue contradiction detection, while this task is still far from solved and requires further study.

Related Work
Table 1 summarizes the comparison of CDCONV with related benchmarks / datasets for (dialogue) contradiction detection.

Contradiction Detection for Conversation
The contradictions in dialogue systems can be split into two major types: Extrinsic and Intrinsic (Dziri et al., 2021;Ji et al., 2022).The Extrinsic type refers to the contradiction between a conversation and external information.For instance, the KvPI dataset (Song et al., 2020) focuses on the contradiction to structured attribute profiles.The DIALFACT benchmark (Gupta et al., 2022) aims at detecting contradictory statements to world facts and improv-ing factual correctness.The CI-ToD dataset (Qin et al., 2021) involves the inconsistency with knowledge bases in task-oriented dialogue.One potential limitation of Extrinsic dialogue contradiction detection is that it may rely on static and manually curated external information (e.g., profiles), which could be insufficient in open-domain dialogue.
Our work focuses on the Intrinsic type, which refers to the contradiction inside a conversation and is more widespread and fundamental in opendomain dialogue.The DECODE dataset (Nie et al., 2021) is a relevant work to ours, whose contradiction cases are mostly collected by manually writing subsequent utterances to contradict the given dialogue histories.Besides the language difference, CDCONV is distinguished from DECODE in two aspects: (1) Apart from History Contradiction, CDCONV additionally covers two contradiction categories: Intra-sentence Contradiction and Role Confusion, which are also typical and common in human-bot conversations ( §3).(2) Instead of being human-written, the contradiction cases in CDCONV are constructed by simulating the user behaviors that trigger chatbots to make contradictions ( §4.2), which are closer to the real scenario of human-bot conversation.

Categories of Dialogue Contradiction
A conversation with n turns is formally denoted as u 1 , b 1 , . . ., u n , b n , where u k and b k denote the kth-turn utterances from the user and the chatbot respectively.We focus on whether b n makes a contradiction in the dialogue context.
In the preliminary study, we manually inspected 200 multi-turn human-bot conversations with two Chinese open-domain chatbots: EVA (Zhou et al., 2021;Gu et al., 2022) and PLATO (Bao et al., 2021a,b).On average, each conversation contains about 30 turns but only roughly 1 contradiction case.Based on the inspected contradiction cases, we identify three typical categories of dialogue contradiction according to the object that b n contradicts, as intuitively illustrated by Figure 3: • Intra-sentence Contradiction: b n is contradictory to itself.In other words, there exist two disjoint subsentences b (1) n ⊂ b n (usually separated by commas, periods or conjunctions) so that they are not compatible with each other.
• Role Confusion: b n confuses the speaker's role.
That is, b n is more likely to be a user's reply to b n−1 rather than a bot's to u n .
• History Contradiction2 : b n is contradictory to the dialogue history.The contradictions caused by mistaking or forgetting the dialogue history (Xu et al., 2022a,b) usually fall into History Contradiction, as the last example in Figure 2.
Figure 2 provides the examples of the above three contradiction categories.They occupied 16%, 18%, and 54% in our inspected contradiction cases,  2 for detailed annotation statistics.
respectively.The remaining cases (< 12%) mostly contradict time-sensitive information (e.g., the chat time) or facts (e.g., when the iPhone was released), which, as aforementioned ( §2), are beyond the scope of this work.We note that Intra-sentence Contradiction and Role Confusion were less studied previously while actually typical and common in human-bot conversations.CDCONV can serve as a good start point for investigating them.
4 Data Collection

Collection Procedure
We automatically constructed the CDCONV conversations along with elaborate manual inspection.We narrow down the conversations in CDCONV to 2-turn ones (n = 2).The overview procedure is shown in Figure 4: 1. We took a human-written utterance as u 1 and obtained the chatbot's reply b 1 .
2. Using one of the trigger methods in §4.2, we automatically constructed u 2 based on u 1 or b 1 and generated the chatbot's next reply b 2 .
3. Human annotators were asked to inspect (1) if b 1 , u 2 , b 2 do not contain any ethical risk (e.g., offensive language, hate speech, unethical suggestions, etc.) and are fluent and understandable, and (2) if b 1 does not make Intra-sentence Contradiction (to ensure a valid dialogue history).The unqualified conversations were removed.4. Considering the full contextual information, human annotators marked whether b 2 makes a contradiction based on the categories in §3.Specifically, we adopted single-label annotation.That is, according to the order in §3, once a contradiction of some category is recognized, the subsequent categories will not be judged.Note that the cases, where b 2 does not answer the questioning u 2 and responds incoherently (e.g., unnaturally transition the topic), were additionally marked and filtered out.
Collecting u 1 We collected the human-written utterances from DuPersona, a crowd-sourced Chinese open-domain dialogue corpus 3 .This is due to our observation that these crowd-sourced utterances are of higher quality compared to social media posts (e.g., Weibo and Douban) and contain rich persona information, which is in line with the style and content of general chitchat.We used those utterances that contain second-person nouns and "?" as u 1 , since noticed that such questioning utterances would elicit chatbots to talk specific information about themselves and could avoid uninformative or meaningless replies.
Persona Labels To help understand which type of information was involved in History Contradiction, these b 2 were additionally annotated with one of the four persona labels: attributes, opinions, experiences and persona-unrelated.Their examples are shown in Figure 2 and their definitions are provided in §B.Note that we annotated the persona 3 https://www.luge.ai/#/luge/dataDetail?id=38 information since its related discussion in Chinese chitchat usually occupies a large proportion according to our observations on social media corpora.
Chatbots We used two state-of-the-art Chinese open-domain chatbots, EVA (Zhou et al., 2021;Gu et al., 2022) and PLATO (Bao et al., 2021a,b).EVA is an Encoder-Decoder model with 24 encoder layers and 24 decoder layers and has 2.8B parameters in total.PLATO adopts a Unified Transformer architecture (Bao et al., 2020) and has 32 layers and 1.6B parameters.They are both pre-trained on massive Chinese social media corpora.

Trigger Methods
Our inspection on contradiction cases ( §3) also revealed that chatbots are more prone to making contradictions under several specific user behaviors: (1) the user input is short and uninformative, (2) the user inquires about the dialogue history (similarly noticed by Li et al. 2021), and (3) the user asks for similar information in the context.By simulating these user behaviors, we devise a series of methods to automatically construct u 2 .These methods are illustrated by the examples in Figure 1, 2 and 5.Note that the automatic construction of u 2 suggests the necessity of inspecting if it is fluent and understandable, which is thus an important step to ensure data quality ( §4.1).
Short Utterance u 2 is a short and uninformative utterance.It simulates a user's casual or perfunctory reply to the chatbot.
Inquiring History (Bot) Inquiring History (Bot / User) u 2 is an inquiry about the dialogue history.It simulates a user's inquiry about the contents of previous conversations.
We first extracted named entities in b 1 (about the bot) or u 1 (about the user) using HanLP4 (He and Choi, 2021).Then we leveraged an open-sourced question generation model5 to generate questions about the extracted entities, which were used as u 2 .
Note that when inquiring about the user, we used the utterances that contain first-person nouns from DuPersona as u 1 .Since we noticed that such obtained u 2 was sometimes not natural enough, we modified most of u 2 using the pattern "Do you know...?", which we denote as Inquiring History (User-M), as illustrated in Figure 5.
Paraphrasing u 2 expresses the same meaning to u in a different way.It simulates a user's clarification question to the previous questions.
We paraphrased u 1 through back-translation as u 2 .The Chinese u 1 was first translated to English and then back-translated to Chinese.We used the Baidu translation API and removed those u 2 that were identical to u 1 .
Perturbation As an extension of Paraphrasing, we found that u 2 obtained by perturbing u 1 , where u 2 and u 1 have similar or opposite meanings, could also trigger contradictions.Different from the methods before, Perturbation is more likely to be users' "hacking" behaviors instead of general chitchat, which may be out of the intents of curiosity, probing, or malicious attacks, etc.
We perturbed u 1 in three ways.(1) Synonym.We randomly replaced the nouns in u 1 with their synonyms using an open-sourced synonym dic-tionary6 .(2) Antonym.We randomly replaced the verbs or adjectives in u 1 with their antonyms using the antonym dictionary.For Synonym and Antonym, there are 2.3/3.7 words per u 1 on average that can be replaced with their synonyms/antonyms.In practice, we randomly chose one replaceable word in u 1 at a time.(3) Negative.We randomly replaced the words in u 1 with their negatives using the negative dictionary or inserted negatives before the verbs in u 1 .Since we noticed that negatives would greatly impair the fluency of u 2 , we additionally applied back-translation to u 2 to improve its fluency.

Quality Control
All the human annotators were hired from a reputable data annotation company.They were instructed with the annotation procedure and the definitions and examples of contradiction categories.However, due to the characteristics of the Chinese language and the difference in individual habits of language usage and communication, the annotation criteria of the annotators may somewhat vary and need to be calibrated with our assistance.We applied the following mechanisms for quality control: Annotator Training All the annotators were required to take a training tutorial, which consists of 50 conversations for pilot annotation.We provided feedback to help them calibrate the annotation criteria.

Multi-person Annotation
In the formal annotation, each conversation was annotated by two different annotators.If their results were inconsistent, a third annotator would be asked to re-annotate and discuss the case with the first two annotators to reach a consensus.
Spot Check To more effectively calibrate the annotation criteria, we conducted annotation batch by batch and randomly sampled 100 conversations each batch for spot check.We provided feedback to the annotators and instructed them to amend their annotations.After each revision we would conduct spot check again until the pass rate reached 95%.Finally, we conducted five batches of annotation with incremental batch sizes (17K annotated conversations in total).Except for the first two batches, all subsequent batches directly passed the first spot checks.

Statistics and Annotation Analysis
Table 3 shows the statistics of CDCONV.It contains 11,660 conversations, where the average lengths of u 1 , b 1 , u 2 , b 2 are 16.4,12.1, 11.1, 11.6 respectively.The ratio of positive and negative samples is 1.68 (7,309 / 4,351).Both positive and negative samples include conversations constructed using various trigger methods, which suggests a high diversity of CDCONV.Among the negative samples, History Contradiction occupies the largest proportion (70.1%) along with rich persona labels.
To shed light on the trigger methods and the chatbot behaviors, we show in Table 2 the comprehensive annotation statistics.For the trigger methods, they all can effectively trigger dialogue contradictions.Notably, Short and Inquiring (User-M) are the most effective in triggering Role Confusion and History Contradiction respectively.For the chatbot behaviors, EVA and PLATO both produce fluent replies with little ethical risk, but can both be easily goaded into making contradictions.EVA is more prone to making Intra-sentence Contradiction ( b 1 / b 2 ) and History Contradiction , while PLATO makes more Role Confusion and incoherent b 2 .We speculate that their different behaviors may result from the gaps in model architectures and training corpora.

Setups
We randomly split CDCONV into the training/validation/test sets with the ratio of 6/2/2.The experiments were conducted with two settings.The 2-class one detects whether b 2 makes a contradiction, while the 4-class one recognizes the contradiction category (the three categories in §3 along with a non-contradiction one).We measure model performance using Accuracy and Macro-F1.

Compared Methods
We experimented with three popular Chinese pretrained models: BERT, RoBERTa (Cui et al., 2021) and ERNIE (Sun et al., 2019).They all contain 12 Transformer layers (Vaswani et al., 2017) with the hidden size 768.The BERT and RoBERTa are both pre-trained with whole word masking while ERNIE with the different knowledge masking strategies.We compared three methods of contradiction detection: • Sentence Pair: The model input consists of the bot's utterances b 1 and b 2 .This method follows the NLI framework adopted in previous work (Williams et al., 2018;Welleck et al., 2019;Nie et al., 2021) where contradiction detection is performed between a pair of sentences.
• Flatten: The flattened whole conversation is taken as the model input, that is, u 1 , b 1 , u 2 and b 2 .This method utilizes contextual information for contradiction detection in a naive way.
• Hierarchical: We note that the three contradiction categories are usually related to different levels of contextual information according to their definitions ( §3).We thus design a hierarchical modeling method, which consists of three separately fine-tuned 2-class classifiers in sequential order (Figure 6).Each classifier targets a specific contradiction category, takes the corresponding level of contextual information as input, and is fine-tuned with 2-class samples: the samples of the targeted contradiction category vs. all the other samples.Once some contradiction category is detected, it is then directly output, otherwise non-contradiction will be finally output.
In prior to fine-tuning, we pre-trained all the models on the Chinese NLI pre-training corpus, which includes two widely used Chinese NLI datasets: CMNLI (Xu et al., 2020) and OCNLI (Hu et al., 2020).We merged the "entailment" and "neutral" labels as the "non-contradiction" one.See Table 5 for more results of NLI pre-training.

Implementation Details
We implemented all experiments with the Pad-dlePaddle platform (Ma et al., 2019).We employed the AdamW (Loshchilov and Hutter, 2018) optimizer with batch size 32 and learning rate 5e-5, and used the linear learning rate scheduler with warmup proportion 0.1.Each model was fine-tuned for 5 epochs and the checkpoint achieving the highest Macro-F1 was used for test.We reported the average results of four random seeds, where each run took about 3 minutes on a single Tesla V100 GPU.

Results
Table 4 shows the results of the 2-class setting, the 4-class setting, and the fine-grained F1 scores of all the categories of the 4-class setting.We have three major observations: (1) Sentence Pair performs worse than Flatten and Hierarchical.It is unsurprising since exploiting contextual information is critical for dialogue contradiction detection, as discussed in §1.
(2) Hierarchical consistently performs best and boosts all the fine-grained results.Specially, Intra-sentence Contradiction and Role Confusion cannot be improved by naively feeding the models with the flattened whole conversation, see the marked decreased scores .In contrast, Hierarchical boosts the performance in Intra-sentence Contradiction and Role Confusion and meanwhile performs well in Non-contradiction and History Contradiction.This is because Hierarchical fully considers the characteristics of different contradiction categories and properly utilizes the required contextual information for detection.For instance, Role Confusion needs to judge whether b 2 is a reply to u 2 or a reply to b 1 .It is sufficient for the classifier of Role Confusion to make use of the three utterances, while further adding u 1 may instead introduce noise and impair performance.
(3) Even for Hierarchical, the performance in Intra-sentence Contradiction and Role Confusion is still poor.Their highest Macro-F1 are 33.0 and 49.5 respectively, which are far inferior to Non-contradiction (85.1) and History Contradiction (71.0).One potential cause is the imbalance of samples of non-contradiction and three contra- diction categories (Table 3).Another important reason may be that these pre-trained models still do not have a good ability of dialogue representation, which may be alleviated by additional pre-training on dialogue corpora.

Error Analysis and Discussion
We manually inspected the cases misclassified by the four RoBERTa Hierarchical models (trained with four random seeds).Figure 7 shows the results of error analysis.Besides proper dialogue modeling (e.g., the hierarchical way), dialogue contradiction detection also requires more abilities such as commonsense, knowledge grounding, and reasoning, which correspond to the cases in Figure 7. Though innate to human, these capabilities are still largely lacked by even gigantic deep neural models (Marcus, 2018;Choi, 2022).These challenges of dialogue contradiction detection manifest that further exploration is worthy.

Conclusion
In this work, we present CDCONV, a benchmark for contradiction detection in Chinese conversations.By simulating common user behaviors that trigger chatbots to make contradictions, we collect 12K conversations annotated with three typical contradiction behaviors.Experiments show that contextual information plays an important role in dialogue contradiction detection.However, there are still unresolved challenges in CDCONV, such as dialogue modeling, commonsense, knowledge grounding and reasoning.We hope that CDCONV can inspire and facilitate future research in dialogue contradiction detection and consistent generation.

Ethical Considerations
Human Annotation The human inspection and annotation was conducted by a reputable data annotation company, and the annotators are compensated fairly based on the market price.We did not directly contact the annotators and their privacy can be well preserved.This work does not use any demographic or identity characteristics.
Data Disclaimer In the construction of the CD-CONV conversations, the u 1 utterances use the dialogue posts from the open-sourced, crowd-sourced corpus DuPersona ( §4.1).The u 2 utterances either come from DuPersona or are constructed using publicly available resources (corpora, models or API, §4.2).The b 1 and b 2 utterances are all produced by chatbots.Due to the potential ethical risks in these utterances, we have censored and filtered out conversations that contained unsafe or unethical contents through human inspection.

A Limitations
Data Coverage and Construction An ideal benchmark for dialogue contradiction detection may be expected to (1) cover as many and diverse contradiction cases as possible, and (2) be close to the real scenario of human-bot conversation scenario.However, the cases of non-contradiction and contradiction in natural human-bot conversations are extremely unbalanced, as stated in §3 and (Nie et al., 2021), which brings great difficulty for the data collection.For this reason, we (1) focus on the three typical contradiction categories in the manually inspected contradiction cases ( §3), and (2) construct conversations by simulating common user behaviors that trigger contradictions.
We are explicitly aware that CDCONV has a finite coverage of the cases of dialogue contradiction.Specially, the CDCONV conversations consist of only two turns, but (1) contradictions may occur after more than one turns, and (2) some contradiction cases, especially History Contradiction, may contradict multiple turns.The samples of (1) can be obtained by applying data augmentation to the CDCONV conversations based on chatbots' selfchat (Gu et al., 2022;Bao et al., 2021b) or language models' completion (Zheng et al., 2022;Dai et al., 2022).The samples of (2) are not covered by CDCONV but in fact rarely occur based on our observations.Future benchmarks for dialogue contradiction detection may consider these complex cases of (2).

Fluency and Coherence of Conversations From
Table 2, we observed that Inquiring (User) results in more incoherent b 2 .The three Perturbation methods also lead to more non-fluent u 2 .It indicates that these methods may somewhat impair the naturalness of conversations.To address this, we conducted elaborated manual inspection (the 3rd and 4nd steps in §4.1) to filter out the conversations containing non-fluent or incoherent replies.
Human Annotation Due to the subjectivity of human annotation, there may unavoidably exist mislabeled samples in CDCONV.To alleviate this, we have adopted the mode of multi-person annotation, conducted spot check for each annotation batch, and required the pass rates to reach 95% to ensure data quality ( §4.3).We especially point out that, despite the mode of multi-person annotation, there may still exist biases in the annotation results regarding "fluency" ( §4.1).Due to the characteristics of the Chinese language and the difference in individual habits of language usage and communication, the annotators' understanding of "fluency" may not be identical.Although we have tried our best to unify the annotation criteria through constant feedback and quality check ( §4.3), these bi-

Figure 1 :
Figure 1: Dialogue contradiction detection requires the full contextual information (including u 1 and u 2 ) rather than only the bot's utterances (i.e., b 1 and b 2 ).

Figure 3 :
Figure 3: Diagram of contradiction categories.Combine the definitions below for a clearer understanding.
Figure 4: The collection procedure of CDCONV.See Table2for detailed annotation statistics.

Figure 6 :
Figure 6: Overview of the Hierarchical method.

Table 1 :
; Welleck Comparison of CDCONV with related benchmarks / datasets for (dialogue) contradiction detection.The Extrinsic type targets the contradiction between a conversation and external information (e.g., profiles or facts), while Intrinsic targets the contradiction inside a conversation.See §2 for detailed discussion.

Table 2 :
Annotation statistics for each trigger method.Each value means the proportion of the corresponding annotation label.The proportions about b 2 are calculated after the unqualified conversations were filtered out (in the 3rd step in §4.1).The proportions of ethical risk and non-fluent b 1 , b 2 are omitted since they are all close to 0.

Table 4 :
Experimental results.Performance increases and decreases compared to Sentence Pair are marked.