An Evaluation Dataset and Strategy for Building Robust Multi-turn Response Selection Model

Multi-turn response selection models have recently shown comparable performance to humans in several benchmark datasets. However, in the real environment, these models often have weaknesses, such as making incorrect predictions based heavily on superficial patterns without a comprehensive understanding of the context. For example, these models often give a high score to the wrong response candidate containing several keywords related to the context but using the inconsistent tense. In this study, we analyze the weaknesses of the open-domain Korean Multi-turn response selection models and publish an adversarial dataset to evaluate these weaknesses. We also suggest a strategy to build a robust model in this adversarial environment.


Introduction
Multi-turn response selection is a task that selects the best response among given candidates for a given dialogue context. Response selection models have recently shown comparable performance to humans (Cui et al., 2020) in the several indomain/held-out benchmarks (Lowe et al., 2015;Zhang et al., 2018a;Dinan et al., 2020). However, in the actual service environment, these models are often found to have weaknesses. For example, the model gives the highest score to the wrong response, which has high word overlap with the context (Yuan et al., 2019) or semantically similar to the context (Whang et al., 2021).
Held-out evaluation often overestimates the realworld performance of the model (Ribeiro et al., 2020), so adversarial datasets for evaluating weaknesses have been constructed for each task, such as NLI (Naik et al., 2018;McCoy et al., 2019), and MRC (Jia and Liang, 2017;Rajpurkar et al., 2018).
A framework for comprehensively evaluating the general linguistic abilities of the model was also studied (Ribeiro et al., 2020).
Several works evaluated adversarial cases for the response selection task (Yuan et al., 2019;Whang et al., 2021). However, they just automatically generate adversarial responses by copying words in the context. In this study, we analyze the weaknesses of the various aspects of the open-domain Korean Multi-turn Response Selection models and construct an adversarial dataset manually. A total of 2,220 test cases are constructed, and each test case are classified by type.
Neural networks do not generalize well to such an adversarial setting because they tend to use superficial patterns and spurious correlation of the dataset overly, which makes models biased (Clark et al., 2019;Nam et al., 2020). Thus, various debiasing methods have been studied to alleviate this phenomenon Utama et al., 2020). In this study, we show that debasing method is also effective in adeversarial evaluation for multi-turn response selection task.
In the retrieval-based chatbot system where response selection is used, response candidates are composed as follows. All utterances in the database are used as response candidates (Humeau et al., 2020), or part of them filtered through search engines are used . To filter the candidates, machine learning-based embeddings or word-level similarity algorithms(e.g., BM25), which also have weaknesses in an adversarial setting, are used . Therefore, almost every time a response is selected by the actual system, adversarial cases are included in the candidates. Thus, robustness to adversarial cases is more important for response selection task. We also construct a real environment test set and experiment that the model robust to an adversarial case has high performance in the real environment.

Adversarial Test Dataset
We analyze the incorrect responses in the internal service log and categorize the types of frequent errors. There are a total of seven types, and details of each type are as follows.
Repetition An incorrect response repeating one of the utterances in the context.
Negation A negation is either added to or omitted from a correct response, generating an erroneous response with reversed affirmative or negative meaning. A test set for a negation error intentionally generates a negative response by adding or removing '안' or '못', which are negative adverbs in Korean (short-form negation) or '-지 않다,' '-지 못하다, ' or '-지 말다' which are negative auxiliary predicates in Korean (Long-form negation) in order to test whether the model understands such semantic reversal.
Tense A morpheme or expression marking tense is added to or removed from a correct response, generating an erroneous response in tense that is inconsistent with the given context. A test set for tense errors adds or replaces morphemes or expressions marking the future tense such as '-겠-,' or ones marking the past tense such as'-었-' to test whether the model fully understands the context disconnection triggered by such tense change.
Subject-Object A test set for subject-object errors generates a response inconsistent with the context due to confusion of the subject and object for a certain action. In particular, since zero anaphora can be found frequently in Korean sentences, incorrect responses are often made because of a failure in identifying the hidden subject of the previous context. This test set uses a subject or an object differently from the ones used in a correct response to examine whether the model fully understands the context disconnection caused by such errors.
Lexical Contradiction A key lexicon of a correct response is replaced with one that holds either conflicting or opposite meaning against the said key lexicon, generating an incorrect response. A test set for lexical contradiction errors replaces a key lexicon in a sentence with an antonym (e.g. hot vs cold) or a word that cannot be used instead (e.g. rain vs snow) to check whether the model understands the precise meaning of such lexicon.
Interrogative Word A test set for interrogative word errors generates a response in a form of 5W1H questions to ask for information that has already been explicitly or implicitly shared in previous dialogues.
Topic A key sentence or vocabulary is replaced with another sentence or term that does not fit in the previous context even though they frequently appear together in the given topic. While this error is similar to the lexical contradiction error to a certain extent, the replacement words used in this test do not hold conflicting or opposite meanings but instead have less semantic relevance to the context of the previous dialogue (e.g. sunny vs umbrella). The test set assesses whether a model fully understands the fact that while the replacement vocabulary is the one that is frequently used in the same given topic, the response does not correctly reflect the context of the previous dialogue.
Five annotators generate a total of 200 dialogue sessions. For each session i, annotators create two correct responses and an arbitrary number(M i ) of incorrect responses based on the instruction described above. All sessions and responses are reviewed and filtered by experts. We set up one test case to consist of context, one correct response, and one incorrect response. Therefore, 2 * M i test cases were extracted for each session, and a total of 2,220 test cases are constructed. It evaluates whether the model gives the correct answer a higher score than the incorrect one for a given context. Statistics and examples are described in Table 1. We release this data set at https: //github.com/kakaoenterprise/ KorAdvMRSTestData.

Method
, where c i denotes a dialogue context, r i is a response utterance, and y i ∈ {0, 1} is a label. The context c i = {u i,1 , u i,2 , ..., u i,k i } consists of sequence of k i utterances. The label y i = 1 means that r i is sensible response for context c i .

Baseline: Fine-tuning BERT
We adopt fine-tuning BERT (Devlin et al., 2019) as a baseline. In this work, similar to the previous works that fine-tuned BERT for the Multi-turn Response Selection task (Gu et al., 2020;Whang et al., 2020;, the input token sequence of BERT x i is composed as follows. (1) The [EOU] is a special token indicating that the utterance is over. The final output hidden vector of the [CLS] token in BERT is fed into a fully connected layer with softmax activation. Then, the BERT is fine-tuned to minimize cross entropy loss between the target label and output of this layer.

Debiasing Strategy
In general, correct dialogue response utilizes keywords or topics in the context. Neural networks tend to use such superficial patterns(e.g., keyword, topic) overly, which makes models biased (Clark et al., 2019;Nam et al., 2020). We see this bias as the main cause of the response selection model's vulnerability to an adversarial environment. Thus, we experimented by applying various debiasing techniques to the response selection task, and DRiFt  was the most effective. The main concept of the debiasing strategy we used is to train a debiased model to fit the residual of the biased model, focusing on examples that cannot be predicted well by biased features only . Details of the method using DRiFt are as follows. First, we train an auxiliary biased model using only biased features. The biased model is a single fully connected layer with softmax activation and trained with cross-entropy loss. The biased feature vector used as an input φ i is as follows.
We use the Jaccard similarity(JS) between the whole context(c i ) and response(r i ) as input features. We also use the JS between the last utterance(u i,k i ) and r i , because the last utterance is most important (Zhang et al., 2018b;Ma et al., 2019). We use two tokenizers: the WordPiece (Wu et al., 2016), and the morpheme analyzer. We assume that these words overlap feature could capture keyword and topic bias. Second, we train a debiased model utilizing a biased model, as shown in Figure 1. The overall structure of the debiased model is the same as the baseline, but only the learning scheme is different. Let b is output hidden vector of the biased model, d is output hidden vector of the debiased model, p b = sof tmax(b), and p d = sof tmax(d). DRiFt method minimize cross entropy loss between p a = sof tmax(b + d) and target labels. Thus, the loss function is defined as follows.
(3) L is the number of classification classes(2 for this task). The gradient is backpropagated only to the

Combination with Multi-task Learning
Recently, self-supervised learning approaches have shown state-of-the-art performance in the response selection task (Whang et al., 2021;Xu et al., 2021). These works devise auxiliary tasks to understand the dialogue better and train the model in a multitask manner. The final loss function in these methods is the weighted sum of losses of auxiliary tasks and main task (i.e., determine given response is a sensible response to the context). Thus, debiasing strategy could be easily combined with these methods by replacing the loss function of the main task with equation 3. We also experiment with selfsupervised learning approach UMS (Whang et al., 2021), and we show that it is also effective in not only in-domain but also adversarial and real environments.

Experiment Setup
We construct an experimental dataset using the corpus that we produced in-house and the public Korean dialogue corpus 1 . We split these corpora into three, and each is for training, validation, and test. Statistics of each dataset are described in Table  2. #pairs denote the number of context-response pairs, #cands denotes the number of candidates per context, pos:neg denotes the ratio of positive and 1 https://corpus.korean.go.kr negative responses in candidates, and #turns denote the average turns per context. Details on the construction are as follows.
Train, valid, and in-domain test The last utterance of the dialogue session is used as a positive response and the rest as context. Negative responses are randomly chosen from the other dialogue.
Adversarial test It is described in the Section 2.
Real environment test In a real environment, response candidates are not sampled randomly but are sampled through a search system , or all utterances without sampling are used as candidates (Humeau et al., 2020). There are many adversarial negatives in this situation, as described in Section 1. We build a dataset by simulating this situation in a similar way to the previous works (Wu et al., 2017;Zhang et al., 2018b).
We take a dialogue session from the test corpus and internal service log as context. We trained a bi-encoder-based context and response embedding model (Humeau et al., 2020) and indexed embeddings of all utterances in the corpus. Then, we retrieve the top 10 utterances based on the similarity score between context embedding as response candidates. For each response, three annotators labeled whether it is sensible to the context. The response determined by more than two people as sensible was selected as the positive response.

Results
We measure the performance ten times for each model and report the mean and standard deviation in Table 3. See Appendix B for details of training. The baseline is a fine-tuned BERT described in Section 3.1. "deb" denotes a debiasing strategy described in Section 3.2. UMS denotes a self-supervised multi-task learning method de-scribed in Section 3.3. Precision@1 is used as an evaluation metric for all test sets.
Debiasing strategy significantly improves adversarial test performance in both baseline and UMS model; it achieves absolute improvements of 4.1% and 4.3% on baseline and UMS. A decline in performance is observed in the in-domain test; -1.0% and -0.5% on baseline and UMS, as the DRiFt debiasing method  shows a slight performance degradation in the in-domain test. However, It improves performance in the comprehensive real environment test; +0.3% and +1.6% on baseline and UMS. This supports our argument that robustness to adversarial cases is important in the response selection task. Additionally, +UMS+deb outperforms +deb in all test set. From this, it can be seen that the debiasing strategy and UMS have a synergistic effect.
The performance of each adversarial type is reported in Table 4. Since we used word-level Jacard Similarity as a biased feature, the debiasing strategy shows huge performance improvement in the Repetition type, which simply uses word sequence in context as a negative response. There is no improvement in the Interrogative Word type. We assume that the reason for it is that this type is difficult because it requires understanding all 5W1H from the context.

Conclusion
We analyze the weaknesses of the open-domain Korean Multi-turn Response Selection models and publish an adversarial dataset to evaluate these weaknesses. We suggest a strategy to build a robust model to an adversarial and real environment with the experimental results. We expect that this work and dataset will help improve the response selection model.

Ethical Considerations
The adversarial dataset we publish is generated manually. All sessions and responses in the dataset are reviewed and filtered by the experts, and we also considered ethical issues in this process. Thus, there is no hate speech or privacy issue in our dataset.