Synthesizing Adversarial Negative Responses for Robust Response Ranking and Evaluation

Open-domain neural dialogue models have achieved high performance in response ranking and evaluation tasks. These tasks are formulated as a binary classification of responses given in a dialogue context, and models generally learn to make predictions based on context-response content similarity. However, over-reliance on content similarity makes the models less sensitive to the presence of inconsistencies, incorrect time expressions and other factors important for response appropriateness and coherence. We propose approaches for automatically creating adversarial negative training data to help ranking and evaluation models learn features beyond content similarity. We propose mask-and-fill and keyword-guided approaches that generate negative examples for training more robust dialogue systems. These generated adversarial responses have high content similarity with the contexts but are either incoherent, inappropriate or not fluent. Our approaches are fully data-driven and can be easily incorporated in existing models and datasets. Experiments on classification, ranking and evaluation tasks across multiple datasets demonstrate that our approaches outperform strong baselines in providing informative negative examples for training dialogue systems.


Introduction
Due to growing availability of dialogue corpora (Li et al., 2017;Zhang et al., 2018;Smith et al., 2020) and the advancement of neural architectures (Radford et al., 2019;Brown et al., 2020;Devlin et al., 2019), dialogue systems have achieved considerable success. As typically formulated, dialogue models generate one or more candidate responses 1 Code and data are publicly available https: //github.com/prakharguptaz/Adv_gen_ dialogue to a provided context, consisting of past dialogue turns. Dialogue ranking (Zhou et al., 2018; and evaluation models (Tao et al., 2018;Yi et al., 2019;Sato et al., 2020), in turn, are deployed to select and score candidate responses according to coherence and appropriateness.
Ranking and evaluation models are generally trained using true positive responses and randomly selected negative responses, which raises two issues. First, random negative candidates often have low content similarity with the context, and thus models learn to associate response coherence and appropriateness with content similarity (Yuan et al., 2019;Whang et al., 2021;Sai et al., 2020). In real systems, generated response candidates tend to be more similar in terms of content, and so other factors (e.g., time expressions, dialogue acts, inconsistencies) tend to be more important. Second, randomly selecting candidates as negative examples in an open domain context can result in false negatives, leading to misclassification of appropriate responses.
To make dialogue models more robust to the spurious pattern of content similarity, prior work proposed to leverage adversarial and counterfactual examples (Kaushik et al., 2020;Srivastava et al., 2020). A reliable method for creating counterfactual data is to collect human-written adversarial negative responses (Sai et al., 2020), but it is expensive, time-consuming, and difficult to scale. Our goal is to create reliable automatic methods for synthesizing adversarial negative responses.
The most common approach to generating natural language adversarial examples is to paraphrase or insert typos, synonyms, or words relevant to the context in the inputs (Iyyer et al., 2018;Ebrahimi et al., 2018;Alzantot et al., 2018;Zhang et al., 2019). In open domain conversations, however, a context can have a wide range of possible responses with varied forms and semantics. Small lexical Table 1: Error categories prevalent in inappropriate responses with high context-response semantic relatedness. We present 7 categories with their descriptions and sample context and response pairs. For each category we also indicate whether it is frequently observed in Retrieval (R) or Generation (G) models. Models which simply learn to associate response coherence with content similarity often ignore these errors. Our approaches create adversarial negative data for training dialogue models by introducing such errors in context relevant utterances.
variations via substitutions and paraphrasing do not provide adequate coverage over the possible space of adversarial responses, and they can also lead to generation of false negatives due to the open-ended nature of dialogues. Creating adversarial dialogue responses is thus different, and can be more challenging than in other natural language domains.
We propose two approaches for adversarial response creation: 1) a mask-and-fill approach that corrupts gold responses related to the context but retains content similarity, and 2) a keyword-guided generative approach that uses concepts from the context to generate topically relevant but incoherent responses. These approaches do not require additional annotations, are black-box (do not need access to model parameters), and are easily adapted to new datasets and domains.
The main contributions of this paper are: 1) We identify and discuss error patterns present in retrieval and generation model outputs, which are difficult to detect due to high content similarity; 2) To the best of our knowledge, we are the first to propose automatic approaches for creating adversarial responses for dialogue model training in a black-box setting; and, 3) We demonstrate that our proposed approaches achieve better performance compared to strong baselines on two datasets on dialogue classification, ranking and evaluation tasks.

Properties of Adversarial Responses
Models trained using randomly sampled negative examples tend to assign high scores to responses with high content similarity with the context, and often ignore other important factors necessary for response appropriateness and coherence. Therefore, we aim to generate adversarial negative responses which have high content similarity with the context, but which still possess factors rendering the responses inappropriate to the context. We present the categorization of such factors or error types which can make a response inappropriate in Table 1. For each category, we provide its description and sample context-response pairs. To create this categorization, we manually analyzed responses present in outputs of generative models, candidates of retrieval sets, and human written adversarial dialogue responses (Sai et al., 2020). Categories C-ent, C-time and C-cont are errors related to various inconsistencies and logical flaws in the responses and indicate poor response appropriateness. Categories C-speaker, C-follow and C-strat are error types specific to the dialogue setting and indicate poor response coherence. Category C-lang indicates poor response fluency. Our categorization of errors is inspired by the categorization suggested by Pagnoni et al. (2021) for factuality of summarization, and Higashinaka et al. (2019); Ko et al.
(2019) and Sato et al. (2020) for dialogue. These categories inform our approaches as well as error analysis.

Methodology
For a given dialogue context C and its gold response R g , our goal is to generate an adversarial response R a such that while achieving high scores from dialogue ranking or evaluation models, it should not be a valid response to the context C. Dialogue ranking and evaluation models trained with such hard synthetic negative responses should learn to associate response relevance with features beyond content similarity, and hence become robust against spurious features.
The adversarial responses should satisfy the following criteria: 1) have high content similarity with input contexts; 2) have one or more errors (Table 1) which make the response inappropriate to the context; 3) be hard training examples, that is, they should likely be misclassified by current models as correct; and 4) sufficiently cover errors which occur naturally in model generated responses and retrieval candidates, and therefore they should be plausible and diverse. We propose two approaches for synthesizing adversarial negative examplesa mask-and-fill approach and a keyword-guided generation approach which we discuss next.

Mask-and-fill Approach
This approach modifies and corrupts original utterances related to a context as shown in Figure 1. It consists of two steps: 1) masking, where one or more tokens of an original utterance are masked out; and 2) infilling, where the masked out tokens are substituted with new tokens. For a context C, the set of original utterances consists of: • Set of ground truth responses of the context -R g . • Set of utterances from the context -U c . • Set of retrieved responses based on context -R e . Masking: We use the hierarchical masking function from Donahue et al. (2020) which selectively masks spans at the granularities of words, n-grams, and sentences. We apply the masking function to each utterance multiple times to get up to 3 masked versions per utterance. Each utterance is constrained to have at least two masked spans. The spans are selected randomly for masking following Donahue et al. (2020). Infilling: We extend the Infilling Language Model (ILM) from Donahue et al. (2020) for dialogue Figure 1: Mask-and-fill approach using ILM model. ILM is trained to infill n-grams in place of blanks in a response. Tokens after [infill] replace the [blank] tokens. During training, Mask-and-fill learns to infill responses conditioned on the correct context. During testing, it infills the response conditioned on a random context which introduces errors in the response. response infilling (Figure 1). The ILM model is a GPT-2 (Radford et al., 2019) based language model. For any piece of text t with some spans masked with [blank] tokens, it is trained to predict the blanked spans in t as a sequence generation problem. Each blank is infilled with an n-gram which can consist of one or more tokens. For generating adversarial responses, infilling is done by conditioning on random contexts C rand instead of the original context C to introduce various categories of errors (Table 1). For example in Figure 1, conditioning on a random context leads to the infilling of "the marriage" in the response, introducing error of type C-ent. For the context "Did you stay your stay at our hotel?" it generates a response "I enjoyed at lot at the marriage". By corrupting the three types of utterances R g , U c and R e , this approach is able to introduce errors covering the 7 categories in Table 1. Preventing false negatives: Accidentally incorporating false negatives during training can lead to the model learning to misclassify appropriate responses. However due to the open-ended nature of dialogue responses, preventing generation of false negatives is not trivial. In addition to conditioning on random contexts, we incorporate the following mechanisms during infilling to further reduce false negative generation: • Semantics of substitution: We only select token substitutions which were not present in the tokens which were blanked. We also lower the generation probability of the blanked tokens' top 10 related words based on GloVe embedding (Pennington et al., 2014) similarity by a factor of 100. This ensures that the blanks are not infilled by the originally blanked tokens or any related words. • Degree of substitution -To ensure that the gen- We should visit the park today. license We will bring our license and documents. Figure 2: Keyword-guided approach for adversarial response generation. During training, the model learns to generate a response conditioned on its keywords and the correct context. During testing, it generates the response conditioned on a random context and keywords extracted from the correct context. The generated response thus shares content with the test context but does not directly address the context. erated negative response is sufficiently different from the original utterance, we filter out the original utterance if the number of words in the utterance after stop-word removal is less than 2. We also filter a generated response if the difference in count of non stop-words between the original and generated response is less than 2. Improving fluency: The ILM model often generates responses with poor grammar or structure. To improve the fluency of the adversarial response sets, we first generate up to 4 different infilled variations of the masked original utterances, then score them using a GPT-2 based scorer named lm-scorer 2 . We then select the desired number of responses from this larger set.

Keyword-guided Approach
This approach generates adversarial responses using keywords from the context as guidance, as shown in Figure 2. The base generative architecture is a GPT-2 based dialogue model and it is trained to generate responses conditioned on the context and the response keywords. For adversarial response generation, the generation is conditioned on a random context C rand and keywords from the test context C. In Figure 2, for the context "How long did it take you to get your license?" it generates a response "We will bring our license and documents." To create the keyword set K for a response, the model selects n number of keywords randomly from the set of all keywords extracted from the context C, where n is chosen randomly between 1 to 3 for every context. Keyword extraction is performed using Rake (Rose et al., 2010). We call this model Key-context. Since the generation is conditioned on keywords from context C, the generated response shares some content and semantics with the test context. However, since it is also conditioned on a random context C rand , the generated response also incorporates entities, time expressions, speaker role, dialogue act, and other details based on C rand . Since the generation model is not perfect, it also introduces errors related to fluency. Hence, the model is able to introduce errors covering the 7 categories in Table 1.
Key-context only uses keywords from the context to induce content similarity with the context. However, responses can have high content similarity due to the presence of similar concepts rather than just keywords. To introduce content similarity at concept level, we expand the keyword set K with their top 10 most related words based on their GloVe embeddings. We use the gensim library 3 to find the most related words. For example, the related words for the keyword "christmas" are "holidays" and "easter". We replace a keyword in keyword set K with one of its related words with a probability of 0.5. We call this variant Key-sem.

Classification Model
Our classification model architecture is based on the Speaker-Aware Bert (SA-Bert) model (Gu et al., 2020). Given a dialogue context C = {C 1 , C 2 , . . . , C h } with C k denoting k th utterance in the context, a response r and a label y ∈ {0, 1}, the goal of the dialogue model M is to learn a score s(C, r) by minimizing cross-entropy loss function for the binary classification task. To calculate s(C, r), C and r are concatenated, with a prepended [CLS] token. The output vector E [CLS] ∈ R H for the [CLS] token is used as the aggregated representation for the context-response pair classification. The final prediction is made asŷ = sof tmax(WE [CLS] ), where W ∈ R 2×H . SA-Bert model incorporates speaker information in two ways. First, an additional speaker embedding is added to the token representations which indicates the speaker's identity for each utterance. Second, a [EOT] token is added at the end of each speaker turn. Before fine-tuning Bert model on the classification task, we first adapt Bert to the dataset by using the standard masked language model objective (Devlin et al., 2019).
We test our approaches and baselines on dialogue classification, ranking and evaluation tasks.

Training Details
We use the base-uncased checkpoints for BERT (Devlin et al., 2019) and ELECTRA (Clark et al., 2020) from the Hugging Face transformers library (Wolf et al., 2020). We trained the models with maximum sequence length of 128, maximum number of training epochs set to 3, Adam optimizer with initial learning rate of 5e-5 with linear decay, batch size of 60 per GPU on machines with 4 Nvidia 2080Ti GPUs. For generation, we use temperature of 0.9, nucleus sampling with p equal to 0.9 and minimum length of 5. We repeat each experiment three times (five times for BERT-based models) with different random seeds, use the validation split to select the best model, and report the mean metric values. Validation was done every 200 batches.

Datasets
We use two open-domain dialogue datasets: DailyDialog++ (Sai et al., 2020) and Per-sonaChat (Zhang et al., 2018). DailyDialog++ consists of 16900 dialogue contexts in train set, 1028 in validation set and 1142 in the test set. Each context contains 5 positive responses and 5 random negative responses. It also contains 5 adversarial responses per context collected through crowdsourcing where annotators were instructed to create negative responses with high content similarity with the context. A subset of 9259 out of the 16900 training contexts have 5 human-written adversarial negative responses. It has two test sets, adversarial test set and random test set, based on the type of the negative response. PersonaChat dataset (Zhang et al., 2018) is a corpus of human-human personaconditioned conversations consisting of 8938 dialogues in the train set. We sample 2 random context-response pairs from each dialogue with a total of 17876 contexts for training. We prepend the persona utterances to the dialogue contexts in our experiments. Since there is no human-created adversarial test set available for PersonaChat dataset, we construct an artificial adversarial dataset by randomly selecting an utterance from the dialog context and inserting it in the set of candidate responses following Jia and Liang (2017) and Whang et al. (2021). The adversarial test set for each context consists of the ground truth response, one utterance selected from the dialog context, and 8 random negative responses. The random test set consists of 9 random negative responses.

Metrics
For classification task, we report the accuracy following (Sai et al., 2020). For ranking task, we report standard ranking metrics -Recall R n @k and mean reciprocal rank (MRR). For DailyDia-log++, n is 6 in Recall as candidates consist of one positive response with 5 negative responses. For PersonaChat, n is 10. For both classification and ranking tasks, we report results separately for the adversarial and the random test sets.
The dialogue evaluation task comprises of scoring or rating a response for its quality. For this task, we report the correlation of model scores with human provided ratings. We leverage the human ratings released by the following sources: 1) 600 ratings for response "sensibility" from  with inter-rater agreement > 0.6 (Krippendorff's α (Krippendorff, 2018)). The responses consist of outputs from hierarchical recurrent encoder decoder (HRED) model with Attention  and Variational HRED model with attention ; 2) 700 ratings for response quality from . The responses are from 6 different generative models -Seq-2-Seq (Sutskever et al., 2014), attentional Seq-2-Seq, HRED, VHRED, GPT2-small, and GPT2-medium (Wolf et al., 2019) with greedy decoding, ancestral sampling, and nucleus sampling based decoding (Holtzman et al., 2020). The inter-rater agreement is 0.815 (Krippendorff's α), and 3) Since the first two sources do not cover retrieval model outputs, we additionally collect quality ratings for 100 responses from a retrieval model's (Poly-Encoder (Humeau et al., 2020)) selected responses and 100 human written responses with moderate inter-annotator agreement (Cohen's Kappa 0.45 (Cohen, 1968)). All data points belong to the Dailydialog dataset and ratings are scaled between 0-1. By combining these sources we have a total of 1500 ratings for different context-response pairs.

Baselines
We compare the following approaches of creating adversarial negative response sets.   (2020) and Lin et al. (2020) and has shown strong performance in passage and response retrieval.
Mask-and-fill Our approach that infills utterances conditioned on random contexts. Key-context Our approach that generates responses conditioned on test context keywords and random context history. Key-sem Our approach similar to Key-context which additionally conditions on words semantically related to the keywords in the context. For each context, adversarial train sets are created by adding 5 random negative responses to the set of 5 negative responses created from the above approaches. If an approach create more than 5 responses, we randomly select 5 from them.

Models
We experiment with following architectures for ranking and evaluation models in our experiments: 1) Bert (Devlin et al., 2019). We use the SA-Bert model (Gu et al., 2020), 2) Electra (Clark et al., 2020), pre-trained with a replaced token detection objective and employs a generator-discriminator framework, and 3) Poly-encoders (Humeau et al., 2020), allows for fast real-time inference by precomputing each candidate response representation once, and then ranking candidate responses for retrieval by attending to the context.

Results and Discussion
In this section, we compare the performance of our approaches with the baselines on dialogue classification, ranking and evaluation tasks.
Performance on classification Our proposed approaches Mask-and-fill and Key-sem achieve the highest classification accuracy on the adversarial test set (Table 2), a few percentage short of the Human baseline. The closest baseline is BM25 which has a gap of 3% in accuracy compared to our  approaches. Token-subs, which applies transformations on positive responses to corrupt them, does not fair well on this task. This indicates that simple transformations do not provide good coverage of semantic variations present in the adversarial test responses. Our approaches achieve similar performance across different model architectures, demonstrating their generalizability. Unsurprisingly, the Human baseline performs strongly as the training and test data were created in the same manner and have similar distributions. On the random test set, the performance of all approaches is either very close or lower than the Random baseline. Since the similarity between correct responses and the context is generally a lot higher than between random responses and the context in the random test set, Random baseline performs better since it associates coherence mostly with semantic similarity. Finally, our analysis shows that all baselines tend to assign low scores to valid responses which do not address a context directly. For example, for the context "Will you join us for the concert?", if the response is "It is supposed to rain this week.", models assign it a low score. Such scenarios require understanding of social and commonsense related factors. We leave addressing this limitation to future work.
Performance on ranking On the DailyDialog adversarial test set, Mask-and-fill and Key-sem approaches achieve the best Recall and MRR, closely followed by BM25 baselines ( Table 2). The trends of the ranking metrics are similar to those observed for accuracy metrics. Our approaches perform better than the Human baseline on the random test set. On PersonaChat dataset, Mask-and-fill and Keysem perform better than the baselines (Table 3), especially on the adversarial test set. This demonstrates the extensibility of our approach across datasets. Mask-and-fill performs better than Keysem as the keyword sets contain a lot of keywords from the persona because of which responses have  high content similarity with the persona rather than with the context. The poor performance of the Random baseline provides evidence that training models using random negative candidates does not make the models robust against hard test cases during testing. BM25 is a strong baseline for both datasets since retrieved responses also provide coverage over errors of various categories. However, retrieved response quality and diversity depends on the size of the retrieval pool. Furthermore, a stronger retrieval mechanism can lead to higher false negatives. While the variation in BM25 response sets is constraint by the size of the dataset, and they provide lesser coverage over categories Ccont, C-strat and C-lang (Table 1), our approaches have no such constraints.
Performance on dialogue evaluation To study the performance of various approaches on real systems, we compare them on the task of Dialogue evaluation or scoring. We measure the correlation between the scores predicted by the approaches in Table 4 with human provided ratings. Reference based metrics like BLEU-2, ME-TEOR, SkipThought and Vec Extrema achieve very low correlations, similar to findings reported in prior art (Liu et al., 2016;Gupta et al., 2019). BERTScore and RUBER achieve moderate correlation. Our approach Key-sem achieves the best correlations, followed by Mask-and-fill. BM25's performance is lower than that of our approaches, but it is higher than the Random and Semi-hard approaches. Although Token-subs did not achieve high performance on the classification and ranking tasks, it performs well on this task. This is likely because real model outputs contains more of  (1) And what about the potatoes? Steven, i don't know.
(3) Your wife didn't like it. Please don't tell me she is really interested in gardening.
(4) I really want to go inside. It's really cold outside.   the factual inconsistencies and contradictions that this approach captures, than what the adversarial test sets contain. Key-sem performs better than Mask-and-fill on evaluation since while Mask-andfill only modifies utterances related to the context, Key-sem can freely generate more diverse adversarial responses for training. Also, Key-sem achieves higher correlation than Human baseline. This may be because it is difficult for humans to create erroneous responses with distributions similar to the ones in model generated or selected responses, especially error types like C-speaker, C-strat and Clang. In contrast, our approaches provide good coverage over all error types.
Analysis of errors types We analyze the classifi-cation outputs of various approaches on the Daily-Dialog++ adversarial test set and report the types of misclassifications by each approach in Figure 3. We first select a subset of test data where at least one of the approaches misclassifies the adversarial response as positive. We then manually categorize the types of errors presented in Table 1 for 200 randomly selected contexts from this subset. Each response can have multiple error types. C-follow and C-extra are the dominant error types which are misclassified by baselines Random, BM25 and Token-subs. Key-sem and Mask-and-fill approaches achieve improvement in all error types compared to baselines and have a more uniform error distribution. While Key-sem performs better on C-extra, Mask-and-fill is better on C-follow and C-speaker.
Adversarial response examples We present sample responses from our approaches along with Random and Human baseline responses in Table 5. Random approach generates responses which are easily distinguishable from ground truth responses. Mask-and-fill approach modifies either the ground truth response, utterances from the context or BM25 retrieved responses. It modifies these utterances to introduce corruptions such as noncontextual tokens, extraneous entities, incorrect time expressions, affective words or contradictions which makes the response either inappropriate or incoherent to the context, but it remains topically similar to the context. In Key-sem the dialogue acts, some entities and other tokens of the generated response depend on a random context the response is conditioned on, which also makes the response inappropriate or incoherent to the context.

Related Work
Dialogue response ranking and evaluation are important tasks in dialogue domain because even the recent large pretrained-language model based architectures (Zhang et al., 2020b;Humeau et al., 2020;Adiwardana et al., 2020;Roller et al., 2021;Gupta et al., 2021) have been shown to be susceptible to creating inconsistent, ungrammatical and incoherent responses (Roller et al., 2021). Traditional word-overlap based metrics like BLEU have been shown to be ineffective for dialogue response scoring (Liu et al., 2016;Gupta et al., 2019).
Recently trainable metrics such as ADEM , RUBER (Ghazarian et al., 2019) and USR (Mehri and Eskenazi, 2020) have been proposed for these tasks. However, since they are trained using negative samples obtained from random contexts, they are also prone to the spurious pattern of content similarity. Adversarial or counterfactual data creation techniques have been proposed for applications such as evaluation ( (Zhang et al., 2019). While these approaches are optimized to change the predictions of a target model by perturbing the inputs, our approaches are more general and are not optimized towards any target model. Polyjuice (Wu et al., 2021) and FactCC (Kryscinski et al., 2020) proposed approaches for modelagnostic general-purpose counterfactual generation. These approaches change the model's prediction by creating small edits through substitutions and insertions to the inputs. They are not applicable to our setting where we aim to flip the gold label, that is, convert a valid response to an adversarial response, while the model prediction should ideally remain the same to create hard training examples. Furthermore small perturbations do not provide good coverage over the adversarial response space and can create false negative responses. Adversarial semantic collisions (Song et al., 2020) aims to generate texts that are semantically unrelated but judged as similar by NLP models to expose model vulnerabilities. However, the outputs which are unrelated to the context are not useful for adversarial training as they are easy to classify.
Finally, negative sampling strategies have also been studied for creating hard negative samples in context of visual embeddings (Faghri et al., 2018;Guo et al., 2018), knowledge graphs (Kotnis and Nastase, 2017), document retrieval (Saeidi et al., 2017;Karpukhin et al., 2020) and response retrieval Lin et al., 2020). In this work we compare and build upon past work and are the first to propose generative approaches for adversarial negative response creation in dialogue.

Conclusion
This paper introduces approaches for synthesizing adversarial negative responses for training more robust dialogue response ranking and evaluation models. To synthesize a rich and comprehensive set of responses, we present and analyze categories of errors which affect the models. Our proposed approaches do not require any manual annotation and achieve high performance in dialogue classification, ranking and evaluation tasks across two datasets. These results demonstrate the promise of synthetic negative examples for improving open domain dialogue. Future work, we will explore synthesizing adversarial test sets and methods for finer grained, controlled adversarial response generation.

B Experiments with Masking
We experiment with two procedures for masking in the Mask-and-fill approach: 1) Random masking, which masks contiguous chunks of tokens some probability p. We leverage the masking function from (Donahue et al., 2020) which can selectively mask spans at the granularities of words, n-grams, and sentences. 2) Importance masking, which keeps the most important tokens in a response relevant to the context and masks the rest. For Importance masking, we leverage the matching model from (Cai et al., 2019) which is trained to estimate the sequence-level quality s(q, r) of a response r for a given query q. They decompose the sequence level matching score between a context and a response into a set of token-level scores as follows: ω k x T q W s (r k + e r k ) = m k=1 ω k s k where s k = x T q W s (r k + e r k ), and x r is the weighted sum of a Bert Transformer encoder outputs r k as well as their initial vector representations e k . The importance of each response token k to the context is estimated by s k . We mask out any token with importance weight ω k less than the average ω and only retain tokens highly relevant to the context following Cai et al. (2019). In our initial experiments we found that the Importance masking procedure lead to worse performance than Random masking. The adversarial test set accuracy on DailyDialog adversarial test set was 85.43% compared to the 87.45% accuracy using Random masking. Our analyses showed that Importance masking masked out about 50% of the response tokens, and the infills generated by the ILM model were mostly poor in fluency as the number of masked tokens was high. We therefore finally used Random masking for Mask-and-fill.

C Sample Model Generated Responses
In continuation of sample responses presented in Table 5 of the main paper, we present some more sample responses from different approaches along with Random and Human baseline responses in Table 6.

D Additional Implementation Details
For BM25 approach, we use the open source implementation from transformer rankers 4 . The DailyDi-alog++ dataset contains 16900 dialogue contexts but only 9259 of those have adversarial negative responses for the Human baseline. For the results reported in Table 4, all approaches from Random and below use the Bert architecture and trained using DailyDialog domain data. Additionally, RU-BER is also trained on the DailyDialog++ dataset. The approaches above Random in the table do not require training. Each approach predicts a score for the set of 1500 responses created using a set of generative and retrieval models as detailed in section 4.2.2. Sentence-Bert used in Semi-hard sampling scheme is fine-tuned on the datasets used in this paper.
For the Mask-and- represents a context with h utterances, r the response and B l b=1 are the tokens blanked in the response.
[eot] is used to indicate end of turn. To generate a set of 5 adversarial responses in the Mask-andfill approach, we first create 4 masked versions Table 6: Outputs from different approaches for negative response set creation. Random responses are unrelated to the contexts. Mask-and-fill and Key-sem approaches create responses which are highly similar to the content of the contexts, and hence the model needs to learn factors important for response coherence and appropriateness such as presence of correct entities, time expressions, strategies and others. of every utterance related to the context (R g , U c and R e ). ILM model then generates 4 infills per masked utterance. Thus each utterance gets 16 different modified versions. All these modified utterances are then ranked using the lm-scorer library and we select the top 5. BM25 similarity is used to create the retrieved response set.
For the Keyword-guided approaches, the model is given as input the context C, keywords from the ground truth response K, and the ground truth response r as shown in Figure 2. Specifically, the model takes in the following sequence of inputs - proaches during training, positive responses and negative responses are interleaved, i.e. each positive response is followed by one random and one adversarial response.