Improving Automated Evaluation of Open Domain Dialog via Diverse Reference Augmentation

Multiple different responses are often plausible for a given open domain dialog context. Prior work has shown the importance of having multiple valid reference responses for meaningful and robust automated evaluations. In such cases, common practice has been to collect more human written references. However, such collection can be expensive, time consuming, and not easily scalable. Instead, we propose a novel technique for automatically expanding a human generated reference to a set of candidate references. We fetch plausible references from knowledge sources, and adapt them so that they are more fluent in context of the dialog instance in question. More specifically, we use (1) a commonsense knowledge base to elicit a large number of plausible reactions given the dialog history (2) relevant instances retrieved from dialog corpus, using similar past as well as future contexts. We demonstrate that our automatically expanded reference sets lead to large improvements in correlations of automated metrics with human ratings of system outputs for DailyDialog dataset.


Introduction
Evaluation by human annotators perhaps give the best insights into quality of machine generated natural language outputs. However, they can be expensive and time consuming. Much focus has therefore been on automated evaluation methods which correlate with human evaluations. Automated metrics such as BLEU (Papineni et al., 2002) work well for tasks such as machine translation, but often correlate poorly with human ratings in tasks such as open domain dialog which admit a wide variety of * VG and HJ contributed equally for this paper. Order decided by coin flip. 1 Code and data are available at https://github.com/harsh19/ Diverse-Reference-Augmentation/ Figure 1: We propose automatic ways to collect references sans any crowd-sourcing, through two types of knowledge sources: commonsense and retrieved instance knowledge, followed by automated adaptation to make them more fluent in the target contexts.
valid response for given context, often due to small number of human written references (Zhao et al., 2017;Sai et al., 2020b). Prior work (Sugiyama et al., 2019;Gupta et al., 2019) has demonstrated that having multiple valid references for the same context leads to automated metrics being better correlated to human judgements for appropriateness. However, collecting human written responses is difficult to scale, can be costly, and may find it difficult to cover a large variety of correct responses (Celikyilmaz et al., 2020).
In this work, we automatically extract a large number of diverse references to be used with such reference-based metrics, without resorting to expensive crowd-sourcing. Intuitively, since opendomain dialog pertains to everyday life, its utterance text tends to re-instantiate from a large but limited pool of situations (Schank, 1972) e.g friends debating politics etc, with variation only on some details e.g country discussed. Hence, knowledge encapsulating a wide scope of situations can serve as one starting point to automatically seed a set of diverse references. We first fetch plausible candidates from two types of knowledge sources ( Figure  1). Such knowledge sources provide ready and easy access to a large number of potentially appropri-ate and diverse references. However, all retrieved instances may not be directly useful. As such, to achieve more fluent references, we propose techniques to adapt the candidate references based on the context (e.g change country being discussed). Note that since we are interested in creating references for only evaluating appropriateness of system outputs, our techniques can rely on broader data sources compared to dialog models. For example, we use future context and human written reference for retrieval, while a dialog model cannot.
Our contributions are as follows: (1) We propose a method for automated reference set augmentation for automated dialog evaluation. Compared to collecting more human-written responses, our approach is inexpensive and scalable, and fetches a diverse set of references.
(2) We observe high correlations of various automated metrics with human ratings when proposed reference augmentation is applied to the test split of DailyDialog dataset (Li et al., 2017). We additionally observe that paraphrasing, a popular data augmentation technique, performs much worse. (3) We employ novel use of commonsense knowledge and dialog corpus instances, and unsupervised techniques for adapting retrieved references into more fluent forms. Figure 1 shows an overview of our proposed methodology. We first fetch plausible candidates from two types of knowledge sources. Thereafter, the retrieved candidate references are adapted so that they are fluent in the target context. We refer to our proposed method as SCARCE ( SCalable Automated Reference Construction for Evaluation).

Knowledge Sources
Pre-trained Commonsense Model Much open domain dialog is based on everyday matters. We posit that extracting inferences about a situation using a commonsense knowledge base could be useful in identifying a wide variety of plausible reactions for a given dialog context. For example, a person making arrangements for an event might receive thanks from others ( Figure 1). We utilize COMET (Bosselut et al., 2019) an off-the-shelf commonsense knowledge model built on ATOMIC (Sap et al., 2019a) or ConceptNet (Speer et al., 2017) corpus. It can be used to elicit commonsense inferences.
COMET-ATOMIC provides inferences on cause-effect interrelations between events pertaining to nine relation types such as oReact (effect on others due to the event), and oWant (inferences about wants of the receiver of event). Given an utterance from the previous speaker, we draw up to 5 inferences pertaining to each of oEffect, oReact, and oWant relation types to construct plausible references for the target response. For example, for an utterance 'I will make the arrangements. It will be great.', one of the inferences corresponding to oEffect is 'feel excited', depicting a plausible state of the next dialog speaker. However, such outputs are typically phrases, and we discuss transformation to fluent sentences in Section 2.2. Similarly, we use inferences pertaining to 'CausesDesire' and 'HasFirstSubevent' relation types from COMET-ConceptNet.
Dialog Corpus Retrieval For a test dialog context under consideration, one is likely to find similar contexts occurring in some of the training dialogues, given a sufficient number of them. Using retrieval, we can identify such contexts and use their responses as pseudo-references for the test-time response. Specifically, for retrieval, we use the BM25 function S bm25 (x, y) (Robertson et al., 1995)  Our approach is related to Galley et al. (2015), who propose ∆-BLEU measure which uses retrieval to produce pseudo-references. However, unlike here, they require annotator quality scores to weigh them during evaluation. Moreover, though we utilize retrieval for evaluation, methods of this kind have found success in many generation setups (Li et al., 2018;Peng et al., 2019;Khandelwal et al., 2019). Besides being automatic, our method differs from the above ones in that it explores the added utility of future information for retrieval. For instance, for the dialog shown in Figure 1, besides matching "Great!" in the response, our retrieval benefits from matching "cool" in the future.

Context Adaptation
We note that commonsense knowledge outputs are incomplete sentences, and we use simple templates to convert them to fluent sentences e.g. 'feels ex-cited' gets transformed to 'i feel excited'. (Detailed templates in Appendix B) Further, we note that references from knowledge sources are often not fluent for the target context. For example, 'event' in the retrieved reference shown in Figure 1 can be updated to 'party' to construct a more apt reference. To adapt the retrieved text to better fit the target context we use employ an unsupervised decoding procedure, based on the approach of Qin et al. (2020), that uses gradient ascent to search for output text that maximizes (1) fluency with the left context (approximated by the likelihood of the output text under a pretrained GPT-2 model) and (2) similarity to the original text from the knowledge source (approximated by the likelihood of the original text under the output text's token-level word distributions). The method utilizes a heuristic update procedure to iteratively refine a differentiable proxy for the output text (a sequence token-level word distributions), while keeping the model parameters fixed. More details can be found in Qin et al. (2020) and in Appendix B.

Experiments
We investigate the extent to which automated metrics on an evaluation dataset correlate with human ratings of system outputs. We use the human ratings collected by Gupta et al. (2019), who collected utterance level human ratings using Amazon Mechanical Turk (AMT). They used a collection of 100 dialogue contexts that are randomly selected from the DailyDialog dataset. The generated response from various methods are rated in terms of appropriateness (from 1-5, with 5 denoting the best) by 5 different AMT workers. They collected and considered outputs from following methods: CVAE (Zhao et al., 2017), HRED (Serban et al., 2016), Seq2Seq (Vinyals andLe, 2015), Dual-encoder (Lowe et al., 2015), and Human-written responses. We report Spearman rank correlation (Spearman, 1961) and Kendall Tau rank correlation (Kendall, 1938) of human ratings against ngram overlap metrics such as BLEU (2002), METEOR (Banerjee and Lavie, 2005), ROUGE-L (Lin, 2004), and embedding based metrics like cosine similarity of average word embedding (EmbeddingAvg) (Wieting et al., 2016) or Skip Thought Embedding (Kiros et al., 2015), and precision (BERT-Prec) and recall (BERT-Rec) components of BertScore (Zhang et al., 2020).
We compare the correlations across following  (Table 1). In fact, rank correlations across most of the metrics are better for SCARCE-SINGLE compared to MULTI, even though former uses only single human written reference while latter uses upto 5 human written references 2 . Additionally, we observe that PARAPHRASE produces little or no improvements in correlations with human ratings (Table 1). We posit that for a given response, alternate responses constitute a strictly richer subspace than that of response paraphrases, which tend to be lexico-syntactically variant but semantically invariant.
Analyzing the impact of various components: To understand the impact of various components, we report Spearman rank correlation scores for BLEU-4 and BERT-Prec metrics with some variants of SCARCE-SINGLE (   Table 6.)

Quality of Auto-generated References:
We check the quality of SCARCE references by recruiting human annotators, showing them the reference along with the dialog context, and requesting them to tag each reference as appropriate, neutral, or notappropriate, with respect to the dialog context. We randomly select 150 responses each from SCARCE and MULTI for this purpose. We observe that in about 29% of the references from SCARCE (fully automatically generated) were annotated as not appropriate, compared to 7% for MULTI, demonstrating fair quality of augmented responses from SCARCE (Additional details and results in Appendix). We do note that the ones marked as not relevant/appropriate can often be tweaked easily by a human to transform them into valid responsesdemonstrating the possibility of exploring humanin-the-loop setups along with SCARCE to collect even better references.

Discussion
Transferability to more languages: Transferability of our approach to more languages is one aspect that merits discussion. While commonsense resources aren't readily available in all languages, a workaround can be to use off-the-shelf MT to translate before querying into English versions of the commonsense resources, and then translate back retrieved information. Furthermore, we note that while commonsense knowledge was useful, removing the COMMONSENSE method and relying on retrieval alone causes only relatively modest drop in performance (see Table 2). Thus, for languages lacking commonsense resources, one may still attain good gains in reference based evaluation by retrieving and adapting from dialog corpus alone.
Reference-less metrics: We note that while comparisons of using proposed approach against using reference-free metrics (Lowe et al., 2017;Tao et al., 2017) would be interesting, the focus of the current work is on improving reference-based evaluation via unsupervised reference augmentation. While reference-less metrics offer convenience to work with zero or a very small number of references,  reference-based metrics can be advantageous on several fronts. Reference-based evaluation can be more interpretable under certain situations by identifying the reference which matches the most with a given system output. Reference-based evaluations allow for easy incorporation of additional references -in contrast, many learned model-based metrics will require retraining if additional annotations become available.

Related Work
Prior work explores many ways to improve over single-reference evaluation without collecting multiple ones. Fomicheva et al. (2020) obviate need for multiple references in MT by generating many "althypotheses" via test-time dropout from the same model. Sai et al. (2020a) and Gupta et al. (2019) collect additional manually annotated responses for dialog contexts. Compare to them, our method of automatically collecting additional references automatically is more scalable.
Automatic data augmentation in NLP has largely been used for increasing training data (Feng et al., 2020;Wei and Zou, 2019;Feng et al., 2021). In this work, we use retrieved dialog instances and commonsense knowledge base to augment reference set for a given dialog context. ∆-Bleu (Galley et al., 2015) and uBLEU (Yuma et al., 2020) also use retrieval to produce pseudo-references for dialog response evaluation. Compared to ∆-Bleu and uBLEU, our work is different since we utilize commonsense knowledge base and perform contextual adaptation. Prior work in dialog response generation has explored the use of commonsense knowledge base (Majumder et al., 2020) as well as retrieval (Song et al., 2016;Majumder et al., 2021) -in contrast, our focus is on augmenting reference set for improving evaluation.
Automatic model-based metrics like ADEM

Conclusion
In this work, we demonstrate how existing knowledge sources can be used to construct a diverse set of references in an automated and scalable manner. The resulting reference set demonstrates high correlation with human ratings of system outputs.
In future, we plan to incorporate other commonsense types into SCARCE, such as social (Sap et al., 2019b) and moral (Forbes et al., 2020). We also hope to explore human-in-the-loop setups which build on SCARCE to collect even better references.

Ethics Statement
Our preference ratings are collected over source content from an already existing, publicly available and widely used dataset i.e DailyDialog (Li et al., 2017) We neither solicit, record or request any kind of personal or identity information while collecting our ratings. Our work primarily performs experiments on dialog in English (Bender and Friedman, 2018). Dialog models are known to suffer from biases learnable from dialog training data, such as gender bias (Dinan et al., 2019). However, our work and contribution does not present or release any new models or model checkpoints, and is primarily focussed on making existing evaluation setups better through automated collection of larger reference sets.  Table 4 shows Spearman rank correlation scores with p-values.

A.2 Quality Assessment based on RUBER
As a second, automated way of ascertaining response quality, we use the unreferenced part of the RUBER metric (Tao et al., 2017), which uses a pretrained model to score quality of responses based on context alone. Here, we use the RUBER checkpoint 3 from (Sai et al., 2020a), which first pretrains on a large Reddit dataset, followed by finetuning on DailyDialog. SINGLE and MULTI have a quality of ≈ 0.72, while for RETRIEVAL the values is 0.63 . COMMONSENSE is found to have the most superior quality at 0.82, surpassing even MULTI.

A.3 Diversity of References
We investigate the diversity of the references by computing self-BLEU scores (Zhu et al., 2018) among references from PARAPHRASE vs SCARCE. For fair comparison, we randomly chose 4 references from corresponding method. We observe self-BLEU4 scores of 0.36 for PARAPHRASE compared to only 0.13 4 for SCARCE.

B Additional Details on Context Adaptation
B.1 Templates to convert Knowledge Base Outputs to Full Sentences Table 5 lists the set of templates and rules used to transform semi-structured COMET outputs to surface natural language forms.

B.2 Unsupervised Decoding Procedure For Context Adaptation
We use the author's own implementation 5 of their DELOREAN decoding algorithm from (Qin et al., 2020). We use default hyperparameters from their implementation, which use the non-finetuned gpt2medium checkpoint as the LM atop which the unsupervised, gradient-based decoding procedure is run. Note that the model parameters are not updated in any way -the gradient computation and updates here are happening w.r.t the states, or more specifically, the state activation. More specifically, authors propose an iterature procedure wherein they 3 tinyurl.com/ynqd54tt 4 Note that lower self-BLEU denotes more diverse 5 tinyurl.com/2lqp9z6s alternatively perform forward and backward passes.
In the forward pass, the current output Y is updated as per the likelihood of the underlying decoder.
In the backward pass, the output is updated to be as similar as possible to the sentence Z from the knowledge source using back-propagation. However, since Y is discrete, we maintain a soft representation Y of the output Y wherein Y i represents the logits at the i th position as per the underlying decoder. Next, we shall describe the backward and forward passes of the iterative procedure: 1: In backward pass, we update logits based on the gradients of a content-matching loss function Y L( Y t−1 , Z) giving backward logits y b t 2: Next, we perform forward pass using the underlying decoder for steps 1 to N . During forward pass at step t, we compute the logits y f t based on left context i.e. X and Y <t . Next we perform weighted averaging of the forward and backward logits at step t to arrive at the final logits to be used for the next time step in forward pass.
Y i is initialized by performing a forward pass conditioned only on X as per greedy decoding. We alternatively perform backward and forward passes till convergence. Final response is obtained via the resulting logit outputs Y .
Specifically, we use their "counterfactual" setup, where an ending e old is adapted from its old context c old to an altered, new context c new , generating a new, predicted endingê new . In our case, c new is the dialog context for the turn under evaluation d past t .
In the RETRIEVAL case, c old is the context of the retrieved candidate turn x past t . For the COMMON-SENSE case, c old is also our current context, i.e the same as c new -we're simply attuning the already drawn inference better to the current context.

C Retrieval Similarity Function -Details
Consider a dialog d , broken up by turns as {C 1 . . . C t , C t+1 =d resp t , C t+2 . . . C T }, where t+1 denotes the turn currently under evaluation. For the context-response C 1 t ,r t pair to be evaluated, we retrieve pseudo-references based on a combination of of a) Past d past t+2+L f . L b and L f are past and future context windows. Our retrieval similarity function is a sum of the log scores between each corresponding element of the turn under evaluation with the candidate turn.  => 'I thank you.' Table 5: Templates and rules to transform semistructured COMET outputs to surface NL forms.
We set L b = L f = 2 without specific tuning, as an intuitive tradeoff between enough specificity and enough possibility of relevant candidates. BM25 (Robertson et al., 1995) or "Best Match 25" is a tfidf like similarity. Its specific form is: Here, tf i and df i are the term frequency in the current document and the document frequency (in the corpus). N is corpus size, while dl and avdl are current and average document lengths. b controls extent of document length normalization, while k 1 controls effect of term frequency. With b = 0 and k 1 → ∞, this reduces to simple tfidf . Here, we use default gensim values, b = 0.7, k 1 = 0.5

D Qualitative Examples
In Tables 6, we list  The quality of references were judged by two graduate students from a university where the medium of instruction is English. The annotators were requested to ignore minor grammar issues, and focus more on the content of the response.

F Computing Details
The