Understanding and Improving the Exemplar-based Generation for Open-domain Conversation

Exemplar-based generative models for open-domain conversation produce responses based on the exemplars provided by the retriever, taking advantage of generative models and retrieval models. However, due to the one-to-many problem of the open-domain conversation, they often ignore the retrieved exemplars while generating responses or produce responses over-fitted to the retrieved exemplars. To address these advantages, we introduce a training method selecting exemplars that are semantically relevant to the gold response but lexically distanced from the gold response. In the training phase, our training method first uses the gold response instead of dialogue context as a query to select exemplars that are semantically relevant to the gold response. And then, it eliminates the exemplars that lexically resemble the gold responses to alleviate the dependency of the generative models on that exemplars. The remaining exemplars could be irrelevant to the given context since they are searched depending on the gold response. Thus, our training method further utilizes the relevance scores between the given context and the exemplars to penalize the irrelevant exemplars. Extensive experiments demonstrate that our proposed training method alleviates the drawbacks of the existing exemplar-based generative models and significantly improves the performance in terms of appropriateness and informativeness.


Introduction
Exemplar-based generative models (Wu et al., 2019;Cai et al., 2019b;Gupta et al., 2021) for open-domain conversation combine a retrieval model (Humeau et al., 2019;Mazare et al., 2018;Kim et al., 2021) and a generative model (Adiwardana et al., 2020;Roller et al., 2021;Figure 1: Responses generated by the three exemplarbased generative models. RetNRef ignores the exemplar during response generation, RetNRef α generates the response highly over-fitted to the exemplar, and RetNRef trained with our training method (CORGE) well utilizes the exemplar to produce a more fluent response than that of the others. Zhang et al., 2020;Brown et al., 2020) into a single framework to generate responses in two steps: (1) the retriever searches an exemplar using the given context as a query, and (2) the generator produces a response based on the given context and the retrieved exemplar. Exemplar-based generative models produce more specific responses than vanilla generative models while being more fluent than retrieval models.
Despite their success, exemplar-based generative models have two major shortcomings. Primitive exemplar-based generative models Cai et al., 2019a) tend to entirely ignore the exemplars and produce responses similar to those of vanilla generative models. This is due to the one-to-many problem (Li et al., 2016) where there are many possible responses for each dialogue context. During the training phase, the retrieved exemplar is not helpful for generating the gold response when the exemplar retrieved for the given context is significantly different from the gold response.
This leads exemplar-based generative models to ignore the exemplar while generating responses, as shown in Figure 1(a). To address this issue, recent exemplar-based generative models utilize the gold response (Roller et al., 2021) or the slightly perturbed gold response (Cai et al., 2019b) as an exemplar in the training phase. However, these training methods cause the generator to rely heavily on the retrieved exemplar, i.e. the generator resorts to copying the provided tokens, as shown in Figure 1(b). These two disadvantages of existing exemplar-based generative models can adversely affect the quality of the generated response. Therefore, we introduce CORGE (COnnecting Retriever and GEnerator), a simple training method of exemplar-based generative models considering the one-to-many problem of the open-domain conversation. As inspired by Wu et al. (2019), CORGE first utilizes the gold response instead of dialogue context as the query for the retriever to select exemplars that are similar to the gold response. The retrieved exemplars ensure that exemplar-based generative models utilize their semantics while generating the gold response at the training phase. Since the exemplars are retrieved by the gold response, some of them are lexically identical or too similar to the gold response. These exemplars lead exemplar-based generative models to be trained to depend on the exemplar heavily. Thus, CORGE then eliminates the exemplars based on the distance between the exemplars and the gold response to alleviate the dependency of the generative models on the exemplars. Here, we employ Jaccard similarity to measure the distance (Guu et al., 2018;Cai et al., 2019a;Wu et al., 2019). However, as the selected exemplars solely depend on the gold response, some of them may be irrelevant to the given context, which results in exemplar-based generative models still ignoring the retrieved exemplar. To solve this, CORGE utilizes the relevance scores between the context and the exemplar to weight the relevant exemplars and penalizes irrelevant exemplars to the given context. Extensive experiments show that CORGE is generally applicable to the existing exemplar-based generative models and improves the quality of generated responses regarding appropriateness and informativeness.
Our main contributions: (1) We analyze the shortcomings of existing exemplar-based generative models derived from the nature of the opendomain conversation, the one-to-many problem.
(2) We introduce a training method (CORGE) to improve the quality of generated responses by selecting useful exemplars and weighting the exemplars by relevance scores assessed by the retriever.
(3) Through the human evaluation, we demonstrate that CORGE significantly improves the performance of exemplar-based generative models in terms of appropriateness and informativeness.
2 Related Work

Exemplar-based Generation
While generative models have shown remarkable performance on the open-domain conversation, it is well-known that generative models tend to yield uninformative and bland responses (Li et al., 2016;Liu et al., 2016;Serban et al., 2017;Li et al., 2020;Holtzman et al., 2019;Welleck et al., 2019). Exemplar-based generative models are introduced to overcome the aforementioned problem generative models suffer. Wu et al. (2019) introduce an exemplar-based generative model for open-domain conversation, which retrieves a context-exemplar pair conditioned by the input context and encodes the lexical difference between the input context and the retrieved context to the edit vector. The response is produced by feeding the exemplar and the edit vector to the generator. ; Roller et al. (2021) also retrieve the exemplar using the given context as a query and concatenate the exemplar with the context, then feed the concatenated exemplar into the generator to produce the final response for the open-domain conversation. Cai et al. (2019a,b) propose a method that removes the irrelevant information from the exemplar, then uses the masked exemplar to inform the generator to produce the response. Gupta et al. (2021) condition the generator with the retrieved exemplars and the extracted semantic frames of the exemplars, which improves the coherence of generated responses. We do not consider this model as a baseline because their model requires an additional semantic frame extractor, and it can be mutually complemented with our proposed training method.

Knowledge-grounded Generation
Knowledge-grounded generation models that utilize retrieved results (e.g., relevant documents from Wikipedia) to generate informative responses have been proposed to perform knowledge-intensive NLP tasks (e.g., open-domain question answering). The knowledge-grounded generation has a  Figure 2: Illustration of the drawbacks of existing exemplar-based generative models. The black dotted line indicates the boundary of the relevant exemplars to the given context. similar form with the exemplar-based generation. However, the main difference is that knowledgegrounded generative models extract the knowledge from external resources to generate the informative response. Guu et al. (2020) show the effectiveness of pre-training a knowledge retriever with the largescale language model for open-domain question answering, and Lewis et al. (2020) demonstrate that knowledge-grounded generative models produce more informative and diverse sentences than vanilla generative models on a wide range of knowledgeintensive NLP tasks. Fan et al. (2021) similarly propose a knowledge-grounded generative model for response generation, but they do not focus on the open-domain conversation. In Method Section, we demonstrate the difference between our approach and knowledge-grounded generative models, and we show that existing knowledge-grounded generative models are not directly applicable to the open-domain conversation in Experiments Section.

Exemplar-based Generation
Let D = {(c i , r i ) | 1 ≤ i ≤ n} denote the dialogue dataset, which consists of n pairs of context c and response r. Exemplar-based generative models are composed of two components: a retriever R and a generator G. For a given context c i , the retriever finds the top-scoring exemplar based on the relevance score S R (z, c i ) of the exemplar z ∈ R , where R is a pre-defined response set. The generator computes the probability of the response for the context c i while utilizing the exemplar z as P G (r|c i , z).

Drawbacks of Existing Exemplar-based
Generative models As mentioned in Roller et al. (2021), the primitive exemplar-based generative model  tends to ignore the retrieved exemplar dur-ing response generation due to the one-to-many problem in open-domain conversation (Li et al., 2016). Since its retriever searches an exemplar based on a given context, the retrieved exemplar is often significantly different from a gold response of the generator, although both of the retrieved exemplar and gold response are relevant to the given context, which is shown in Figure 2(a). As the retrieved exemplar is not helpful for generating the gold response, the generator is trained to ignore the retrieved exemplar and to produce a response using only the given context. To induce the generator to utilize retrieved exemplars more actively, Roller et al. (2021) make use of the gold response, and Cai et al. (2019b) use perturbed gold response as an exemplar rather than using retrieved exemplars during the model training. However, since the exemplar z i and the gold response r i are too similar (as shown in Figure 2(b)), the exemplar-based generative model learns to rely overly on the exemplar. Eventually, the generator produces a highly over-fitted response to the exemplar by directly copying the tokens of the exemplar.

Method
We hypothesize that selecting semantically relevant but lexically distanced exemplars from the gold response could solve the drawbacks above. To validate this hypothesis, we introduce a training method of exemplar-based generative models, called CORGE. Our proposed training method is illustrated in Figure 3, and the illustrative examples about the exemplars selected by CORGE are described in Table 1.

Selecting Exemplars Semantically Relevant but Lexically Distanced to the Gold Response
We describe how CORGE selects semantically relevant but lexically distanced exemplars to the gold response. Conventionally, the retriever selects the exemplars z based on the relevance score S R (z, c i ) for the given context c i . However, this searching process could return a significantly different exemplar z from the gold response r i , and it induces the generator G to ignore the retrieved exemplar during response generation. Therefore, we select exemplars based on the gold response r i to ensure that the generator G utilizes the exemplars inspired by Wu et al.. We select top-k scoring exemplars based on the score S R ′ (z, r i ), which we call k-Nearest Exemplars (kNE). 1 These kNE are more semantically related to the gold response r i than the exemplar obtained by using S R (z, c i ). However, some of the selected kNE are lexically identical or too close to the gold response r unintentionally since the retriever searches the exemplars based on the gold response. We observe that using these exemplars also causes the overfitting problem of generated responses; therefore, the generator excessively copies tokens from the exemplars. From this, we are motivated to filter out the exemplars which are lexically too close to the gold response and preserve the exemplars properly distanced to the gold response to mitigate the over-fitting problem. Here, we employ Jaccard similarity to measure the lexical similarity (Guu et al., 2018;Cai et al., 2019a;Wu et al., 2019) between the exemplar and the gold response. Exemplars are filtered out when their Jaccard distance with the gold response r is larger than 0.6, and we replace them with the randomly chosen responses from the pre-defined response set R. The threshold of filtering is empirically chosen as 0.6. The set of the final exemplars z obtained through these steps is referred to as Z i = {z i,1 , z i,2 , · · · , z i,k }.

Weighting the Selected Exemplars based on the Relevance Score
As we select the exemplar totally based on the gold response, some of kNE could be relevant to the gold response r i but irrelevant to the given context c i . Therefore, we condition the generator with the relevance score of kNE to reward the relevant exemplars and penalize irrelevant exemplars. Using the retriever R, we calculate the relevance score S R (z i,j , c i ) per each selected exemplar z i,j , then apply the softmax function to the relevance score to 1 Note that S R (z, c) and S R ′ (z, ri) use the same retriever, but they are computed differently. Please refer to how we calculate the score S R ′ (z, ri) and S R (z, c) in the Supplementary Materials. obtain the normalized relevance score P R (z i,j , c i ).
Then we replace the traditional likelihood with the weighted likelihood using the normalized score.
Our final training objective is to minimize the loss The gradient of the generator G is calculated as follows: . This equation demonstrates that the gradient of the generator G is scaled by the normalized relevance score P R (z, c i ), which indicates that the generator is less updated when the retrieved exemplar z is not relevant to the given context c i . This procedure helps the model ignore the irrelevant exemplars. Thus, the generator learns to fetch tokens from the exemplar more easily, which is relevant to the gold response. Difference between CORGE and Knowledgegrounded generative models The way of leveraging the relevance scores is already employed by knowledge-grounded generative models (Lewis et al., 2020;Sachan et al., 2021) in open-domain question answering. However, there is a significant difference between our CORGE and knowledgegrounded generative models. CORGE uses the relevance score P R (z, c i ) to penalize the irrelevant exemplars z to the given context c i since the exemplars are retrieved by S R ′ (z, r i ). Knowledgegrounded generative models use it as the latent variable to jointly train the retriever R and generator G. Especially, knowledge-grounded generative models also tend to ignore the retrieved exemplars due

Input Context
What kind of animals you take care of?

Gold Response
I work with a variety of animals. I sometimes work with lions and monkeys.

Context Retrieval
Sim P R (z, c)  Context Retrieval indicates the exemplar retrieved by using the context as a query, and kNE shows the exemplars selected by using the gold response as a query.
Sim measures the lexical similarity between the gold response and the exemplar and P R (z, c) indicates the normalized relevance score calculated by retriever.
to the one-to-many nature in open-domain conversation when the retriever and generator are jointly trained. On the other hand, we do not perform the joint learning of the retriever and the generator, but freeze the retriever while training the generator.

Dataset
We utilize the following four datasets used in Roller et al. (2021), which are Blended Skill Talk (BST) (Smith et al., 2020), ConvAI2 (Zhang et al., 2018), Empathetic Dialogues (ED) (Rashkin et al., 2019), and Wizard of Wikipedia (WoW) ). To simplify the notation, we denote the concatenated version of these four datasets as BST+. We split BST+ into train, validation, and test sets following Smith et al. (2020).

Baselines
Retrieval and Generative Models Bi-encoder 256M (Mazare et al., 2018) and Blender 90M (Roller et al., 2021) are considered as a baseline retrieval model and a baseline generative model. Further, they are also employed as a retriever and a generator of the following exemplarbased generative baselines, respectively.
Exemplar-based Generative Models Since our proposed training method is for training exemplar-based generation models, we first consider recent exemplar-based generation models, RetNRef , RetNRef α (Roller et al., 2021), and MatToGen (Cai et al., 2019b), as baselines.
RetNRef concatenates the retrieved exemplar with the given context as the input of the generator to produce the response. RetNRef α is the dialogue retrieval version of RetNRef, which adopts α-blending to escape from simply ignoring the retrieved exemplars (α = 0.5). MatToGen extracts the meaningful tokens from the exemplar to provide them to the generator.
To verify the effectiveness of our training method, we apply CORGE to RetNRef and Mat-ToGen instead of their training method. They are denoted as RetNRef +CORGE and MatTo-Gen+CORGE, respectively.
Knowledge-grounded Generative Models Although RAG (Lewis et al., 2020) and KIF (Fan et al., 2021) are proposed to perform knowledgegrounded generation tasks, we employ RAG and KIF as baselines since they have a similar form with exemplar-based generative models. Our experiments demonstrate that these knowledge-grounded generative models cannot be directly applied to the open-domain conversation.

Evaluation Metrics
To verify the effectiveness of our training method CORGE, we conduct a pair-wise comparison through the human evaluation following . We use two criteria: Appropriateness and Informativeness. Appropriateness measures how the generated response is fluent, logical, and appropriate to the given context. Informativeness measures how the generated response has meaningful information relevant to the given context. We use Amazon Mechanical Turk to collect the annotations, and more details are described in the Supplementary Material.
We also employ the automatic evaluation metrics, Perplexity (PPL), Dist-n, and BLEU (Papineni et al., 2002), to analyze the generated responses of each model. PPL measures how well the model predicts a response based on the given input context, and lower PPL indicates that the model predicts the response better. To analyze how much the exemplar-based generative model leverages the retrieved exemplar, we introduce two variants of PPL by utilizing conditional probability when exemplars are given: (1) PPL gold uses the  conditional probability P G (r|c, r), which assumes the situation when the gold response is given as an exemplar, and (2) PPL ret uses the conditional probability P G (r|c, z) where z is the retrieved exemplar by using S R ′ (z, r). Lower PPL gold denotes that the exemplar-based generative model predicts the gold response well when the gold response is given as an exemplar. Lower PPL ret indicates that the exemplar-based generative model well leverages the provided exemplar to predict the gold response. Dist-n (Li et al., 2016) is the ratio of distinct ngrams to a total number of n-grams for all the generated responses, which measures the degree of the diversity of the generated responses. BLEU (z,r) is adopted to measure the degree of the token overlap between the provided exemplar and the generated response pair (z, r). A higher BLEU (z,r) score indicates that the generator copies more from the provided exemplar while generating the response.

Implementation Details
We provide the details of our implementation in the Supplementary Material. We will the source codes of CORGE for the reproducibility of the conducted experiments.
6 Experimental Results These evaluation results demonstrate that CORGE leads the existing exemplar-based generative models to produce more fluent and informative responses.

Investigating the Exemplar-based Generative Models with Automatic Metrics
Through the automatic evaluation, we verify that existing exemplar-based generative models ignore the provided exemplar or generate responses overfitted to the provided exemplar. As shown in Table 3, RetNRef +CORGE and MatToGen+CORGE show lower PPL ret than Blender 90M, which means that the exemplar-based generative models trained with CORGE make a better prediction of the gold response than Blender 90M by utilizing the provided exemplar. RetNRef +CORGE has a smaller degree of PPL gold and PPL ret than those of RetNRef, which infers RetNRef +CORGE leverages the provided exemplar better than Ret-NRef. RetNRef α has lower PPL gold than Ret-NRef +CORGE, however, RetNRef α has higher  PPL ret than RetNRef +CORGE. This result demonstrates that RetNRef α does not make good use of the retrieved exemplar except when the gold response is given as the retrieved exemplar. From this observation, we claim that RetNRef α generates a response highly over-fitted to the selected exemplar, which is caused by utilizing the gold response as an exemplar in the training phase. The same goes for MatToGen, where applying CORGE mitigates the over-fitting issue.
Higher Dist-n of RetNRef +CORGE and Mat-ToGen+CORGE compared to Blender 90M shows that our exemplar-based generative models produce more diverse responses than the vanilla generative model. Moreover, RetNRef +CORGE has higher Dist-n than RetNRef, which shows that utilizing the exemplars helps the generator diversify the responses. Although RetNRef α is the only one that achieves comparable Dist-n to that of the vanilla retrieval model, Bi-encoder 256M, it is derived from an over-fitting to the exemplar considering the gap between PPL gold and PPL ret , resulting in the degradation of appropriateness and informativeness in human evaluation.
Average BLEU (z,r) scores implicitly measure the overlap between the retrieved exemplar and the generated response; thus, a higher degree of BLEU (z,r) indicates that the generator depends more on the retrieved exemplar. RetNRef shows a negligible BLEU (z,r) score, which reaffirms that the model is almost not utilizing the retrieved exemplar. RetNRef α and MatToGen have higher BLEU (z,r) scores compared to RetNRef +CORGE and MatToGen+CORGE, respectively, which verifies that the former depends more on the retrieved exemplar than the latter.

Train steps Std. of Relevance Scores
Ours Ours + joint RAG Figure 4: The standard deviation of the normalized retriever score gets smaller when we jointly train the retriever for exemplar-based generative models. Ours stands for RetNRef +CORGE, and joint indicates jointly training the retriever with the generator.

Incapability of Knowledge-grounded Generative Models in Open-domain Conversation
The automatic evaluation results in Table 3 confirm that knowledge-grounded generative models are ignoring the exemplar. PPL gold , PPL ret , and Dist-n of RAG and KIF have a similar degree to those of Blender 90M, which implies that the exemplars are not providing useful information while generating the response. The average BLEU (z,r) score also has a poor degree, indicating almost no overlap between the retrieved exemplars and the generated responses. We explain that these results are originated from the difference between the open-domain conversation and knowledge-grounded generation tasks. While training knowledge-grounded generative models, they use P R (z, c) to fetch the external knowledge. However, the generator also ignores the retrieved exemplar due to the one-to-many nature of the open-domain conversation.
In addition, we observe that jointly training the retriever with the generator causes the retriever stuck in the local minima. As shown in Figure 4, the standard deviation of normalized relevance scores P R (z, c) computed by the retriever   almost gets near zero when the retriever of RAG is jointly trained. A smaller standard deviation means the relevance scores are getting flattened. Although knowledge-grounded generative models empirically have shown that jointly training the retriever and generator improves the performance in knowledge-intensive NLP tasks (Lewis et al., 2020), in open-domain conversation, the retrieved exemplars are ignored. Thus, the retriever learns to produce an uninformative relevance score. As a result, the retriever collapses, which means the retriever may return inappropriate exemplars to the generator (also shown in the example of KIF and RAG in Table 4). Intriguingly, jointly training the retriever with CORGE also causes the retriever scores to be flattened, as shown in Figure 4, and we empirically observe the minor collapse of the retriever as we experienced in RAG as well. Thus, CORGE does not jointly train the retriever.

Ablation Study
To verify the effectiveness of each component in CORGE, we conduct the ablation study. In Table 5, PPL ret from RetNRef +CORGE is lower than any other ablation counterparts, which confirms each component contributes to predicting the responses. RetNRef +CORGE−RS and Ret-NRef +CORGE−kNE have a higher degree of PPL ret and PPL gold , which indicates RS and kNE help the generator to utilize the exemplar while generating the response. RetNRef +CORGE−JF provides a strong signal of over-fitting, where it has extremely low PPL gold but exceptionally high PPL ret . Dist-n shows our model produces the most diverse responses among the models except Ret-NRef +CORGE−JF, where RetNRef +CORGE−JF excessively copies the tokens from the retrieved exemplar. The average BLEU (z,r) scores also show the same trend, where reaffirms the effect of the components of CORGE.

Conclusion
In this paper, we introduce a generally applicable training method for exemplar-based generative models to alleviate their disadvantages derived from the one-to-many problem. Our training method selects exemplars that are semantically relevant but lexically distanced from the gold response and weights those exemplars with the relevance score measured by the retriever. Through the extensive analysis, including pair-wise human evaluation, we verify that our method improves the performance of existing exemplar-based generative models in terms of appropriateness and informativeness.

A.1 How the Retriever Calculates the Scores
Our retriever follows the architecture of Biencoder (Mazare et al., 2018), and the score S R (z, c) and S R ′ (z, r) are calculated as follows: where d(z) and d(r) are encoded vectors produced by response encoder BERT r and q(c) is an encoded vector produced by context encoder BERT c . The notation R ′ indicates that it only uses the response encoder instead of using the context encoder together. CORGE is not limited to use Bi-encoder as a retriever and can be applied to other types of a retriever (e.g. Poly-encoder (Humeau et al., 2019)).

A.2 Model Details
As we mentioned in Section 5.2, we employ Biencoder 256M and Blender 90M as a retriever and a generator of each exemplar-based generative model, respectively. For MatToGen, additional MLP layers are added to the retriever, as follows the details in Cai et al. (2019b). When training the models, weights of the retriever and the generator are initialized with the pre-trained Bi-encoder 256M and Blender 90M, respectively, For Blender 90M, we use the model released by ParlAI (Miller et al., 2017), which is fine-tuned on the BST+ dataset. For Bi-encoder 256M, we fine-tune the model released by ParlAI on the BST+ dataset, and we follow the hyperparameter settings of Humeau et al. (2019), which are implemented in the ParlAI library. The pre-defined response set is constructed from the BST+ training set, which contains about 400K responses. We use NVIDIA DGX Station A100 for training the models.

A.3 Hyperparameters
When training exemplar-based generative models with CORGE, five (k=5) exemplars are utilized for each training instance. The exemplar-based generators are trained with a batch size of 32 and an initial learning rate of 7e-6, and the learning rate is decayed in half when the training loss meets the plateau. The model is trained until there is no progress in the validation PPL.

A.4 Generation Strategy
When we generate samples using generative model, exemplar-based generative models, and knowledgegrounded generative models, we adopt a beam decoding strategy which is widely used in generative models (Graves, 2012). Following (Roller et al., 2021), we choose a minimum beam length and a beam size as 20 BPE tokens and 10, respectively, and use tri-gram beam blocking on context and response blocks. During the inference phase, both exemplar-based generative models and knowledgegrounded generative models use the top-1 scoring candidate as an exemplar chosen from utilizing the relevance score S R (z, c).

B Evaluation Details
We prepare dialogue cases that have three-turn input contexts and the gold response from the BST and evaluate them by human pair-wise comparison and automatic evaluation. There are 980 test cases, and we randomly choose 100 test cases for the human evaluation.

B.1 Pair-wise Human Evaluation
As we described in Section 5.3, we use Amazon Mechanical Turk to collect the annotations. Each test case is rated by three annotators to improve the robustness of the evaluation result. We set a maximum number of annotations per worker in order to reduce the potential bias. To control the quality of the annotations, we only allowed annotators who satisfy the following requirements to evaluate our results: (1) HITs approval rate greater than 95%, (2) Location is one of Australia, Canada, New Zealand, United Kingdom, and the United States, (3) Lifetime number of HITs approved greater than 1000, following Li et al. (2018). Figure 5 shows the instructions and the interface for the human evaluation. To mitigate the bias from the annotator, we randomly shuffle the order of the model and the corresponding response.

B.2 Automatic Evaluation
For automatic metrics, we calculate the metric for each case and take the average of those values. When calculating BLEU, we use sentence_bleu function in nltk python package (Loper and Bird, 2002).

C Measuring Inference Time
We measure how much time spend when the model generates the responses. When generating the response, Blender 90M takes 0.481 seconds, and Ret-NRef +CORGE takes 0.523 seconds per instance. There is only an 8.7% amount of inference time gap between Blender 90M and RetNRef +CORGE. This tells us that exemplar-based generation can significantly improve the quality of responses regarding appropriateness, informativeness, and diversity without increasing the amount of time to generate answers. We test our model on NVIDIA DGX Station A100 with PyTorch 1.7.1, CUDA 11.0, CuDNN 8.0, and here we adopt the generation strategy we describe above. When we measure the inference time, we only use a single GPU (NVIDIA A100 GPU, 40GB Memory), and the inference time is measured as the average inference time of 100 response generations.

D Additional Results
We provide additional samples for the retrieved exemplar and the model response from the baselines and our models in Table 6.