Reason first, then respond: Modular Generation for Knowledge-infused Dialogue

Large language models can produce fluent dialogue but often hallucinate factual inaccuracies. While retrieval-augmented models help alleviate this issue, they still face a difficult challenge of both reasoning to provide correct knowledge and generating conversation simultaneously. In this work, we propose a modular model, Knowledge to Response (K2R), for incorporating knowledge into conversational agents, which breaks down this problem into two easier steps. K2R first generates a knowledge sequence, given a dialogue context, as an intermediate step. After this"reasoning step", the model then attends to its own generated knowledge sequence, as well as the dialogue context, to produce a final response. In detailed experiments, we find that such a model hallucinates less in knowledge-grounded dialogue tasks, and has advantages in terms of interpretability and modularity. In particular, it can be used to fuse QA and dialogue systems together to enable dialogue agents to give knowledgeable answers, or QA models to give conversational responses in a zero-shot setting.


Introduction
To be regarded as successful, a conversational agent needs to generate utterances that are both knowledgeable and factually correct, as well as being conversationally appropriate, fluent and engaging.The pursuit of this goal has led to ever bigger models that store a large amount of knowledge in their parameters (Roller et al., 2021;Adiwardana et al., 2020;Zhang et al., 2020).However, hallucination -wherein a model generates factually inaccurate statements -has remained a problem no matter the size of the model (Shuster et al., 2021a).
Recent advances in neural retrieval models have made some inroads into this problem (Lee et al., 2019;Lewis et al., 2020b;Shuster et al., 2021a;Komeili et al., 2021)   on both the dialogue context and by learning to retrieve documents containing relevant knowledge.However, the conversational setting is challenging because these models are required to perform multiple duties all in one shot: to perform reasoning over the returned documents and dialogue history, find the relevant knowledge, and then finally combine this into a conversational form pertinent to the dialogue.Perhaps due to this complexity, it has been observed that failure cases include incorporating parts of multiple documents into one factually incorrect response, or failure to include knowledge at all and reverting instead to a generic response using the dialogue context only.
In this work, we instead propose to decompose this difficult problem into two easier steps.Specifically, by first generating pertinent intermediate knowledge explicitly and then, conditioned on this prediction, generating the dialogue response.We call this model Knowledge to Response (K2R).Using this modular design, we can train and evaluate the reasoning performance of the model independently from its conversational abilities, increasing the interpretability of our model's output.This also allows us to plug external knowledge into dialogue systems without any requirement for retraining, for example, from question answering systems.The dialogue response model's task reduces to incorporating the predicted knowledge in an engaging and context-fitting conversational response.
We conduct extensive experiments across multiple tasks and datasets.We find that our K2R model effectively improves correct knowledge-utilization and decreases hallucination (Shuster et al., 2021a) in knowledge-grounded dialogue (Dinan et al., 2019).In open-domain dialogue, the K2R model improves the performance on automatic metrics compared to its seq2seq counterpart, along with the additional benefits of increased interpretability of the model's output and the possibility for knowledge injections.The modular design allows us to fuse state-of-the-art pre-trained QA modelswithout any fine-tuning -with dialogue models to generate answers that humans judge as both more knowledgeable and engaging.Our modular system also outperforms multi-tasking approaches.Our code and generated dataset is made publicly available1 .

Related Work
Improving dialogue systems by increasing their knowledgeability has been tried in several different ways: from integrating knowledge bases (Zhu et al., 2017;Liu et al., 2018;Wang et al., 2020), to larger models that are pre-trained on more data (Roller et al., 2021;Adiwardana et al., 2020;Zhang et al., 2020), and recent neural retrieval models (Shuster et al., 2021a;Thulke et al., 2021).Knowledgegrounded open-domain dialogue datasets (Dinan et al., 2019;Komeili et al., 2021;Zhou et al., 2018;Gopalakrishnan et al., 2019) foster the research and development of knowledge-aware generative dialogue models.A known issue of such models, referred to as "hallucination", is that they mix up facts and generate factually inaccurate statements.Shuster et al. (2021a) try to alleviate hallucination by using recent advancements in retrieval-augmented generative models developed for opendomain QA tasks (Lewis et al., 2020b;Izacard and Grave, 2021).These methods still hallucinate to some degree, and their predictions (and hence errors) are not easily interpretable.
There is also recent work in the space of modular or intermediate generation components for text generation.The approach of text modular networks promises more interpretable answers to multi-hop questions (Khot et al., 2020;Jiang and Bansal, 2019;Gupta et al., 2020).Khot et al. ( 2020) learn a generative model that decomposes the task in the language of existing QA models for HotpotQA (Yang et al., 2018) and DROP (Dua et al., 2019).Herzig et al. ( 2021) solve text-to-SQL tasks with intermediate text representations.For storytelling, hierarchical generation procedures have been proposed (Fan et al., 2018).In reinforcement learning settings, generating natural language has been used as an intermediate planning step (Sharma et al., 2021;Hu et al., 2019), and in particular in goaloriented dialogue (Yarats and Lewis, 2018) and open-domain QA (Adolphs et al., 2021) as well.For summarization tasks, the work of Baziotis et al. (2019) proposes an intermediate autoencoder latent representation.Similarly, West et al. (2019) apply the information bottleneck principle to find an intermediate compressed sentence that can best predict the next sentence.For knowledge-grounded dialogue, an approach using internet search can also be seen as a modular intermediate step, where the search query is first generated (Komeili et al., 2021).In that sense retrieval-based QA has also been seen as a modular technique in many studies (Chen et al., 2017;Yan et al., 2019).
Previous work has also explored the intersection of QA and dialogue models from multiple different angles.The DREAM dataset (Sun et al., 2019) consists of multiple-choice questions about a conversation.Yang and Choi (2019) propose a questionanswering task based on dialogue histories of the TV show Friends.The QuAC (Choi et al., 2018) and CoQA (Reddy et al., 2019) datasets are designed to have the questions asked in the conversational flow, with possibly, multiple follow-ups.However, while these datasets require a model to understand a dialogue's history, the target responses are short-form answers.Therefore, these tasks do not train a dialogue model that generates an engaging, conversationally appropriate response; instead, they result in a QA model that understands dialogue-structured context.

K2R Model
We propose a two-step model for generating dialogue responses called Knowledge to Response (K2R).Instead of directly mapping from dialogue history (context) to response, it generates an intermediate sequence output which is the knowledge basis for the next utterance.Conceptually, our K2R model consists of two parts: • A seq2seq knowledge model that maps from context to knowledge.
• A seq2seq response model that generates the final response given the predicted knowledge and the context.
The two models can potentially share parameters (or even be the same model), and the two steps would then be differentiated by context tokens in the input.Alternatively, the two models can be completely separate and trained on different resources, allowing plug-and-play modularity.We explore both these options in this work.

Supervised Training
We can train two separate models for our standard K2R: a knowledge model and a response model; both are encoder-decoder transformers (Vaswani et al., 2017).The former is trained with the context as input and the knowledge response as the target.We can perform standard supervised training using existing resources such as QA and dialogue datasets with annotated knowledge (Dinan et al., 2019).The second part of the K2R, the response model, gets as input the context appended with the gold knowledge (replaced by predicted knowledge during inference) inside special knowledge tokens.(Shuster et al., 2020) as well as a derived version of it, LightQA, ending on a question about the episode.We run all our experiments using the ParlAI (Miller et al., 2017)  We compare the models' predictions against the gold dialogue response in terms of perplexity (PPL), F1, Rare F1 (RF1), BLEU-4 (B4), and ROUGE-L (RL).Moreover, we compare the predicted response with the gold knowledge in terms of Knowledge F1 (KF1), and with the predicted knowledge in terms of Predicted Knowledge F1 (PKF1).

Unsupervised Training
Metrics Across the experiments, we use standard generation metrics using the ground-truth such as Perplexity (PPL), F1, BLEU-4 (B4), and ROUGE-L (RL).Following recent literature (Shuster et al., 2021a), we additionally use the Rare F1 (RF1) metric that only considers infrequent words in the dataset when computing the F1 score.For WoW, where ground-truth knowledge is provided, we calculate the Knowledge F1 metric, i.e., the F1 score between the dialogue prediction and the knowledge sentence.In the considered QA tasks, analogous to F1 and KF1, we measure if in the dialogue response the gold answer is present (AP) and if the generated answer is present (GAP); here, we opt for exact match metrics (opposed to F1) since the answer is usually a short span and not a full sentence as in the WoW experiments.
Models The K2R always consists of two (possibly the same) seq2seq Transformers (Vaswani et al., 2017).While the response model is always a fine-tuned BART-Large (Lewis et al., 2020a) model (except when sharing parameters), the knowledge model varies in the experiments to follow common setups from existing baselines: BART for open-domain dialogue, BART RAG DPR (Token) (Lewis et al., 2020b) with a Wikipedia index for knowledge-grounded dialogue, and Fusion-in-Decoder (FiD) (Izacard and Grave, 2021) for question answering.Note that all knowledge models are general seq2seq Transformer models, and the main design difference is the neural-retriever-in-the-loop for knowledge-grounded tasks.

Wizard of Wikipedia (WoW)
WoW (Dinan et al., 2019) is a dataset of humanhuman dialogue that is grounded on Wikipedia articles.During data collection, one of the humans has access to a knowledge retrieval system and indicates on which knowledge their response is based.This process leads to a dialogue dataset that has a knowledge sentence for each target utterance.Hence, the setup for our K2R model is straightforward: first, (learn to) generate the knowledge sentence, and then, based on that prediction, generate the dialogue response.Table 2 shows an example episode with gold targets and model responses (including injected author knowledge).
We train three different variants of our K2R model as explained in Section 3. First, a standard two-model variant of K2R, consisting of a BART RAG DPR model for knowledge prediction and a BART model for the knowledge-conditioned response prediction.Second, a BART RAG DPR model with shared parameters, i.e., trained jointly on knowledge and response prediction.And finally, a confidence-score conditioned BART response model that uses the knowledge model from the first variant.

Quantitative Results
In Table 1, we compare our K2R approach on the WoW test set (seen split) against their dialogue-only-model counterparts: a BART model and a BART RAG DPR model with access to a Wikipedia index.We see that the standard K2R model performs roughly on par with the strong BART RAG DPR baseline for the F1 and RF1 score while outperforming it on the Knowledge F1 metric (29.2% vs. 26.1%).As we will see later, this matches human evaluations, which show Context Topic: Husky Apprentice: I just got a husky puppy Wizard: It sounds cute!Huskies are known amongst sled-dogs for their fast pulling style.Apprentice: I guess in the north they are working dogs huh?

Gold Knowledge
Sled dogs were important for transportation in arctic areas, hauling supplies in areas that were inaccessible by other methods.

Gold Response
Sled dogs, including Huskies, are used for transportation in arctic areas.

BART
Yes, they are used for sled dog racing.

RAG DPR
Yes, they are used in sled dog racing.They are an ever-changing cross-breed of the fastest dogs.K2R Knowledge Prediction Huskies are used in sled dog racing.
Response Prediction Yes, they are used for sled racing.

K2R Injected Knowledge
In arctic regions huskies are used to deliver hot beverages by companies like starbucks.

Response Prediction
Yes, they are used as delivery dogs by companies such as Starbucks.lucination, 16% vs. 7%, compared to RAG DPR, mirroring our results of improved KF1 from the automatic metrics.Notably, K2R hallucinates less than any model studied by Shuster et al. (2021a).However, K2R is rated as less engaging than BART RAG DPR, 54% vs. 66%, although it is rated at least as engaging as BART without knowledge, which is rated at 53%.

Natural Questions
We use the OpenQA-NQ dataset (Lee et al., 2019) of Google queries paired with answers extracted from Wikipedia.The answers in this dataset are short-form, e.g., the question "When did the Dallas Cowboys win their last playoff game?" is answered with "2014".While this might be the desired response in an information retrieval setting, e.g., a Google search, it might appear laconic and unnatural in a long-form human conversation.We are interested in developing a model that generates knowledgeable but also engaging conversational responses to open-domain questions.
As baselines for this task, we employ two different dialogue model baselines: (i) a standard generative model trained on open-domain dialogue (WoW), and (ii) a retrieval-augmented generative model trained on WoW.Additionally, we also com-pare against a pure QA model trained on NQ.While the dialogue models trained on WoW generate appropriate dialogue responses, they are not finetuned to answer questions.On the other hand, the QA model excels at answering questions but is not able to provide an engaging, full-sentence response.Due to the modular architecture of our K2R model, we can combine these two types of models.Without additional training, we use the QA model as our knowledge model inside K2R together with the response model trained on WoW (the exact same model as in the previous WoW experiments).

Quantitative Results
We do not have gold dialogue responses (i.e., conversational, full-sentence answers) available for this task, so we focus on the knowledgeable aspect of the models and evaluate in terms of AP and GAP (i.e., exact match of the answer span in the dialogue response (AP) or the exact match of the knowledge model's generated answer in the dialogue response (GAP)) Table 4 shows the results of the automatic evaluation.The BART baseline model trained on WoW only manages to answer 4.2% of the questions.Its retrieval-augmented variant, BART RAG DPR, improves this to 13.8%.The pure QA model, T5 FiD DPR, contains the gold answer for 46.7% of the questions in its response.For our K2R model, we stack together the T5 FID DPR QA model as a knowledge model with BART, trained on WoW, as a response model.This K2R model has the gold answer in its dialogue response for 39% of the questions.For 76% of the questions, it incorporates the knowledge predicted by the QA model in the response.To improve the GAP metric, we increase the beam size from 3 to 30 and add a filtering that chooses, if possible, the first beam that contains the predicted knowledge answer.This leads to a GAP of 96.8% and an AP of 46.3%, the latter being on par with the original QA model (46.7%), while still producing a conversational response.Note that the AP of the K2R is limited by the QA model used as knowledge model.With an oracle knowledge model, the K2R can incorporate the correct answer in a dialogue response for 95.5% of the questions.
Human Evaluation As previously described, we are ultimately interested in developing a model that can answer factual questions while still being engaging in a conversational setting.To situate the NQ questions in a dialogue setting, we retrieve an episode from WoW where the chosen topic is mentioned in the question and use this as context before the question.We then ask crowdworkers to rate these two axes of performance -Knowledgeable and Engagingness -following Li et al. (2019).More details about the evaluation setup as well as examples can be found in Appendix A.6.Table 5 shows the results of the study.The columns show the percentage of wins of the model against its opponent on a given row.Our K2R model beats all three baselines on both axes significantly (p < .01).A rating has to be justified by an explanation of the human evaluator of which we provide samples in Table 21 and 22.It shows that most evaluators rate the longer, and more detailed answers of K2R (compared to the QA model) as both more knowledgeable and engaging.

Qualitative Results
One interesting feature about the K2R model is that one has control over the knowledge used in the response.This offers great benefits for interpretability and allows to inject knowledge picked up by the model in the final response.Table 6 gives an example for that.Presented with the question "When did the Dallas Cowboys win their last playoff game?" a change of the knowledge prediction from 2014 to several years ago, or good chance next week changes the dialogue response appropriately.

LIGHT
In the following experiments, we focus on the textbased open-world adventure game dialogue setting of LIGHT (Urbanek et al., 2019).More specifically, we consider LightWild (Shuster et al., 2021c), a dataset of more than 40k episodes which are not specifically knowledge grounded, but require commonsense reasoning and attention to detail of the context instead.Hence, we do not consider retrieval-augmented models for this task.Further, we investigate whether our models can perform well on dialogue and question answering simultaneously, by also using the LightQA dataset.

LightQA
LightQA is a task built from LightWild episodes that contain a factual question about the context as the last utterance, with typically short answers.Results Results are given in Table 7 for various metrics.K2R improves both F1 (15.5 vs. 16.6) and RF1 (9.6 vs. 10.4) compared to the best baseline model.This K2R model outperforms nonmodular multitasking on both tasks (LightWild and LightQA) simultaneously.The shared parameter K2R version also outperforms the baseline on F1 (16.3) and RF1 (10.2), proving that the performance gain is not due to increased model size.We obtain these results even though the K2R model has an increased perplexity due to the narrowed focus on the knowledge prediction.In Appendix A.5, we provide results of confidence-conditioned models, which can control perplexity vs. GAP tradeoffs, similar to the WoW results in Section 4.1.Qualitative examples of K2R on this task are provided in Table 8.We note the strong ability of the response model to adapt to author provided knowledge, even when it seems quite out of context, e.g.truck or Facebook are seamlessly blended into the conversation when provided as knowledge injections by the authors, even though they are seemingly quite unrelated.We believe this helps reinforce the argument that separating the knowledge and response modules, as proposed in this work, represents a good choice of structure, as both steps seem to be learnable for our models.

Conclusion
In this work, we presented K2R: a modular approach for knowledge-based dialogue models.We showed that by decomposing the knowledge step and response generation into explicit sequence-tosequence subtasks, we could improve dialogue systems by incorporating knowledge or turning short QA model answers into an appropriate conversational form.In detailed experiments, we showed that this modular system helps with hallucination in knowledge-grounded dialogue, is rated by humans as more knowledgeable and engaging when answering questions, and improves generation metrics on open-domain dialogue.Furthermore, it allows for more interpretable results and supports knowledge injection.Future work should continue to investigate methods with modular reasoning steps to help in difficult language tasks.

Limitations
It is well known that large language models have multiple serious shortcomings.On the technical side, they have a tendency to repeat (Welleck et al., 2019) and contradict themselves (Roller et al., 2021;Ouyang et al., 2022).Furthermore, they frequently mix up or invent new facts, commonly referred to as "hallucination" (Shuster et al., 2021b).On a more fundamental note, language models suffer from biases in the training data (Lu et al., 2020;Abid et al., 2021), and can generate unsafe or even toxic language when prompted with the wrong context (Roller et al., 2021).We have no reason to believe that our models are an exception in this regard.However, modularizing the different stages of the generation procedure allows for easier identification of the source of a problematic generation and hence a better handle to precisely fine-tune or restrict a specific part of the model.Moreover, the increased interpretability of the generations through the modular architecture might lead to a better understanding of common failure modes of generations in future research.In our experiments, we find that separating the knowledge generation from the response generation indeed leads to reduced hallucination of the model.In the first example, the knowledge generated by the K2R model seems to answer the posed question better by saying there is no precise definition of genius.In the second example, we see the gold knowledge drifting off completely by jumping from the topic of blue skies to the movie "Blue Skies".In the next example, we have the case where the K2R model generates the exact gold knowledge.This often happens when the conversation goes in a clear direction (here, Huskies as pets), and a very close matching sentence exists about it in the Wikipedia article.Then, the model generates an exact copy of this sentence.The final example shows a failure mode of the K2R model.Here, the knowledge model generates a general sentence about psychology when it is asked about the specif work of two psychologists.

A.2.1 Interpretability
The K2R architecture allows for more interpretable conversational agents due to the possibility of observing not only the final response but also the intermediate knowledge response it is conditioned on.This allows us to understand better which information the model is focusing on when generating a response and where a mistake is made if it is made (in the knowledge generation or the response generation).Our experimental results support this claim.In the Wizard-of-Wikipedia experiments of Sec.4.1, we see in Table 1 that the F1 score between the conversational response and the predicted knowledge (PKF1) is up to 76.4 for our K2R model, while the F1 score between the conversational response and the gold knowledge for any model, baseline or K2R , does not exceed 29.2.Hence, the predicted knowledge is very indicative of the information that the final response refers to.Qualitatively, we see this behavior in the examples of Table 2 where an injection of knowledge, "Huskies are used to deliver hot beverages by companies like Starbucks", leads to a conversational response incorporating this information.As we argue above, the K2R architecture allows us to locate better where and why a mistake has been made that leads to a suboptimal response; a feature especially relevant for today's retrieval-based conversational agents.The last example of Table 13 shows such a failure mode: while the "Apprentice" asks for information about two specific psychologists "Mehr and Meyer", the knowledge response model generates a generic sentence about the field of psychology.Due to the modular structure, we can conclude that the problem, in this case, is the retrieval (and generation) of the appropriate knowledge.
In the experiments of NQ in Sec.4.2 and LIGHT in Sec.4.3, we observe that the generated knowledge/answer is present in the conversational response (GAP) for the vast majority of the test example (from 75.5 to 99.6, depending on task and model).This highlights again that the knowledge response gives us a good indication of the information content the conversational model is focused on.The examples of Table 6 and 8 show for NQ and LIGHT, respectively, that a change of the knowledge prediction (by injecting knowledge) leads to major changes in the responses.Hence, the knowledge prediction helps us understand what the focus of the response model was when generating the next utterance.

A.3 LightQA
Our goal with LightQA is to have a task that requires a model to answer questions about the previous context.For example, in LIGHT, a player might ask another character where to find a certain key to complete their quest.Here, we would want a model, acting as the character, to answer appropriately if the knowledge is in the context description.With this goal in mind, we design a dataset in the following way: First, we take a LightWild episode and use an abstractive summarization model, trained on CNN/Daily Mail (Nallapati et al., 2016) and the SAMSum Corpus (Gliwa et al., 2019), to generate a summary.Then we identify all noun chunks, entities, and proper nouns and use them as possible answer candidates.For each answer candidate, we use a T5 question generation model, trained on SQuAD (Rajpurkar et al., 2016), to generate a possible question given the summary as context.As the last step, we filter the generated questions with a QA model, trained on SQuAD, by checking that it would generate the used answer candidate with access to the summary and question.An episode of our dataset consists of the original LightWild episode (up to a certain turn) and the generated question as the last utterance.Hence, our labels in this dataset are not the usual dialogue responses but short answers.Table 17: Quantitative Evaluations on Wizard of Wikipedia Test (seen and unseen split).We compare against the ground truth dialogue response in terms of perplexity (PPL), F1, Knowledge F1 (KF1), Rare F1 (RF1), BLEU-4 (B4), and ROUGE-L (RL).

A.6 NQ Acute Eval Details
We closely follow the human evaluation setup studied by Li et al. (2019) and set up a pairwise model comparison on Amazon MTurk.To situate the NQ questions in a dialogue setting, we retrieve an episode from WoW where the chosen topic is mentioned in the question and use this as context.To have a smooth transition between dialogue context and the question itself, we prefix the question with "By the way, ...".The human evaluators are presented with a side-by-side comparison of the same context and question but with different answers corresponding to individual models.They are asked to read the dialogue and assess the final response according to one of the two following criteria, following the same wording as in (Li et al., 2019): • If you had to say that one speaker is more knowledgeable and one is more ignorant, who is more knowledgeable?
• Who would you prefer to talk to for a long conversation?
In Figure 2 and 3, we provide screenshot examples of the interface used for the human evaluation.To ensure a high quality of evaluations, we only select people that manage to correctly solve two manually constructed onboarding examples.
Figure 2: Example interface for human evaluation for knowledgeable.The first utterance is a knowledge paragraph that answers the final question-provided to give the reviewer the relevant information to assess the models' answers.Then, there is a random dialogue roughly matching the topic of the final NQ question which is prefixed with "By the way, ...".The reviewer is asked to vote for the better response among the two models and provide a brief justification.
by generating responses based * Work done during a Meta AI internship.

Figure 1 :
Figure 1: Two examples of modular Knowledge to Response (K2R) models, which condition a dialogue model on (a) the output of a (pretrained) QA model, or (b) the output of a general knowledge model.

Table 1 :
framework.Quantitative Evaluations on Wizard of Wikipedia Test (seen split).

Table 2 :
Examples of model outputs on the Wizard of Wikipedia Test set.The K2R model appropriately changes its dialogue prediction when replacing the predicted answer with (author chosen) injected knowledge.
a large decrease in hallucination.To give an idea of the performance limits of K2R, we also evaluate it with an oracle knowledge model.Standard K2R model training leads to increased perplexity values, which we associate with the model being overly confident about its knowledge predictions caused by always conditioning the model on correct knowledge during training.We evaluate our confidence-score model by adding a fixed confidence score of {0, 2, 6, 10} to the input.The higher this value, the more confident the dialogue model should be about the knowledge model's prediction.The results show that when

Table 4 :
Quantitative Evaluations on Natural Questions Test set with different response models (RM), knowledge models (KM), and access to knowledge (Know.).

Table 5 :
Human evaluation results on Knowledgeable (top) and Engagingness (bottom) ratings for NQ Test.Bold win percentages indicate significance (p < .01).

Table 6 :
Examples of model outputs on the Natural Questions Test set.The K2R model appropriately changes its dialogue prediction when replacing the predicted answer with (author chosen) injected knowledge.

Table 7 :
Details about the construction of this dataset are provided in Appendix A.3.Quantitative Evaluations on LightWild and LightQA Test sets.Top of a tall tree, Somewhere; This is the most majestic tree in the jungle.It spans high into the sky, with brilliant emerald leaves and sturdy brown branches.It is very healthy and spreads very wide.It flowers with brilliant violet flowers.Partner: Chameleon Self: Farmer; I was born in a poor village.I eat what we grow.I love being close to the earth.Farmer: The view is as mesmerizing as it always was, I can never get bored of this Chameleon: How are you today, farmer?
we show that even in such a setting, our K2R model can be beneficial in creating an intermediate output the dialogue model focuses on.Moreover, the same models can do well at both dialogue (LightWild) and QA (LightQA) at the same time.I'm doing well, thank you for asking.I've been working hard on the farm, trying to grow enough to feed my family for the next few months, but it's been hard with the truck not running.K2R Injected Knowledge Facebook Response Prediction I'm doing well, thank you for asking.I've been working hard on my crops, and I hope to sell them on Facebook soon

Table 8 :
Examples of model outputs on the LightWild Test set.The K2R model appropriately changes its dialogue prediction when replacing the predicted answer with (author chosen) injected knowledge GeniusWizard: a genius is a person who displays exceptional intellectual ability.Apprentice: What is an IQ of a genius?Gold KnowledgeA genius is a person who displays exceptional intellectual ability, creative productivity, universality in genres or originality, typically to a degree that is associated with the achievement of new advances in a domain of knowledge.Knowledge Response There is no scientifically precise definition of genius, and the question of whether the notion itself has any real meaning has long been a subject of debate, although psychologists are converging on a definition that emphasizes creativity and eminent achievement.Blue is always nice.I like royal blue.Wizard: I once road on The Royal Blue train from New York to D.C Apprentice: Oh that sounds really nice.I bet there was a lot of scenery and blue skies.Gold Knowledge Blue Skies is a 1946 American musical comedy film directed by Stuart Heisler and starring Bing Crosby, Fred Astaire, and Joan Caulfield.Knowledge Response The clear daytime sky and the deep sea appear blue because of an optical effect known as Rayleigh scattering.It sounds cute!Huskies are known amongst sled-dogs for their fast pulling style.Apprentice: I guess in the north they are working dogs huh?Wizard: Sled dogs, including Huskies, are used for transportation in arctic areas.Apprentice: That is so cool and probably helpful but mine is just a pet Gold Knowledge Huskies are also today kept as pets, and groups work to find new pet homes for retired racing and adventure trekking dogs.Knowledge Response Huskies are also today kept as pets, and groups work to find new pet homes for retired racing and adventure trekking dogs.What is your motivation for your actions related to education, work, etc.? Apprentice: That's a tough question.I'd say my own progress.I like seeing how much I improve with something.You?Wizard: I am retired now.Are you familiar with the work of Mehr and Meyer, well known psychologists?Apprentice: I am not.Could you tell me about them?Gold KnowledgeAccording to Maehr and Meyer, "Motivation is a word that is part of the popular culture as few other psychological concepts are."Knowledge Response Psychology is the science of behavior and mind, including conscious and unconscious phenomena, as well as thought.

Table 13 :
Examples of knowledge predictions of the K2R model against the gold knowledge selected by the Wizard.The examples show that it is often unclear what the proper knowledge is to support the next turn in open-domain dialogue.

Table 14 :
Quantitative Evaluations on Natural Questions Valid.