Knowledge-Grounded Dialogue Generation with a Unified Knowledge Representation

Knowledge-grounded dialogue systems are challenging to build due to the lack of training data and heterogeneous knowledge sources. Existing systems perform poorly on unseen topics due to limited topics covered in the training data. In addition, it is challenging to generalize to the domains that require different types of knowledge sources. To address the above challenges, we present PLUG, a language model that homogenizes different knowledge sources to a unified knowledge representation for knowledge-grounded dialogue generation tasks. We first retrieve relevant information from heterogeneous knowledge sources (e.g., wiki, dictionary, or knowledge graph); Then the retrieved knowledge is transformed into text and concatenated with dialogue history to feed into the language model for generating responses. PLUG is pre-trained on a large-scale knowledge-grounded dialogue corpus. The empirical evaluation on two benchmarks shows that PLUG generalizes well across different knowledge-grounded dialogue tasks. It achieves comparable performance with state-of-the-art methods in the fully-supervised setting and significantly outperforms other approaches in zero-shot and few-shot settings.


Introduction
Recent work has shown that conversational models can be trained in an end-to-end fashion (Zhang et al., 2019;Roller et al., 2020;Adiwardana et al., 2020).Though such models can generate coherent and natural responses consistent with conversational history, there is still a clear gap between conversational AI and human conversations.The primary reason is that existing dialogue systems lack knowledge of the subject and thus cannot deeply dive into specific topics with humans.In order to better incorporate knowledge into dialogue, Table 1: Knowledge representation and topic coverage statistics of existing knowledge-grounded dialogue datasets.% Topics means the portion of topics or facts in the knowledge database covered by the dataset.knowledge-grounded dialogue systems have become increasingly popular.
Knowledge-grounded dialogue generation is about generating informative and meaningful responses based on both conversation context and external knowledge sources.Thus far, researchers have collected knowledge-grounded dialogues for various tasks using crowdsourcing platforms, for instance, open-domain dialogues (Dinan et al., 2019;Zhou et al., 2018) and conversational recommendation dialogues (Li et al., 2018;Moon et al., 2019;Hayati et al., 2020).Workers are asked to base their replies on knowledge from structured knowledge bases (Moon et al., 2019;Hayati et al., 2020;Tuan et al., 2019) or unstructured documents (Dinan et al., 2019;Zhou et al., 2018;Feng et al., 2020).Taking advantage of recent advances in large-scale language models (Raffel et al., 2019;Lewis et al., 2020a;Guu et al., 2020), researchers have also built knowledge-grounded dialogue systems by fine-tuning such language models in an end-to-end fashion (Shuster et al., 2021;Zhao et al., 2020b;Li et al., 2021).
However, there are two critical challenges in these existing methods.First, it is expensive and time-intensive to collect knowledge-grounded dialogues.As shown in Table 1, most of the datasets only cover a small portion of the knowledge base.Thus, systems which only fine-tune with small training sets generalize poorly on unseen topics in the same knowledge base.Additionally, the formats of knowledge sources vary in different tasks, making the approaches unable to transfer to other domains with different knowledge sources.For example, REDIAL adopts a movie database as the knowledge source to recommend movies.Techniques on this task exploit the graph structure.It is difficult to adapt such techniques to document-grounded conversation tasks like Wizard of Wikipedia.
In this work, we present PLUG, a model that can unify different knowledge formats for knowledgegrounded dialogue generation.First, we convert different knowledge formats to unstructured text, and then we use a pre-trained language model to process them into a unified dense representation to incorporate the knowledge representations into dialogue generation.We pre-train PLUG on different knowledge-ground dialogue corpora, including a large-scale open-domain conversation dataset from Reddit.This allows PLUG to learn knowledge in various formats from different tasks, and thus transfer to any knowledge-grounded dialogue task with few-shot learning techniques.
We evaluate the effectiveness of PLUG by applying it to a popular open-domain knowledge-grounded dialogue benchmark, Wizard of Wikipedia (Dinan et al., 2019), and a knowledgegrounded conversational recommendation benchmark, REDIAL (Li et al., 2018).PLUG achieves results comparable to the state-of-the-art method under a fully-supervised setting.It outperforms other methods on both tasks under zero-shot and few-shot settings, demonstrating that PLUG can be grounded on world knowledge in different knowledge sources and generalize to different downstream tasks.
Our contributions are three-fold: (1) We propose a novel knowledge-based pre-trained language model, PLUG, that can be applied to any knowledge-grounded dialogue tasks; (2) Our model achieves slightly better results than state-of-theart models in fully-supervised settings and shows promising improvements over the current stateof-the-art in zero-shot and few-shot settings; (3) We present extensive experiments to explore the bottlenecks of the task and the future direction of knowledge-grounded dialogues.

Approach
We describe our approach in this section.Figure 1 gives a diagram of our proposed method.We first introduce the background of knowledge-grounded dialogues and the backbone language model in Section 2.1.Then, we formalize the task and introduce the details of PLUG in Section 2.2.Finally, we explain the training process of our PLUG, which includes the pre-training dataset selection and the data pre-processing processes in Section 2.3.

Background: Knowledge-Grounded Pre-training
Traditional knowledge-grounded dialogue includes three main steps: information extraction, knowledge prediction, and response generation.Previous work focuses on developing separate modules.
Inspired by the recent success of applying a largescale pre-trained language model on task-oriented dialogue systems (Peng et al., 2020;Hosseini-Asl et al., 2020), we explore the possibility of using a unified knowledge representation in a large-scale language model.In order to properly manage the task in a sequence-to-sequence setup, we choose T5 (Raffel et al., 2020) as our backbone.
T5 is a state-of-the-art sequence-to-sequence pretrained Transformer (Vaswani et al., 2017) model for transfer learning.T5 is trained by converting various language tasks into a text-to-text tasks.After fine-tuning on a dialogue dataset, T5 can generate fluent and coherent responses.Nevertheless, responses are often too generic because they are not grounded on specific knowledge.PLUG is trained on the vanilla T5 model but grounded on real-world knowledge during training, making it inherit T5's capability of producing human-like responses but include more knowledge.

PLUG
A knowledge-grounded dialogue can be formulated as: where C = {C i } n i=1 is a dialogue context, and R = {R i } n i=1 is the response in a dialogue that has n turns.S is the external knowledge source for task t.For each dialogue turn, we can formulate a knowledge-grounded dialogue generation task on a single domain d as p(R i |C i , S).
As shown in Figure 1, each task has its own knowledge source (e.g.documents, databases, and knowledge graphs).In order to make all knowledge-grounded dialogue generation tasks able to fit in the text-to-text encoder-decoder framework, we follow T5 to feed each dialogue turn into the language model simply by concatenating the context C i = {c 1 , c 2 , ..., c m }, and essential knowledge triples K i = {k 1 , k 2 , ..., k n } as a token sequence.The essential knowledge is extracted from the knowledge source S and represented as text of triples.We train the model to predict the responses token sequence R = {r 1 , r 2 , ..., r k }.The probability of the responses is formulated as: We will explain how we select and process pretraining datasets in the following sections.

Model training process
We pre-trained the PLUG model using two datasets, Reddit Conversation (Galley et al., 2018) and Open-DialKG (Moon et al., 2019).We will first present the three-step data cleaning process of Reddit Conversation in Section 2.3.1, then we will introduce OpenDialKG in Section 2.3.2.

Reddit Conversation
Reddit Conversation Galley et al. (2018) is a largescale open-domain conversation dataset.It extracts the conversation threads that are grounded on a document from the Reddit data. 2 We only keep the conversations grounded on Wikipedia passages for pre-training to recognize better the knowledge used in the dialogue.Since vanilla document-based dialogue in Reddit Conversation does not have a knowledge label for each dialogue turn, we apply a hierarchical information extraction method to obtain the essential knowledge in each turn.Our information extraction method includes three steps: knowledge retrieval, statistical ranking, and semantic ranking.
Knowledge Retriever.We use a knowledge retriever to retrieve all relevant knowledge in a single turn's response.We first extract the title of the grounding Wikipedia passage in the dialogue.Then, we retrieve relevant knowledge triples from a large-scale knowledge graph, DBpedia (Lehmann et al., 2015).Specifically, we query the DBpedia data set via a public SPARQL endpoint3 , and then collect triples whose subject or object is the Wikipedia passage in the dialogue.For example, we keep triples <Barack Obama, alma mater, Columbia University> and <Michelle Obama, spouse, Barack Obama> for the dialogue talking about Barack Obama.In order to carry sufficient knowledge to refine in the next step, we retrieve 500 triples for every passage from the knowledge graph.
Statistical Ranking.After retrieving adequate knowledge, we rank the corresponding triples to refine the knowledge.Specifically, we get the TF-IDF (term frequency-inverse document frequency) value for all the retrieved triples.To find the triples related to the context, we concatenate the dialogue history and the response as the query.Then we compute the cosine similarity between the query and every triple.Because every triple has the Wikipedia passage name as the subject or object, a higher cosine similarity score means the query has more similar text with the distinguished text in the triple.We rank the query-document similarity score and only keep the top-50 triples in this step.
Semantic Ranking.The TF-IDF-based cosine similarity score only counts words overlapping between triples and the query.It will introduce triples whose overlapping words are not meaningful in the context and response.Additionally, the Reddit Conversation dataset is obtained from Reddit conversation threads.It involves many responses that are not grounded on any knowledge.In order to find the triples that have the best semantic similarity with the response and filter out ungrounded responses, in this step, we estimate the semantic similarity score with Sentence-Bert (Reimers and Gurevych, 2019).We rerank the 50 triples from the second step based on the score.Additionally, we abandon the dialogue turns whose best semantic similarity is lower than a threshold because the response cannot find proper knowledge, while we want to pre-train the model with knowledge-grounded turns.

OpenDialKG
To generalize our model to various tasks, we also employ OpenDialKG to enrich our pre-training dataset.OpenDialKG consists of two types of tasks, recommendations and chit-chat, across four domains.Unlike the Reddit Conversation dataset, which needs to find the knowledge grounding in every turn, the original OpenDialKG has a Knowledge graph path label for each dialogue, and a triple label for each dialogue turn.The response is grounded on the labeled triple during data collection.Thus, we use the triple in the dataset as the essential knowledge in our pre-training examples.

Experiments
We demonstrate our approach on two different downstream tasks: open-domain knowledgegrounded dialogue and conversational recommendation.Besides the fully-supervised learning setting, we also explore the performance of our approach in a few-shot and zero-shot setting.

Datasets and Knowledge Sources
We test our approach on Wizard of Wikipedia (WoW; (Dinan et al., 2019)) and REDIAL (Li et al., 2018).Basic dataset statistics are listed in Table 2  to extract the top three triples from the passages as our essential knowledge.The pre-processing is conducted with the code published on ParlAI. 5REDIAL.REDIAL (Li et al., 2018) is also collected on Amazon Mechanical Turk.Two crowdworkers, a "movie seeker" and "movie recommender," are randomly paired.The recommender has access to a movie database and can recommend movies based on movie information, such as actors and movie genres.There are 6,924 different movies mentioned in 51,699 movie slots in the dataset.We follow Li et al. (2018) to split the dataset into training, validation, and test sets.Since recommenders use movie-related knowledge when they recommend movies to seekers, we use it as the essential knowledge for a given turn in this dataset.
We experiment with three knowledge sources: (1) We query the movie names mentioned in the dialogue context and retrieve similar movies from the knowledge graph DBpedia, mentioned in Section 2.3, and then input the similar movies in a triple format as the essential knowledge.
(2) We query the movie names mentioned in the context and retrieve movie comments from MovieLens. 6, then use the keywords in the comments as the essential knowledge.
(3) We use the output of the recommender module in KGSF (Zhou et al., 2020), which is the state-of-the-art system on this dataset.

Baselines
We compare the known best models from different datasets in the following experiments.For the Wizard of Wikipedia dataset, we choose the retrievalaugmented generation (RAG) model from Shuster et al. (2021).It retrieves wiki documents and generates responses based on the documents.We compare PLUG with this document-based generation method to see the impact of our essential knowledge format.We choose the RAG model also using T-5 as the baseline for a fair comparison.
For the REDIAL dataset, we choose the current state-of-the-art: KBRD (Chen et al., 2019) and KGSF (Zhou et al., 2020) as our baselines.Both use a recommender module to predict the recommendation item in the next turn and a generation model to generate the response.All baseline results are from Zhou et al. (2021).To investigate the best performance of our approach, We also use the recommender from KGSF as a knowledge source in our system and compare it with other knowledge sources we mentioned in Section 3.1.As an ablation study, we also explore the performance of vanilla T5 on both tasks to see the performance gain brought by our pre-training process.

Metrics
For evaluation, we report the performance with standard automatic metrics: BLEU-4 (B4) (Papineni et al., 2002), ROUGE-L (RL) (Lin, 2004), and unigram overlap (F1) of the generated responses.Besides that, for the Wizard of Wikipedia dataset, we follow Shuster et al. (2021) to report the overlapping unigrams between the model's generation and the knowledge on which the human grounded during dataset collection (KF1), attempting to capture whether a model is speaking knowledgeably.On the other hand, for the REDIAL dataset, we follow previous work (Chen et al., 2019;Zhou et al., 2020;Wang et al., 2021) to report distinct-n (Dist-6 https://grouplens.org/datasets/movielens/n) at the sentence level to evaluate the diversity of the model's generation.We also evaluate whether the ground truth movie recommendation can be found in the generated response and report it as the recommendation item recall in responses (Rec).

Implementation Details
We process the Reddit monthly submissions and comments dump from 2011 to 2017, consisting of over 894k knowledge-grounded dialogue turns.As detailed in Section 2.3.1, we set the threshold as 0.35 in the semantic ranking.After filtering with our hierarchical information extraction method, over 321k dialogue turns to remain.All dialogue turns in the OpenDialKG dataset are used in the pretraining.Each dialogue turn is processed to form a sequence of tokens consisting of three segments: dialogue context, essential knowledge, and response.We keep the top-three triples/keywords as our essential knowledge in pre-training and downstream tasks.PLUG is implemented with Huggingface Pytorch Transformers (Wolf et al., 2020) and initialized with the 800M-parameter T5 model.We use Adam (Kingma and Ba, 2014) with weight decay for pre-training.Training examples are truncated to ensure a maximal length of 512.Models are pre-trained on 8 Nvidia V100 GPUs until we observe no progress on validation data or up to 20 epochs.The best configuration of hyper-parameters is selected through cross-validated grid-search.

Fully-Supervised Results
We first evaluate PLUG with all training examples in the training sets to compare its performance with other state-of-the-art systems.Additionally, we experiment with using golden knowledge in the input to explore the upper bound of our method.
Table 3 shows the Wizard of Wikipedia Test Seen and Test Unseen results.We can see that PLUG with retrieved knowledge achieves better BLEU-4, ROUGE-L, and F1 scores than the RAG method and the model without adding knowledge in the input, on both seen and unseen topics.It suggests that our essential knowledge format helps the model generate responses to ground knowledge better.We also observe that PLUG outperforms the model without pre-training on all metrics, which means our pre-training can boost this task.
We list REDIAL's results in Table 4.We compare our approach to the state-of-the-art systems and T5-Large models without pre-training.Additionally, we include a comparison to models with different knowledge sources as described in Section 3.1.It shows that our best model (PLUG+KGSF) achieves new state-of-the-art results on the recommendation item recall metric and distinct metrics.This is understandable given that our approach is built upon pre-trained language models.Similarly, we also observe noticeable performance gains for the pre-training on this task.However, compared to systems with currently available knowledge sources, it is immediately apparent that the system with golden knowledge vastly outperforms the current state-of-the-art on all metrics.This surprising improvement implies that current retrievers are the main bottleneck for the conversational recommendation task.We will discuss more details in Section 3.8.
Overall, we observe noticeable improvement brought by pre-training on both tasks, but it is less significant than expected.It implies that the knowledge grounding pattern in the response is limited; a complete training set is more than enough for the T5-Large model to learn the generation task.We will discuss more details in zero-shot and few-shot settings in the following subsections.

Zero-Shot and Few-Shot Results
We focus on zero-shot and few-shot settings because it is more realistic to evaluate dialogue systems.Specifically, we randomly sample 10/50/500 dialogues with different topics from the training sets and observe performance on the complete test sets.Moreover, we also evaluate under a zero-shot setting.We experiment with knowledge retrieved by existing retrievers on both tasks to set a realistic setting.We compare our models to those without pre-training to explore how our pre-training benefits the model's few-shot learning capability.Wizard of Wikipedia's experiments results are in Figure 2, and REDIAL's results are in Figure 3.Note that for Wizard of Wikipedia, topics in original Test Seen set may not be seen during training in this setting since we only use a small portion of data in the original training set.We use original Test Seen and Test Unseen sets to compare with fully-supervised results.As can be seen in Figure 2 (a), 2 (b), 2 (c), 3 (a), and 3 (b), PLUG maintains a higher BLEU-4, ROUGE-L, and F1 scores on both tasks when training with less than 500 dialogues.It means PLUG obtains knowledge-grounded generation ability from pre-training and can generalize to different tasks.
From Figure 2 (d), we observe that models without pre-training achieve a higher knowledge F1 score under a zero-shot setting for the Wizard of Wikipedia dataset.In contrast, it achieves a deficient performance on the language qualityrelated metrics, which implies that models only copy knowledge words but generate gibberish responses without training.However, PLUG still generates knowledge-grounded responses with a lower knowledge F1 score out-of-box.This result also suggests that we should only consider knowledge F1 scores when the model has decent scores of language quality metrics.
For the REDIAL dataset, from Figure 3 (d), we can see that there is not as much improvement in recommendation item recall brought by pretraining when compared to BLEU-4 and ROUGE-L on a zero-shot setting.However, we observe a noticeable difference between PLUG and the T5 model, which means PLUG learns to generate grounded on knowledge faster than the T5 model.The unusually high DIST-4 of T5 in Figure 3 (d

Human Evaluation
We conduct a human evaluation on Wizard of Wikipedia to assess the overall quality of the responses of our model compared to T5 and RAG7 .Specifically, we randomly select 100 responses for each model with the same context from Test Seen and Test Unseen.For the few-shot setting, we use the models trained with 50 dialogues.We hire workers on Amazon Mechanical Turk to rate models' responses on a 0 -2 scale with three metrics: Fluency, Coherence, and Knowledge.The order of the systems shown to workers is shuffled to avoid confounding practice effects.Three different workers evaluate each dialogue turn.Table 5 reports average metrics scores.We observe that responses from our fully-supervised model are more fluent and coherent than those from RAG, which benefits from our simple but effective essential knowledge representation.We can also see significant improvement on all metrics for PLUG under a zeroshot setting compared to the T5 model.Perfor- mance improvement under the few-shot setting is less than in the zero-shot setting, but PLUG still outperforms T5 on all metrics, which fits the result in automatic evaluation.Interestingly, we observe that responses from the model trained with 50 dialogues have already been very fluent and coherent, which is even higher than those from the fullysupervised model.However, responses from the fully-supervised model contain the most appropriate knowledge, which suggests that the model has learned how to generate high-quality responses in a few-shot setting, but it continues to learn how to ground on knowledge with more training samples.

Discussion and Analysis
In order to investigate the enormous performance gap between models with golden knowledge and retrieved knowledge in  We find that the performance gain for both models is linear with respect to the performance of the knowledge source, whereas PLUG has a better boost on the BLEU-4 score and recommendation recall score.The curve with a higher slope shows the potential benefit from our pre-training method when better knowledge sources are available in the future.Furthermore, the performance gap on DIST-4 between PLUG and T5 is almost the same as golden knowledge increases, but the DIST-4 of T5 surprisingly drops when no golden knowledge is available.It indicates that T5 requires a better knowledge source in the training set to generate diverse responses under a few-shot setting, while PLUG has learned that ability in the pre-training process and generates diverse responses out-of-thebox.We also note that the performance boost with a better knowledge source is much more than the generation technology in previous work.This massive gap may shed light on the research direction of knowledge-grounded dialogue tasks for future efforts.

Related Work
Knowledge-grounded dialogue is increasingly becoming a more important topic, with datasets proposed to model its occurrence on different tasks.Dialogues in these datasets are based on various formats of knowledge, such as open-domain conversations based on documents (Ghazvininejad et al., 2018;Dinan et al., 2019;Gopalakrishnan et al., 2019), movie recommendation conversations based on a movie database (Li et al., 2018;Hayati et al., 2020), or recommendation conversations based on a knowledge graph (Moon et al., 2019;Liu et al., 2021b).
One of the principal challenges in knowledgegrounded conversations is to incorporate knowledge into dialogue systems.Recent work investigates different techniques of learning a better knowledge representation to fuse knowledge in the response generation process.Ghazvininejad et al. (2018)  Inspired by the massive success of pre-trained language models for a variety of natural language processing tasks (Devlin et al., 2019;Radford et al., 2019;Yang et al., 2019;Zhang et al., 2019;Raffel et al., 2020), another line of work investigates learning knowledge through language models' parameters (Petroni et al., 2019;Rosset et al., 2020;Roberts et al., 2020).In our pre-training process, we aim to learn extra knowledge and, more importantly, learn how to generate response grounding on the essential knowledge.
Two recent studies are most closely related to our work.Chen et al. (2020) proposed a pretrained model for data to text task.They unify the knowledge format in the pre-training data and downstream tasks, but they depend on the graph structure and do not work on knowledge-grounded dialogues.Shuster et al. (2021) applied the document retrieval augmentation method (Lewis et al., 2020b) on open-domain knowledge-grounded dialogues, but they do not do pre-training and rely on Wikipedia documents in the decoder, limiting their model to document-based dialogues.We use generalizing essential knowledge instead of documents in our pre-training, making our model more generalizable.Our approach can be seen as generalizing both of these lines of work, and showing for the first time that a pre-trained model is effective for various knowledge-grounded tasks with different knowledge formats.

Conclusion and Future Work
We present a knowledge-grounded pre-trained language model PLUG that can be applied to any knowledge-grounded dialogue tasks.It subsumes different knowledge sources into a simple but effective unified essential knowledge representation.Evaluation results on two benchmarks indicate that our model performs better in zero-shot and fewshot settings and can generalize to different knowledge grounded tasks.
Looking forward, a future work should be to extend our pre-training datasets with more knowledge sources.We hope our model can transfer to more knowledge-grounded tasks like question answering.Another interesting direction would be to develop better information retrievers since experiments show that the retriever is the main bottleneck in the knowledge-grounded dialogues.

Figure 1 :
Figure 1: A diagram of PLUG.PLUG homogenizes different knowledge sources in different tasks to a unified knowledge representation.Then it learns to ground response generation on the unified knowledge representation.
) is caused by diverse but irrelevant responses.It is also demonstrated by low BLEU-4 and ROUGE-L scores in Figure 3 (a) and Figure 3 (b), and the decrease of DIST-4 when we increase the training data size.

Figure 3 :
Figure 3: Zero-shot and few-shot results on REDIAL.
separately encoded the dialogue history and documents to infuse the response with external world facts.Chen et al. (2019);Wang et al. (2021);Zhou et al. (2020) joined a knowledge graph representation in a response generation module.Zhu et al. (2017) combined the knowledge from the database with the user intent and feed it into the decoder.Unlike these studies, we use a single encoder for both dialogue context and knowledge.In order to improve the systems' performance on unseen topics and train knowledge-grounded dialogue in a low-resource setting.Researchers investigate pre-training methods for the knowledgegrounded tasks.Zhao et al. (2020a) pre-train the dialogue generation model with ungrounded dialogues and pre-train the knowledge encoder with the Wikipedia dump separately.Li et al. (2020) proposed a pre-trained latent variable model to learn the way that the knowledge is expressed in the response.Liu et al. (2021a) built a document encoder and a dialogue context encoder, then pre-trained them separately in multiple stages.The knowledge encoder in these studies is pre-trained separately and only accepts the same knowledge format, while we pre-train our model with essential knowledge in the text format, thus fitting different knowledge sources in the downstream tasks.

Figure 4 :
Figure 4: Analysis of models with different knowledge sources on REDIAL . .

Table 2 :
(Dinan et al., 2019)ons in Wizard of Wikipedia (WoW) and REDIALWizard of Wikipedia.This dataset(Dinan et al., 2019)is collected on Amazon Mechanical Turk.
Shuster et al. (2021)pens between a "wizard" who has access to knowledge about a specific topic, and an "apprentice" who is interested in the topic.The wizard's response is grounded on a Wikipedia article in each turn.The data is split as a training set, a validation set, and a test set.The test set has two subsets: Test Seen and Test Unseen.Test Seen contains conversations whose topics are seen in the training set, while topics in Test Unseen are not seen in the training or validation set.To extract the essential knowledge in each dialogue turn, we first keep the top five passages retrieved by the TF-IDF retriever inShuster et al. (2021).Then we use an Open Information Extraction (OpenIE) annotator 4

Table 3 :
Fully-supervised results on Wizard of Wikipedia Test Seen and Test Unseen Sets.Figure 2: Zero-shot and few-shot results on Wizard of Wikipedia Test Seen and Test Unseen sets.
Table 4, we compare the performance of models with different knowl-

Table 5 :
Human evaluation results of different models on Wizard of Wikipedia.