A Model of Cross-Lingual Knowledge-Grounded Response Generation for Open-Domain Dialogue Systems

Research on open-domain dialogue systems that allow free topics is challenging in the ﬁeld of natural language processing (NLP). The performance of the dialogue system has been improved recently by the method utilizing dialogue-related knowledge; however, non-English dialogue systems suffer from repro-ducing the performance of English dialogue systems because securing knowledge in the same language with the dialogue system is relatively difﬁcult. Through experiments with a Korean dialogue system, this paper proves that the performance of a non-English dialogue system can be improved by utilizing English knowledge, highlighting the system uses cross-lingual knowledge. For the experiments, we 1) constructed a Korean version of the Wizard of Wikipedia dataset, 2) built Korean-English T5 (KE-T5), a language model pre-trained with Korean and English corpus, and 3) developed a knowledge-grounded Korean dialogue model based on KE-T5. We observed the performance improvement in the open-domain Korean dialogue model even only English knowledge was given. The experimental results showed that the knowledge inherent in cross-lingual language models can be helpful for generating responses in open dialogue systems.


Introduction
Large language models trained with a large-scale corpus (Radford et al., 2019;Lewis et al., 2020;Raffel et al., 2019;Adiwardana et al., 2020;Roller et al., 2020) have stirred considerable research interest by showing low perplexity in several text generation tasks, which correlated with high token accuracy on in-domain test data, and providing linguistic fluency. However, when conditional text generation was performed using a large model, a "hallucination" problem (Maynez et al., 2020) was *Equal contribution found while generating plausible text using the internal knowledge implicitly stored in the parameter and condition text together. Owing to the hallucination problem in open-domain dialogue tasks, it is often observed that the model produces a response containing false information. For example, if the token "1992" frequently appears after "was born in" in the corpus for pre-training, the information is stored in the parameter of the model. In case "When was Elvis Presley born?" is entered as a condition, false information such as "Elvis Presley was born in 1992" is often generated.
Knowledge-grounded dialogue tasks (Dinan et al., 2019;Zhou et al., 2018) were introduced for dialogue models to generate informative responses based on knowledge, and then dialogue modeling research based on external knowledge was started. Because the responses of the knowledge-grounded dialogue models are generated based on dialogue history and external knowledge, the knowledgegrounded dialogue models mitigate the hallucination problem compared to the models based only on dialogue history (Shuster et al., 2021).
For knowledge-grounded dialogue systems in non-English, construction of data in a different language than English is required since most of the published knowledge-grounded dialogue datasets are built based on English. However, building knowledge-grounded data based on the corresponding language takes a lot of time and cost (Li et al., 2020). Even when translating existing English data, the high translation cost is incurred because of the large volume of knowledge data included in the dataset. In this paper, to avoid the data construction overhead, we suggest a cross-lingual knowledgegrounded dialogue model that generates responses in another language than English using knowledge in English.
For the cross-lingual knowledge-grounded dialogue model, (1) we constructed the Korean Wizard of Wikipedia (KoWoW) dataset by translating the Wizard of Wikipedia dataset (Dinan et al., 2019), a knowledge-grounded dialogue benchmark, into Korean. Based on T5 (Raffel et al., 2019), (2) we built Korean-English T5 (KE-T5), a pre-trained language model specialized in Korean and English, and (3) developed a cross-lingual knowledgegrounded dialogue model that selects knowledge and generates responses based on the T5 architecture. We conducted an experiment to prove that a dialogue model generating responses with knowledge in English alleviates the hallucination problem rather than that without knowledge, and shows comparable performance to a dialogue model with knowledge translated from English into Korean. In addition, by sharing our insights through the qualitative analysis of the generated responses based on the proposed model, we describe several research directions for future knowledge-grounded dialogue tasks.

Knowledge-Grounded Dialogue Data
Representative knowledge-grounded dialogue datasets include the CMU document grounded conversion dataset (CMU_DoG) (Zhou et al., 2018) and the Wizard of Wikipedia dataset (WoW). CMU_DoG is suitable for generating conversations about a specific article, like reading discussion, because it is a dataset that selects a specific document from Wikipedia and collects conversations about the contents of the document. WoW is a dataset whose conversations are collected by selecting knowledge from Wikipedia to generate a response for each turn, on the basis of dialogue history. In every turn, a specific sentence is selected as knowledge among articles returned by TF-IDF, and the conversation is conducted using the knowledge sentence; unlike CMU_DoG, the knowledge sentences for a conversation may have come from several documents. Therefore, WoW can be applied to open-domain chit-chat engines that can change topics according to the flow of the conversation.
In WoW, there are two speakers, Apprentice and Wizard. The apprentice talks freely with the wizard, and the wizard discusses about a given topic with the apprentice. The wizard selects appropriate knowledge for the next response and responds based on the selected knowledge and dialogue history. When there is no appropriate knowledge, or when responding without knowledge is possible, such as in the case of agreeing with the other party's opinion, the wizard responds based only on dialogue history. This task is to generate the next utterance of the wizard using knowledge and dialogue history. Therefore, the dialogue model is constructed to perform knowledge selection to select knowledge for the next utterance generation, and to generate a response based on the selected knowledge and dialogue history. The WoW dataset consists of train, validation, and test splits. Validation and test splits are further subdivided into seen and unseen splits. The seen and unseen splits are the cases where the conversation topic does and does not overlap with the train split, respectively.

Pre-Trained Language Models
In most NLP tasks including dialogue tasks, transfer learning from a language model, trained using a large corpus, to a downstream task has shown high performance. Among the various pre-trained language models, T5 (the text-to-text transfer transformer) takes an encoder-decoder architecture, and was trained using the Colossal Clean Crawled Corpus (C4) (Raffel et al., 2019) that cleaned the raw corpus obtained from the Web. With C4, models trained with auto-regressive objectives (T5 AR) and models trained with span-corruption objectives (T5 Span) were published.
MT5 (Multilingual T5) (Xue et al., 2020), constructed and released to support cross-lingual downstream tasks, was trained with the span-corruption objective of T5, and a large corpus in 101 languages was used for training. However, the multilingual corpus used to train the MT5 contains a very small proportion of the non-English data, and high performance for non-English tasks is difficult to obtain. In this study, a Korean-English language model was built to analyze the performance improvement of the dialogue model in a minority language by injecting knowledge in English.

Knowledge-Grounded Response Generation Models
To generate natural and correct responses in knowledge-grounded dialogue, various successful machine learning techniques have been applied, similar to the research trends in other NLP tasks.
WoW proposed a knowledge selection model using the transformer encoder and memory, and a generative model generating the next utterance by concatenating encoded vectors of the selected knowledge and dialogue history. The proposed model had higher response generation performance than the model that generates responses without knowledge (Dinan et al., 2019). SKT (Kim et al., 2020) improved the performance of knowledge selection through keeping track of the prior and posterior distribution over knowledge, thereby improving response generation performance in knowledgegrounded dialogue. In dialoGPT (Zhao et al., 2020c), BART FK (Bruyn et al., 2020), and knowledge GPT (Zhao et al., 2020b), the generation performance was improved by using a pre-trained language model.

KoWoW: Korean Wizard of Wikipedia
We used a commercial machine translation API (MT) to build the KoWoW dataset. We chose the multi-stage translation strategy (Ham et al., 2020) as a strategy for building the KoWOW. In this strategy, training and validation splits are translated by machine, and in the case of test splits, machine-translated drafts are corrected by human translators by referring to the original text. Because WoW's utterances are colloquial, whereas the machine translator are trained with written languages, human translators spent more effort on correcting the machine-translated text, rather than directly translating the original English text into Korean.
To maintain the contextual/stylistic consistency of training data and evaluation data to some extent during the process of Koreanization of the WoW dataset, the same MT, the Google's neural machine translation system (Wu et al., 2016), was used allthrough the multi-stage translation strategy.
In the test split, if the content and meaning of the utterance translated by MT were different from the original text, the human translators retained the machine-translated text as much as possible and corrected it manually. When some idioms were translated and their meanings changed, they were revised for the correct expressions. For the translation quality, two experts in English and Korean took a role of human translators.

Language Combinations of KoWoW
For the experiment of the cross-lingual knowledgegrounded dialogue task, we constructed four datasets according to the language composition combinations of knowledge and utterance using the constructed Korean and English parallel data. KoWoW En-En, whose knowledge and utterance are both in English, is the same dataset as WoW, and KoWoW Ko-Ko is the dataset, which both knowledge and utterance are in Korean. Therefore, the knowledge-grounded task in KoWoW En-En and KoWoW Ko-Ko performs knowledge selection and utterance generation in a monolingual environment. On the other hand, in the KoWow Ko-En (Knowledge-Korean, Utterance-English) and KoWoW En-Ko (Knowledge-English, Utterance-Korean) datasets, where the languages for knowledge and utterance are cross-lingual combinations, two different languages are used for knowledge selection and utterance generation. For example, in the KoWoW En-Ko dataset, the knowledge sentence for generating the next utterance is selected from knowledge candidates in English using dialogue history in Korean. The response is generated in Korean, using the selected knowledge sentence in English and dialogue history in Korean. Table 1 shows the statistics of the KoWoW dataset, which is the same as the WoW dataset.  The existing T5, the pre-trained model learned with only the English corpus, is difficult to be applied for downstream tasks using multi-languages. In the case of MT5, the total vocabulary size is very large (250,000 words), the large memory for training and inference is required, and the computational cost is high. Despite the high cost of MT5, high performance in the NLP tasks supporting only two languages is difficult to achieve due to the fact that the vocabulary size for Korean is small. We built Korean-English T5 (KE-T5), a T5based pre-trained model for both English and Korean. KE-T5 used Google's SentencePiece (Kudo and Richardson, 2018) as a tokenizer, and 64,000 word/sub-word vocabulary was used for all experiments. To support both Korean and English, the SentencePiece model was trained to cover 99.95% of the corpus consisting of a 7 to 3 ratio of Korean and English. The 60GB Korean corpus crawled on the web was filtered, and a total of 92GB Korean-English raw corpus was secured, including Real-Newslike data of the C4 dataset in English. C4's RealNews is the filtered data to include only the web pages used in Zellers et al. (2019). The corpus used to train KE-T5 consists of 39 million examples. Using the constructed corpus, we trained the model with the span-corruption objective of T5, like MT5. We evaluated KE-T5 in several Korean/English downstream tasks such as document summary, extractive QA, and text classification, and KE-T5 showed high performance in both Korean and English, and the performance of Korean/English downstream tasks is illustrated in the Appendix A.

Models for Conversation Generation
We developed a dialogue model based on KE-T5 for the cross-lingual knowledge-grounded dialogue task, and the structure of the model is shown in Figure 1. In each dialogue, when the current dialogue turn is t and the token sequence of each turn is X t , the current dialogue context is X 1 , · · · , X t , the response to generate is X t+1 , and X 1 is the topic of dialogue.
(1) The model selects the most appropriate knowledge to generate the next response among knowledge candidates, using dialogue context. (2) After that, the next utterance is generated using the selected knowledge and dialogue context.

Retrieval Transformer Network
The retrieval transformer network that selects knowledge uses the KE-T5 encoder as a base model, as shown in Figure 1. In the retrieval model, knowledge candidates and dialogue context are independently encoded by the encoder, and the aver-age vector is calculated along the sequence dimension of the encoded vector sequences and then normalized to obtain the representation vector. Then, the attention between the representation vectors of knowledge candidates and the representation vector of the dialogue context is calculated, and the knowledge with the largest attention value is selected. Suppose the number of knowledge candidates is N . The tokens of the i-th knowledge are K i , the knowledge candidates are K 1 , · · · , K N , and the encoded knowledge vector is enc(K 1 ), · · · , enc(K N ). Let the encoded vector be averaged along the sequence dimension, and then the normalized representation vector be repr(K 1 ), · · · , repr(K N ). Similarly, when the encoded current dialogue context is averaged, the normalized representation vector is called repr(ctx). The retrieval model selects the knowledge index (i knowledge ), as depicted in Eq. 1.
(1) During training, knowledge candidates are either gold knowledge or knowledge that is not used to generate a response, and the labels KL 1 , · · · , KL N are generated such that gold knowledge is 1 and the others are 0. Assuming that A i is the attention score of repr(K i ) and repr(ctx), the loss L knowledge for the knowledge selection model is defined as Eq. 2. (2)

Generative Transformer Network
For the generative transformer network, the selected knowledge and dialogue context are concatenated and then input into the model, and the model is trained to generate the next utterance X t+1 . The model is trained to minimize the negative log likelihood loss (L N LL ) of the utterance P t+1 generated by the model and the next utterance X t+1 . The proposed model is similar to the generative transformer memory network of WoW (Dinan et al., 2019), but the input of the generative model is a token rather than an encoded vector, and the generative model is based on an encoder-decoder structure. Similar to the end-to-end model (Dinan et al., 2019) in WoW, the proposed model was trained to minimize the loss of the weighted sum of L knowledge and L N LL , which are the losses of the retrieval and generative models respectively. Therefore, the final loss of the proposed model is determined by Eq. 3.

Experimental Settings and Metrics
We used perplexity (PPL) and F1 score (unigram overlap), which are commonly used in knowledgegrounded dialogue tasks, as evaluation metrics for responses generated using predicted knowledge. In addition, knowledge selection accuracy was measured in the cross-lingual setting. The size of KE-T5's pre-trained model used for the retrieval and generative models was 60 million (small model) and 220 million (base model). In addition, all experiments were conducted through transfer learning using pre-trained models, and the name of the model in the results indicates the pretrained model used for training. All experiments were equally learned by 10 epochs, and detailed settings for training can be found in the Appendix B.

Performance of T5 and KE-T5 on English Dataset
Because the KoWoW constructed in this study has been newly released, it is difficult to determine whether the contents proved in this paper are reliable only from the KoWoW-based experimental results. Therefore, to prove the performance stability of the KE-T5 model built for this experiment, we performed a performance experiment of the knowledge-grounded dialogue model through KE-T5 and WoW before evaluating the dialogue model in cross-lingual data.  a pre-trained model using an auto-regressive objective, and T5 Span is a pre-trained model using the span-corruption objective. T5 Span and KE-T5 are pre-trained models that are trained identically, except for data and vocabulary.
When comparing the results of T5 AR and T5 Span in Table 2, using a model trained with an auto-regressive objective as a pre-trained model has lower perplexity than using a model trained with span-corruption objective. As the perplexity of T5 AR is the lowest when knowledge is not used, the perplexity of pre-trained models learned with auto-regressive objectives seems to be low because the auto-regressive objective reduces perplexity in generation. However, in the F1 score, the two models showed similar performance.
When comparing the performance of T5 Spanand KE-T5 based models, it can be seen that the performance is similar except that the F1 score of T5 Span is slightly higher in the seen topics. Therefore, it can be concluded that the relatively high perplexity of the proposed KE-T5 is due to the objective of the pre-trained model. Comparing the performance of the proposed model based on KE-T5 and other models, it can be seen that the KE-T5 based model has comparable performance to the existing state-of-the-art models in the knowledgegrounded dialogue task. Therefore, it can be seen that the KE-T5 based model has sufficient perfor-mance to be used as a baseline model in KoWoW.   Table 3 shows the performance of the proposed model in all language combinations of Section 3.1. In this experiment, to compare the performance using various cross-lingual pre-trained models, the performance was compared using the Multi-lingual T5 (MT5) and KE-T5 that support both Korean and English. First, in the performance in (1), the model using KE-T5 has a higher F1 score in both Test Seen and Test Unseen than the model using MT5. However, it can be confirmed that the perplexity of MT5 is lower than that of KE-T5. This is because the Korean vocabulary size of KE-T5 is 44K words, which is larger than the 12K words of MT5. In the KE-T5 model using Korean knowledge, the F1 scores in Test Seen and Test Unseen were 4.5 and 3.6 higher than the model using only dialogue history, respectively. This proves that Korean knowledge is of great help in generating Korean responses.

Performances on KoWoW
When comparing (2), which is composed of languages with different knowledge and utterances, and (1), a monolingual modeling environment, the F1 score of (2) is lower than that of (1). However, KE-T5's Test Seen and Test Unseen are small differences of 0.4 and 0.8, respectively, and the performance improved by 4.1 and 2.8, respectively, compared to the case of not using knowledge. From this result, it can be seen that the response generation performance is improved if English knowledge is used for non-English knowledge-grounded dialogue tasks. In addition, it was confirmed that both KE-T5 and MT5 showed high performance in the cross-lingual NLP task even though they were learned through a corpus independently collected between languages without using English-Korean parallel data in the pre-training process.
In the experimental results of opposite knowledge and utterance combinations (3) and (4), the cross-lingual dataset (3) showed a slightly lower F1 score than (4). In addition, compared to the case where knowledge was not used, the F1 score was significantly improved by 5.5 and 5.4, respectively. As shown in Table 3, although the knowledge accuracy of MT5 in Test Unseen was higher than that of KE-T5, the F1 score was low. This means that the MT5-based model has a numerically lower performance in generating a knowledge-based response than the KE-T5-based model.

Qualitative Analysis
In the experimental results in Section 5.3, the perplexity of generation is affected by the scale and vocabulary composition of the pre-learning model, and the f1 score-based evaluation method may also have a discrepancy from the qualitative quality evaluation felt by humans. Therefore, in this study, we qualitatively analyzed the responses generated by the proposed model in KoWoW En-Ko and KoWoW Ko-En.
In Table 4, in the model that did not use knowledge, a hallucination problem was found that generated false information as a response that a band called Insane Clown Posse was formed in 1977 (orange box on the table). In contrast, the model using English knowledge generates a factual and informative response that the Insane Clown Posse was formed in Detroit in 1987. In the case of using Korean or English knowledge, it is confirmed that the model generates a true response based on selected knowledge.
While comparing the results of the knowledgegrounded dialogue model using KE-T5 and MT5 as pre-learning models, both models generate selected knowledge-based responses. However, it was often observed that the MT5-based model generates a response using words irrelevant to the context, such as the orange box in Table 5. In addition, the MT5 based model frequently generated phrases such as "I don't know much about it." and "I'll have to check it out." regardless of the context, at a high frequency, when generating responses with-   out knowledge. They tended to generate the same phrase repeatedly. Table 6 shows response samples generated by the KE-T5 based model. (1), (5), (6) are natural response samples, and (2), (3), (4), (7) are unnatural response samples. Like response (1), the proposed model generates informative responses in Korean even if only English knowledge is used in most of the turns. In the case of (2), the apprentice spoke of a negative stance on the topic, and if it is a natural conversation, it will generate a response that agrees or empathizes with this utterance. However, the model generates a response that is not related to the context, such as "Yes, it's a fun hobby," and does a topic related explanation based on the selected knowledge. Moreover, it shows the wrong result of generating a response by simply copying and translating information from a given knowledge. (3) shows the case in which contradictions occurred in the response. This is an error that may appear in the process of generating an incidental response using the selected knowledge after the proposed model fails to judge that the knowledge chosen by itself is the knowledge that can answer the question. (4) is a case where the model does not answer Apprentice's questions and says what the model wants to say based on the selected knowledge. This is a case where the conversation model simply paraphrases the chosen knowledge and delivers it without knowing what the other person is asking.
In the dataset, in the case of agreeing to or reacting to the other person's words, knowledge may not be used. (5) reacts to apprentice's words, and (6) is an example of answering and asking another person's question. In cases (5) and (6), the proposed model generates a natural response. However, it can be seen that the response in (7) is contradictory and generates false information that 'Mick' was replaced with 'Brian Jones'. Because Rolling Stones is a seen topic, when training, the knowledge that 'Brian Jones' and Rolling Stones are related may have been implicitly stored in the model parameter.
It seems that the model first generated a pattern of 'I'm not sure, but I do know', and then generated a response using 'replaced' in apprentice's utterance and 'Brian Jones', which is the learned internal knowledge. Even in the proposed model, if knowl-

Knowledge-English Utterances-Korean
(1) Seen Topic: Purple Apprentice: Oh that's cool, do you know if purple had any historical uses?
Predicted Knowledge: Purple was the color worn by Roman magistrates; it became the imperial color worn by the rulers of the Byzantine Empire and the Holy Roman Empire, and later by Roman Catholic bishops.

Response:
Roman magistrates wore purple, and purple was worn by the rulers of the Byzantine and Holy Roman Empires.

Apprentice:
Yeah, its sort of an older persons activity like bingo, it's not so much that younger people can't do it, its just not the primary audience Predicted Knowledge: The most common types of pin bowling include ten-pin, nine-pin, candlepin, duckpin and five-pin bowling, while in target bowling, bowls, skittles, kegel, bocce, carpet bowls, pétanque, Response: Yes, it's a fun hobby, it can include ten-pin, nine-pin, kegel, skittles, and five-pin bowling.
(3) Unseen Topic: Guns N' Roses Apprentice: Have you heard of the band Guns N' Roses? They are my favorite band ever Predicted Knowledge: Guns N' Roses, often abbreviated as GNR, is an American hard rock band from Los Angeles, California, formed in 1985.

Response:
No, but I know they are a respected American hard rock band..  edge is not provided, a hallucination problem still exists.

Conclusion
In this work, we studied whether proper knowledge was well selected and whether a knowledgegrounded response was well generated even if the languages of knowledge and utterances were different in response generation using external knowledge. Through experiments, we showed that even if the languages of knowledge and utterance are different, if the pre-trained model supports both languages, the performance is comparable to that of the monolingual model. In addition, through qualitative analysis, the proposed model generates more informative responses than when knowledge is not used in most cases, and because it is based on external knowledge, the hallucination problem that generates a factually inaccurate response based on internal knowledge is alleviated. However, there were cases in which the selected knowledge was simply translated without answering the other person's question, contradictions occurred in the generated response, and false information was generated when knowledge was not selected. Future work would be able to conduct research that generates responses using knowledge by understanding the other's intentions and questions, rather than simply generating responses that convey knowledge. In addition, it will be interesting to study the prevention of contradictions in the response when generating the response from the model. Finally, when there is no external knowledge, research to reduce the hallucination problem and research to classify whether the generated response is true or false would help to create a natural dialogue model.

A Performance of KE-T5 on Korean/English downstream tasks
The KE-T5 was trained using a 90GB Korean-English corpus, with a mini-batch size of 256 and trained over 1.5M steps. Although the final training steps differ for each model size, the performance measured in this section was measured based on 1M steps (small) and 600M steps (base, large). For the large model, it took 2 months to train the 2.2M steps using the TPU-v3 8 cores.

A.1 Extractive Question Answering(QA)
SQuAD (Rajpurkar et al., 2016) and Ko-rQuAD (Lim et al., 2019) were used to evaluate the extractive QA performance. Stanford Question Answering Dataset (SQuAD) is a Wikipedia-based QA benchmark, and Korean Question Answering Dataset (KorQuAD) is a Korean Wikipedia-based QA benchmark. Version 1 was used for evaluation, and version 1 is a dataset in which the correct answer to a query exists in a given context. As shown in Table 7

A.2 Neural Machine Translation
TED multilingual data (Qi et al., 2018) is multilingual subtitle data of TED video created by TED's open translate project 2 . The translation performance between Korean and English was measured using this data. Table 8 shows that the translation task that translates Korean to English shows higher performance than that of English to Korean.     (Ham et al., 2020) and Korean Semantic Text Similarity (KorSTS) (Ham et al., 2020) are datasets released by Kakao Brain, and KorNLI is a dataset that was translated SNLI (Bowman et al., 2015), XNLI (Conneau et al., 2018) and MNLI  into Korean. Ko-rSTS is a dataset that was translated from Semantic Text Similarity (STS) (Cer et al., 2017). Hate Speech (Moon et al., 2020) is data that classifies whether a given sentence is hate speech, and classifies the type of hate speech.  A.6 Korean Summarization tasks NIKL summarization data 2 is summarization data published by the National Institute of Korean Language(NIKL) Republic of Korea. It is divided into a summary split and a topic split. The summary split is built by human-handed summarizing articles.The topic split is data that concatenates the topic sentences selected by a person in an article.   (See et al., 2017) is the task of summarizing a given document. As shown in Table 13, KE-T5 has good performance in the English summarization task.

B Detailed settings for experimentation
All experiments were trained and validated with the same hyper parameter setting. Knowledge was truncated so that the number of tokens did not exceed 64, and the dialogue context was truncated to 256. Due to GPU memory limitations, knowledge candidates were divided into 32 sized mini batches. In Eq. 3, the knowledge weight λ was set to 0.95. The adam optimizer was used for training, and epsilon was set to 5e-4, beta 1 to 0.9, and beta 2 to 0.98. The learning rate is 5e-4, and an invert square root is used as a learning rate scheduling method. The learning scheduler decay is set to 0.5, and the warm up steps is set to 5000. One NVIDIA V100 32GB GPU was used for training, and it took about 1 day to learn. Beam search was used for inference, and the beam size was set to 4 and the length penalty was set to 0.65.

C Additional Samples
The below tables show samples generated by the proposed model on the KoWoW dataset. Table 14 and Table 15 show samples generated by the proposed model from four topics. Table 14 shows responses generated using gold knowledge, and Table 15 shows responses generated using predicted knowledge.