More is Better: Enhancing Open-Domain Dialogue Generation via Multi-Source Heterogeneous Knowledge

Despite achieving remarkable performance, previous knowledge-enhanced works usually only use a single-source homogeneous knowledge base of limited knowledge coverage. Thus, they often degenerate into traditional methods because not all dialogues can be linked with knowledge entries. This paper proposes a novel dialogue generation model, MSKE-Dialog, to solve this issue with three unique advantages: (1) Rather than only one, MSKE-Dialog can simultaneously leverage multiple heterogeneous knowledge sources (it includes but is not limited to commonsense knowledge facts, text knowledge, infobox knowledge) to improve the knowledge coverage; (2) To avoid the topic conflict among the context and different knowledge sources, we propose a Multi-Reference Selection to better select context/knowledge; (3) We propose a Multi-Reference Generation to generate informative responses by referring to multiple generation references at the same time. Extensive evaluations on a Chinese dataset show the superior performance of this work against various state-of-the-art approaches. To our best knowledge, this work is the first to use the multi-source heterogeneous knowledge in the open-domain knowledge-enhanced dialogue generation.


Introduction
The rapid developments of knowledge-enhanced techniques have enabled machines to understand the instinct semantics of human conversations further and generate informative responses (Yu et al., 2020). External knowledge, such as commonsense bases (Speer et al., 2017), documents , and tables (Wu et al., 2021), can bridge the gap between machines and humans in conversation by generously providing knowledge that is * Corresponding author: Ying Li, li.ying@pku.edu.cn. The email of the first author: wusixing@pku.edu.cn Commonsense (iPhone, related to, smartphone) (smartphone, has context ,mobile phones) (Android, has context, mobile phones) ….

Text
The iPhone is a line of smartphones designed and marketed by Apple Inc. that use Apple's iOS mobile operating system.…  hard to be learned from a conversational corpus (Ghazvininejad et al., 2018). However, previous knowledge-enhanced works are still far from satisfactory because they usually solely rely on a single-source homogeneous knowledge base: (1) Conversations are diverse because humans are free to talk about whatever topics they like Hu et al., 2020), but the knowledge coverage of a single knowledge base is limited. Thus, only a finite portion of dialogues could benefit from the external knowledge; the remaining can only rely on the given query because no knowledge can be matched. Suffering from the long-tail issue and the cost of a massive workforce, it is not wise to improve the coverage by expanding the number of entries in a single-source knowledge base. (2) Each knowledge source has its advantages and disadvantages , for example, plain text has richer information than knowledge graph, but it performs worse in logically modeling. No knowledge type can always perform best; the most suitable knowledge only depends on the case.
Human beings can use various kinds of knowledge learned from different sources. Therefore, as shown in Figure 1, we believe using multiple knowledge sources can improve knowledge coverage and have more room to select appropriate knowledge. However, every coin has two sides; using multi-source heterogeneous is more challenging because of the following two conflicts: (1) Topic Conflict: given a dialogue, knowledge entries are usually retrieved by the entity name matching technique (Wu et al., 2020a). Thus, knowledge entries retrieved from one source may be irrelevant to the dialogue context and have different topics compared to entries retrieved from other sources. Blindly using such irrelevant/conflicting knowledge entries can confuse the model; (2) Generation Conflict: Although dialogue utterances and different knowledge bases are made of words, the word distributions vary among them. It can affect the generation if a model tries to improve the informativeness by copying words from knowledge entries. For example, if a word 'apple' appears in both the dialogue context and the commonsense knowledge, there exist two tokens of 'apple' in the dialogue vocab and the commonsense vocab, respectively. Then, two 'apple' will have two different tokens/probabilities when predicting the next word, making it difficult for a model to judge which one should be the objective. This issue is severe when using multi-source heterogeneous knowledge. With more knowledge sources, there are more chances for conflicts; then, the more conflicts, the lower the response quality. This paper proposes a novel multi-source heterogeneous knowledge-enhanced dialogue generation model, MSKE-Dialog. MSKE-Dialog can improve knowledge coverage by integrating more knowledge sources. In this paper, we use commonsense knowledge, text knowledge, and infobox knowledge at the same time. Compared to only use one of them, simultaneously using such three knowledge sources can improve the coverage by 63 ∼ 200% in our dataset. To alleviate the impact of topic conflict, we propose a Multi-Reference Selection mechanism. It uses a global relevance gate and a dynamic selection gate to select relevant knowledge from different sources. We also propose a Multi-Reference Generation mechanism, which will construct a unified dynamic vocab and comprehensively refer to all inputs (i.e., the context and the multi-source knowledge) during the decoding. As a result, MSKE-Dialog can avoid the impact of generation conflict as possible and generate an informative response.
Our experiments are conducted on a Chinese Weibo dataset. In both automatic and human evaluations, MSKE-Dialog can outperform various stateof-the-art knowledge-enhanced methods by notable margins, as well as can surpass the fine-tuned pretraining system CDial-GPT (GPT & GPT2) (Wang et al., 2020b) even with fewer parameters and training corpus. Extensive deep analyses also demonstrate: (1) Compared to simply integrating multiple knowledge bases, MSKE-Dialog has better performance because it can alleviate two mentioned challenging conflicts; (2) Even if MSKE-Dialog only uses a single-source knowledge, our model can also achieve promising results. It demonstrates the performance gain comes from not only the multisource knowledge but also the approach itself.

Problem Statement and Overview
The goal is to generate the dialogue response is a set of given references to guide the generation. R X = (r X,1 , . . . , r X,l X ) represents the dialogue context (history), {R K i } represents a set of multi-source heterogeneous knowledge, where the i-th R K i = (r K i ,1 , . . . , r K i ,l K i ) represents the relevant entries retrieved from the i-th knowledge source. Considering both R X and {R K i } serve as a type of reference in the response generation stage, we call R X and {R K i } as dialogue reference and knowledge references, respectively. Thus, R is called as the reference set.
As shown in Figure 2, MSKE-Dialog employs three heterogeneous knowledge sources; in other words, {R K i } contains the commonsense knowledge R K C , the text knowledge R K T , and the infobox knowledge R K I . The high-level architecture of MSKE-Dialog consists of three parts.
(1) Reference Encoding: We propose four different encoders to encode the given references (2) Reference Selection: In the decoding stage, we update the decoder with not only the last predicted token, but also the context-aware readouts gathered from the encoded reference set R. To obtain conflict-free readouts from the encoded R, we propose a Multi-Reference Selection mechanism.
(3) Multi-Reference Guided Generation: MSKE-Dialog can not only generate a word from the fixed vocabulary, but also copy a word from R. To avoid the conflicts during the generation, we propose a Multi-Reference Generation mechanism and a Dynamic Copy mechanism.

Reference Encoding
Dialogue Reference: Each word r X,t ∈ R X is first embedded as r w X,t , with the fixed vocab V R X .
Then, a bi-directional GRU network (denoted as g ) (Cho et al., 2014) is adopted to encode R X into hidden states R X = (r X,1 , · · · , r X,l X ), r X,t = [r ← X,t ; r → X,t ], [·; ·] indicates the concatenation operation: Commonsense Reference: Each entry r K C ,n ∈ R K C is a fact triplet r K C ,n = (e h,n , e r,n , e t,n ), where e h/t is the head/tail entity, e r is the relation. Following , we adopt TransE 1 (Bordes et al., 2013) to learn the embedding e h/r/t,n with the vocab V R K C . TransE learns the translation-based embedding as: Thus, r K C ,n = [e h,n ; e r,n ; e t,n ] is the encoded entry, and the encoded commonsense knowledge entry set is denoted as R K C = {r Kc,n }.
Text Reference: Each text reference is a word sequence R K T = (r K T ,1 , . . . , r K T ,l K T ). Thus, each token r K T ,n is first embedded as r w K T ,n with the 1 TransE is not the STOA method. However, this paper does not focus on embedding learning. For comparing models accurately, we use TransE as previous works do. vocab V R K T . Considering R K T is a long text paragraph, we use a 2-layer Transformer (Vaswani et al., 2017) to encode the sequence efficiently: Infobox Reference: Following Liu et al. (2018), each infobox table R K I is first regarded as a set of key-value attributes {(a k n , a v n )}, where each key a k n is an noun phrase, and each value a v n = (a w n,1 , · · · , a w n,|a v n | ) is a short text. Thus, R K I can be subsequently decomposed to a set of key-word pairs {a kw n,m },where each key-word pair a kw n,m includes the n-th key a k n and the m-th word in the n-th value a w n,m . Then, a kw n,m is embedded as: a kw n,m = [a k n ; a w n,m ; pos n,m ] where the attribute key embedding a k n uses the vocab V R K I,K , the attribute word embedding a w n,m uses the vocab V R K I , the positional embedding pos n,m is appended to indicate the position (i.e, n, m). After decomposing key-value pairs to keyword pairs, the number of pairs will significantly increase. Therefore, for the efficiency, we use a 2-layer Transformer to encode key-word pairs:

Scalability
Now, the reference set R has been encoded to R = (R X , {R K i }) 2 . Each encoded R j can be similarly regarded as a set of embedding {r j,n }. The remaining part of MSKE-Dialog does not have any knowledge-type-specific module. Consequently, MSKE-Dialog has the superior scalability because it can remove a knowledge R K i by simply removing it from R, or add a new knowledge source to R by adding a corresponding encoder.

Reference Selection
State Updating: At each decoding step t, the decoder state z t is firstly updated with a GRU unit g d , the embedding of the last generated token y t−1 , and the context-aware reference readout c t : Multi-Reference Selection: The reference readout c t is obtained by fusing local reference read- where the dialogue reference readout r c X,t gathers from the encoded R X = (r X,1 , · · · , r X,l X ) 3 : and each knowledge reference readout r c R K i ,t ∈ {r c K C ,t , r c K T ,t , r c K I ,t } gathers from the encoded R K C/T/I , respectively: Relevance Gate: Each reference R j ∈ R may have various importance, and may have conflicts with other references. Thus, we employ a global Relevance Gate α rel R j ∈ (0, 1) to control the participation of each reference. Each relevance gate α rel R j is given before the decoding: where σ is the sigmoid activation function, ELU is another activation function (Clevert et al., 2016), s R j is the reference summary of R j . 3 the shape of vectors/matrices is defined as R n×1 /R n×m Each reference summary s R j is given by taking the last dialogue reference state r X,lx as attention query, and the encoded reference R j = {r j,n } as keys/values : Selection Gate: During each decoding step, we employ a dynamic context-aware Selection Gate α sel R j ,t to control the fine-grained usage of R j : where is the sof tmax operation, a sel t ∈ R |R| ; thus, each local selection gate is α sel

Multi-Reference Guided Generation
Copying words besides the fixed vocabulary has shown great potential in promoting OOV-free, informative and diverse responses (Lin et al., 2020). However, token distributions are various among multiple references R. It poses a great challenge to avoid conflicts. We propose a Multi-Reference Generation mechanism to address this issue.
Word Prediction: To predict the next token y t , we first compute a generation probability over the fixed vocab V R X by a two-layer M LP f gen : then, for each reference R j ∈ R, we compute a probability distribution to estimate the probability to copy a token from the corresponding reference: where each f R j copy is a General attention function (Luong et al., 2015), [z t ; c t ; y t−1 ] is the attention query, the encoded R j serves as the attention key.
Dynamic Vocab: To eliminate conflicts brought by different word distributions of given references, a dynamic vocab V d is built, which consists of all words that appear in both the reference set R and the fixed vocab V R X 4 : Then, a projection matrix M V R X ∈ R |V d |×|V R X | to map the computed generation distribution p gen t to the dynamic vocab space. Similarly, for each copy distribution P copy R j ,t of reference R j , we construct a projection matrix M R j ∈ R |V d |×|R j | , which maps the copying distribution of R j to the dynamic vocab space.
Multi-Reference Generation: The probability of the next word y t is given by infusing all distributions with a generation gate γ gen t and several copy gates γ copy R j ,t : we not only use a mode weight α mode * ,t to control the participation of each distribution, but also adopt the previous relevance gate α rel R j to help the infusing of copy distributions: where mode gates α mode Training: Finally, P t ∈ R |V d | can be used to predict the next token. The model can subsequently be optimized by minimizing the following negative log-likelihood: 3 Experiment

Experiment Methodology
Dataset: It is built upon three open-released Chinese Weibo corpora (Shang et al., 2015;Ke et al., 2018;Cai et al., 2019). We adopt a ConceptNet (Speer et al., 2017) base released by (Wu et al., 2020b) as the commonsense knowledge. It contains about 696K triples, 27K entities, and 26 relations. For the text knowledge, we collect introduction paragraphs of 1,663K entities from Chinese Wikipedia. Besides, we also collect infobox tables of 1,581K entities from Chinese Wikipedia. All texts are tokenized by Jieba 5 . Following (Wu et al., 2020b), entity words ∈ R X are used as queries to retrieve knowledge queries from knowledge bases.
For each dialogue, we retrieve up to 200 most relevant 6 commonsense triplets, up to 1 relevant text paragraph, and up to 1 infobox table. After the pre-processing and the dialogue-knowledge alignment, the statistics have been reported in Table 1.
As reported in Table 1, using all three knowledge sources can improve the coverage by 63∼200%.
Models: There are 5 groups: (1) None: the widely-used attentive Seq2Seq (Luong et al., 2015), and its variant Copy that can copy words from the query (See et al., 2017); (2) Commonsense: the first CCM leverages the commonsense knowledge with two graph attention mechanism (Zhou et al., 2018). The next ConceptFlow  and ConKADI (Wu et al., 2020b) are two latest SOTA commonsense knowledge-enhanced methods; (3) Text: we use one of the latest SOTA text knowledge-enhanced methods RefNet, which proposes a reference-aware network to access the background text (Meng et al., 2020a); (4) Infobox: we adapt two data-to-text works to dialogue models by adding dialogue encoding/attention/copy modules (from Copy). The first SA-S2S proposes a structure-aware seq2seq to use the infobox knowledge (Liu et al., 2018), the next TransInfo is one of the latest SOTA infobox knowledge-aware text generation approach with a  2016)) and the word overlap-based BLEU1-4 (Papineni et al., 2002), ROUGE-L (Lin, 2004), and to evaluate the relevance to the ground-truth responses. Following , we use DIST-Uni/Bi (the ratio of distinct 1/2-grams among all generated tokens), and the 4-gram entropy Ent4 to evaluate the diversity. In addition, we use the entity score (i.e., the number of the generated entity/knowledge words per sentence) to measure the knowledge utilization. We count the entity score on each type of knowledge (CSK, TXT, IBT : commonsense, text, infobox), and compute the averaged entity score (AVG). Finally, to fairly compare the overall performance, we report the overall geometric mean scores relative to Seq2Seq, Appendix B has elaborated the detail. When comparing different approaches, we do not use the perplexity, because the definitions and computations vary among approaches. We will report the perplexity in the ablation study, because all model variants share the same computation.

Experimental Results
Automatic Evaluation: As reported in Table 2, MSKE-Dialog wins the best in 12 metrics, wins second/third place in 3 metrics, and the best overall performance. In the aspect of relevance, MSKE-Dialog beats baselines in all related metrics, indicating responses generated by MSKE-Dialog are closer to the topic. Thanks to the proposed Multi-Reference Generation mechanism, MSKE-Dialog has the best performance in DIST-Uni/Bi and the second-best performance in Ent-4, showing MSKE-Dialog can generate diverse and informative responses. MSKE-Dialog slightly loses to GPT base in Ent-4, we think the reason is GPT base has already been pre-trained by a large amount of dia-  logues. Moving to the aspect of knowledge, MSKE-Dialog undoubtedly beats other baselines in the overall score with the cooperation of three heterogeneous knowledge sources. MSKE-Dialog loses to RefNet/CCM in terms of text/commonsense entity score. The reason is such two baselines only use one knowledge source, but our approach uses three sources; thus, our approach would not only focus on using one source.
Human Evaluation: We conduct the pair-wise evaluation. Baselines include ConKADI, TransInfo, RefNet, GPT base (the best in the corresponding group), and the naive Seq2Seq. We employ 3 welleducated native speakers to annotate the sampled 200 test cases. There are two criteria: (1) Appropriateness evaluates the fluency, and the relevance to the context; (2) Informativeness evaluates how much new knowledge is provided.
As reported in Table 3, MSKE-Dialog can also outperform baselines in human evaluation. Compared with the automatic results, Seq2Seq has better performance in human evaluation. This is due to humans always have a high tolerance for boring/generic but fluent responses. The remaining results are roughly in line with the automatic results.  pre-training+ 700K fine-tune ). It verifies the advantage of using multi-source heterogeneous knowledge and the effectiveness of our model.

Ablation Analysis
To investigate what makes the most contribution to MSKE-Dialog, we conduct extensive studies.
Knowledge Contribution: We design a set of single-source variants of MSKE-Dialog to explore which knowledge brings the most improvement. As reported in Table 4, compared to Base, which neither uses external knowledge nor copies word, all three single-source variants have improvements in both the overall performance and the perplexity. Previous works (Gu et al., 2016;Vinyals et al., 2015) have shown that copying words from the context R X can significantly improve the performance. Our models can also benefit from this factor, the Context+* outperforms Base+* by notable margins. Evidently, among the three knowledge sources, commonsense knowledge and text knowledge bring more contributions. The perplexity of Context/Base+IBT have notable improvements, but the improvement of the overall score (i.e., the quality of the generated responses) is not notable 7 . We guess the employed beam-search decoding may be a bottleneck, we leave it as future work. It is worth noting that our approach can also beat the best knowledge-enhanced baselines without using more source knowledge sources. The best com-  Model Contribution In this part, all variants use all three knowledge sources. We check the performance contribution by removing a module from the Full model, namely, MSKE-Dialog. As reported in Table 5, we first remove the use of dynamic vocab V d . While the knowledge score increases, the relevance score and the diversity score sharply decrease. This is due to -V d tends to copy words from external knowledge without considering the context. Meanwhile, we propose a Multi-Reference Selection mechanism to solve the topic conflict and propose a Multi-Reference Generation mechanism to generate informative responses without the impact of generation conflict. -Multi R.Gen./R.Select prove such two mechanisms are effective, especially the Multi-Reference Generation. Comparing -K.Copy and -K.Attn, -K.Copy has more degeneration, indicating copying knowledge words brings more improvements.
Full does not achieve the best in each aspect but has the best overall performance and perplexity, which indicates using multi-source knowledge is quite challenging. It is crucial to fuse the knowledge sources into the context without the impact of the possible conflicts.   We test the full model and a variant that does not use multisource knowledge. We report the overall score and the geomean relative score in the aspect of relevance. Compared to the diversity/knowledge, relevance is a more representative aspect in the low-resource evaluation.

More Studies
Low Resource Evaluation: We train MSKE-Dialog and a non-knowledge-enhanced variant on only a part of the dataset. As illustrated in Figure 3, with the incorporation of multi-source knowledge, MSKE-Dialog with only 1 2 ∼ 1 4 conversational data can archive comparable performance with the nonknowledge-enhanced variant. It indicates the multisource knowledge can indeed help the dialogue generation if the conversational data is not enough. This can be quite useful when constructing a system in a low-resource language/scenario.

Case Study:
We show three cases generated by our MSKE-Dialog and two better baselines in the human evaluation in Table 6. In Case #1, only MSKE-Dialog provided the new information, demonstrating our multi-source heterogeneous knowledge-enhanced approach is able to generate more informative responses with the improved knowledge coverage. In the next Case #2, although ConKADI also provided new information, it failed to generate a fluent response. It indicates it is crucial to alleviate the conflict between knowledge and context. In the last Case #3, although all three models have generated fluent and informative responses, GPT base generated a more natural response and brought more information, which can be attributed to GPT base was trained by more training data. It tells us the potential to investigate the pre-training methods; we leave it as future work. This work only focuses on the non-pre-training method because pre-training models have expensive costs in training/using.

Related Work
The vanilla Seq2Seq tends to generate generic responses, such as 'I don't know ' (Chen et al., 2017). Many efforts have been devoted to diversifying the generations Gao et al., 2019), etc. One crucial factor leading to this issue is the lack of sufficient knowledge. During the conversation, the vanilla Seq2Seq model can only access the given query, which only contains limited knowledge (Ghazvininejad et al., 2018). The insufficient knowledge makes it hard for a model to understand the context and generate an informative response. To this end, knowledge-enhanced approaches have been proposed and demonstrated promising performances (Yu et al., 2020). The knowledge can be texts (Ren et al., 2020;Kim et al., 2020;Tam, 2020), the structured graphs/tables/bases (Zhou et al., 2018;Qin et al., 2019;Wu et al., 2020b,c;, the semi-structured infobox (Wu et al., 2021), the pre-trained models (Devlin et al., 2019;Radford et al., 2019;Moghe et al., 2020), and many other external knowledge components Xu et al., 2019).
However, most previous works can only use single-source homogeneous knowledge. Solely re-lying on only one type of knowledge greatly limits the performance in the real scenario. Some previous works have also noticed this issue. For example, augmenting the knowledge graph with an external text comprehension module  or a KBQA module (Wang et al., 2020a), introducing multi-modal visual features (Liang et al., 2020) for emotional conversation or visual conversation (Meng et al., 2020b). Our work is different from them because we focus on the open-domain knowledge-enhanced dialogue response generation, rather than the emotional/visual conversation, etc. To our best knowledge, few works have studied this topic in this area. In addition, MSKE-Dialog has a salable framework. A new knowledge source can easily be integrated by simply adding a knowledge encoder.

Conclusion & Future Work
This paper proposes a novel multi-source heterogeneous knowledge-enhanced dialogue generation approach, MSKE-Dialog, which outperforms competitive knowledge-enhanced baselines and pretraining models. It verifies the advantages of using multi-source heterogeneous knowledge and the advantages of our approach.
We will continue to investigate the advantages of knowledge-enhanced dialogue generation. We notice the current decoding strategy may be a bottleneck of knowledge-enhanced works and the potential of multi-source knowledge + pre-training. We will also pay more attention to such topics.
In this work, the dataset involves both conversational data and knowledge data. All three involved Chinese Weibo (weibo.com, an open SNS in China ) conversational datasets are open-released by previous works for research (Shang et al., 2015;Ke et al., 2018;Cai et al., 2019). Including but not limited to the involved three datasets, conversational data crawled from Weibo are widely used in training/evaluating in the research of Chinese dialogue generation research and other NLP researches, for example, (Wang et al., 2020b;Su et al., 2020). All data crawled from Weibo are open-accessed posts/responses that everyone can see; no privacyrelated data (such as gender, nickname, birthday, etc.) are used. But if it needs commercial use, it may need to ask for additional permission from the original author/copyright owner. We use the commonsense knowledge from ConceptNet (Speer et al., 2017); according to its description, it is allowed to reuse them in research (see conceptnet.io). We also collect text knowledge and infobox knowledge from Wikipedia (under the license CC BY-SA 3.0); it is allowed to reuse them in both research and commercial. To summary, as research work, this work has no concern on the dataset and other aspects.

A Model Implementations
We re-implement Seq2Seq, Copy, SA-S2S, and TransInfo by using the PyTorch, and the remaining use the official implementations and decoding strategies.
Hyper-Parameters : The word embedding dimension is 200, the commonsense entity/relation dimension is 100, the GRU dimension is 512. We use the Adam optimizer with the initial learning rate of 0.0001, and the batch size of 32. After each training epoch, we will check a model's performance (perplexity) on the validation set, if the perplexity starts to increase, the learning rate will be halved; if the epoch number reaches 20 or the perplexity increases in two successive epochs, the training will be stopped. During the inference, we select the hyper-parameter of the lowest perplexity on the validation set. The official implementations of CCM, ConceptFlow, RefNet, and GPTs use the greedy search decoding and do not support beamsearch; thus, we keep the official settings. For the remaining approaches, we apply the beam-search decoding strategy ( beam width = 10). Under such settings, the training on a single Nvidia Geforce RTX Titan roughly costs 2 days. In addition, we use a pre-trained Chinese word embedding released (Song et al., 2018) to initialize (if support) and evaluate.

B Overall Score
We evaluate models with more than 10 metrics, it is confusing to judge the overall performance by only checking the different discrete scores. For comparing the overall performance, we report the overall geometric mean relative scores. The performance baseline is Seq2Seq (i.e., its relative score is defined as 1.0). In detail, each metric is defined as M i,j,k , where i is the index of the evaluation aspects (Relevance, Diversity, and Knowledge), j is the index of the evaluation method in the i-th aspect (for example, the aspect Relevance includes Embed, ROUGE, and BLEU), the last k indicate the specific metric variant of M i,j (for example, the embedding-based metric Embed has three settings: average, greedy, extreme.). Subsequently, the computation of the overall geometric relative score can be described as: • 1. For each M i,j,k , we first compute the performance rate relative to Seq2Seq. For example, if Seq2Seq achieves 10.0 in terms of the metric M i,j,k , and MSKE-Dialog achieves 15.0; then, the relative performance rate of MSKE-Dialog is 1.50. The relative performance rate of M i,j,k is denoted as R i,j,k .
• 2. Each evaluation method M i,j may have different metric variants, but the number of metric variants should not affect the overall score, so for each evaluation method M i,j ∈ (Embed, ROUGE, BLEU, DIST), we compute the geometric mean relative score among its variants: R i,j = GeoM ean({R i,j,k }). Meanwhile, in this part, we use the averaged entity score AVG instead of the geomean of three sub entity scores.
• 3. For each aspect M i , we compute the geometric mean relative score among its evaluation methods: R i = GeoM ean({R i,j }).
• 4. The overall geometric mean relative score is given by: The computed R can be used to compare different models easily.

C Human Evaluation
Following (Wu et al., 2020b), we conduct the pair-wise evaluation. The competitors include ConKADI, TransInfo, RefNet, GPT base (the best baselines in the corresponding group), and the widely-used Seq2Seq. We employ three welleducated native speakers to annotate the sampled 200 test cases (1,200 pairs in total). There are two criteria: (1) Appropriateness evaluates the relevance to the dialogue context, and fluency; (2) Informativeness evaluates how much new knowledge is provided in a generated response.  Table 7: Human annotation results. Score means our approach significantly outperforms baselines (sign test, p-value < 0.05, ties are removed).
As reported in Table 7, MSKE-Dialog can also outperform baselines in human evaluation. Compared with the automatic results, Seq2Seq has better performance in human evaluation. This is due   to humans always have a high tolerance for boring/generic but fluent responses. The remaining results are roughly in line with the automatic results. It is worth noting that, compared to GPT base , although only using 62% of parameters (59.14M vs. 95.5M) and less than 10% of training data (700K vs.1.3B Words+6.8M pre-training+ 700K fine-tune ), MSKE-Dialog still can outperform GPT base . It demonstrates the advantage of using multi-source heterogeneous knowledge and the effectiveness of our model design.
Following (Wu et al., 2020b), We count the agreement among volunteers, for the appropriateness, 2/3 agreement ( the percentage of the cases that at least 2 volunteers give the same label) is 94.2%, the 3/3 agreement is 53.7%; for the informativeness, 2/3 agreement is 94.4%, the 3/3 agreement is 52.4%.

D Knowledge Contribution
We design a set of single-source variants of MSKE-Dialog to explore which knowledge brings the most improvement. As reported in Table 8, compared to Base, which neither uses external knowledge nor copies word, all three single-source variants have improvements in both the overall performance and the perplexity. Previous works (Gu et al., 2016;Vinyals et al., 2015) have shown that copying words from the context R X can significantly improve the performance. Our models can also benefit from this factor, the Context+* outperforms Base+* by notable margins. Evidently, among the three knowledge sources, commonsense knowledge and text knowledge bring more contributions. The perplexity of Context/Base+IBT have notable improvements, but the improvement of overall score (i.e., the quality of the generated responses) is not notable 8 . We guess the employed beam-search decoding may be a bottleneck, we leave it as future work.
It is worth noting that our approach can also beat the best knowledge-enhanced baselines without using more source knowledge sources. The best commonsense/text/infobox knowledge-enhanced baselines and their overall scores are CinKADI/1.43, RefNet/1.32, TransInfor/1.29, which are lower than our single-source variants: Context+CSK/1.66, Context+TXT/1.60, Context+IBT/1.60. It proves that our MSKE-Dialog not only has the ability to use multi-source heterogeneous knowledge, but also has more efficiency in model design.

E Model Contribution
In this part, all variants use all three knowledge sources. We check the performance contribution by removing a module from the Full model, namely, MSKE-Dialog. As reported in Table 9, we first remove the use of dynamic vocab V d . While the knowledge score increases, the relevance score and the diversity score sharply decrease. This is due to -V d tend to copy words from external knowledge without considering the context. We propose a Multi-Reference Selection mechanism to solve the topic conflict and propose a Multi-Reference Generation mechanism to generate informative responses without the impact of generation conflict.
-Multi R.Gen./R.Select prove such two mechanisms are effective, especially the Multi-Reference Generation. Comparing -K.Copy and -K.Attn, -K.Copy has more regression, indicating copying knowledge words brings more improvements.
Full does not achieve the best in each aspect but has the best overall performance and perplexity, which indicates using multi-source knowledge is quite challenging. It is crucial to fuse the knowledge sources into the context without the impact of the possible conflicts.