EARL: Informative Knowledge-Grounded Conversation Generation with Entity-Agnostic Representation Learning

Generating informative and appropriate responses is challenging but important for building human-like dialogue systems. Although various knowledge-grounded conversation models have been proposed, these models have limitations in utilizing knowledge that infrequently occurs in the training data, not to mention integrating unseen knowledge into conversation generation. In this paper, we propose an Entity-Agnostic Representation Learning (EARL) method to introduce knowledge graphs to informative conversation generation. Unlike traditional approaches that parameterize the specific representation for each entity, EARL utilizes the context of conversations and the relational structure of knowledge graphs to learn the category representation for entities, which is generalized to incorporating unseen entities in knowledge graphs into conversation generation. Automatic and manual evaluations demonstrate that our model can generate more informative, coherent, and natural responses than baseline models.


Introduction
Generating informative and appropriate responses is vital for the success of human-like dialogue systems. To this end, there has been a rising tendency in enhancing conversation models with external knowledge recently, which is well-known as the knowledge-grounded conversation model (Ghazvininejad et al., 2018;Zhou et al., 2018;Dinan et al., 2019). Several studies incorporate unstructured texts, such as web pages (Ghazvininejad et al., 2018) and Wikipedia articles (Dinan et al., 2019), as the external knowledge to generate informative responses. Some work introduces structured knowledge, e.g. the knowledge graph (Zhou et al., 2018)  Prior studies adopt either pre-trained knowledge graph embeddings (Zhou et al., 2018), e.g. TransE (Bordes et al., 2013), word embeddings (Wu et al., 2019), or adjacency matrix (Tuan et al., 2019) to model entities and relations in knowledge graphs and incorporate them to conversation generation. These models face two major challenges when applied to introduce large-scale knowledge graphs. First, there is a significant gap in representations between knowledge and text (Buitelaar and Cimiano, 2008;Zhou et al., 2018), which requires model training to apply knowledge in conversation generation based on different knowledge representations. However, the training corpus of knowledgegrounded conversations only contains a small subset of entities for applying knowledge, while the large-scale untrained entities are difficult to be utilized due to the gap between their representations. Second, it is extremely challenging to represent millions of entities and triples of large-scale knowledge graphs (see Table 1) by these methods in practice, for instance, the adjacency matrix requires |V| × |L| × |V| computational resources (V, L denote the set of entities and relations, respectively).
To address these issues, we propose EARL, an Entity-Agnostic Representation Learning method to incorporate knowledge graphs into informative conversation generation, which can be easily integrated into existing conversation frameworks, such as Seq2Seq (Sutskever et al., 2014), HRED (Serban et al., 2015), and Transformer (Vaswani et al., 2017). The intuition lies in that knowledge graphs have sparse entities but dense relations, e.g. Con-ceptNet (Speer et al., 2017) contains over 8 million entities while only 36 relations, as shown in Table  1. EARL learns entity-agnostic representations for nodes in the knowledge graph (see Section 3.2 for more details) based on the context information of the conversation and the structure information of the knowledge graph, which does not parameterize the specific representation for each entity like prior methods. Thus it alleviates the problems mentioned above and is more suitable for applying large-scale knowledge graphs.
Specifically, EARL addresses the issues mentioned above in three ways: (1) A delexicalization step replaces the entities in the conversation history with mask tokens, which makes it entity-agnostic to the conversation context thus generalized for unseen entities.
(2) A knowledge interpreter is proposed to model the generalized representation of an entity by the structure information of knowledge graphs and the context information of the conversation, which allows our method to generate informative responses with the unseen knowledge graph during inference. (3) EARL learns the relation embeddings for conversation generation while it does not need to store representations for millions of entities, making it scalable to apply large-scale knowledge graphs. Figure 1 shows conversation samples generated by EARL and the prior knowledge-grounded baseline, where EARL (the last line) can inject the unseen entities (the white nodes) coherently. In contrast, the prior baseline model wrongly utilizes the seen knowledge graph (the grey nodes) in conversation generation given the unseen entities as input.
To summarize, our contributions are as follows: • This work is the first attempt to utilize knowledge graphs without parameterizing specific entity representations in conversation generation, which can be easily integrated to existing conversation frameworks.
• Automatic and manual evaluations show that EARL can generate informative responses with both seen knowledge graphs and unseen knowledge graphs in two benchmark datasets. Ablation studies demonstrate the influence of different mechanisms and conversation frameworks. Recently, Sequence-to-Sequence (Seq2Seq) models (Sutskever et al., 2014; have been applied to large-scale open-domain conversation generation, including neural responding machine (Shang et al., 2015), hierarchical recurrent models (Serban et al., 2015), and many others Li et al., 2016;Shao et al., 2017). Some models are proposed to improve the content quality of generated responses by copy mechanisms, diversified beam search algorithms, and various techniques (Shao et al., 2017;Li et al., 2016;Mou et al., 2016;Gu et al., 2016). However, the lack of background information or related knowledge results in significantly degenerated conversations, where the text is bland and strangely repetitive (Holtzman et al., 2020). Other studies, aiming to generate informative responses, incorporate external knowledge into conversation generation, including unstructured texts (Ghazvininejad et al., 2018;Long et al., 2017), and structured knowledge graphs (Han et al., 2015;Zhou et al., 2018).

Knowledge Graph Enhanced Conversation Models
Some prior works introduce high-quality structured knowledge graph for conversation generation. Zhu et al. (2017) presented an end-to-end knowledge grounded conversation model using a copy network (Gu et al., 2016). A large-scale commonsense knowledge graph is introduced to open-domain conversation generation by graph attention mechanisms in (Zhou et al., 2018). Moon et al. (2019) proposed a knowledge graph walker to select relevant entities of the knowledge graph to improve the performance of retrieval-based conversation models. The adjacency matrix (Tuan et al., 2019) is introduced to modeling the dynamic knowledge graph in conversation generation. However, these studies adopt pre-trained knowledge graph embeddings (Zhou et al., 2018), word embeddings (Wu et al., 2019), or the adjacency matrix (Tuan et al., 2019) to represent knowledge triples, making them not applicable for large-scale and unseen knowledge graphs. By contrast, our model addresses this issue by representing knowledge entities based on the context and the structure information of the knowledge graph, making our model entity-agnostic and able to incorporate large-scale and unseen knowledge graphs into conversation generation.

Task Definition
Our problem is formulated as follows: Given a context X = (x 1 , x 2 , · · · , x n ), which is the word sequence of a conversation history H = (U 1 , U 2 , · · · , U c ), and knowledge graphs G = {g 1 , g 2 , · · · , g |G| }, the goal is to generate the response Y = (y 1 , y 2 , · · · , y m ) by estimating the probability: P (Y |X, G) = m t=1 P (y t |y <t , X, G). The graphs are retrieved from a knowledge base using the words in the context as queries. As Zhou et al. (2018), each graph contains one-hop triples as g i = {τ i 1 , τ i 2 , · · · , τ i |g i | }, and each triple (subject, relation, object) is represented as τ i j = (subj i , rel i j , obj i j ).

Entity-Agnostic Representation Learning
EARL consists of three modules: an encoder to convert the context to the hidden representations, a knowledge interpreter to represent each subject and object entity based on the context and structure information, and a decoder to generate a token or select an entity from the knowledge graph determined by a knowledge selector. The overview of EARL is presented in Figure 3. Instead of parameterizing specific representations for entities of knowledge graphs as used in prior studies (Zhou et al., 2018;Wu et al., 2019;Tuan et al., 2019), EARL learns entity-agnostic representations conditioning on the context informa- tion of the conversation and the structure information of the knowledge graph. Entity-agnostic representations are defined as category representations for entities sharing the same context and structure information, including two major circumstances. One is caused by the one-to-many mapping property of knowledge graphs (Fan et al., 2014;Xiao et al., 2016), where a subject has multiple objects with the same relation. As shown in the left knowledge graph in Figure 1, (Chuck Palahniuk, Write, Pygmy) and (Chuck Palahniuk, Write, Tell-All) have the one-to-many mapping property, and EARL learns the same category representation for Pygmy and Tell-All, which is suitable for the dialogue con-  Figure 3: Overview of EARL. The blue and red content denote two subject entities in the context, and the grey content represent the object entities in the knowledge graph. Entities are represented by the knowledge interpreter and stored in the memory module of the decoder, where subj i and obj i j denote the ith subject entity and the jth object entity corresponding to the ith subject entity, respectively. text inquiring about books by Chuck Palahniuk. The other is the circumstance that the same dialogue context (after delexicalization) grounding on different knowledge graphs, where the object is connected to the subject with the same relation. For instance, the conversations in Figure 1 share the same context but different knowledge graphs, where The Selfish Gene in the right graph, Pygmy and Tell-All in the left graph possess the same relational structure (subj, Write, obj) in knowledge graphs, leading to the same category representations for these entities.
The visualizations of entity embeddings of two knowledge graphs in Figure 1 by EARL and TransE are provided in Figure 2. EARL learns different representations for entities in different conversation context or with different structure of knowledge graphs but the same representation for entities sharing the same context and relational structure information (Pygmy, Tell-All, and The Selfish Gene). However, there is a gap between TransE embeddings of these entities, making it difficult to utilize unseen entities.

Encoder
In the encoder, we propose a delexicalization step before encoding the context, which replaces entities in the context with a token [MASKi]. i denotes the reverse order of entities in the context, designed to allow our model to concentrate on the newest entities mentioned in the context. The delexical-ization process makes EARL entity-agnostic for conversation context, which enables our model to extend to unseen entities in knowledge graphs.
After the delexicalization step, the context X is fed to a bi-directional encoder f θ to get the hidden representation H = (h 1 , h 2 , · · · , h n ) and h X , which are defined as follows: where f θ can be implemented by Transformer (Vaswani et al., 2017) or the gated recurrent unit (GRU, .

Knowledge Interpreter
After obtaining the hidden representations of the context, knowledge interpreter is designed to represent each entity in the knowledge graph based on the context and the structure information. For each subject entity subj i mentioned in the context, we retrieved the corresponding knowledge graph g i , where each object entity obj i j can be connected to the central entity (subject) with relation rel i j . In order to ensure our model to be agnostic to entities, we don't learn embeddings for each entity. By contrast, we represent the mentioned entity subj i with the hidden representations of the context, and model the related object entity obj i j by reasoning through the structure information of the knowledge graph g i 1 . This process is defined as follows: where MLP represents the multi-layer perceptron layer, e(subj i ), e(rel i j ), e(obj i j ) denote the embedding of the subject entity, the relation, and the object entity, respectively.
Although aforementioned methods is able to represent the relevant entities related to the context, it cannot represent entities, which are not mentioned in the context or not connected to the subject entity in the context with any path in the knowledge graph. In this case, we resort to represent the entity i with N r i relations connected to it by graph attention based on the hidden state h X of the context, which is formulated as follows: e(obj i ) = MLP(e(subj i )), where e(rel i n ) denotes the embedding of the relation n connected to the entity i, e(subj i ) and e(obj i ) are two representations of a same entity i, serving as the subject and object entity embeddings used in the decoding process.

Decoder
The decoder g θ is a unidirectional neural network with the attention mechanism Vaswani et al., 2017) conditioning on the hidden representation of the context H, which updates its state as follows: In order to generate related entities from knowledge graphs during decoding, a knowledge selector 1 This method can be straightforward extended to represent the object entity, which is connected to the subject entity in L hops as path j = (subj i , rel j 1 , rel j 2 , · · · , rel j L , obj j ). Due to the length limit, we leave it as future work. is designed to allow the decoder to select object entities from knowledge graphs or words from the vocabulary. Inspired by Tu et al., 2016, we also introduce a coverage mechanism to facilitate the decoder to avoid generating repetitive entities. The decoding process is formulated as follows: where g t ∈ [0, 1] is a scalar to balance the choice between an entity obj i and a generic word w g , P g /P e is the distribution over generic words / entities respectively, and P (y t ) is the final word decoding distribution.

Loss Function
The loss function is the cross entropy between the predicted token distribution P (y t ) and the groundtruth distribution p t in the training corpus. Additionally, we apply supervised signals on the knowledge selector to teacher-force the selection of entities or generic words. The loss on one sample < X = (x 1 , x 2 , · · · , x n ), Y = (y 1 , y 2 , · · · , y m ) >is defined as: where p t is the one-hot vector of the ground-truth y t , g t is the probability of selecting an entity word or a generic word, q t ∈ {0, 1} is the true choice of an entity word or a generic word in Y , α and β are the number of entity words and the number of generic words in a batch, respectively. The second term is used to supervise the probability of selecting an entity word or a generic word.

Datasets
We adopt two knowledge graph enhanced conversation generation datasets in our experiments: The DuConv dataset 2 : a knowledge graph enhanced conversation dataset in Chinese proposed by Wu et al. (2019). It has 29,858 dialogues and 270,399 utterances in the domain of Movies. DuConv constructs the knowledge graph with the information crawled from a movie website as the external knowledge, which contains 3,598,246 fact triples over 143,627 entities and 45 relations. However, only the training data is released with the knowledge information, which contains 19,858 dialogues. After filtering the noisy data, we randomly split the corpus into the train (80%), validation (10%), and test sets (10%). The test set consists of the seen test set (5%) and the unseen test set (5%), where the former contains the knowledge graphs that appeared during the training process, and the latter contains the knowledge graphs, of which the subject entities and most of the object entities are unseen in the training process. The statistics is shown in Table 2.
The OpenDialKG dataset 3 : a knowledge graph enhanced conversation dataset in English proposed by Moon et al. (2019). It has 15,673 dialogues and 91,209 utterances in four domains, including Movies, Books, Sports, and Music. Open-DialKG uses the Freebase (Bast et al., 2014) knowledge graph as the external knowledge, which contains 1,190,658 fact triples over top 100,813 entities and 1,358 relations. However, the released data only consists of 13,776 dialogues, which contains some noisy data, e.g. empty utterances in the dialogue. After filtering the noisy data, we randomly split the corpus in the same way as DuConv. The statistics is presented in Table 2.
• DIALOGPT: a pre-trained dialogue model (Zhang et al., 2020;Wang et al., 2020) based on transformers, which is widely adopted in dialogue generation.
• MemNet: a knowledge-grounded model adapted from (Ghazvininejad et al., 2018), of which the memory units store word embeddings of knowledge triples.
• PostKS: a knowledge-grounded model selecting knowledge by prior and posterior distributions proposed by Wu et al. (2019), where we adopt word embeddings, instead of the RNN knowledge encoder, to represent knowledge triples.
• CopyNet: a copy network model (Zhu et al., 2017), which represents knowledge triples by word embeddings, and can copy words from knowledge triples or generate words from the vocabulary.
• CCM: a knowledge graph enhanced conversation model proposed by Zhou et al. (2018), which represents knowledge graphs using graph attention mechanisms based on the pretrained TransE (Bordes et al., 2013) embeddings.

Implementation Details
We used Tensorflow (Abadi et al., 2016) and Pytorch (Paszke et al., 2017) to implement our model and baselines. We chose RNN, implemented by GRU, as the framework for EARL to make a fair comparison with baseline models, as most of them (Zhou et al., 2018;Wu et al., 2019) are implemented by GRU. The encoder/decoder, f θ /g θ , has 2-layer BiGRU/GRU structures with 512 hidden cells for each layer and uses different parameters. DIALOGPT is initialized by pre-trained parameters (Zhang et al., 2020;Wang et al., 2020) and finetuned in downstream datasets. Following prior studies (Zhou et al., 2018;Wu et al., 2019), we adopted greedy search as the decoding objective. The λ in the loss function is set to 0.1 by manual tuning. The word embedding size is set to 300. The vocabulary size is limited to 19,000/30,000 for EARL and 24,000/56,000 for baselines in OpenDi-alKG/DuConv datasets respectively. The TransE embedding size of entities and relations is set to 100. We used the stochastic gradient descent (SGD) algorithm with mini-batch. The batch size and learning rate are set to 100 and 0.5, respectively. The model was run at most 20 epochs, and the training stage of each model took about one day on a GPU machine. We selected the model performing best in the validation set to evaluate in test sets. Our code is available at: https://github.com/ thu-coai/earl.

Automatic Evaluation
Metrics 4 : We chose Entity (Zhou et al., 2018) to evaluate the ability of generating informative responses by calculating the number of entities per response. Distinct-n and Perplexity (PPL) (Serban et al., 2015) are adopted to evaluate the ability of generating diverse responses and the probability of generating ground-truth responses. We computed Precision, Recall, and F1 scores between generated entities and ground-truth entities per response to evaluate the relevance of generated entities. Table 3, EARL obtains the best performances in most metrics on all the test sets, demonstrating that EARL can generate more informative, relevant, and diverse responses than baseline models based on both trained and untrained knowledge graphs. Specifically, EARL achieves the highest number of entities per generated response, which is nearly two times higher than the second-highest score obtained by CCM, indicating that EARL is able to generate more informative responses. Besides, EARL outperforms all the baselines in the Precision, Recall, and F1 metrics, showing that entities selected by EARL are more relevant to the ground-truth entities. Furthermore, the Distinct-3/4 scores of EARL are also higher than the baselines' scores, demonstrating that EARL can generate more diverse responses.

Results: As shown in
DIALOGPT achieves the best performance in Perplexity, due to the pre-trained process and the largescale parameters. However, it performs worse than knowledge-grounded models in metrics except for Perplexity, without the ability of utilizing the relevant knowledge. For perplexity, EARL outperforms most knowledge-grounded baselines except for CopyNet, as CopyNet learns the embeddings for each entity during training, while EARL does not parameterize any entities.
Compared to the seen test set, most of the baselines perform worse in Precision, Recall, and F1 scores on the unseen test set, leading to irrelevant entities generated, as it contains knowledge graphs and entities that do not appear during training. The decrease of Precision, Recall, and F1 becomes larger from DuConv to OpenDialKG, as the size of knowledge graphs increases (see Table 2). However, EARL achieves comparable performances on the unseen test set, even most scores are slightly higher than those on the seen test set, indicating that EARL can utilize the unseen entities in knowledge graphs during the inference process. As we provide the pre-trained TransE embeddings of the knowledge graphs in the unseen test set to CCM to build a strong baseline, its performance on the unseen test set does not decrease as other baselines, albeit still worse than the performance of EARL.

Manual Evaluation
In order to better understand the quality of generated responses from the content and knowledge perspectives, we resorted to manual evaluation through crowdsourcing. 400 contexts were randomly sampled from four test sets (100 samples for each test set) for manual annotation. We conducted the pairwise comparison between the response generated by EARL and the one by a baseline for the same context. In total, there are 1,200 pairs since we chose three baselines, which achieve top performances in automatic evaluation. For each response pair, three judges were hired to give a preference between the two responses in terms of the following two metrics. The tie was allowed. Notice that system identifiers were masked during annotation.
Metrics: We adopted two widely used metrics, Appropriateness and Informativeness as proposed in (Zhou et al., 2018). Appropriateness measures the quality of the generated response at the content level (whether the response is appropriate in relevance, coherence, and adequacy). In-   formativeness measures the quality of the generated response at the knowledge level (whether the response provides new information and relevant knowledge in response to the context).
Annotation Statistics: We calculated the Fleiss' kappa (Fleiss, 1971) to measure inter-rater consistency. Fleiss' kappa for Appropriateness and Informativeness is 0.57 and 0.49, respectively, denoting the "Moderate agreement" of the annotations. We also calculated the agreements of human annotators. For Appropriateness, the percentage of the pairs that at least two judges gave the same label (2/3 agreement 5 ) amounts to 97.5%, and the percentage for 3/3 agreement is 58.3%. For Informativeness, the percentage for at least 2/3 agreement is 95.7%, and that for 3/3 agreement is 51.0%.

Results:
The results are shown in Table 4. The score is the percentage that EARL wins a baseline after removing "Tie" pairs. EARL outperforms all the baselines in terms of both metrics on all the test sets, where EARL achieves significantly better performances (sign test, p-value < 0.05) in most cases. EARL has over 90% chances to win MemNet in Informativeness on all the test sets, as MemNet cannot utilize the knowledge triples stored in the memory efficiently, leading to generic or irrelevant responses. For Appropriateness, the probabilities that EARL wins MemNet are a little lower than that of Informativeness, as the generic or high-frequency responses generated by Mem-Net are usually fluent in grammar. Compared to MemNet, CopyNet performs slightly better in Informativeness while worse in Appropriateness since generating more informative entities brings about difficulties in fluent and coherent conversation generation. CCM performs best among all the baselines because it can introduce the knowledge graph information taking the pre-trained TransE embeddings as input. However, its performances are still worse than EARL, especially in the unseen test sets, as the usage of knowledge graphs and entities is not finetuned during the training process, which leads to the gap of performance between the seen and unseen test sets.
Noticeably, the probabilities that EARL wins baselines achieve higher in the unseen test set, as utilizing untrained knowledge graphs are relatively difficult for baselines. However, this problem is alleviated by the entity-agnostic knowledge interpreter of EARL, which learns the representations of entities based on the context information and the knowledge graph structure information. EARL's better performances on the unseen test sets demonstrate EARL can utilize unseen entities in knowledge graphs and suitable for informative knowledge-grounded conversation generation.

Conclusion and Future Work
In this paper, we present an entity-agnostic representation learning method to incorporate knowledge graphs into informative conversation generation. It learns to represent entities using the relational structure of the knowledge graph instead of parameterizing billions of entities directly, thereby more suitable for applying large-scale unseen graphs. Automatic and manual evaluations show that EARL can generate appropriate and in-formative responses with both seen and unseen knowledge graphs as input.
In future work, we will explore the pre-trained knowledge-grounded conversation model based on EARL, which can incorporate the large-scale knowledge graphs with entities in multiple hops into conversation generation.

A.1 Ablation Study
In order to investigate the influence of coverage and delexicalization mechanisms, we conducted ablation tests where one of these mechanisms was removed from EARL each time, as shown in Table  5. As we can see, EARL performs best in Precision and Perplexity metrics, indicating that EARL, equipped with coverage and delexicalization mechanisms, is able to generate more relevant entities and responses compared to other alternatives.

Impact of Coverage
Mechanism EARL without coverage mechanism achieves the best performance in Distinct-n scores. However, the improvement in diversity is caused by the repetitive entities generated in responses, as the number of repetitive entities per response is improved from 0.5%/4.2% to 23.9%/44.9% on the DuConv/OpenDialKG dataset after removing the coverage mechanism. Thus, adopting the coverage mechanism in the decoding process is helpful to alleviate entity repetition and generate informative responses.

Impact of Delexicalization Mechanism
After removing the delexicalization mechanism, the performances in Precision and Perplexity decrease in four test sets, though the Entity score increases in the OpenDialKG dataset. The reason is that EARL without delexicalization introduces noises in the encoding process, as the word embeddings of unseen entities are not finetuned during training. Besides, it causes the gap between the entity-agnostic representations of trained entities and unseen entities, as shown in Figure 2, leading to the decrease in Precision.  and outperform baselines, including the large-scale pre-trained model, DIALOGPT (see Table 3). However, it performs worse in Precision, F1, and Perplexity metrics than RNN-based EARL, which may be caused by the small datasets and model sizes (Zeyer et al., 2019;Chen et al., 2018). To make a fair comparison with baseline models, we implemented the Transformer-based EARL with 3 Transformer blocks, which is approximately equal to baseline models in model sizes. In future work, we believe the larger corpus and deeper networks may further improve the performance of EARL implemented by Transformer.

A.2 Case Study
Sample conversations are shown in Figure 4. The text in red/blue denotes the entity of the provided knowledge, which appeared in the context/response. For the first conversation, a movie in the provided knowledge called, Our Meal For Tomorrow, is recommended in the human response. However, Seq2Seq, DIALOGPT, and Transformer generated irrelevant movies, Demonic Toys, Journeys to the Bottom of the Sea, and Where's the Dragon?, without access to the provided knowledge. Although MemNet, PostKS, and CopyNet can take the knowledge as input, they also generated undesired entities, as they cannot learn a meaningful representation of the entity, Our Meal For Tomorrow, which has not appeared during training. CCM and our model EARL generated Our Meal For Tomorrow as human since they can represent the entity with the relational structure. It is noteworthy, after removing the coverage mechanism, EARL w/o coverage generated Our Meal For Tomorrow twice. The repetitive text undermines the quality of responses.
For the second conversation, baselines generated irrelevant content as before. Although CCM can represent entities with the relational structure, it still generated undesired content as "Seth Gordon is a great movie", because of the noise introduced by the word embeddings and knowledge representations of unseen entities. EARL utilized unseen knowledge more efficiently and generated "Seth Gordon directed Freakonomics" according to the knowledge "(Seth Gordon, Direct, Freakonomics)", as it learns entity-agnostic representations, which are more generalized for unseen entities. After removing the delexicalization mechanism, EARL w/o delexical generated irrelevant content, due to the noise introduced by the word embeddings of Figure 4: Sample responses generated by all the models on the unseen test set of DuConv (upper) and OpenDialKG (lower). The text in italic denotes the entity, and the text in red/blue denotes the entity of the provided knowledge, which appeared in the context/response. The original text in Chinese of DuConv is presented in parentheses. the entity, Seth Gordon, which causes the gap in entity-agnostic representations between seen and unseen entities.